Advanced Techniques for RAG and AI Agents

Academy

Building AI agents and RAG-systems

Prev tutorial

This lesson dives into advanced techniques: making the model reason step-by-step with Chain of Thought, returning structured outputs, and detecting hallucinations with logprobs. You’ll also learn how to chunk documents effectively — and even run an open-source LLM on your own machine.

What Anthropic Thinks About Building Effective Agents

Recently, the team at Anthropic (creators of Claude) released an article titled Building Effective Agents.
Here are a few key insights from it:

Context is everything. An agent doesn’t “know” anything by itself. It works with what you give it — the prompt, structure, and dialogue history.
An agent is not a single prompt — it’s a sequence of steps. It’s recommended to design agents as processes, where the LLM draws conclusions, makes decisions, stores intermediate data, and passes it between stages.
State matters. An agent that makes a request and passes the result to another stage must understand its current state. Without it, it becomes a chatterbox with amnesia.
Use multiple prompts. One for analysis, another for decision-making, a third for generation, a fourth for review and improvement. This is architecture — not just one long prompt.

Conclusion: An LLM is just a building block. A real agent is architecture + orchestration. This is where Directual comes in.

Structured Output — So It Doesn’t Just Talk, But Actually Works

If you want the assistant to return not just plain text, but structured data, you need to use Structured Output.

This can be in the form of JSON, XML, markdown, a table, etc.

Why this matters:

Easy to parse and use in Directual scenarios
Ensures the correctness of responses
Precise control over the format

How to Configure Structured Output in Directual

There are two ways to set up SO on the Directual platform:

Response Format

This is the option where you add "response_format": { "type": "json_object" } to the request, and define the response structure directly in the system prompt. Request example below:

Function Calling

This option involves using tools. In this case, the response format is guaranteed. Request example below:

If you need a different format (for example, XML), specify it directly in the prompt text.

Chain of Thought — Making the LLM Think Out Loud

Chain of Thought (CoT) is a technique where the model reasons step by step. This:

improves accuracy
makes logic traceable
helps catch errors

Combo: CoT + Structured Output

The model reasons through the problem,

then returns a final JSON.

You can store the reasoning steps for logging, while showing only the result to the user.

Example of a CoT + SO request from Directual:

‍

As a result, we get a JSON like this, which can be further processed in the scenario:

Back to RAG — Let’s Talk About Chunking

When you have a lot of documents, and they’re long — you need to split them into chunks.

Why it matters:

Language models have a context length limit
Very long texts are poorly vectorized

Chunking Approaches

In Directual, it’s convenient to implement chunking in three steps:

Split the text into a JSON object like this: { "chunks": [ { "text": "..." }, { "text": "..." }, ... ] }
Use a JSON step to create objects in the Chunks structure
Send the array of objects to a LINK scenario step, where you apply embeddings and link each chunk back to the parent document

There are three methods for splitting text into chunks:

1. By length

For example, 100 words with an overlap of 10.

The code for generating JSON is below.

Note: to use arrow functions, you need to enable ECMAScript 2022 (ES13) in the START => Advanced step. By default, ES6 is used.

Also, when saving a JS object into a field of type json, make sure to wrap the expression with JSON.stringify().

2. By structure

Split into paragraphs. If a chunk is shorter than 5 words, merge it with the next one — it’s probably a heading.

Code for the Edit object step:

3. By meaning

Send a request to ChatGPT:

How to Know If Chunking Is Bad

Poor chunking = repetitive answers, broken logic, or “nothing found.”

How to Test LLM for Hallucinations — Use logprobs

Logprobs = the log-probability of each token.

High logprob (closer to 0) = confidence
Low logprob = uncertainty

Use this to filter unreliable responses.

What you can do:

Don’t show uncertain answers
Regenerate the response
Show the answer with a warning

In combination with Structured Output, you can check confidence at the field level.

On Directual:

Add a request step with logprobs: true
Visualize it in HTML (color-coded by confidence level)

Code for visualizing the model’s response with logprobs: true:

This code generates the HTML. Additionally, you need to save the CSS in the Web-app => Custom code section:

Running an LLM Locally — No APIs, No Cloud

Model: Qwen 1.5 1.8B Chat — a small model, but enough for demo purposes on a laptop!

Utility Check

Make sure the necessary tools are installed.

Creating an Isolated Python Environment

We’ll use virtualenv to keep dependencies clean.

Installing Required Libraries

torch — the PyTorch engine Qwen runs on
transformers — Hugging Face library for loading and running LLMs
accelerate — auto-detects GPU and optimizes inference
flask — a lightweight framework for APIs; we’ll use it to run our local server

Connecting to Hugging Face

Think of it like GitHub, but for models and neural networks. Qwen is hosted there.

Go to https://huggingface.co/, log in (or sign up if needed), and create a new Read-only token.

Back to the terminal — log in to Hugging Face and enter your new token.

Now let’s do a quick check to make sure everything works.

Smoke test

Open Jupyter Notebook — a lightweight interactive environment that makes it easy to write and run Python projects step by step.

If it’s not installed, install it via:

Create a file named test_qwen.py

Run it in the terminal.

The model download may take 10–15 minutes. After that, it will be stored locally and ready to use.

Launching the API

We’ve confirmed that the model runs and responds — now let’s create a server that will expose an API identical to the ChatGPT API.

Create a file called app.py

Run it!

Now we have a local API, but we need to make it accessible from the internet — including from Directual.

Creating an HTTPS Tunnel for the API

he generated API endpoint can be used in an HTTP request step in Directual — just like any other LLM API!

Conclusion

You’ve learned:

How Anthropic approaches agent design
How Structured Output works
How to apply Chain of Thought
How to properly chunk text
How to catch hallucinations using logprobs
How to run a local LLM on your laptop

And most importantly — how to connect it all together on Directual.

Now it’s all about practice.

Apply what you’ve learned, build assistants, create your own agents. Good luck!

Prev tutorial

Meet & greet like-minded no-coders

Hop into our cozy community and get help with your projects, meet potential co-founders, chat with platform developers, and so much more.

Join Directual community