Directual
Menu

Advanced Techniques for RAG and AI Agents

This lesson dives into advanced techniques: making the model reason step-by-step with Chain of Thought, returning structured outputs, and detecting hallucinations with logprobs. You’ll also learn how to chunk documents effectively — and even run an open-source LLM on your own machine.

What Anthropic Thinks About Building Effective Agents

Recently, the team at Anthropic (creators of Claude) released an article titled Building Effective Agents.
Here are a few key insights from it:

  • Context is everything. An agent doesn’t “know” anything by itself. It works with what you give it — the prompt, structure, and dialogue history.
  • An agent is not a single prompt — it’s a sequence of steps. It’s recommended to design agents as processes, where the LLM draws conclusions, makes decisions, stores intermediate data, and passes it between stages.
  • State matters. An agent that makes a request and passes the result to another stage must understand its current state. Without it, it becomes a chatterbox with amnesia.
  • Use multiple prompts. One for analysis, another for decision-making, a third for generation, a fourth for review and improvement. This is architecture — not just one long prompt.

Conclusion: An LLM is just a building block. A real agent is architecture + orchestration. This is where Directual comes in.

Structured Output — So It Doesn’t Just Talk, But Actually Works

If you want the assistant to return not just plain text, but structured data, you need to use Structured Output.

This can be in the form of JSON, XML, markdown, a table, etc.

Why this matters:

  • Easy to parse and use in Directual scenarios
  • Ensures the correctness of responses
  • Precise control over the format

How to Configure Structured Output in Directual

There are two ways to set up SO on the Directual platform:

__wf_reserved_inherit

Response Format

This is the option where you add "response_format": { "type": "json_object" } to the request, and define the response structure directly in the system prompt. Request example below:

Function Calling

This option involves using tools. In this case, the response format is guaranteed. Request example below:

If you need a different format (for example, XML), specify it directly in the prompt text.

Chain of Thought — Making the LLM Think Out Loud

Chain of Thought (CoT) is a technique where the model reasons step by step. This:

  • improves accuracy
  • makes logic traceable
  • helps catch errors

Combo: CoT + Structured Output

The model reasons through the problem,

then returns a final JSON.

You can store the reasoning steps for logging, while showing only the result to the user.

Example of a CoT + SO request from Directual:

As a result, we get a JSON like this, which can be further processed in the scenario:

Back to RAG — Let’s Talk About Chunking

When you have a lot of documents, and they’re long — you need to split them into chunks.

Why it matters:

  • Language models have a context length limit
  • Very long texts are poorly vectorized

Chunking Approaches

In Directual, it’s convenient to implement chunking in three steps:

  1. Split the text into a JSON object like this: { "chunks": [ { "text": "..." }, { "text": "..." }, ... ] }
  2. Use a JSON step to create objects in the Chunks structure
  3. Send the array of objects to a LINK scenario step, where you apply embeddings and link each chunk back to the parent document
__wf_reserved_inherit

There are three methods for splitting text into chunks:

1. By length

For example, 100 words with an overlap of 10.

The code for generating JSON is below.

Note: to use arrow functions, you need to enable ECMAScript 2022 (ES13) in the START => Advanced step. By default, ES6 is used.

Also, when saving a JS object into a field of type json, make sure to wrap the expression with JSON.stringify().

2. By structure

Split into paragraphs. If a chunk is shorter than 5 words, merge it with the next one — it’s probably a heading.

Code for the Edit object step:

3. By meaning

Send a request to ChatGPT:

How to Know If Chunking Is Bad

Poor chunking = repetitive answers, broken logic, or “nothing found.”

How to Test LLM for Hallucinations — Use logprobs

Logprobs = the log-probability of each token.

  • High logprob (closer to 0) = confidence
  • Low logprob = uncertainty

Use this to filter unreliable responses.

What you can do:

  • Don’t show uncertain answers
  • Regenerate the response
  • Show the answer with a warning

In combination with Structured Output, you can check confidence at the field level.

On Directual:

  • Add a request step with logprobs: true
  • Visualize it in HTML (color-coded by confidence level)

Code for visualizing the model’s response with logprobs: true:

This code generates the HTML. Additionally, you need to save the CSS in the Web-app => Custom code section:

__wf_reserved_inherit

Running an LLM Locally — No APIs, No Cloud

Model: Qwen 1.5 1.8B Chat — a small model, but enough for demo purposes on a laptop!

Utility Check

Make sure the necessary tools are installed.

Creating an Isolated Python Environment

We’ll use virtualenv to keep dependencies clean.

Installing Required Libraries

  • torch — the PyTorch engine Qwen runs on
  • transformers — Hugging Face library for loading and running LLMs
  • accelerate — auto-detects GPU and optimizes inference
  • flask — a lightweight framework for APIs; we’ll use it to run our local server

Connecting to Hugging Face

Think of it like GitHub, but for models and neural networks. Qwen is hosted there.

Go to https://huggingface.co/, log in (or sign up if needed), and create a new Read-only token.

__wf_reserved_inherit

Back to the terminal — log in to Hugging Face and enter your new token.

Now let’s do a quick check to make sure everything works.

Smoke test

Open Jupyter Notebook — a lightweight interactive environment that makes it easy to write and run Python projects step by step.

If it’s not installed, install it via:

Create a file named test_qwen.py

Run it in the terminal.

The model download may take 10–15 minutes. After that, it will be stored locally and ready to use.

Launching the API

We’ve confirmed that the model runs and responds — now let’s create a server that will expose an API identical to the ChatGPT API.

Create a file called app.py

Run it!

Now we have a local API, but we need to make it accessible from the internet — including from Directual.

Creating an HTTPS Tunnel for the API

Register an account at ngrok.com, get your token, and run the following in the terminal:

he generated API endpoint can be used in an HTTP request step in Directual — just like any other LLM API!

Conclusion

You’ve learned:

  • How Anthropic approaches agent design
  • How Structured Output works
  • How to apply Chain of Thought
  • How to properly chunk text
  • How to catch hallucinations using logprobs
  • How to run a local LLM on your laptop

And most importantly — how to connect it all together on Directual.

Now it’s all about practice.

Apply what you’ve learned, build assistants, create your own agents. Good luck!

Meet & greet like-minded no-coders

Hop into our cozy community and get help with your projects, meet potential co-founders, chat with platform developers, and so much more.