Advanced Techniques for RAG and AI Agents

This lesson dives into advanced techniques: making the model reason step-by-step with Chain of Thought, returning structured outputs, and detecting hallucinations with logprobs. You’ll also learn how to chunk documents effectively — and even run an open-source LLM on your own machine.

What Anthropic Thinks About Building Effective Agents

Recently, the team at Anthropic (creators of Claude) released an article titled Building Effective Agents.
Here are a few key insights from it:

Context is everything. An agent doesn’t “know” anything by itself. It works with what you give it — the prompt, structure, and dialogue history.
An agent is not a single prompt — it’s a sequence of steps. It’s recommended to design agents as processes, where the LLM draws conclusions, makes decisions, stores intermediate data, and passes it between stages.
State matters. An agent that makes a request and passes the result to another stage must understand its current state. Without it, it becomes a chatterbox with amnesia.
Use multiple prompts. One for analysis, another for decision-making, a third for generation, a fourth for review and improvement. This is architecture — not just one long prompt.

Conclusion: An LLM is just a building block. A real agent is architecture + orchestration. This is where Directual comes in.

Structured Output — So It Doesn’t Just Talk, But Actually Works

If you want the assistant to return not just plain text, but structured data, you need to use Structured Output.

This can be in the form of JSON, XML, markdown, a table, etc.

Why this matters:

Easy to parse and use in Directual scenarios
Ensures the correctness of responses
Precise control over the format

How to Configure Structured Output in Directual

There are two ways to set up SO on the Directual platform:

Response Format

This is the option where you add "response_format": { "type": "json_object" } to the request, and define the response structure directly in the system prompt. Request example below:

{
  "model": "gpt-3.5-turbo",
  "response_format": { "type": "json_object" },
  "messages": [
    {
      "role": "system",
      "content": "Respond strictly with a JSON object containing the fields: title (string), summary (string), items (array of strings)."
    },
    {
      "role": "user",
      "content": "Compose a summary: {{#escapeJson}}{{text}}{{/escapeJson}}"
    }
  ]
}

Function Calling

This option involves using tools. In this case, the response format is guaranteed. Request example below:

{
  "model": "gpt-3.5-turbo",
  "tool_choice": "auto",
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "generate_summary",
        "description": "Structured output with title, summary, and items.",
        "parameters": {
          "type": "object",
          "properties": {
            "title": { "type": "string" },
            "summary": { "type": "string" },
            "items": {
              "type": "array",
              "items": { "type": "string" }
            }
          },
          "required": ["title", "summary", "items"]
        }
      }
    }
  ],
  "messages": [
    {
      "role": "user",
      "content": "Сделай summary по тексту: {{#escapeJson}}{{text}}{{/escapeJson}}"
    }
  ]
}

If you need a different format (for example, XML), specify it directly in the prompt text.

{
  "model": "gpt-3.5-turbo",
  "messages": [
    {
      "role": "system",
      "content": "You are an assistant that always returns output in valid XML format. Wrap the entire response in a single <response> root tag. Do not include any explanation or commentary. Output only XML. Tags inside: title (string), summary (string), items (array of strings)."
    },
    {
      "role": "user",
      "content": "Compose a summary: {{#escapeJson}}{{text}}{{/escapeJson}}"
    }
  ]
}

Chain of Thought — Making the LLM Think Out Loud

Chain of Thought (CoT) is a technique where the model reasons step by step. This:

improves accuracy
makes logic traceable
helps catch errors

Combo: CoT + Structured Output

The model reasons through the problem,

then returns a final JSON.

You can store the reasoning steps for logging, while showing only the result to the user.

Example of a CoT + SO request from Directual:

‍

{
  "model": "gpt-4o",
  "tool_choice": "auto",
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "math_reasoning",
        "description": "Explaining the math problem solution",
        "parameters": {
          "type": "object",
          "properties": {
            "steps": {
              "type": "array",
              "items": {
                "type": "object",
                "properties": {
                  "explanation": { "type": "string" },
                  "output": { "type": "string" }
                },
                "required": ["explanation", "output"],
                "additionalProperties": false
              }
            },
            "final_answer": { "type": "string" }
          },
          "required": ["steps", "final_answer"],
          "additionalProperties": false
        }
      }
    }
  ],
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful math tutor. Always answer using the math_reasoning function."
    },
    {
      "role": "user",
      "content": "how can I solve 8x + 7 = -23"
    }
  ]
}

As a result, we get a JSON like this, which can be further processed in the scenario:

{
   "steps": [
      {
         "explanation": "We start with the equation 8x + 7 = -23. Our goal is to solve for x.",
         "output": "8x + 7 = -23"
      },
      {
         "explanation": "To isolate the term with x, we first need to get rid of the constant on the left side. We do this by subtracting 7 from both sides of the equation.",
         "output": "8x = -23 - 7"
      },
      {
         "explanation": "Simplify the right side by performing the subtraction, which gives us -30.",
         "output": "8x = -30"
      },
      {
         "explanation": "Now, we need to solve for x. Since 8 is multiplied by x, we divide both sides of the equation by 8 to isolate x.",
         "output": "x = -30 / 8"
      },
      {
         "explanation": "Simplify the right side by performing the division, and reduce the fraction to its simplest form. -30 divided by 8 is -3.75, which can also be expressed as the fraction -15/4.",
         "output": "x = -3.75 or x = -15/4"
      }
   ],
   "final_answer": "x = -3.75 or x = -15/4"
}

Back to RAG — Let’s Talk About Chunking

When you have a lot of documents, and they’re long — you need to split them into chunks.

Why it matters:

Language models have a context length limit
Very long texts are poorly vectorized

Chunking Approaches

In Directual, it’s convenient to implement chunking in three steps:

Split the text into a JSON object like this: { "chunks": [ { "text": "..." }, { "text": "..." }, ... ] }
Use a JSON step to create objects in the Chunks structure
Send the array of objects to a LINK scenario step, where you apply embeddings and link each chunk back to the parent document

There are three methods for splitting text into chunks:

1. By length

For example, 100 words with an overlap of 10.

The code for generating JSON is below.

Note: to use arrow functions, you need to enable ECMAScript 2022 (ES13) in the START => Advanced step. By default, ES6 is used.

Also, when saving a JS object into a field of type json, make sure to wrap the expression with JSON.stringify().

JSON.stringify({
  "chunks": _.chain(`{{text}}`.split(/\s+/))
    .thru(words => {
      const chunkSize = 100;
      const overlap = 10;
      const chunks = [];
      for (let i = 0; i < words.length; i += (chunkSize - overlap)) {
        chunks.push({ text: words.slice(i, i + chunkSize).join(' ') });
      }
      return chunks;
    })
    .value()
})

2. By structure

Split into paragraphs. If a chunk is shorter than 5 words, merge it with the next one — it’s probably a heading.

Code for the Edit object step:

JSON.stringify({
  "chunks": _.chain(`{{text}}`.split(/\n+/))
    .map(_.trim)
    .filter(p => p.length > 0)
    .thru(paragraphs => {
      const chunks = [];
      let i = 0;
      while (i < paragraphs.length) {
        const current = paragraphs[i];
        if (current.split(/\s+/).length < 5 && i + 1 < paragraphs.length) {
          chunks.push({ text: current + ' ' + paragraphs[i + 1] });
          i += 2;
        } else {
          chunks.push({ text: current });
          i += 1;
        }
      }
      return chunks;
    })
    .value()
})

3. By meaning

Send a request to ChatGPT:

{
  "model": "gpt-4o",
  "response_format": { "type": "json_object" },
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant that splits text into meaningful semantic chunks. Each chunk should represent a coherent idea, paragraph, or topic segment. Return your output strictly as a JSON object with the structure: { \"chunks\": [ {\"text\": \"...\"} ] }."
    },
    {
      "role": "user",
      "content": "Split the following text into semantic chunks:{{#escapeJson}}{{text}}{{/escapeJson}}"
    }
  ]
}

How to Know If Chunking Is Bad

Poor chunking = repetitive answers, broken logic, or “nothing found.”

How to Test LLM for Hallucinations — Use logprobs

Logprobs = the log-probability of each token.

High logprob (closer to 0) = confidence
Low logprob = uncertainty

Use this to filter unreliable responses.

What you can do:

Don’t show uncertain answers
Regenerate the response
Show the answer with a warning

In combination with Structured Output, you can check confidence at the field level.

On Directual:

Add a request step with logprobs: true
Visualize it in HTML (color-coded by confidence level)

Code for visualizing the model’s response with logprobs: true:

_.map({{response}}.choices[0].logprobs.content, ({ token, logprob }) => {
  const confidenceClass =
    logprob > -1e-5 ? 'logProbs-high' :          
    logprob > -1e-4 ? 'logProbs-medium' :       
    'logProbs-low';                             

  const safeToken = _.escape(token);
  return `<span class="${confidenceClass}">${safeToken}</span>`;
}).join('')

This code generates the HTML. Additionally, you need to save the CSS in the Web-app => Custom code section:

<style>
  .logProbs-high {
      background-color: #d4edda; /* GREEN */
      color: #155724;
    }
    .logProbs-medium {
      background-color: #fff3cd; /* YELLOW */
      color: #856404;
    }
    .logProbs-low {
      background-color: #f8d7da; /* RED */
      color: #721c24;
    }
</style>

Running an LLM Locally — No APIs, No Cloud

Model: Qwen 1.5 1.8B Chat — a small model, but enough for demo purposes on a laptop!

Utility Check

Make sure the necessary tools are installed.

brew --version  
python3 --version  
pip3 --version  
virtualenv --version

Creating an Isolated Python Environment

We’ll use virtualenv to keep dependencies clean.

virtualenv qwen_env  
source qwen_env/bin/activate

Installing Required Libraries

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu  
pip install transformers accelerate flask

torch — the PyTorch engine Qwen runs on
transformers — Hugging Face library for loading and running LLMs
accelerate — auto-detects GPU and optimizes inference
flask — a lightweight framework for APIs; we’ll use it to run our local server

Connecting to Hugging Face

Think of it like GitHub, but for models and neural networks. Qwen is hosted there.

Go to https://huggingface.co/, log in (or sign up if needed), and create a new Read-only token.

Back to the terminal — log in to Hugging Face and enter your new token.

pip install huggingface_hub  
huggingface-cli login

Now let’s do a quick check to make sure everything works.

Smoke test

Open Jupyter Notebook — a lightweight interactive environment that makes it easy to write and run Python projects step by step.

jupyter notebook

If it’s not installed, install it via:

pip install notebook

Create a file named test_qwen.py

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen1.5-1.8B-Chat"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

input_text = "Tell me how LLM works!"
inputs = tokenizer(input_text, return_tensors="pt")

output = model.generate(**inputs, max_new_tokens=100)
result = tokenizer.decode(output[0], skip_special_tokens=True)

print(result)

Run it in the terminal.

python test_qwen.py

The model download may take 10–15 minutes. After that, it will be stored locally and ready to use.

Launching the API

We’ve confirmed that the model runs and responds — now let’s create a server that will expose an API identical to the ChatGPT API.

Create a file called app.py

from flask import Flask, request, jsonify
from transformers import AutoModelForCausalLM, AutoTokenizer

# Load the model and tokenizer
model_name = "Qwen/Qwen1.5-1.8B-Chat"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

# Create Flask app
app = Flask(__name__)

@app.route('/v1/chat/completions', methods=['POST'])
def chat_completion():
    data = request.json
    messages = data.get("messages", [])

    # Build the prompt with role prefixes
    prompt = "".join([f"{m['role']}: {m['content']}\n" for m in messages])
    prompt += "assistant: "

    # Tokenize input
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids

    # Generate response
    output_ids = model.generate(
        input_ids,
        max_new_tokens=256,
        eos_token_id=tokenizer.eos_token_id
    )

    # Decode model output
    decoded = tokenizer.decode(output_ids[0], skip_special_tokens=True)

    # Extract only assistant's message
    assistant_reply = decoded.split("assistant:")[-1].strip()

    # Build response JSON
    return jsonify({
        "id": "chatcmpl-local",
        "object": "chat.completion",
        "created": 1234567890,
        "model": model_name,
        "choices": [
            {
                "index": 0,
                "message": {
                    "role": "assistant",
                    "content": assistant_reply
                },
                "finish_reason": "stop"
            }
        ]
    })

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8000)

Run it!

python app.py

Now we have a local API, but we need to make it accessible from the internet — including from Directual.

Creating an HTTPS Tunnel for the API

ngrok http 8000

he generated API endpoint can be used in an HTTP request step in Directual — just like any other LLM API!

Conclusion

You’ve learned:

How Anthropic approaches agent design
How Structured Output works
How to apply Chain of Thought
How to properly chunk text
How to catch hallucinations using logprobs
How to run a local LLM on your laptop

And most importantly — how to connect it all together on Directual.

Now it’s all about practice.

Apply what you’ve learned, build assistants, create your own agents. Good luck!

Prev tutorial

Next tutorial