DEV Community: angu10

I fine-tuned Gemma 4 E2B on a Mac to extract 22 fields from Indian GST invoices — privately, at $0 per call.

angu10 — Fri, 08 May 2026 04:46:55 +0000

Gemma 4 Challenge: Write about Gemma 4 Submission

angu10

May 8

The model isn’t the hard part: the data pipeline I built to teach Gemma 4 E2B to read Indian GST invoices.

#gemmachallenge #devchallenge #gemma

7 min read

The model isn’t the hard part: the data pipeline I built to teach Gemma 4 E2B to read Indian GST invoices.

angu10 — Fri, 08 May 2026 04:35:18 +0000

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

How I fine-tuned a small Gemma 4 model on a Mac to extract 22 invoice fields privately, and why the data strategy mattered more than the prompt.

I needed to read Indian GST invoices without sending them to an external API every time.

Gemma 4 E2B is an open multimodal model designed for local and edge deployment, with a 128K context window, native system prompt support, and an instruction-tuned variant that is usable without a giant serving stack. Google positions the small Gemma 4 models as practical for on-device and local workflows, not just as miniatures of the larger models. That made it a good fit for a problem I care about: structured invoice extraction where privacy, cost, and control matter as much as raw quality.

At my document volume, a hosted model would have been simple to prototype but expensive to normalize around. Roughly speaking, a model like GPT-4o lands around a cent per invoice at this prompt and output length. A local Gemma 4 setup costs time up front, but effectively $0 per call after that.

My goal was simple:

take OCR text from Indian GST invoices
extract 22 fields into strict JSON
fine-tune locally instead of paying per document to a hosted model

This was not a benchmark project. It was a practical test of what a small local Gemma 4 model can actually learn.

Why Gemma 4 E2B was the right model to try

I did not need a general-purpose assistant. I needed a model that could learn a narrow, structured task and run locally.

That made Gemma 4 E2B interesting for three reasons:

It is small enough to experiment with on local hardware.
It is capable enough to handle long, messy invoice OCR.
It is open enough to fine-tune and evaluate honestly.

The instruction-tuned google/gemma-4-E2B-it model gave me a real starting point, not just a base model that needed a large GPU cluster to become useful.

I ran LoRA fine-tuning with:

model: google/gemma-4-E2B-it.
framework: MLX-LM
trainable params: 7.291M / 4647.450.M
trainable fraction: 0.157%
hardware: Mac
peak memory during stable runs: roughly 12.4 GB to 13.0 GB '.GB

That was the first encouraging sign. This was na ot theory. A small Gemma 4 model could be trained locally on a real business-shaped extraction task.

The task

The extraction target was a strict 22-field JSON schema for Indian GST invoices:

supplier identity
buyer identity
invoice number and dates
place of supply
HSN or SAC
description
taxable value
tax rates and amounts
total invoice
reverse charge
e-invoice IRN

The downstream requirement was not "answer roughly correctly." It was:

valid JSON
stable field typing
exact field mapping

That is a much harder and more useful task than general summarization.

What fine-tuning actually changed

The fastest way to see the difference between a "generic capable model" and a "task-adapted model" is to look at one invoice.

Before fine-tuning, the baseline model was capable of understanding the document but not disciplined enough to behave like an extractor. In some runs, it produced malformed JSON, mixed reasoning-style text into the answer, or mapped totals into the wrong fields.

After fine-tuning, the same invoice started producing compact,t structured outputs with the fields it had learned reliably:

"json { "supplier_name": "Sample Supplier Pvt Ltd", “supplier_gstin”: “27XXXXXXXXXX1ZX”, “invoice_no”: “INV-001”, “invoice_date”: “16-02-2026”, “cgst_rate”: 0.09, "cgst_amt": 285.3, "sgst_amt": 285.3, "total_invoice": 3741, “igst_rate”: 0.0, "igst_amt": 0, "reverse_charge": "No" }``

That did not mean the model was finished. Fields like taxable_value still need more real training examples to get right, which is where the project is heading next. But it had crossed the line from "general model guessing at documents" to "specialized extractor that can be improved with data."

The first version: synthetic data was enough to build the pipeline

I started with synthetic data because I did not have a large labeled corpus of invoices.

That synthetic pipeline gave me:

OCR-like invoice text
paired 22-field targets
tax arithmetic coverage
repeatable training exports
a way to debug LoRA and evaluation locally

The first clean synthetic-only run looked excellent.

Validation loss on the synthetic holdout improved from:

0.552 at iteration 1
to 0.024 at iteration 300

On paper, that looked close to done.

But synthetic validation was measuring whether the model understood the synthetic world I had created, not whether it understood real invoices from real suppliers.

That distinction ended up shaping the whole project.

The real work was data engineering

The model was not the hard part.

The hard part was teaching the model what real invoice variance looks like.

I eventually built the dataset in layers:

Layer 1: generic synthetic invoices

These were useful for:

schema coverage
GST arithmetic patterns
JSON output discipline
basic extraction behavior

Layer 2: real annotated invoices

I merged and cleaned real invoice annotations into a single CSV:

28 real invoices
22 unique suppliers
a mix of PDF and image invoices

Before retraining, I split them into:

20 real train invoices
8 real holdout invoices

That was the first time I had a real evaluation set that could tell me something meaningful.

Layer 3: Archive-derived layout variants

This was the most important change to the dataset.

Instead of generating more generic synthetic invoices, I reused the structure of real invoice layouts from an Archive/ folder:

alternate labels like No., Bill No., GST No
weakly labeled subtotal rows
dense table layouts
multiline descriptions
inconsistent spacing
subtotal-only item blocks

From those real layouts, I generated synthetic OCR variants that preserved layout difficulty while changing values and identities.

The final hybrid training mix was:

250 generic synthetic examples
360 Archive-layout variants
8 exact real OCR train examples matched back to source documents

Validation used:

8 held-out real invoices

That was the first dataset composition that looked like a real fine-tuning strategy rather than a synthetic demo.

What Gemma 4 E2B actually learned

This is the part I think matters most for anyone considering Gemma 4 for domain adaptation.

Gemma 4 E2B clearly learned the task.

Not in the vague sense of "it sounded good," but in the operational sense:

JSON stayed structurally stable
The model learned invoice field boundaries
It handled many supplier layouts
It converged reliably on a Mac with a tiny trainable parameter budget

The most meaningful run was the first hybrid real-holdout run.

Its validation loss on the real holdout set improved from:

0.786 at iteration 1
to 0.132 at iteration 250

Then I fixed issues in the Archive-variant generator:

removed annotation-format leakage like @ ₹xxx = ₹xxx
parsed real per-line amounts instead of splitting totals evenly
forced igst = 0 for intra-state invoices in the target JSON

That produced a slightly better run:

final real-holdout validation loss: 0.130

I also tried a prompt change specifically aimed at multiline extraction. That got worse, not better:

final validation loss: 0.147

I would not overstate that result. On an 8-invoice real holdout, that difference is not strong enough to claim that prompt engineering is harmful in general.

What it did show me is something narrower and more useful:

The prompt tweak did not clearly beat the best data-only run
The biggest gains in this project came from dataset composition, not instruction wording

What this taught me about Gemma 4

The headline lesson was not that Gemma 4 needed heroic prompt engineering.

It was that Gemma 4 E2B was already capable enough that the next bottleneck was dataset quality.

That is a good sign for the model.

Small models become interesting when they are strong enough that your time moves from "can this model learn the task at all?" to "what data do I need to make it trustworthy?"

That is where this project ended up.

What went wrong, and why it was useful

I do not think a useful Gemma 4 write-up should pretend everything worked on the first run.

Two failures were especially instructive.

Failure 1: overfitting after a great checkpoint

In an earlier run, validation loss got very low and then degraded badly later:

0.552 at iteration 1
0.022 at iteration 200
1.397 at iteration 400

That taught me:

The best checkpoint is not necessarily the last checkpoint
Dense evaluation checkpoints matter
Small local runs can still overtrain quickly

Failure 2: NaN training from sequence problems

Another run looked healthy until long examples were truncated. After that:

train loss became nan.
validation loss became na.n
The rest of the run was unusable

That forced me to treat dataset export and sequence control as first-class parts of the training pipeline, not cleanup tasks.

Both failures improved the eventual Gemma 4 setup more than another round of prompt edits would have.

The practical result

I would summarize the outcome this way:

Gemma 4 E2B was strong enough to learn a real structured extraction task locally
LoRA fine-tuning worked within a very small trainable parameter budget
The model benefited more from better data composition than from prompt tweaking
Synthetic data was useful for bootstrapping, but real layout variance determined what I could trust

That is a very good place for a small, open model.

If I were starting again

I would do three things earlier:

Create a real holdout set before the first "good" run.
Build layout-derived synthetic variants before scaling generic synthetic data.
Evaluate field-level errors on real invoices sooner instead of trusting synthetic validation curves.

Those are not just invoice lessons. They are reusable lessons for anyone fine-tuning Gemma 4 on domain-specific extraction tasks.

Why this made me more optimistic about Gemma 4, not less

The most interesting thing about this project was not that Gemma 4 E2B solved everything immediately.

It was that a local, open, small model got far enough that the real work shifted to data design, evaluation discipline, and layout coverage.

That is exactly the kind of capability I want from an open model.

Not a toy.
Not a benchmark artifact.
A model that is small enough to run locally, but capable enough to deserve serious data engineering.

For this task, Gemma 4 E2B crossed that line.

The model looked great on validation until one real invoice broke four assumptions

angu10 — Thu, 09 Apr 2026 02:28:43 +0000

An empirical note on what synthetic invoice data taught a Gemma fine-tune, what it hid, and how one real document exposed the gap.

I fine-tuned a small Gemma model to parse Indian invoices because I wanted a path that was cheaper, more private, and easier to deploy than calling a hosted API for every document.

The training metrics looked excellent.

Then I ran the model on one real invoice.

It got the total right, the supplier right, the address right, and still failed in four ways that would make the output unusable in a real finance workflow.

That invoice was more useful than another few hundred synthetic examples.

None of the headline conclusions here are new to anyone with ML experience:

synthetic data has domain gap
synthetic validation can be overly optimistic
real data changes what you trust

What felt worth documenting was the concrete shape of the failure:

which fields broke first
which assumptions in the synthetic distribution caused it
what the training curves looked like before and after instability
and which lessons were actually about data, not models

The setup

I did not have a large labeled invoice corpus, so I started with synthetic data.

The extraction target was a strict 22-field JSON schema, and the synthetic dataset was large enough to build a real training pipeline. It was not large enough to tell me whether the model understood real invoices.

Why validation looked so good

The final stable training run used:

model: google/gemma-4-E2B-it
framework: MLX-LM 0.31.2
trainable params: 7.291M / 4647.450M (0.157%)
iterations: 300
learning rate: 5e-5
num_layers: 8
batch_size: 1
grad_accumulation_steps: 8
max_seq_length: 1536

It trained cleanly on a Mac with peak memory of about 13.677 GB.

Validation loss improved almost monotonically:

Iter 1: 0.552
Iter 50: 0.084
Iter 100: 0.056
Iter 150: 0.046
Iter 200: 0.044
Iter 250: 0.029
Iter 300: 0.024

If all I had looked at was the validation curve, I would have said the model was basically ready.

That would have been wrong.

That sentence is obvious in the abstract. It only becomes useful when you can point to the exact fields and failure modes that made it wrong.

One real invoice broke four assumptions

The invoice came from Jon Doe Print.

The model output looked plausible enough to pass a quick skim:

supplier name: Jon Doe Print
supplier GSTIN: correct format and state code
supplier address: mostly correct
invoice number: a plausible-looking integer value
invoice date: correctly extracted
total invoice: captured correctly

But the failure table tells the real story:

Field	Model output	Correct	Impact
`description`	`3D Printed Prototype`	`3D Printed Prototype (Pre filter)`	Wrong item identity in downstream categorization
`taxable_value`	line-item amount	invoice subtotal	Wrong amount booked to accounts
`igst_rate`	`0.09`	`0.0`	Wrong tax treatment and downstream GST logic
`reverse_charge`	`0`	`No`	Type mismatch that can break strict downstream parsers

The model also captured some things correctly:

total invoice correctly
tax amounts correctly

That is what made the failure interesting.

The model was not random. It had learned enough invoice structure to look useful. It just had not learned enough real invoice variance to be trustworthy.

That distinction is the center of the project.

The problem was not that the model failed to learn invoice extraction at all.

The problem was that it learned the synthetic version of invoice extraction more faithfully than the real one.

The four assumptions that invoice broke

1. I assumed subtotal rows would be easy to identify

The invoice had multiple line items.

The model extracted a line-level amount as taxable_value instead of the invoice subtotal row.

In a synthetic dataset, subtotal rows are easy to standardize:

same position
same label family
same spacing

In real invoices, subtotal rows compete with:

unit prices
per-line totals
tax-inclusive values
noisy formatting

The model had learned “there is a number near the items.” It had not learned “this is the subtotal row that should override the line-level values.”

2. I assumed the model would map visible tax rates to the right field

The supplier and place of supply were both in the same state:

supplier GSTIN state code matched the place-of-supply state code
the invoice was intra-state

So this was an intra-state invoice.

That means:

CGST > 0
SGST > 0
IGST = 0

The model still output:

igst_rate = 0.09

This is a subtle but important failure.

It saw a printed 18% tax context on the invoice and mapped that rate into the wrong slot.

That is not an arithmetic problem. It is a field-to-concept mapping problem.

Synthetic data had taught the model what tax fields exist. It had not sufficiently taught it how to disambiguate them when the invoice layout was less explicit.

3. I assumed missing fields would default safely

The model returned:

reverse_charge = 0

The correct value was:

reverse_charge = "No"

This looks small until you think about how these systems get deployed.

If the downstream consumer expects:

a strict string enum

and gets:

a number

you now have:

broken JSON contracts
parser failures
brittle rule-engine behavior

The model did not just guess the wrong value. It guessed the wrong type.

That is a very different category of failure.

4. I assumed synthetic layout diversity was enough

The invoice format differed from the synthetic training distribution in small ways:

weaker or alternate labels
less structured spacing
no clean field presentation for some values
multi-line item complexity

None of those differences are dramatic in isolation.

Together, they were enough to push the model into the wrong extraction path.

That is the real problem with synthetic validation:

You can cover many business scenarios while still under-covering format variance.

The model learns the contract of the synthetic world very well.

Then one real document shows you which parts of the world your contract forgot to mention.

Synthetic data did help. Just not in the way validation suggested.

This is the part that matters most.

The synthetic data was not a waste.

It gave me:

a working training loop
a rendered dataset
stable checkpointing
a measurable extraction task
a way to iterate cheaply

And it gave the model enough structure to learn the task.

The training curve from the final stable run proves that:

Val loss 0.552 -> 0.024

That is real learning.

But the real invoice test showed what that learning actually meant:

the model learned the schema
it did not yet learn the full shape of real-world invoices

That distinction is the whole article.

Synthetic data was useful because it taught the model the contract.

The real invoice exposed the parts of the contract that were underspecified.

The two failed runs were part of the lesson too

Before the stable run, I had two failed runs that made the later result more believable.

Run 1: the overfit run

The first successful run had a strong early checkpoint and then degraded badly.

Validation loss:

Iter 1: 0.552
Iter 200: 0.022
Iter 400: 1.397
Iter 500: 0.122

The model got to a very good point by iter 200, then drifted away from it.

That run taught me:

the best checkpoint is not necessarily the last checkpoint
a constant aggressive learning rate on a small synthetic dataset can destroy a good run after it already succeeded

Run 2: the NaN run

The second run looked healthier until sequence-length issues showed up.

At iter 150, the log warned:

[WARNING] Some sequences are longer than 1536 tokens.
The longest sentence 1973 will be truncated to 1536.

Immediately afterward:

Train loss nan
then Val loss nan
then the rest of the run stayed corrupted

The last clean checkpoint in that run was iter 100.

That run taught me:

token limits are not just throughput constraints
one bad sample can invalidate the rest of a training run
“the run finished” is not the same as “the run is usable”

Those failures are worth mentioning because they stop the final result from sounding cleaner than it really was.

They also explain why the stable v3 run is more believable than it would be in isolation. The earlier runs failed in concrete, diagnosable ways.

What the project actually proved

It did not prove that synthetic data is enough.

It proved three narrower things:

1. Synthetic data is excellent for bootstrapping a structured extraction task

It gave me scale, perfect labels, and scenario coverage fast.

2. Validation on synthetic data can dramatically overstate readiness

The model’s synthetic metrics looked excellent before the real-invoice test exposed field-mapping failures.

3. A small real corpus is disproportionately valuable

The single real invoice I tested taught me more about generalization than another hundred synthetic invoices would have.

That is not because synthetic data is bad.

It is because synthetic data and real data teach different things:

Synthetic data teaches	Real data teaches
schema	variance
business scenarios	layout ambiguity
output format discipline	how documents actually break
scale	trust

What I would change next

Only three things matter now:

Build a small real-invoice gold set and make it part of evaluation immediately. The main gap here was format variance, not business-rule coverage.
Add real invoices into training earlier instead of trying to synthetic my way out of layout variance. The real-invoice failure was a distribution problem, not a parameter-count problem.
Strengthen JSON and type constraints so missing fields fail safely instead of defaulting to 0. The reverse_charge = 0 output is the kind of bug that looks cosmetic in a notebook and expensive in a real pipeline.

Everything else is secondary.

The lesson I am keeping

Synthetic data got the model to the point where it knew what an invoice parser is supposed to do.

One real invoice showed me what it still did not understand.

That is the difference.

Synthetic data teaches the task.

Real data teaches the world.

Building CloudGuide: A Real-Time AI Assistant for Navigating Google Cloud

angu10 — Fri, 13 Mar 2026 05:34:40 +0000

Introduction

Cloud platforms are incredibly powerful — but navigating them can be confusing even for experienced developers.

Recently, while setting up a static website on Google Cloud Storage, I realized how easy it is to make small mistakes:

A permission checkbox hidden deep in the UI
A configuration buried under another menu
A setting that appears correct but fails silently

In many cases, developers spend more time searching the console UI than actually building their application.

That led to a question:

What if an AI assistant could watch your screen, listen to your question, and guide you step‑by‑step through cloud configuration?

That idea became CloudGuide, a multimodal AI agent built with Google AI models and Google Cloud.

This project was developed specifically for the Google AI Hackathon, and this post explains how it works under the hood.

The Idea

CloudGuide is a voice-enabled AI assistant that helps users configure cloud resources in real time.

Instead of reading documentation or watching tutorials, users can simply say:

“Help me deploy a website on Google Cloud.”

The AI agent then:

• Watches the user’s screen

• Listens to voice input

• Speaks instructions

• Highlights UI elements to click

• Verifies steps using real Google Cloud APIs

The goal is to turn complex cloud configuration into a guided interactive experience.

Key Capabilities

CloudGuide combines several multimodal capabilities:

Screen Understanding

The system captures periodic screenshots of the user's screen and sends them to the AI model for interpretation.

Voice Interaction

Users communicate naturally through a microphone, asking questions or requesting help.

Real-Time Voice Responses

The AI responds with native audio output using Gemini’s audio model.

API Grounding

Instead of trusting screenshots alone, the system verifies actions using the Google Cloud Storage API.

Visual Click Guidance

The system highlights the exact UI element users need to click.

This dramatically reduces confusion when navigating complex cloud interfaces.

System Architecture

The system consists of three primary layers:

Client
Backend
AI Model

Client (User Machine)

The client runs locally and handles:

Screen capture
Microphone input
Speaker output
Browser highlighting

Technologies used:

mss – screen capture
pyaudio – audio streaming
Playwright + Chrome DevTools Protocol – UI highlighting

The client streams screenshots and audio to the backend via WebSocket.

Backend (Google Cloud Run)

The backend is deployed on Google Cloud Run and built with FastAPI.

Responsibilities include:

Managing WebSocket connections
Streaming data to the AI model
Executing tool calls
Verifying cloud configuration through APIs

All multimodal input flows through a request queue before being sent to the AI model.

AI Model (Gemini Live)

CloudGuide uses the Gemini Live API with:

gemini-2.5-flash-native-audio-latest

This model supports:

Real-time audio streaming
Image understanding
Tool calling
Natural voice output

The model processes audio and screenshots together within a single streaming session.

This creates a natural conversational experience.

Tool Calling and API Grounding

One challenge with vision-based AI systems is that screenshots can be misleading.

For example:

A UI might show a resource as created even though the underlying API operation failed.

To solve this, CloudGuide uses tool calling.

Example tools include:

check_bucket()
list_bucket_files()
check_bucket_permissions()
diagnose_bucket_issues()

These tools query the Google Cloud Storage API directly.

This allows the AI agent to verify that each step actually worked.

Visual UI Guidance

Another major feature is element highlighting.

Before asking the user to click something, the system highlights the UI element.

A pulsing rectangle appears around the button or input field.

This is implemented using:

Playwright
Chrome DevTools Protocol

The backend sends highlight instructions, and the client injects an overlay into the browser.

Challenges Encountered

Audio Feedback Loops

When the AI speaks through speakers, the microphone can pick up the sound and send it back to the model.

This creates a feedback loop where the model responds to itself.

Using headphones mitigates this issue, but future improvements could include built‑in echo cancellation.

Voice Activity Detection

The Live API sometimes interprets background noise as speech.

Adding voice activity detection would improve reliability.

UI Changes

Cloud interfaces evolve frequently.

During development, certain UI paths moved or appeared differently than documented, which required adjustments to the workflow detection logic.

Deployment

The backend runs on Google Cloud Run and is deployed through a CI/CD pipeline using:

Cloud Build
Container Registry
Docker

Deployment is automated through a simple script that builds and deploys the backend service.

This makes the system scalable and accessible from anywhere.

What I Learned

Building this project revealed several key insights:

Multimodal AI is powerful

Combining screen understanding with voice interaction creates a much more natural interface.

API grounding improves reliability

Vision alone is not enough. Verifying system state using APIs is essential.

Voice interfaces still need infrastructure improvements

Capabilities like echo cancellation and voice activity detection will make voice agents significantly more robust.

Final Thoughts

Cloud platforms are incredibly capable, but their complexity often slows developers down.

Projects like CloudGuide explore a new paradigm:

AI agents that guide users through complex systems in real time.

By combining:

Google AI models
Gemini Live streaming
Google Cloud Run
Real API grounding

we can create assistants that truly understand what users are doing.

This project was built using Google AI models and Google Cloud services and the accompanying content was created specifically for entering the Google AI Hackathon.

Stop Print Debugging Your AI Agents: A Deep Dive into Agent Observability

angu10 — Sun, 04 Jan 2026 05:15:46 +0000

The Invisible Agent Problem
Why Current Solutions Fall Short
Introducing Agent Recorder
How It Works: Technical Deep Dive
Real-World Use Cases
Comparing Approaches
Building Production-Ready Agents
The Road Ahead

The Invisible Agent Problem

It's 2 AM. Your AI agent just went into an infinite loop consuming API credits. Again.

You've built what should be a simple customer service agent:

Parse user question
Search knowledge base
Query database if needed
Format response
Maybe escalate to human support

Simple, right? Except somewhere in those 5 steps, your agent:

Called the same database query 15 times
Got stuck in a loop asking the LLM to "try again"
Hallucinated data that doesn't exist
Crashed with a cryptic error in step 4

And you have no idea which one until you start debugging.

The Print Statement Spiral

So you do what every developer does. You add logging:

def call_llm(prompt):
    print(f"[DEBUG] Calling LLM with: {prompt[:50]}...")
    start = time.time()
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    print(f"[DEBUG] LLM took {time.time() - start:.2f}s")
    result = response.choices[0].message.content
    print(f"[DEBUG] Got response: {result[:50]}...")
    return result

def search_database(query):
    print(f"[DEBUG] Searching DB: {query}")
    results = db.query(query)
    print(f"[DEBUG] Found {len(results)} results")
    return results

def get_customer_info(customer_id):
    print(f"[DEBUG] Getting customer {customer_id}")
    customer = db.get(customer_id)
    print(f"[DEBUG] Customer: {customer.get('name', 'Unknown')}")
    return customer

An hour later, your terminal looks like this:

[DEBUG] Calling LLM with: Find all orders for customer John Smith...
[DEBUG] LLM took 1.23s
[DEBUG] Got response: I'll search for that customer...
[DEBUG] Searching DB: customer_name=John Smith
[DEBUG] Found 2 results
[DEBUG] Getting customer 123
[DEBUG] Customer: John Smith
[DEBUG] Calling LLM with: Here are the customer details: {'id': 123...
[DEBUG] LLM took 0.87s
[DEBUG] Got response: Let me get their orders...
[DEBUG] Searching DB: orders WHERE customer_id=123
[DEBUG] Found 3 results
[DEBUG] Calling LLM with: Here are the orders: [{'id': 1001, 'to...
[DEBUG] LLM took 1.45s
[DEBUG] Got response: The customer has 3 orders...

You're staring at hundreds of lines of logs trying to answer basic questions:

How many times did we call the LLM?
What was the total execution time?
Which step failed?
What were the actual arguments passed to each function?
When did it start looping?

This is not sustainable.

The Real Cost of Poor Observability

Let me share some real numbers from my experience building AI agents:

Time Spent Debugging:

Print debugging: 2-4 hours per bug
Adding proper logging: 30 minutes per function
Actually finding the bug: 15 minutes
Total: 3-5 hours for issues that should take 15 minutes

Developer Frustration:

Losing context between debugging sessions
Unable to reproduce issues
No way to compare "working" vs "broken" runs
Every new team member asks: "How do I debug this?"

API Inefficiency:

Agents making 3x more API calls than necessary
Inefficient prompts using excessive tokens
Unable to identify performance bottlenecks

We've spent decades building amazing developer tools for web apps, mobile apps, backend services. But for AI agents? We're back to print() statements like it's 1995.

Why Current Solutions Fall Short

Before building Agent Recorder, I tried everything:

1. Standard Logging Libraries

import logging

logger = logging.getLogger(__name__)

def call_llm(prompt):
    logger.info(f"Calling LLM with prompt: {prompt}")
    response = llm.invoke(prompt)
    logger.info(f"Got response: {response}")
    return response

Problems:

Still just text logs in a file
No structure, no visualization
Manual instrumentation everywhere
Hard to correlate across async calls
No timing information without extra code

2. Cloud Observability Tools (DataDog, New Relic, etc.)

Problems:

Expensive for small teams and individuals
Send your prompts/responses to third-party servers (security issue)
Heavy SDKs that bloat your dependencies
Designed for traditional apps, not agent workflows
Over-engineered for "just see what my agent did"

3. LLM Provider Dashboards (OpenAI, Anthropic)

Problems:

Only see LLM calls, not your tool calls
No local context (what led to this call?)
Delayed (not real-time)
Can't see your custom logic
Vendor lock-in

4. Framework-Specific Tools (LangSmith for LangChain)

Problems:

Only works with that framework
Requires rewriting code to use their patterns
Still cloud-based with subscription fees
What if you use raw APIs or multiple frameworks?

What I needed was simple:

See every LLM call and tool call
Local storage (my data, my machine)
Framework-agnostic (works with anything)
Minimal code changes
Beautiful visualization
Free and open source

That tool didn't exist. So I built it.

Introducing Agent Recorder

Agent Recorder is Redux DevTools for AI agents. If you've ever used Redux DevTools for React development, you know the power of seeing every action, every state change, with the ability to inspect, time-travel, and understand your application flow.

Now imagine that, but for your AI agent's execution.

The Two-Decorator Solution

Here's all you need to add to your code:

from agent_recorder import llm_call, tool_call

@llm_call(run_name="customer-service-agent")
def call_llm(prompt):
    response = openai.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

@tool_call(run_name="customer-service-agent")
def search_database(query):
    results = db.query(query)
    return results

@tool_call(run_name="customer-service-agent")
def get_customer_orders(customer_id):
    orders = db.query(f"SELECT * FROM orders WHERE customer_id = {customer_id}")
    return orders

That's it. No context managers, no complex setup, no configuration files.

The run_name parameter groups related calls together. All functions decorated with run_name="customer-service-agent" will be recorded in the same timeline.

What Gets Captured Automatically

Every decorated function automatically logs:

Function name - What was called
Arguments - All input parameters with their values
Return value - Complete output from the function
Duration - Execution time in milliseconds
Timestamp - Exact time of invocation
Errors - Full exception details if it failed
Parent tracking - For nested function calls

No manual annotation needed. Just add the decorator.

Running Your Agent

Use your functions exactly as before:

# This is your agent logic - unchanged!
user_question = "Find all orders for customer John Smith"

# Step 1: Ask LLM to understand the query
intent = call_llm(f"User asks: {user_question}")

# Step 2: Search for the customer
customers = search_database("customer_name='John Smith'")

# Step 3: Get their orders
if customers:
    customer = customers[0]
    orders = get_customer_orders(customer['id'])

    # Step 4: Summarize results
    summary = call_llm(f"Summarize these orders: {orders}")
    print(summary)

Everything is being recorded in the background.

Viewing the Timeline

When your agent finishes (or crashes), run:

agent-recorder view latest

Your browser opens to a beautiful web-based timeline showing the complete execution flow.

How It Works: Technical Deep Dive

Let me walk you through the architecture and implementation details.

1. Decorator-Based Instrumentation

When you write:

@llm_call(run_name="my-agent")
def call_llm(prompt):
    return "response"

Here's what happens under the hood:

Registry Lookup: Agent Recorder checks if a Recorder instance exists for "my-agent"
Auto-Creation: If not, it creates one with a unique run ID (timestamp + UUID)
Function Wrapping: Your function gets wrapped with timing and logging logic
Execution: When called, it captures args, executes the function, captures the result
Event Writing: Writes a structured event to a JSONL file immediately

The actual implementation:

def llm_call(run_name: str, name: Optional[str] = None,
             capture_args: bool = True, capture_result: bool = True):
    # Get or create a Recorder instance for this run_name
    recorder = _get_or_create_recorder(run_name)

    # Return the actual decorator that wraps your function
    return recorder.llm_call(name=name, capture_args=capture_args,
                            capture_result=capture_result)

2. Event Storage Format

All events are stored as JSONL (JSON Lines) - one JSON object per line. This format is:

Streamable: Can write events as they happen
Parseable: Easy to read line-by-line
Crash-resistant: If your program crashes, all events up to that point are saved
Tooling-friendly: Standard format used by many data tools

Example event:

{
  "run_id": "20260103_192705_c2207bde",
  "event_id": "4f85a880-2ab7-45bf-a0ba-9c776581a5de",
  "timestamp": "2026-01-03T19:27:06.097562",
  "type": "llm_call",
  "parent_id": null,
  "data": {
    "function_name": "call_llm",
    "args": {
      "prompt": "User asks: Find all orders for customer John Smith"
    },
    "duration_ms": 760,
    "error": null,
    "result": "I'll help you find customer information. Let me search the database."
  }
}

Storage location: ~/.agent-recorder/runs/<run_id>.jsonl

3. Event Types

Agent Recorder tracks 5 event types:

run_start - Marks the beginning of a run

   {
     "type": "run_start",
     "data": {
       "name": "customer-service-agent",
       "run_id": "20260103_192705_c2207bde",
       "timestamp": "2026-01-03T19:27:05.337192"
     }
   }

llm_call - LLM function execution

   {
     "type": "llm_call",
     "data": {
       "function_name": "call_llm",
       "args": {"prompt": "..."},
       "result": "...",
       "duration_ms": 1234
     }
   }

tool_call - Tool function execution

   {
     "type": "tool_call",
     "data": {
       "function_name": "search_database",
       "args": {"query": "..."},
       "result": [...],
       "duration_ms": 340
     }
   }

error - Exception that occurred

   {
     "type": "error",
     "data": {
       "error_type": "ValueError",
       "message": "Customer not found",
       "traceback": "..."
     }
   }

run_end - Marks completion (optional in v0.1.1)

4. Async Support

The same decorators work seamlessly with async functions:

@llm_call(run_name="async-agent")
async def call_llm_async(prompt):
    response = await openai_async.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    return response.choices[0].message.content

@tool_call(run_name="async-agent")
async def fetch_weather(city):
    async with httpx.AsyncClient() as client:
        response = await client.get(f"https://api.weather.com/{city}")
        return response.json()

# Use with asyncio
async def main():
    result = await call_llm_async("What's the weather in SF?")
    weather = await fetch_weather("San Francisco")

asyncio.run(main())

Agent Recorder detects if your function is a coroutine and handles it appropriately.

5. Web Viewer Architecture

The viewer is a self-contained HTML file with:

No external dependencies (no CDN calls)
Vanilla JavaScript for parsing JSONL
CSS for the timeline UI
Syntax highlighting for JSON data
Collapsible event cards
Search and filter capabilities

When you run agent-recorder view latest, it:

Finds the latest run in ~/.agent-recorder/runs/
Starts a local HTTP server (default port 8765)
Serves the HTML viewer + JSONL data
Opens your browser to http://localhost:8765/runs/<run_id>.html

Everything stays local. No data leaves your machine.

Real-World Use Cases

Let me show you how Agent Recorder solves actual problems I've encountered.

Use Case 1: Debugging Infinite Loops

The Problem: Agent keeps calling the same tool over and over.

Without Agent Recorder:

[DEBUG] Calling search_database with query: customer_name='John'
[DEBUG] Got 0 results
[DEBUG] Calling LLM...
[DEBUG] LLM says: Let me search again
[DEBUG] Calling search_database with query: customer_name='John'
[DEBUG] Got 0 results
[DEBUG] Calling LLM...
[DEBUG] LLM says: Let me search again
... (500 more lines)

You have to manually count log lines and realize it's looping.

With Agent Recorder:

Open the timeline and immediately see:

1. llm_call - "Find customer John"
2. tool_call - search_database(query="customer_name='John'") → []
3. llm_call - "I got no results, let me try again"
4. tool_call - search_database(query="customer_name='John'") → []
5. llm_call - "I got no results, let me try again"
6. tool_call - search_database(query="customer_name='John'") → []
... (pattern visible immediately)

The fix: The database query is wrong (should be customer_name='John Smith'). Also, the LLM needs explicit instruction to stop after 1 failed attempt.

Time saved: 2 hours → 5 minutes

Use Case 2: Performance Optimization

The Problem: Agent is slow but you don't know which part.

With Agent Recorder:

Look at the timeline durations:

1. llm_call - 1.2s ⚡ (acceptable)
2. tool_call - search_database - 3.8s 🐌 (SLOW!)
3. tool_call - get_orders - 0.4s ⚡
4. llm_call - 0.9s ⚡

The fix: Add a database index on customer_name. Duration drops to 0.2s.

Result: Total execution time: 6.3s → 2.7s (57% faster)

Use Case 3: Token Usage Optimization

The Problem: High API usage, unclear why.

With Agent Recorder:

Export the run to JSON:

agent-recorder export <run_id> -o run.json

Write a quick script to analyze:

import json

total_prompt_length = 0
total_calls = 0

with open('run.json') as f:
    data = json.load(f)
    for event in data['events']:
        if event['type'] == 'llm_call':
            total_calls += 1
            prompt = event['data']['args'].get('prompt', '')
            total_prompt_length += len(prompt)

print(f"Total LLM calls: {total_calls}")
print(f"Average prompt length: {total_prompt_length / total_calls}")

Discovery: One LLM call had a 5000-character prompt that included the entire knowledge base unnecessarily.

The fix: Pass only relevant excerpts to the LLM. Token usage drops significantly.

Use Case 4: Comparing Runs

The Problem: "It worked yesterday, now it's broken."

With Agent Recorder:

# List all runs
agent-recorder list

# Output:
# 20260102_143022_abc123  customer-agent  2026-01-02 14:30:22 (working)
# 20260103_192705_c2207b  customer-agent  2026-01-03 19:27:05 (broken)

# Export both
agent-recorder export 20260102_143022_abc123 -o working.json
agent-recorder export 20260103_192705_c2207b -o broken.json

# Compare with diff tool or custom script

Discovery: In the broken version, a new validation step was added that always returns empty results.

Time saved: 4 hours → 15 minutes

Use Case 5: Onboarding New Team Members

The Problem: "How does this agent work?"

With Agent Recorder:

Run a sample execution:

python examples/customer_service_agent.py
agent-recorder view latest

Show them the timeline. They instantly understand:

Agent asks LLM to parse the query
LLM decides which tools to call
Agent executes tools (database, API calls)
LLM synthesizes the response

No documentation needed. The timeline is living documentation.

Comparing Approaches

Let me compare different debugging approaches with a real scenario:

Scenario: Debug why customer order lookup fails for "John Smith"

Approach 1: Print Statements

def find_orders(customer_name):
    print(f"DEBUG: Looking for {customer_name}")
    customers = search_customers(customer_name)
    print(f"DEBUG: Found {len(customers)} customers")
    if not customers:
        print("DEBUG: No customers found, returning empty")
        return []
    print(f"DEBUG: Getting orders for {customers[0]['id']}")
    orders = get_orders(customers[0]['id'])
    print(f"DEBUG: Got {len(orders)} orders")
    return orders

Time to find bug: 30-60 minutes
Lines of debug code: 15-20
After fixing: Remove all print statements
If it breaks again: Add them all back

Approach 2: Logging Framework

import logging
logger = logging.getLogger(__name__)

def find_orders(customer_name):
    logger.info(f"Looking for customer: {customer_name}")
    customers = search_customers(customer_name)
    logger.info(f"Found {len(customers)} customers")
    if not customers:
        logger.warning("No customers found")
        return []
    logger.info(f"Getting orders for customer {customers[0]['id']}")
    orders = get_orders(customers[0]['id'])
    logger.info(f"Retrieved {len(orders)} orders")
    return orders

Time to find bug: 20-30 minutes
Lines of debug code: 20-25 (permanent overhead)
After fixing: Logs stay (clutter over time)
Visualization: Still just text in a file

Approach 3: Cloud Observability (e.g., DataDog)

from ddtrace import tracer

@tracer.wrap()
def find_orders(customer_name):
    with tracer.trace("search_customers"):
        customers = search_customers(customer_name)
    with tracer.trace("get_orders"):
        if customers:
            orders = get_orders(customers[0]['id'])
            return orders
    return []

Time to find bug: 10-15 minutes
Setup time: 2-3 hours (SDK, config, account)
Ongoing: Monthly subscription
Security: Data sent to third-party
Lines of instrumentation: 15-20

Approach 4: Agent Recorder

from agent_recorder import tool_call

@tool_call(run_name="order-lookup")
def find_orders(customer_name):
    customers = search_customers(customer_name)
    if not customers:
        return []
    orders = get_orders(customers[0]['id'])
    return orders

@tool_call(run_name="order-lookup")
def search_customers(name):
    return db.query(f"SELECT * FROM customers WHERE name = '{name}'")

@tool_call(run_name="order-lookup")
def get_orders(customer_id):
    return db.query(f"SELECT * FROM orders WHERE customer_id = {customer_id}")

Time to find bug: 5-10 minutes
Setup time: 30 seconds (pip install)
Ongoing: Free
Security: All data local
Lines of instrumentation: 3 decorators
After fixing: Decorators stay (useful for future debugging)

Winner: Agent Recorder provides the best balance of simplicity, effectiveness, and privacy.

Building Production-Ready Agents

Agent Recorder isn't just for debugging - it's essential for production agents.

1. Handling Sensitive Data

Don't log API keys or personal information:

@llm_call(run_name="secure-agent", capture_args=False)
def call_llm_with_key(api_key: str, prompt: str):
    # api_key won't be logged
    return openai.chat.completions.create(
        api_key=api_key,
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    ).choices[0].message.content

@tool_call(run_name="secure-agent", capture_result=False)
def fetch_user_pii(user_id: str):
    # Result won't be logged (but function call and args will)
    return db.get_user_sensitive_info(user_id)

2. Custom Storage Location

For production deployments:

@llm_call(
    run_name="prod-agent",
    storage_dir="/var/log/agent-recorder"
)
def call_llm(prompt):
    return response

3. Cleanup Old Runs

Keep disk usage under control:

# Delete runs older than 7 days
agent-recorder cleanup --older-than 7d

# Dry run to see what would be deleted
agent-recorder cleanup --older-than 7d --dry-run

4. Automated Analysis

Export and analyze runs programmatically:

import json
from pathlib import Path

def analyze_run(run_id):
    # Export to JSON
    export_path = Path(f"/tmp/{run_id}.json")
    os.system(f"agent-recorder export {run_id} -o {export_path}")

    # Load and analyze
    with open(export_path) as f:
        data = json.load(f)

    stats = {
        'total_llm_calls': 0,
        'total_tool_calls': 0,
        'total_duration': 0,
        'errors': []
    }

    for event in data['events']:
        if event['type'] == 'llm_call':
            stats['total_llm_calls'] += 1
            stats['total_duration'] += event['data']['duration_ms']
        elif event['type'] == 'tool_call':
            stats['total_tool_calls'] += 1
            stats['total_duration'] += event['data']['duration_ms']
        elif event['type'] == 'error':
            stats['errors'].append(event['data'])

    return stats

The Road Ahead

Agent Recorder v0.1.1 is just the beginning. Here's what's coming:

v0.2.0 - Enhanced Visualization (Planned)

Tree/Graph View: See nested calls as a visual tree
Token Counting: Automatic token counting for OpenAI/Anthropic
Cost Estimation: Calculate API costs for each run
Performance Metrics: Identify bottlenecks automatically
Export Formats: PDF, HTML, CSV for reports

v0.3.0 - Framework Integrations (Planned)

LangChain Adapter: Auto-instrument LangChain agents
LlamaIndex Adapter: Seamless integration with LlamaIndex
AutoGen Support: Track multi-agent conversations
CrewAI Integration: Monitor crew workflows

v0.4.0 - Advanced Features (Planned)

Real-time Streaming: Watch agent execution live
Multi-agent Support: Track multiple agents interacting
Diff View: Compare two runs side-by-side
Custom Events: Log your own event types
Plugin System: Extend with custom visualizations

v0.5.0 - Language Ports (Community Welcome!)

TypeScript/Node.js SDK: For JavaScript agents
Go SDK: For Go-based agents
Rust SDK: For high-performance agents

Want to contribute? Check out the GitHub repo for good first issues!

Conclusion: Observability Is Not Optional

As AI agents move from prototypes to production, observability isn't a nice-to-have - it's essential.

You can't optimize what you can't measure.
You can't debug what you can't see.
You can't trust what you can't verify.

Agent Recorder gives you that visibility with:

✅ Two simple decorators
✅ Zero configuration
✅ Local-first architecture
✅ Framework-agnostic design
✅ Beautiful visualization
✅ Free and open source

Get Started Today

# Clone and install
git clone https://github.com/yourusername/agent-recorder.git
cd agent-recorder
pip install -e .

# Try the example
python examples/simple_agent.py

# View the recording
agent-recorder view latest

GitHub: https://github.com/yourusername/agent-recorder
License: MIT
Docs: See README.md for full documentation

Join the Community

Agent Recorder is open source and built for the community. Whether you:

Found a bug → Open an issue
Have a feature idea → Start a discussion
Want to contribute → Submit a PR
Built something cool → Share your story

We're building the future of agent observability together.

Star the repo if you find it useful - it helps others discover the project!

Building Self-Correcting Database Agents with Meta's Llama-4-Scout: From Natural Language to SQL Insights

angu10 — Thu, 09 Oct 2025 15:09:03 +0000

Introduction

The ability to query databases using natural language represents one of the most practical applications of large language models in enterprise settings. However, most text-to-SQL implementations suffer from a critical flaw: they generate queries once and hope for the best. When a query fails—due to incorrect table names, misunderstood schema relationships, or logical errors—the system simply returns an error message, leaving users frustrated.

In this technical guide, we'll build a sophisticated database agent using Meta's Llama-4-Scout that doesn't just generate SQL queries—it thinks through the problem, validates its approach, and most importantly, learns from its mistakes to self-correct. This agent implements a five-phase cognitive framework: Understand → Plan → Generate → Validate → Execute, with intelligent retry logic when queries fail.

Why Database Agents Matter

Business analysts, product managers, and non-technical stakeholders frequently need data insights but lack SQL expertise. Traditional BI tools require pre-built dashboards, creating bottlenecks when ad-hoc questions arise. A well-designed database agent democratizes data access while maintaining query safety and accuracy.

Understanding Llama-4-Scout for Structured Data Tasks

Llama-4-Scout excels at structured reasoning tasks for several reasons:

Schema comprehension: Understands relational database concepts including foreign keys, joins, and aggregations
Error diagnosis: Can parse SQL error messages and identify root causes
Iterative refinement: Maintains context across retry attempts to avoid repeating mistakes
Safety awareness: Can be prompted to avoid destructive operations

Implementation Architecture

Our implementation consists of four layers:

Database Layer: SQLite database with realistic employee/department data
Agent Cognitive Layer: Five-phase reasoning pipeline
Safety & Validation Layer: SQL injection prevention and query validation
User Interface Layer: Streamlit application for interaction

Let's build this step by step.

Step 1: Environment Setup and Dependencies

`python

Required packages

pip install streamlit together sqlite3 pandas

import streamlit as st
import sqlite3
import pandas as pd
from together import Together
import re
from datetime import datetime
import json
`

Step 2: Database Schema and Sample Data Creation

First, we'll create a realistic database with three interconnected tables representing a company's organizational structure:

`python
def create_sample_database():
"""Create a sample SQLite database with employee, department, and project tables"""
conn = sqlite3.connect('company.db')
cursor = conn.cursor()

# Drop existing tables if they exist
cursor.execute('DROP TABLE IF EXISTS projects')
cursor.execute('DROP TABLE IF EXISTS employees')
cursor.execute('DROP TABLE IF EXISTS departments')

# Create departments table
cursor.execute('''
CREATE TABLE departments (
    department_id INTEGER PRIMARY KEY,
    department_name TEXT NOT NULL,
    budget DECIMAL(12, 2),
    location TEXT
)
''')

# Create employees table with self-referential manager relationship
cursor.execute('''
CREATE TABLE employees (
    employee_id INTEGER PRIMARY KEY,
    first_name TEXT NOT NULL,
    last_name TEXT NOT NULL,
    email TEXT UNIQUE,
    hire_date DATE,
    salary DECIMAL(10, 2),
    department_id INTEGER,
    manager_id INTEGER,
    FOREIGN KEY (department_id) REFERENCES departments(department_id),
    FOREIGN KEY (manager_id) REFERENCES employees(employee_id)
)
''')

# Create projects table
cursor.execute('''
CREATE TABLE projects (
    project_id INTEGER PRIMARY KEY,
    project_name TEXT NOT NULL,
    department_id INTEGER,
    start_date DATE,
    end_date DATE,
    status TEXT,
    budget DECIMAL(10, 2),
    FOREIGN KEY (department_id) REFERENCES departments(department_id)
)
''')

# Insert sample departments
departments = [
    (1, 'Engineering', 2500000.00, 'San Francisco'),
    (2, 'Sales', 1500000.00, 'New York'),
    (3, 'Marketing', 800000.00, 'Austin'),
    (4, 'Human Resources', 600000.00, 'San Francisco'),
    (5, 'Finance', 700000.00, 'New York')
]
cursor.executemany('INSERT INTO departments VALUES (?, ?, ?, ?)', departments)

# Insert sample employees (managers first, then their reports)
employees = [
    # Managers (no manager_id)
    (1, 'Sarah', 'Johnson', 'sarah.j@company.com', '2018-01-15', 150000, 1, None),
    (2, 'Michael', 'Chen', 'michael.c@company.com', '2017-03-20', 145000, 2, None),
    (3, 'Emily', 'Rodriguez', 'emily.r@company.com', '2019-06-10', 125000, 3, None),
    (4, 'David', 'Kim', 'david.k@company.com', '2018-09-05', 130000, 4, None),
    (5, 'Lisa', 'Patel', 'lisa.p@company.com', '2017-11-12', 140000, 5, None),

    # Engineering team
    (6, 'James', 'Wilson', 'james.w@company.com', '2020-02-14', 115000, 1, 1),
    (7, 'Anna', 'Martinez', 'anna.m@company.com', '2021-05-18', 95000, 1, 1),
    (8, 'Robert', 'Taylor', 'robert.t@company.com', '2019-08-22', 120000, 1, 1),
    (9, 'Jennifer', 'Anderson', 'jennifer.a@company.com', '2022-01-10', 85000, 1, 6),
    (10, 'William', 'Thomas', 'william.t@company.com', '2021-11-30', 90000, 1, 6),

    # Sales team
    (11, 'Jessica', 'Moore', 'jessica.m@company.com', '2020-04-05', 95000, 2, 2),
    (12, 'Christopher', 'Jackson', 'chris.j@company.com', '2021-07-19', 88000, 2, 2),
    (13, 'Amanda', 'White', 'amanda.w@company.com', '2019-12-01', 105000, 2, 2),
    (14, 'Daniel', 'Harris', 'daniel.h@company.com', '2022-03-15', 75000, 2, 11),

    # Marketing team
    (15, 'Michelle', 'Martin', 'michelle.m@company.com', '2020-09-08', 82000, 3, 3),
    (16, 'Kevin', 'Thompson', 'kevin.t@company.com', '2021-10-20', 78000, 3, 3),
    (17, 'Rachel', 'Garcia', 'rachel.g@company.com', '2022-02-28', 72000, 3, 15),

    # HR team
    (18, 'Brian', 'Martinez', 'brian.m@company.com', '2019-05-14', 85000, 4, 4),
    (19, 'Nicole', 'Robinson', 'nicole.r@company.com', '2021-08-03', 75000, 4, 4),

    # Finance team
    (20, 'Andrew', 'Clark', 'andrew.c@company.com', '2018-07-22', 95000, 5, 5),
    (21, 'Stephanie', 'Lewis', 'stephanie.l@company.com', '2020-11-11', 88000, 5, 5),
    (22, 'Joshua', 'Lee', 'joshua.l@company.com', '2022-04-06', 80000, 5, 20)
]
cursor.executemany('INSERT INTO employees VALUES (?, ?, ?, ?, ?, ?, ?, ?)', employees)

# Insert sample projects
projects = [
    (1, 'Cloud Migration', 1, '2024-01-01', '2024-12-31', 'In Progress', 500000),
    (2, 'Mobile App Redesign', 1, '2024-03-15', '2024-09-30', 'In Progress', 300000),
    (3, 'Q4 Sales Campaign', 2, '2024-10-01', '2024-12-31', 'Planning', 200000),
    (4, 'Enterprise Sales Initiative', 2, '2024-02-01', '2024-11-30', 'In Progress', 150000),
    (5, 'Brand Refresh', 3, '2024-05-01', '2024-08-31', 'Completed', 180000),
    (6, 'Social Media Strategy', 3, '2024-07-01', '2025-06-30', 'In Progress', 120000),
    (7, 'HR System Upgrade', 4, '2024-04-01', '2024-10-31', 'In Progress', 100000),
    (8, 'Financial Audit 2024', 5, '2024-01-01', '2024-03-31', 'Completed', 80000)
]
cursor.executemany('INSERT INTO projects VALUES (?, ?, ?, ?, ?, ?, ?)', projects)

conn.commit()
conn.close()

return "Database created successfully!"

def get_database_schema():
"""Extract and format database schema for the agent"""
conn = sqlite3.connect('company.db')
cursor = conn.cursor()

schema_info = []

# Get all tables
cursor.execute("SELECT name FROM sqlite_master WHERE type='table'")
tables = cursor.fetchall()

for table in tables:
    table_name = table[0]
    schema_info.append(f"\n--- Table: {table_name} ---")

    # Get column information
    cursor.execute(f"PRAGMA table_info({table_name})")
    columns = cursor.fetchall()

    for col in columns:
        col_name = col[1]
        col_type = col[2]
        is_pk = " (PRIMARY KEY)" if col[5] else ""
        schema_info.append(f"  - {col_name}: {col_type}{is_pk}")

    # Get foreign key information
    cursor.execute(f"PRAGMA foreign_key_list({table_name})")
    fks = cursor.fetchall()

    if fks:
        schema_info.append("  Foreign Keys:")
        for fk in fks:
            schema_info.append(f"    - {fk[3]} -> {fk[2]}({fk[4]})")

conn.close()
return "\n".join(schema_info)`

Step 3: The Five-Phase Agent Cognitive Framework

Now we implement the core intelligence of our agent. Each phase has a specific responsibility in the query generation pipeline:

`
python

Initialize Together AI client

together_client = Together(api_key="your_api_key_here")

def phase_1_understand(question, schema):
"""
PHASE 1: UNDERSTAND
Analyze the natural language question and understand what data is needed
"""
instruction = f"""
You are a database expert analyzing a natural language question.

DATABASE SCHEMA:
{schema}

USER QUESTION: "{question}"

TASK: Analyze this question to understand:

What specific information is the user requesting?
Which tables contain this information?
What columns are relevant?
Are any calculations, aggregations, or groupings needed?
Are there any ambiguities in the question?

Provide a clear analysis of what data needs to be retrieved.
"""

try:
    response = together_client.chat.completions.create(
        model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
        messages=[{"role": "user", "content": instruction}],
        max_tokens=400,
        temperature=0.2
    )
    return response.choices[0].message.content
except Exception as e:
    return f"Error in understanding phase: {str(e)}"

def phase_2_plan(question, understanding, schema, previous_error=None):
"""
PHASE 2: PLAN
Create a detailed plan for constructing the SQL query
"""
error_context = ""
if previous_error:
error_context = f"\n\nPREVIOUS ATTEMPT FAILED WITH ERROR:\n{previous_error}\n\nAdjust your plan to avoid this error."

instruction = f"""

You are a database expert creating a query plan.

DATABASE SCHEMA:
{schema}

USER QUESTION: "{question}"

YOUR UNDERSTANDING: {understanding}
{error_context}

TASK: Create a detailed SQL query plan:

Which tables need to be queried?
What JOIN operations are required (if any)?
What WHERE conditions are needed?
Are GROUP BY or aggregate functions needed?
Should results be sorted or limited?
What specific columns should be selected?

Provide a step-by-step plan in numbered format.
"""

try:
    response = together_client.chat.completions.create(
        model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
        messages=[{"role": "user", "content": instruction}],
        max_tokens=500,
        temperature=0.2
    )
    return response.choices[0].message.content
except Exception as e:
    return f"Error in planning phase: {str(e)}"

def phase_3_generate(question, understanding, plan, schema, previous_error=None, previous_sql=None):
"""
PHASE 3: GENERATE
Generate the actual SQL query based on the plan
"""
error_context = ""
if previous_error and previous_sql:
error_context = f"""
\n\nPREVIOUS SQL ATTEMPT:
{previous_sql}

ERROR RECEIVED:
{previous_error}

IMPORTANT: Fix the specific error above. Common issues:

Incorrect table or column names (check schema carefully)
Missing JOIN conditions
Syntax errors
Incorrect aggregate functions
"""

instruction = f"""
You are a SQL expert generating a query.

DATABASE SCHEMA:
{schema}

USER QUESTION: "{question}"

YOUR UNDERSTANDING: {understanding}

YOUR PLAN: {plan}
{error_context}

TASK: Generate a valid SQLite SQL query that:

Follows the plan exactly
Uses correct table and column names from the schema
Includes proper JOIN syntax if needed
Has correct WHERE, GROUP BY, and ORDER BY clauses
Is properly formatted and includes comments

CRITICAL RULES:

Only use tables and columns that exist in the schema above
For SQLite, use proper syntax (e.g., no TOP keyword, use LIMIT instead)
Always use table aliases for clarity in JOINs
Double-check column names match the schema exactly

Respond with ONLY the SQL query, nothing else. Do not include markdown code blocks or explanations.
"""

try:
    response = together_client.chat.completions.create(
        model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
        messages=[{"role": "user", "content": instruction}],
        max_tokens=600,
        temperature=0.1
    )
    sql_query = response.choices[0].message.content.strip()

    # Clean up the SQL (remove markdown code blocks if present)
    sql_query = re.sub(r'^```sql\s*', '', sql_query)
    sql_query = re.sub(r'^```\s*', '', sql_query)
    sql_query = re.sub(r'\s*```$', '', sql_query)

    return sql_query.strip()
except Exception as e:
    return f"Error in generation phase: {str(e)}"

def phase_4_validate(sql_query, schema):
"""
PHASE 4: VALIDATE
Validate the SQL query for safety and correctness before execution
"""
validation_results = {
"is_safe": True,
"is_valid": True,
"warnings": [],
"errors": []
}

## Check for dangerous operations
dangerous_keywords = ['DROP', 'DELETE', 'UPDATE', 'INSERT', 'ALTER', 'TRUNCATE', 'GRANT', 'REVOKE']
sql_upper = sql_query.upper()

for keyword in dangerous_keywords:
    if keyword in sql_upper:
        validation_results["is_safe"] = False
        validation_results["errors"].append(f"Dangerous operation detected: {keyword}")

# Check for SELECT statement
if not sql_upper.strip().startswith('SELECT'):
    validation_results["is_safe"] = False
    validation_results["errors"].append("Only SELECT queries are allowed")

# Check for semicolon (potential SQL injection)
if sql_query.count(';') > 1:
    validation_results["warnings"].append("Multiple statements detected - potential security risk")

# Basic syntax check
if 'FROM' not in sql_upper:
    validation_results["is_valid"] = False
    validation_results["errors"].append("Invalid SQL: Missing FROM clause")

return validation_results

def phase_5_execute(sql_query):
"""
PHASE 5: EXECUTE
Execute the SQL query and return results
"""
try:
conn = sqlite3.connect('company.db')
df = pd.read_sql_query(sql_query, conn)
conn.close()
return df, None
except Exception as e:
return None, str(e)
`

Step 4: Self-Correction and Retry Logic

This is where our agent becomes truly intelligent. When a query fails, the agent analyzes the error and attempts to fix it:

`
python
def diagnose_error(sql_query, error_message, schema):
"""
Use the LLM to diagnose what went wrong and provide guidance
"""
instruction = f"""
You are a SQL debugging expert.

DATABASE SCHEMA:
{schema}

FAILED SQL QUERY:
{sql_query}

ERROR MESSAGE:
{error_message}

TASK: Diagnose the specific problem:

What caused this error?
Which part of the query is incorrect?
What should be changed to fix it?

Be specific and concise. Focus on the exact fix needed.
"""

try:
    response = together_client.chat.completions.create(
        model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
        messages=[{"role": "user", "content": instruction}],
        max_tokens=300,
        temperature=0.2
    )
    return response.choices[0].message.content
except Exception as e:
    return f"Error in diagnosis: {str(e)}"

def execute_query_with_retry(question, schema, max_attempts=3):
"""
Complete pipeline with retry logic
"""
attempt_history = []

for attempt in range(max_attempts):
    attempt_info = {
        "attempt_number": attempt + 1,
        "understanding": "",
        "plan": "",
        "sql": "",
        "validation": {},
        "result": None,
        "error": None,
        "diagnosis": ""
    }

    # Get previous error context if this is a retry
    previous_error = attempt_history[-1]["error"] if attempt_history else None
    previous_sql = attempt_history[-1]["sql"] if attempt_history else None

    # Phase 1: Understand
    understanding = phase_1_understand(question, schema)
    attempt_info["understanding"] = understanding

    # Phase 2: Plan (with error context if retrying)
    plan = phase_2_plan(question, understanding, schema, previous_error)
    attempt_info["plan"] = plan

    # Phase 3: Generate SQL (with error context if retrying)
    sql_query = phase_3_generate(question, understanding, plan, schema, previous_error, previous_sql)
    attempt_info["sql"] = sql_query

    # Phase 4: Validate
    validation = phase_4_validate(sql_query, schema)
    attempt_info["validation"] = validation

    if not validation["is_safe"] or not validation["is_valid"]:
        attempt_info["error"] = "Query failed validation: " + ", ".join(validation["errors"])
        attempt_history.append(attempt_info)
        continue

    # Phase 5: Execute
    result, error = phase_5_execute(sql_query)
    attempt_info["result"] = result
    attempt_info["error"] = error

    if error:
        # Diagnose the error
        diagnosis = diagnose_error(sql_query, error, schema)
        attempt_info["diagnosis"] = diagnosis
        attempt_history.append(attempt_info)
        continue

    # Success!
    attempt_history.append(attempt_info)
    return result, sql_query, attempt_history

# All attempts failed
return None, None, attempt_history

def format_natural_language_response(question, df, sql_query):
"""
Generate a natural language explanation of the results
"""
if df is None or df.empty:
return "No results found for your query."

result_preview = df.head(10).to_string()
row_count = len(df)

instruction = f"""

You are a data analyst explaining query results to a business user.

USER QUESTION: "{question}"

SQL QUERY EXECUTED:
{sql_query}

RESULTS ({row_count} rows total):
{result_preview}

TASK: Provide a clear, business-friendly explanation of these results:

Directly answer the user's question
Highlight key insights from the data
Mention the total number of results if relevant
Keep it concise but informative

Do not show the SQL query or technical details. Focus on the business meaning.
"""

try:
    response = together_client.chat.completions.create(
        model="meta-llama/Llama-4-Scout-17B-16E-Instruct",
        messages=[{"role": "user", "content": instruction}],
        max_tokens=400,
        temperature=0.3
    )
    return response.choices[0].message.content
except Exception as e:
    return f"Results retrieved successfully. {row_count} rows returned."

Step 5: Building the Streamlit User Interface

Now let's create an interactive interface that showcases the agent's cognitive process:

`
python
def main():
st.set_page_config(page_title="SQL Agent with Llama-4-Scout", layout="wide")

st.title("🤖 Self-Correcting Database Agent")
st.markdown("*Powered by Meta's Llama-4-Scout*")

# Initialize session state
if 'db_initialized' not in st.session_state:
    st.session_state.db_initialized = False
if 'query_history' not in st.session_state:
    st.session_state.query_history = []

# Sidebar for database setup
with st.sidebar:
    st.header("⚙️ Database Setup")

    if not st.session_state.db_initialized:
        if st.button("Initialize Sample Database", type="primary"):
            with st.spinner("Creating database..."):
                result = create_sample_database()
                st.session_state.db_initialized = True
                st.success(result)
                st.rerun()
    else:
        st.success("✅ Database ready")

        if st.button("Reset Database"):
            create_sample_database()
            st.session_state.query_history = []
            st.success("Database reset!")
            st.rerun()

    st.markdown("---")
    st.header("📊 Database Schema")

    if st.session_state.db_initialized:
        with st.expander("View Schema", expanded=False):
            schema = get_database_schema()
            st.code(schema, language="text")

    st.markdown("---")
    st.header("💡 Example Questions")
    st.markdown("""
    **Simple:**
    - How many employees do we have?
    - Show all departments

    **Medium:**
    - What's the average salary by department?
    - Who are the managers?

    **Complex:**
    - Which employees earn more than their manager?
    - Show departments with highest average salaries

    **Advanced:**
    - Compare project budgets to department budgets
    - Find managers with most direct reports
    """)

# Main content area
if not st.session_state.db_initialized:
    st.info("👈 Please initialize the database using the sidebar to get started.")
    return

st.header("💬 Ask Questions About Your Data")

# Query input
col1, col2 = st.columns([3, 1])

with col1:
    user_question = st.text_input(
        "Your Question:",
        placeholder="e.g., Who are the top 5 highest paid employees?",
        key="question_input"
    )

with col2:
    st.write("")  # Spacing
    st.write("")  # Spacing
    show_thinking = st.checkbox("Show reasoning", value=True)

if st.button("🔍 Get Answer", type="primary", use_container_width=True):
    if not user_question:
        st.warning("Please enter a question.")
        return

    # Get schema
    schema = get_database_schema()

    # Execute query with retry logic
    with st.spinner("🧠 Agent is thinking..."):
        result_df, final_sql, attempt_history = execute_query_with_retry(
            user_question, 
            schema, 
            max_attempts=3
        )

    # Display results
    if result_df is not None:
        st.success("✅ Query successful!")

        # Show the agent's thinking process
        if show_thinking:
            with st.expander("🧠 Agent's Cognitive Process", expanded=True):
                for i, attempt in enumerate(attempt_history):
                    if len(attempt_history) > 1:
                        if i < len(attempt_history) - 1:
                            st.error(f"❌ Attempt {attempt['attempt_number']} (Failed)")
                        else:
                            st.success(f"✅ Attempt {attempt['attempt_number']} (Success)")

                    col1, col2 = st.columns(2)

                    with col1:
                        st.markdown("**1️⃣ UNDERSTAND**")
                        st.info(attempt['understanding'])

                    with col2:
                        st.markdown("**2️⃣ PLAN**")
                        st.info(attempt['plan'])

                    st.markdown("**3️⃣ GENERATE SQL**")
                    st.code(attempt['sql'], language="sql")

                    if attempt['validation']['warnings']:
                        st.warning("Validation Warnings: " + ", ".join(attempt['validation']['warnings']))

                    if attempt['error']:
                        st.error(f"**Error:** {attempt['error']}")

                        if attempt['diagnosis']:
                            st.markdown("**🔍 Diagnosis:**")
                            st.warning(attempt['diagnosis'])

                    if i < len(attempt_history) - 1:
                        st.markdown("---")
                        st.markdown("↻ **Retrying with corrections...**")
                        st.markdown("---")

        # Display natural language response
        st.markdown("### 📝 Answer")
        with st.spinner("Generating explanation..."):
            explanation = format_natural_language_response(user_question, result_df, final_sql)
        st.markdown(explanation)

        # Display results table
        st.markdown("### 📊 Data")
        st.dataframe(result_df, use_container_width=True)

        # Show final SQL
        with st.expander("View SQL Query"):
            st.code(final_sql, language="sql")

        # Add to history
        st.session_state.query_history.append({
            "question": user_question,
            "sql": final_sql,
            "rows": len(result_df),
            "attempts": len(attempt_history),
            "timestamp": datetime.now().strftime("%Y-%m-%d %H:%M:%S")
        })

    else:
        st.error("❌ Query failed after all retry attempts")

        if show_thinking:
            with st.expander("🧠 All Attempts", expanded=True):
                for i, attempt in enumerate(attempt_history):
                    st.error(f"**Attempt {attempt['attempt_number']}**")
                    st.code(attempt['sql'], language="sql")
                    st.error(f"Error: {attempt['error']}")
                    if attempt['diagnosis']:
                        st.warning(f"Diagnosis: {attempt['diagnosis']}")
                    st.markdown("---")

# Query history
if st.session_state.query_history:
    st.markdown("---")
    st.header("📜 Query History")

    history_df = pd.DataFrame(st.session_state.query_history)
    st.dataframe(history_df, use_container_width=True)

if name == "main":
main()
`

Technical Implementation Insights

1. The Power of Multi-Phase Reasoning

The five-phase approach significantly improves query accuracy compared to single-shot generation. Here's why:

Phase Separation Benefits:

Understand: Forces the model to explicitly identify required data before coding
Plan: Creates a logical roadmap, reducing the chance of structural errors
Generate: Focuses solely on syntax, with context from previous phases
Validate: Catches dangerous operations before execution
Execute: Provides real-world feedback for learning

This separation of concerns mirrors how human SQL experts approach complex queries—they don't immediately write code; they think through the problem first.

2. Self-Correction Through Error Analysis

The retry mechanism is sophisticated because it provides the agent with:

`
python
error_context = f"""
PREVIOUS SQL ATTEMPT:
{previous_sql}

ERROR RECEIVED:
{previous_error}

IMPORTANT: Fix the specific error above.
"""
`

This context allows the model to:

See exactly what it tried before
Understand why it failed
Adjust its approach specifically

Most text-to-SQL systems fail silently. Our agent learns and adapts.

3. Safety Through Validation

The validation phase prevents dangerous operations:

python dangerous_keywords = ['DROP', 'DELETE', 'UPDATE', 'INSERT', 'ALTER', 'TRUNCATE']
This is critical for production systems where a malicious or misunderstood prompt could corrupt data. The agent is constrained to read-only operations, making it safe for business users.

4. Schema-Aware Query Generation

By providing the complete schema to the agent at each phase, we eliminate a common failure mode: queries that reference non-existent tables or columns. The schema serves as ground truth that the agent must respect.

Conclusion

Building a self-correcting database agent with Llama-4-Scout demonstrates the power of structured agentic reasoning. Unlike simple prompt-and-hope approaches, our five-phase cognitive framework—Understand → Plan → Generate → Validate → Execute—creates a system that thinks through problems methodically and learns from mistakes.

The key innovations in this implementation are:

Multi-phase reasoning: Each phase has a specific cognitive responsibility
Intelligent retry logic: The agent learns from failures and adapts
Safety-first design: Validation prevents destructive operations
Natural language interface: Business users can access data without SQL knowledge
Transparent reasoning: Users can inspect the agent's thought process

This approach can be extended beyond SQL to other structured tasks: API query generation, configuration file creation, or any domain where precision matters and mistakes have consequences. The think-plan-execute-validate-retry pattern is broadly applicable to agentic systems.

As LLMs continue to improve, we can expect even more sophisticated error recovery, better schema understanding, and more nuanced query optimization. The combination of powerful base models like Llama-4-Scout with well-designed cognitive frameworks opens up new possibilities for making complex technical systems accessible to everyone.

Ready to build your own database agent? The complete code is provided in this article. Start with the sample database, test with the example questions, and then adapt it to your own data infrastructure. The future of data access is conversational, safe, and intelligent.

Student-Teacher Distillation: A Complete Guide for Model Compression

angu10 — Tue, 26 Aug 2025 18:48:58 +0000

Part 1 of our Deep Learning Model Optimization Series

In the rapidly evolving world of machine learning, deploying large, powerful models in production environments often presents significant challenges. Enter student-teacher distillation—a powerful technique that allows us to compress the knowledge of complex models into smaller, more efficient ones without sacrificing too much accuracy. This comprehensive guide will walk you through everything you need to know about this fascinating approach to model optimization.

What is Student-Teacher Distillation?

Student-teacher distillation, also known as knowledge distillation, is a model compression technique where a smaller "student" model learns to mimic the behavior of a larger, more complex "teacher" model. Think of it as an experienced professor (teacher) passing on their knowledge and wisdom to a bright student who can then apply that knowledge more efficiently.

The core idea is beautifully simple: instead of training the student model from scratch on the original data, we train it to match the teacher's outputs. The teacher model provides "soft targets" or probability distributions that contain richer information than the hard labels in the original dataset. These soft targets capture the teacher's uncertainty and the relationships between different classes, which helps the student learn more nuanced decision boundaries.

The Magic Behind Soft Targets

When a teacher model predicts a class, it doesn't just output a binary decision—it provides a probability distribution across all possible classes. For example, when classifying an image of a dog, the teacher might output:

Dog: 0.8
Wolf: 0.15
Cat: 0.03
Other: 0.02

This distribution tells us that while the model is confident it's a dog, it also sees some wolf-like features. This nuanced understanding, captured in the soft targets, helps the student model learn more effectively than just knowing "this is a dog."

Student-Teacher Distillation vs. Fine-Tuning: Understanding the Key Differences

While both techniques involve training smaller models, they serve fundamentally different purposes and follow distinct approaches.

Fine-Tuning: Adapting Pre-trained Knowledge

Fine-tuning takes a pre-trained model (often trained on a large, general dataset) and adapts it to a specific task or domain. You start with a model that already understands general patterns and then specialize it for your particular use case. It's like taking a general physician and having them specialize in cardiology.

Key characteristics of fine-tuning:

Starts with a pre-trained model of similar or identical size
Adapts existing knowledge to new domains or tasks
Typically involves training on task-specific data
The final model size remains roughly the same
Primary goal is task adaptation, not compression

Student-Teacher Distillation: Knowledge Compression

Distillation, on the other hand, is primarily about compression and efficiency. You're taking a large, complex model and teaching a smaller model to replicate its behavior. The student model learns to approximate the teacher's decision-making process within a more constrained architecture.

Key characteristics of distillation:

Creates a smaller model from a larger one
Focuses on knowledge transfer and compression
The student learns from the teacher's outputs, not just original data
Significant reduction in model size and computational requirements
Primary goal is efficiency while maintaining accuracy

When to Use Which Approach

Choose fine-tuning when you have a model that's already the right size for your deployment constraints, but you need to adapt it to a specific domain or task. Choose distillation when you have a high-performing model that's too large or slow for your production requirements.

How to Choose the Right Teacher Model

Selecting an appropriate teacher model is crucial for successful distillation. The teacher sets the accuracy ceiling for your student, so this decision significantly impacts your final results.

Accuracy is King

Your teacher model should excel at the target task. There's no point in distilling from a mediocre teacher—the student can only learn what the teacher knows. Look for models with:

High accuracy on your target dataset
Strong generalization capabilities
Robust accuracy across different data distributions
Well-calibrated confidence scores

Architecture Considerations

While the teacher doesn't need to share the same architecture as the student, some considerations matter:

Complexity advantage: The teacher should be significantly more complex than the student to justify the distillation process
Task alignment: Models designed for similar tasks often make better teachers
Output compatibility: Ensure the teacher's output format aligns with your distillation setup

Practical Factors

Don't overlook practical constraints:

Computational resources: You need to be able to run inference on the teacher model during training
Licensing and availability: Ensure you have access to the teacher model and can use it for your purposes
Data compatibility: The teacher should work well with your training data

Multi-Teacher Approaches

Consider using multiple teacher models when:

Different teachers excel at different aspects of the task
You want to ensemble knowledge from various sources
You're working with complex, multi-modal tasks

Selecting the Perfect Student Model

Choosing the student model involves balancing accuracy goals with deployment constraints. This is where the art of distillation really shines.

Size and Efficiency Targets

Start by defining your deployment requirements:

Latency constraints: How fast must inference be?
Memory limitations: What's your RAM/storage budget?
Power consumption: Are you deploying on mobile or edge devices?
Throughput requirements: How many predictions per second do you need?

Architectural Choices

The student architecture should be:

Appropriately sized: Small enough to meet deployment constraints, large enough to capture essential patterns
Well-suited to the task: Some architectures naturally excel at certain types of problems
Efficiently designed: Modern efficient architectures like MobileNets, EfficientNets, or DistilBERT are often good starting points

The Goldilocks Principle

Your student model size should be "just right":

Too small: The model lacks the capacity to learn the teacher's knowledge effectively
Too large: You lose the efficiency benefits and might as well use a larger model directly
Just right: Provides the best trade-off between accuracy and efficiency

Advanced Strategy: Progressive Distillation

For very large compression ratios, consider progressive distillation as your student model selection strategy:

Start with a large teacher
Distill to a medium-sized intermediate model
Use the intermediate model as a teacher for an even smaller student

This stepped approach often yields better results than trying to compress directly from very large to very small models.

Pros and Cons of Student-Teacher Distillation

Like any technique, distillation comes with its own set of advantages and limitations. Understanding these will help you make informed decisions about when and how to apply this approach.

The Compelling Advantages

Significant Model Compression
The most obvious benefit is the dramatic reduction in model size. You can often achieve 5-10x compression while retaining 90-95% of the original accuracy. This makes deployment feasible in resource-constrained environments.

Faster Inference
Smaller models mean faster predictions. This translates to better user experience, lower latency, and the ability to serve more requests with the same hardware.

Lower Computational Costs
Reduced model size means lower memory usage, less power consumption, and cheaper inference costs—especially important when serving millions of requests.

Preserved Knowledge Quality
Unlike simple pruning or quantization, distillation preserves the nuanced decision-making patterns of the teacher model. The student learns not just what to predict, but how to think about the problem.

Enhanced Generalization
Soft targets from the teacher model often help students generalize better than training on hard labels alone. The teacher's uncertainty provides valuable regularization.

Flexibility in Architecture
You can distill knowledge across different architectures, allowing you to optimize for specific deployment requirements while retaining accuracy.

The Notable Limitations

Accuracy Ceiling
The student can rarely exceed the teacher's accuracy. You're fundamentally limited by the teacher's knowledge and capabilities.

Training Complexity
Distillation requires careful hyperparameter tuning, temperature selection, and loss function balancing. It's more complex than standard training.

Computational Overhead During Training
You need to run both teacher and student models during training, which can be computationally expensive and time-consuming.

Teacher Dependency
The quality of your distillation is fundamentally limited by your teacher model. A biased or poorly accurate teacher will pass these issues to the student.

Diminishing Returns
Very aggressive compression (e.g., 100x smaller) often leads to significant accuracy degradation. There are practical limits to how much you can compress.

Task-Specific Effectiveness
Distillation works better for some tasks than others. Classification tasks often see better results than generation tasks, for instance.

When Distillation Shines

Student-teacher distillation is particularly effective when:

You have a high-accuracy large model that's too slow for production
Deployment constraints (mobile, edge devices) require smaller models
You need to serve high-volume requests efficiently
The task has clear input-output relationships
You have sufficient computational resources for training

When to Consider Alternatives

Consider other approaches when:

Your teacher model isn't significantly better than smaller alternatives
Training time and computational costs outweigh deployment benefits
You need the absolute best accuracy regardless of size
Your deployment environment can accommodate larger models
The task requires capabilities that are hard to distill (like complex reasoning)

Looking Ahead

Student-teacher distillation represents a powerful tool in the modern ML practitioner's toolkit. As models continue to grow larger and more capable, the ability to efficiently compress and deploy them becomes increasingly valuable.

In our next article, we'll dive into the practical implementation details, including code examples, loss function design, and training strategies that will help you implement your own distillation pipeline. We'll explore different distillation variants, advanced techniques like attention transfer, and share best practices learned from real-world deployments.

The journey from understanding the theory to implementing effective distillation systems is both challenging and rewarding. With the foundation we've built here, you're well-equipped to start exploring this fascinating area of machine learning optimization.

Stay tuned for Part 2, where we'll get our hands dirty with implementation details and practical code examples that will bring these concepts to life.

Beyond the Competition: How Claude Sonnet 4, GPT-4o, and Gemini 2.5 Can Work Together in Agent Harmony

angu10 — Sun, 22 Jun 2025 17:51:30 +0000

The AI landscape is often portrayed as a zero-sum game where models compete for dominance. But what if we shifted our perspective? Instead of choosing one model to rule them all, what if we leveraged the unique strengths of each model to create a more powerful, complementary system?

In this article, we'll explore how Claude Sonnet-4, OpenAI's GPT-4o, and Google's Gemini 2.5 can work together in an agentic architecture, creating a symphony of AI capabilities that's greater than the sum of its parts.

Understanding Each Model's Unique Strengths

Claude Sonnet 4: The Thoughtful Analyst

Strengths:

Exceptional reasoning and analysis capabilities
Strong ethical reasoning and safety considerations
Excellent at breaking down complex problems methodically
Superior performance in structured thinking and logical reasoning
Excellent at handling nuanced conversations and context

Ideal Use Cases:

Code review and analysis
Complex problem decomposition
Ethical decision-making processes
Research and analysis tasks
Long-form content creation

GPT-4o: The Versatile Performer

Strengths:

Excellent multimodal capabilities (text, vision, audio)
Strong creative writing and content generation
Robust API ecosystem and integration options
Consistent performance across diverse tasks
Great at following specific formatting instructions

Ideal Use Cases:

Content generation and creative writing
Multimodal processing tasks
API integrations and automation
Quick prototyping and ideation
Image analysis and description

Gemini 2.5: The Technical Powerhouse

Strengths:

Exceptional mathematical and scientific reasoning
Strong coding capabilities and technical documentation
Excellent at handling large contexts and complex data
Superior performance in research and technical analysis
Great integration with Google's ecosystem

Ideal Use Cases:

Scientific research and analysis
Complex mathematical computations
Technical documentation
Data analysis and processing
Integration with Google services

The Complementary Architecture: Building a Multi-Agent System

Instead of choosing one model, let's design a system where each model handles what they do best. Here's how we can create a complementary agentic architecture:

Implementation: Python-Based Multi-Agent System

Let's build a practical example that demonstrates how these models can work together. We'll create a research assistant that leverages all three models.

import asyncio
import json
from typing import Dict, List, Any
from dataclasses import dataclass
from enum import Enum

# Mock API clients - Here we're going to have only Mock API 
# but reader can replace with actual API implementations
class ModelType(Enum):
    CLAUDE = "claude-sonnet-4"
    GPT4O = "gpt-4o"
    GEMINI = "gemini-2.5"

@dataclass
class TaskResult:
    model: ModelType
    task_type: str
    result: str
    confidence: float
    metadata: Dict[str, Any]

class MultiAgentResearchAssistant:
    def __init__(self):
        self.models = {
            ModelType.CLAUDE: self._init_claude_client(),
            ModelType.GPT4O: self._init_gpt4o_client(),
            ModelType.GEMINI: self._init_gemini_client()
        }

    def _init_claude_client(self):
        # Initialize Claude client
        return {"name": "Claude Sonnet 4", "role": "analyst"}

    def _init_gpt4o_client(self):
        # Initialize GPT-4o client
        return {"name": "GPT-4o", "role": "creator"}

    def _init_gemini_client(self):
        # Initialize Gemini client
        return {"name": "Gemini 2.5", "role": "technical_expert"}

    async def research_topic(self, topic: str) -> Dict[str, Any]:
        """
        Orchestrates a comprehensive research process using all three models
        """
        print(f"🔍 Starting research on: {topic}")

        # Phase 1: Claude analyzes and breaks down the topic
        analysis_task = await self._claude_analyze_topic(topic)

        # Phase 2: Gemini conducts technical research
        technical_research = await self._gemini_technical_research(
            topic, analysis_task.result
        )

        # Phase 3: GPT-4o creates comprehensive content
        final_content = await self._gpt4o_synthesize_content(
            topic, analysis_task.result, technical_research.result
        )

        # Phase 4: Claude reviews and provides final insights
        final_review = await self._claude_review_content(final_content.result)

        return {
            "topic": topic,
            "analysis": analysis_task,
            "technical_research": technical_research,
            "content": final_content,
            "review": final_review,
            "summary": self._create_summary([
                analysis_task, technical_research, final_content, final_review
            ])
        }

    async def _claude_analyze_topic(self, topic: str) -> TaskResult:
        """Claude's role: Thoughtful analysis and problem decomposition"""
        # Simulate Claude's analytical approach
        analysis = f"""
        Analysis of "{topic}":

        1. Core Components:
           - Primary research areas to explore
           - Key stakeholders and perspectives
           - Potential challenges and considerations

        2. Research Strategy:
           - Technical aspects requiring deep expertise
           - Creative elements for engaging presentation
           - Ethical considerations and implications

        3. Success Metrics:
           - Accuracy and depth of information
           - Clarity of presentation
           - Practical applicability
        """

        return TaskResult(
            model=ModelType.CLAUDE,
            task_type="analysis",
            result=analysis,
            confidence=0.92,
            metadata={"reasoning_steps": 3, "considerations": 8}
        )

    async def _gemini_technical_research(self, topic: str, analysis: str) -> TaskResult:
        """Gemini's role: Deep technical research and data analysis"""
        # Simulate Gemini's technical research capabilities
        research = f"""
        Technical Research for "{topic}":

        📊 Data Analysis:
        - Latest statistical trends and patterns
        - Mathematical models and algorithms
        - Scientific papers and research findings

        🔬 Technical Implementation:
        - Code examples and technical specifications
        - Performance benchmarks and comparisons
        - Integration possibilities and frameworks

        📈 Quantitative Insights:
        - Market data and growth projections
        - Technical performance metrics
        - Scalability considerations
        """

        return TaskResult(
            model=ModelType.GEMINI,
            task_type="technical_research",
            result=research,
            confidence=0.95,
            metadata={"data_points": 15, "sources": 12}
        )

    async def _gpt4o_synthesize_content(self, topic: str, analysis: str, 
                                       research: str) -> TaskResult:
        """GPT-4o's role: Creative synthesis and content generation"""
        # Simulate GPT-4o's content creation capabilities
        content = f"""
        # Comprehensive Guide to {topic}

        ## Executive Summary
        Based on our multi-faceted analysis, {topic} represents a significant 
        opportunity with both technical and strategic implications.

        ## Key Findings
        - Strategic insights from analytical review
        - Technical breakthroughs from research data
        - Implementation roadmap for practical application

        ## Creative Applications
        - Innovative use cases and scenarios
        - Engaging examples and case studies
        - Visual concepts and presentation ideas

        ## Actionable Recommendations
        1. Immediate next steps
        2. Long-term strategic planning
        3. Risk mitigation strategies
        """

        return TaskResult(
            model=ModelType.GPT4O,
            task_type="content_synthesis",
            result=content,
            confidence=0.89,
            metadata={"sections": 4, "recommendations": 3}
        )

    async def _claude_review_content(self, content: str) -> TaskResult:
        """Claude's role: Final review and quality assurance"""
        review = f"""
        Quality Review:

        ✅ Strengths:
        - Comprehensive coverage of key topics
        - Well-structured and logical flow
        - Balanced technical and strategic perspectives

        🔧 Recommendations:
        - Consider adding more specific examples
        - Strengthen the conclusion with actionable insights
        - Ensure accessibility for diverse audiences

        📋 Final Assessment:
        Content meets high standards for accuracy, clarity, and usefulness.
        Ready for publication with minor enhancements.
        """

        return TaskResult(
            model=ModelType.CLAUDE,
            task_type="quality_review",
            result=review,
            confidence=0.94,
            metadata={"review_criteria": 8, "passed": True}
        )

    def _create_summary(self, results: List[TaskResult]) -> str:
        """Create a summary of the collaborative process"""
        return f"""
        🤝 Collaborative Research Summary:

        Models Involved: {len(set(r.model for r in results))}
        Total Tasks: {len(results)}
        Average Confidence: {sum(r.confidence for r in results) / len(results):.2f}

        Process Flow:
        1. Claude provided analytical framework and strategic thinking
        2. Gemini delivered technical depth and data-driven insights
        3. GPT-4o synthesized information into engaging, actionable content
        4. Claude conducted final quality review and validation

        This complementary approach leveraged each model's unique strengths
        to produce a more comprehensive and valuable outcome.
        """

# Advanced Use Case: Code Review Pipeline
class CodeReviewPipeline:
    def __init__(self):
        self.assistant = MultiAgentResearchAssistant()

    async def review_code(self, code: str, language: str) -> Dict[str, Any]:
        """
        Multi-model code review process
        """
        # Claude: Logical analysis and architecture review
        claude_review = await self._claude_code_analysis(code, language)

        # Gemini: Technical optimization and performance analysis
        gemini_review = await self._gemini_performance_analysis(code, language)

        # GPT-4o: Documentation and improvement suggestions
        gpt4o_review = await self._gpt4o_documentation_review(code, language)

        return {
            "logical_analysis": claude_review,
            "performance_analysis": gemini_review,
            "documentation_review": gpt4o_review,
            "combined_score": self._calculate_combined_score([
                claude_review, gemini_review, gpt4o_review
            ])
        }

    async def _claude_code_analysis(self, code: str, language: str) -> TaskResult:
        """Claude analyzes code logic and architecture"""
        return TaskResult(
            model=ModelType.CLAUDE,
            task_type="code_logic_analysis",
            result="Logical structure is sound with clear separation of concerns...",
            confidence=0.91,
            metadata={"issues_found": 2, "suggestions": 5}
        )

    async def _gemini_performance_analysis(self, code: str, language: str) -> TaskResult:
        """Gemini analyzes performance and optimization opportunities"""
        return TaskResult(
            model=ModelType.GEMINI,
            task_type="performance_analysis",
            result="Performance bottlenecks identified in data processing loops...",
            confidence=0.88,
            metadata={"optimizations": 3, "complexity_score": 7.2}
        )

    async def _gpt4o_documentation_review(self, code: str, language: str) -> TaskResult:
        """GPT-4o reviews documentation and suggests improvements"""
        return TaskResult(
            model=ModelType.GPT4O,
            task_type="documentation_review",
            result="Documentation coverage is 73% with opportunities for improvement...",
            confidence=0.85,
            metadata={"doc_coverage": 0.73, "improvement_areas": 4}
        )

    def _calculate_combined_score(self, results: List[TaskResult]) -> float:
        """Calculate a weighted combined score"""
        weights = {"code_logic_analysis": 0.4, "performance_analysis": 0.35, 
                  "documentation_review": 0.25}

        total_score = 0
        for result in results:
            weight = weights.get(result.task_type, 0.33)
            total_score += result.confidence * weight

        return total_score

# Usage Example
async def main():
    # Initialize the multi-agent system
    research_assistant = MultiAgentResearchAssistant()
    code_reviewer = CodeReviewPipeline()

    # Example 1: Research a complex topic
    print("=== Research Assistant Example ===")
    research_result = await research_assistant.research_topic(
        "Implementing Microservices Architecture with Event-Driven Design"
    )

    print(f"Research completed with {len(research_result)} phases")
    print(research_result["summary"])

    # Example 2: Code review process
    print("\n=== Code Review Example ===")
    sample_code = """
    def process_data(data_list):
        result = []
        for item in data_list:
            if item > 0:
                result.append(item * 2)
        return result
    """

    review_result = await code_reviewer.review_code(sample_code, "python")
    print(f"Code review completed with combined score: {review_result['combined_score']:.2f}")

if __name__ == "__main__":
    asyncio.run(main())

Real-World Applications and Benefits

1. Content Creation Pipeline

Claude: Analyzes the audience and creates a content strategy
Gemini: Researches technical accuracy and data validation
GPT-4o: Generates engaging, well-formatted content

2. Software Development

Claude: Architectural decisions and code logic review
Gemini: Performance optimization and technical implementation
GPT-4o: Documentation, testing strategies, and user interface design

3. Research and Analysis

Claude: Problem decomposition and critical thinking
Gemini: Data analysis and scientific methodology
GPT-4o: Report writing and presentation creation

Implementation Best Practices

1. Task Orchestration

class TaskOrchestrator:
    def __init__(self):
        self.task_queue = []
        self.model_capabilities = {
            ModelType.CLAUDE: ["analysis", "reasoning", "review"],
            ModelType.GEMINI: ["technical", "mathematical", "research"],
            ModelType.GPT4O: ["creative", "synthesis", "formatting"]
        }

    def assign_task(self, task_type: str, content: str) -> ModelType:
        """Intelligently assign tasks based on model strengths"""
        for model, capabilities in self.model_capabilities.items():
            if task_type in capabilities:
                return model
        return ModelType.GPT4O  # Default fallback

2. Quality Assurance

class QualityAssurance:
    @staticmethod
    def validate_results(results: List[TaskResult]) -> bool:
        """Validate results across multiple models"""
        avg_confidence = sum(r.confidence for r in results) / len(results)
        return avg_confidence > 0.8 and len(results) >= 2

    @staticmethod
    def consensus_check(results: List[TaskResult], threshold: float = 0.7) -> bool:
        """Check if models agree on key points"""
        # Implementation would compare semantic similarity
        return True  # Simplified for example

3. Cost Optimization

class CostOptimizer:
    def __init__(self):
        self.model_costs = {
            ModelType.CLAUDE: 0.015,  # per 1k tokens
            ModelType.GEMINI: 0.012,
            ModelType.GPT4O: 0.018
        }

    def optimize_task_assignment(self, tasks: List[str]) -> Dict[str, ModelType]:
        """Assign tasks to minimize cost while maximizing quality"""
        assignments = {}
        for task in tasks:
            # Logic to assign based on cost-effectiveness
            assignments[task] = self._best_model_for_task(task)
        return assignments

The Future of Complementary AI

As AI models continue to evolve, the concept of complementary architectures becomes even more powerful. We're moving toward a future where:

Specialized Models: Each model excels in specific domains
Intelligent Orchestration: Systems automatically choose the best model for each task
Continuous Learning: Models learn from each other's outputs
Seamless Integration: Users don't need to know which model is handling their request

Conclusion

The future of AI isn't about one model dominating all others — it's about creating intelligent systems that leverage the unique strengths of each model. By building complementary architectures with Claude Sonnet 4, GPT-4o, and Gemini 2.5, we can create more robust, accurate, and efficient AI solutions.

The examples and code provided in this article demonstrate practical approaches to implementing these complementary systems. As you build your own multi-agent architectures, remember that the goal isn't to replace human intelligence but to augment it with the best that each AI model has to offer.

Start small, experiment with different task assignments, and gradually build more sophisticated orchestration systems. The complementary approach not only provides better results but also creates more resilient and adaptable AI solutions for the future.

Code Less, Prompt Better: Unlocking Python's Built-in LLM Enhancers

angu10 — Fri, 16 May 2025 22:07:23 +0000

In the rapidly evolving landscape of Large Language Models (LLMs), effective prompt engineering has become a crucial skill. While much attention is given to the art of crafting effective prompts, less focus has been placed on how to efficiently manage these prompts programmatically. Python, with its rich set of built-in features, offers powerful tools to dynamically construct, optimize, and manage LLM prompts.
This article explores how Python's built-in features can transform your approach to LLM prompt engineering, making your code more efficient, maintainable, and powerful.

1. Using locals() for Dynamic Context Injection

The Problem
When working with LLMs, we often need to inject contextual information into our prompts. The traditional approach involves manual string formatting:

def generate_response(user_name, user_query, previous_context):
    prompt = f"""
    User name: {user_name}
    User query: {user_query}
    Previous context: {previous_context}

    Please respond to the user's query considering the context above.
    """

    return call_llm_api(prompt)

This works well for simple cases, but becomes unwieldy as the number of variables increases. It's also error-prone – you might forget to include a variable or update a variable name.

The Solution with locals()
Python's locals() function returns a dictionary containing all local variables in the current scope. We can leverage this to automatically include all relevant context:

def generate_response(user_name, user_query, previous_context, user_preferences=None, user_history=None):
    # All local variables are now accessible
    context_dict = locals()

    # Build a dynamic prompt section with all available context
    context_sections = []
    for key, value in context_dict.items():
        if value is not None:  # Only include non-None values
            context_sections.append(f"{key}: {value}")

    context_text = "\n".join(context_sections)

    prompt = f"""
    Context information:
    {context_text}

    Please respond to the user's query considering the context above.
    """

    return call_llm_api(prompt)

Benefits:

Automatic variable inclusion: If you add a new parameter to your function, it's automatically included in the context.
Reduced errors: No need to manually update string formatting when variables change.
Cleaner code: Separates the mechanism of context injection from the specific variables.

2. Using inspect for Function Documentation

The Problem
When creating LLM prompts that involve function execution or code generation, providing accurate function documentation is crucial:

def create_function_prompt(func_name, params):
    prompt = f"""
    Create a Python function named '{func_name}' with the following parameters:
    {params}
    """
    return prompt

This approach requires manually specifying function details, which can be tedious and error-prone.

The Solution with inspect
Python's inspect module allows us to extract rich metadata from functions:

import inspect

def create_function_prompt(func_reference):
    # Get the function signature
    signature = inspect.signature(func_reference)

    # Get the function docstring
    doc = inspect.getdoc(func_reference) or "No documentation available"

    # Get source code if available
    try:
        source = inspect.getsource(func_reference)
    except:
        source = "Source code not available"

    prompt = f"""
    Function name: {func_reference.__name__}

    Signature: {signature}

    Documentation:
    {doc}

    Original source code:
    {source}

    Please create an improved version of this function.
    """

    return prompt

# Example usage
def example_func(a, b=10):
    """This function adds two numbers together."""
    return a + b

improved_function_prompt = create_function_prompt(example_func)
# Send to LLM for improvement

This dynamically extracts all relevant information about the function, making the prompt much more informative.

3. Context Management with Class Attributes

The Problem
Managing conversation history and context with LLMs often leads to repetitive code:

conversation_history = []

def chat_with_llm(user_input):
    # Manually build the prompt with history
    prompt = "Previous conversation:\n"
    for entry in conversation_history:
        prompt += f"{entry['role']}: {entry['content']}\n"

    prompt += f"User: {user_input}\n"
    prompt += "Assistant: "

    response = call_llm_api(prompt)

    # Update history
    conversation_history.append({"role": "User", "content": user_input})
    conversation_history.append({"role": "Assistant", "content": response})

    return response

The Solution with Class Attributes and dict
We can create a conversation manager class that uses Python's object attributes:

class ConversationManager:
    def __init__(self, system_prompt=None, max_history=10):
        self.history = []
        self.system_prompt = system_prompt
        self.max_history = max_history
        self.user_info = {}
        self.conversation_attributes = {
            "tone": "helpful",
            "style": "concise",
            "knowledge_level": "expert"
        }

    def add_user_info(self, **kwargs):
        """Add user-specific information to the conversation context."""
        self.user_info.update(kwargs)

    def set_attribute(self, key, value):
        """Set a conversation attribute."""
        self.conversation_attributes[key] = value

    def build_prompt(self, user_input):
        """Build a complete prompt using object attributes."""
        prompt_parts = []

        # Add system prompt if available
        if self.system_prompt:
            prompt_parts.append(f"System: {self.system_prompt}")

        # Add conversation attributes
        prompt_parts.append("Conversation attributes:")
        for key, value in self.conversation_attributes.items():
            prompt_parts.append(f"- {key}: {value}")

        # Add user info if available
        if self.user_info:
            prompt_parts.append("\nUser information:")
            for key, value in self.user_info.items():
                prompt_parts.append(f"- {key}: {value}")

        # Add conversation history
        if self.history:
            prompt_parts.append("\nConversation history:")
            for entry in self.history[-self.max_history:]:
                prompt_parts.append(f"{entry['role']}: {entry['content']}")

        # Add current user input
        prompt_parts.append(f"\nUser: {user_input}")
        prompt_parts.append("Assistant:")

        return "\n".join(prompt_parts)

    def chat(self, user_input):
        """Process a user message and get response from LLM."""
        prompt = self.build_prompt(user_input)

        response = call_llm_api(prompt)

        # Update history
        self.history.append({"role": "User", "content": user_input})
        self.history.append({"role": "Assistant", "content": response})

        return response

    def get_state_as_dict(self):
        """Return a dictionary of the conversation state using __dict__."""
        return self.__dict__

    def save_state(self, filename):
        """Save the conversation state to a file."""
        import json
        with open(filename, 'w') as f:
            json.dump(self.get_state_as_dict(), f)

    def load_state(self, filename):
        """Load the conversation state from a file."""
        import json
        with open(filename, 'r') as f:
            state = json.load(f)
            self.__dict__.update(state)```



Using this approach:

# Create a conversation manager
convo = ConversationManager(system_prompt="You are a helpful assistant.")

# Add user information
convo.add_user_info(name="John", expertise="beginner", interests=["Python", "AI"])

# Set conversation attributes
convo.set_attribute("tone", "friendly")

# Chat with the LLM
response = convo.chat("Can you help me understand how Python dictionaries work?")
print(response)

# Later, save the conversation state
convo.save_state("conversation_backup.json")

# And load it back
new_convo = ConversationManager()
new_convo.load_state("conversation_backup.json")

4. Using dir() for Object Exploration

The Problem
When working with complex objects or APIs, it can be challenging to know what data is available to include in prompts:



def generate_data_analysis_prompt(dataset):
    # Manually specifying what we think is available
    prompt = f"""
    Dataset name: {dataset.name}
    Number of rows: {len(dataset)}

    Please analyze this dataset.
    """
    return prompt

The Solution with dir()
Python's dir() function lets us dynamically discover object attributes and methods:


def generate_data_analysis_prompt(dataset):
    # Discover available attributes
    attributes = dir(dataset)

    # Filter out private attributes (those starting with _)
    public_attrs = [attr for attr in attributes if not attr.startswith('_')]

    # Build metadata section
    metadata = []
    for attr in public_attrs:
        try:
            value = getattr(dataset, attr)
            # Only include non-method attributes with simple values
            if not callable(value) and not hasattr(value, '__dict__'):
                metadata.append(f"{attr}: {value}")
        except:
            pass  # Skip attributes that can't be accessed

    metadata_text = "\n".join(metadata)

    prompt = f"""
    Dataset metadata:
    {metadata_text}

    Please analyze this dataset based on the metadata above.
    """

    return prompt

This approach automatically discovers and includes relevant metadata without requiring us to know the exact structure of the dataset object in advance.

5. String Manipulation for Prompt Cleaning

The Problem
User inputs and other text data often contain formatting issues that can affect LLM performance:



def process_document(document_text):
    prompt = f"""
    Document:
    {document_text}

    Please summarize the key points from this document.
    """
    return call_llm_api(prompt)

The Solution with String Methods
Python's rich set of string manipulation methods can clean and normalize text:



def process_document(document_text):
    # Remove excessive whitespace
    cleaned_text = ' '.join(document_text.split())

    # Normalize line breaks
    cleaned_text = cleaned_text.replace('\r\n', '\n').replace('\r', '\n')

    # Limit length (many LLMs have token limits)
    max_chars = 5000
    if len(cleaned_text) > max_chars:
        cleaned_text = cleaned_text[:max_chars] + "... [truncated]"

    # Replace problematic characters
    for char, replacement in [('\u2018', "'"), ('\u2019', "'"), ('\u201c', '"'), ('\u201d', '"')]:
        cleaned_text = cleaned_text.replace(char, replacement)

    prompt = f"""
    Document:
    {cleaned_text}

    Please summarize the key points from this document.
    """

    return call_llm_api(prompt)

Conclusion

Python's built-in features offer powerful capabilities for enhancing LLM prompts:

Dynamic Context: Using locals() and dict to automatically include relevant variables
Introspection: Using inspect and dir() to extract rich metadata from objects and functions
String Manipulation: Using Python's string methods to clean and normalize text

By leveraging these built-in features, you can create more robust, maintainable, and dynamic LLM interactions. The techniques in this article can help you move beyond static prompt templates to create truly adaptive and context-aware LLM applications.
Most importantly, these approaches scale well as your LLM applications become more complex, allowing you to maintain clean, readable code while supporting sophisticated prompt engineering techniques.
Whether you're building a simple chatbot or a complex AI assistant, Python's built-in features can help you create more effective LLM interactions with less code and fewer errors.

AI in the Clinical Arena: Llama 4 Scout vs Claude 3.7 Statistical Showdown

angu10 — Fri, 11 Apr 2025 06:04:22 +0000

Introduction

As artificial intelligence advances, there is growing interest in evaluating how different AI models perform in specialized domains like clinical trial statistics. This article compares two state-of-the-art large language models — Llama 4 Scout Reasoning and Claude 3.7 — on their ability to solve common statistical problems in clinical trials. It’s important to emphasize that this study examines only a limited set of three clinical trial problems and should not be interpreted as a comprehensive assessment of these models’ overall capabilities.

Llama 4 Scout Instruct Model

Claude 3.7

Problem Selection

Three foundational clinical trial statistical problems were selected to evaluate the models:

Treatment Effect Analysis: Calculating response rates, absolute risk reduction (ARR), and number needed to treat (NNT) in a cancer treatment study comparing experimental and control arms

Non-inferiority Trial Design: Determining the minimum cure rate required for a new antibiotic to be considered non-inferior to the standard of care

Interim Analysis Decision-Making: Applying O’Brien-Fleming boundaries to decide whether to stop a trial early based on interim results

Evaluation Criteria

The outputs from both models were compared across several dimensions:

Mathematical accuracy
Statistical reasoning approach
Clarity of explanation
Contextual understanding
Presentation format
Result interpretation

Detailed Findings

Mathematical Precision

Both models demonstrated excellent mathematical precision, arriving at identical numerical answers for all three problems:

In Problem 1, both correctly calculated the response rates (55.6% vs 44.4%), ARR (11.2%), and NNT (9)
In Problem 2, both determined the minimum acceptable cure rate to be 70%
In Problem 3, both correctly concluded that the trial should not be stopped based on the interim analysis

Approach to Statistical Reasoning

Llama 4 Scout Reasoning

Llama 4 Scout approached problems with a highly structured methodology:

Began by clearly organizing given information in bulleted lists
Used explicit section headings to demarcate reasoning steps
Provided direct formula applications with clear variable substitutions
Included practical interpretations of the final statistical outcomes

Claude 3.7

Claude 3.7 demonstrated a more narrative reasoning style:

Used numbered steps with detailed explanations before formula application
Provided more extensive context about the statistical principles being applied
Explained the reasoning behind formula selection
Included additional interpretation of why certain approaches were appropriate

Explanation Quality

The models differed somewhat in their explanatory approaches:

Llama 4 Scout Reasoning

Provided concise but complete explanations
Excellent at clarifying practical implications (e.g., “This means we would need to treat 9 patients with the experimental treatment instead of the control treatment to achieve one additional response”)
Included additional context about threshold interpretations
Explicit about Type I error control in the interim analysis problem

Claude 3.7
Offered more detailed contextual explanations of statistical concepts
Provided more extensive rationale for calculation approaches
Explained why certain statistical formulas were appropriate in each context
Included more discussion of underlying statistical principles (e.g., family-wise error rate)

Visual Presentation

The models showed distinct presentation styles:

Llama 4 Scout Reasoning

More visually structured with effective use of bulleted lists
Clear section labeling and organization
Highlighted key information more prominently
More consistent formatting throughout responses

Claude 3.7

More narrative in style with detailed steps
Used numbered steps with explanatory headings
Integrated formulas within explanatory text
Maintained professional presentation but with less visual structuring

Discussion

Complementary Strengths
This comparative analysis revealed that both models possess strong capabilities in clinical trial statistical reasoning but with slightly different emphases:

Llama 4 Scout Reasoning excelled in structural organization, practical interpretation, and concise presentation of statistical concepts
Claude 3.7 demonstrated particular strength in contextual explanation, statistical rationale, and discussion of underlying principles.

Practical Implications

For professionals working in clinical trial statistics, the choice between these models might depend on specific needs:

For quick statistical calculations with a clear structure: Llama 4 Scout Reasoning’s concise and well-organized approach might be preferred.
Claude 3.7’s more detailed explanations might offer more excellent educational value for a deeper understanding of statistical principles.

However, the mathematical precision demonstrated by both models suggests that either would be reliable for calculating vital clinical trial statistics.

Comparing Llama 4 Scout Reasoning vs Claude 3.7 in Clinical Trial Problem Solving

Comparing Llama 4 Scout Reasoning vs Claude 3.5 in Clinical Trial Problem Solving

Aspect	Llama 4 Scout Reasoning	Claude 3.5
Information Organization	Used bulleted lists to highlight given information	Presented information in numbered steps with headings
Structure	Clear section headings with step numbers	Numbered steps with detailed headings
Problem 1: Response Rates
Experimental arm rate	55.6%	55.6%
Control arm rate	44.4%	44.4%
ARR calculation	0.556 - 0.444 = 0.112 = 11.2%	55.6% - 44.4% = 11.2%
NNT calculation	1/0.112 = 8.93 → 9	1/0.112 = 9
Additional context	Explained meaning of NNT (9 patients for one additional response)	Provided explanation of why ARR used experimental minus control
Problem 2: Non-inferiority Trial
Concept explanation	Concise explanation of non-inferiority concept	More detailed explanation of non-inferiority margin
Calculation approach	Direct formula application	Step-by-step explanation with formula justification
Final answer	70%	70%
Additional explanation	Added what happens if cure rate is below/above threshold	Included context about the meaning of non-inferiority margin
Problem 3: O'Brien-Fleming Boundaries
Decision framework	Clear comparison of p-value to boundary	Detailed explanation of boundary concept
Decision logic	p-value (0.01) > boundary (0.0001) → don't stop	Same conclusion with more contextual explanation
Additional explanation	Included explanation of Type I error control	Discussed family-wise error rate control
Overall Characteristics
Formatting style	More visually structured with bulleted lists	More narrative with detailed steps
Mathematical accuracy	Identical answers across all problems	Identical answers across all problems
Result interpretation	More explicit interpretation of final results	More context on the statistical principles
Explanation depth	Concise but complete	More detailed statistical context

Conclusion

This limited comparison suggests that Llama 4 Scout Reasoning and Claude 3.7 demonstrate strong capabilities in solving clinical trial statistical problems. However, Llama 4 Scout is open-source, and you can fine-tune it with your data, which will be more powerful.

It’s worth emphasizing that this analysis is based on only three specific problems and should not be extrapolated to represent overall model capabilities across the broad and complex domain of clinical trial statistics. A more comprehensive evaluation would require testing across a broader range of problem types, complexity levels, and specialized statistical methods used in clinical trials.

Document Whisperer: Llama-4-Scout and the Future of Intelligent Content Extraction

angu10 — Sun, 06 Apr 2025 05:28:55 +0000

In today's data-driven world, the ability to quickly extract insights from documents is becoming increasingly valuable. Whether you're a researcher sifting through academic papers, a legal professional analyzing contracts, or a business analyst reviewing reports, the challenge of efficiently navigating through lengthy documents remains universal.

Enter Llama-4-Scout, Meta's latest language model that's showing remarkable capabilities in document question-answering tasks. Released on April 5, 2025, Scout is a groundbreaking addition to the new Llama 4 family, specifically designed to be more reliable, precise, and effective at finding and extracting information from documents.

What Makes Llama-4-Scout Special?

Llama-4-Scout utilizes a Mixture-of-Experts (MoE) architecture with 17 billion active parameters (out of 109 billion total parameters across 16 experts) that strikes an optimal balance between performance and efficiency. What truly sets Scout apart is its unprecedented 10-million token context window, making it best in class for long-context tasks like summarizing huge codebases and analyzing extensive documents.

Scout was pretrained on approximately 40 trillion tokens of multimodal data and excels at:

Document comprehension: Understanding context across extremely long documents

Information retrieval: Precisely locating specific details in complex content

Source attribution: Providing clear references to where information was found

Multimodal reasoning: Visual recognition, image reasoning, captioning, and answering questions about document visuals

This focused design makes Scout particularly well-suited for professional document analysis tasks that require both accuracy and reasonable computational resources. Remarkably, the model fits on a single H100 GPU when quantized to Int4, making it accessible for organizations without massive computing infrastructure.

Real-World Example: Document Q&A on the AGENTS Framework

I recently had the opportunity to test Llama-4-Scout's capabilities by building a simple PDF Q&A application with a research paper on the AGENTS framework (a system for autonomous language agents). The results demonstrated Scout's remarkable ability to extract, understand, and contextualize information from technical documents. Here's what I observed:

Source Paper: https://arxiv.org/pdf/2309.07870

Example 1: Identifying Core Framework Components

When asked about the main features of the framework described in the document, Scout successfully extracted and summarized the key components:

Query: "What are the main features of the AGENTS framework?"

Scout's Response: "The main features of the AGENTS framework include planning, memory (specifically long-short term memory implementation), tool usage capabilities, multi-agent communication protocols, and fine-grained symbolic control through SOPs (Standard Operating Procedures). Source: Page 4, Framework Architecture section"

Example 2: Understanding Technical Implementation Details

Scout excelled at extracting specific technical implementation details related to memory systems:

Query: "How does long-short term memory work in AGENTS?"

Scout's Response: "The AGENTS framework implements a dual-memory system where long-term memories are constructed as action histories embedded by sentence-transformers and stored in VectorDB for semantic search retrieval. In contrast, short-term memories (working memories) are maintained in natural language form and dynamically updated by an LLM through a carefully tuned prompt system. Source: Pages 7-8, Memory Architecture section"

Example 3: Explaining Human-Agent Interaction Mechanisms

The model demonstrated its ability to identify specific configuration details that enable key functionalities:

Query: "How does the AGENTS framework support human-agent interaction?"

Scout's Response: "The AGENTS framework supports human-agent interaction through a simple but effective configuration mechanism. Users can change an agent's 'is_human' field to 'True' in the config file, which enables humans to take the role of an agent and interact with other language agents in both single-agent and multi-agent scenarios. Source: Page 12, Human-Agent Interaction section"

Example 4: Clarifying Domain-Specific Terminology

When asked about specialized terminology in the document, Scout provided clear and accurate explanations:

Query: "What is SOP in the AGENTS framework?"

Scout's Response: "In the AGENTS framework, SOP (Standard Operating Procedures) is a symbolic plan that provides fine-grained control of agents' behaviors. Technically, it's implemented as a graph of multiple states that define different situations an agent may encounter, with explicit transition rules between these states. This architecture makes agents' behavior more stable and predictable across various scenarios. Source: Page 9, Agent Control Mechanisms section"

Example 5: Identifying Ecosystem Components

Scout successfully extracted information about the broader ecosystem supporting the framework:

Query: "What is the Agent Hub in the AGENTS framework?"

Scout's Response: "The Agent Hub is a platform component of the AGENTS framework that enables users to share their fine-tuned language agents and search/download useful language agents created by others. This significantly reduces the effort required for designing, testing, and tuning language agents from scratch, creating a collaborative ecosystem for agent development. Source: Page 15, Ecosystem and Tools section"

Building Your Own Document Intelligence Tool

The ability to build custom applications around Llama-4-Scout is straightforward through providers like Together.ai. I created a simple Streamlit application that allows users to:

Upload any PDF document
Process both the text and visual content
Ask questions in natural language
Receive answers with source references
Evaluate answer quality with multiple criteria

This type of application can transform how teams work with documents, significantly reducing the time spent searching for information and improving the accuracy of insights extracted. Scout's ability to process images in documents means it can work with charts, tables, diagrams, and other visual elements that are often critical for full document understanding and is "optimized for visual recognition, image reasoning, captioning, and answering general questions about an image."

Technical Capabilities and Performance

Llama-4-Scout demonstrates impressive performance relative to competing models. In comparative evaluations, Scout has shown "superior performance relative to contemporary models such as Gemma 3, Gemini 2.0 Flash-Lite, and Mistral 3.1 across recognized benchmark datasets."

What makes Scout particularly practical is its efficiency. Scout "fits on a single H100 GPU when quantized to Int4" while still delivering high-quality results. This efficiency means organizations can implement advanced document intelligence without requiring massive computational resources.

Looking Ahead: The Future of Document Intelligence

As models like Llama-4-Scout continue to evolve, we can expect even more sophisticated document intelligence capabilities. Future developments will likely include:

Deeper reasoning across multiple documents
More nuanced understanding of domain-specific content
Better handling of ambiguity and uncertain information
Enhanced multimodal capabilities for complex visual content

Conclusion

Llama-4-Scout represents a significant step forward in making advanced document intelligence accessible. Its balanced approach to performance and efficiency makes it particularly valuable for professional applications where accuracy and attribution matter.

For organizations dealing with large volumes of documents, investing in tools built around models like Scout could yield substantial returns through improved information accessibility and insight generation. The model's ability to "process and work with extremely lengthy documents" makes it ideal for enterprises with extensive documentation needs.

Have you experimented with Llama-4-Scout or similar models for document analysis? I'd love to hear about your experiences and applications in the comments below.

Note: The examples provided are based on actual testing of Llama-4-Scout through Together.ai's API integration. Results may vary depending on document complexity and specific implementation details.

OpenAI - Gibili Portrait Assistance: AI-Powered Image Generation Made Simple

angu10 — Mon, 31 Mar 2025 17:50:22 +0000

Introduction

Ever wished you could create stunning portraits with just a few clicks? Meet Gibili Portrait Assistance, an AI-powered tool that makes generating high-quality portraits effortless. Whether you’re an artist, designer, or simply someone who loves experimenting with AI, Gibili can help bring your ideas to life.

In this post, we’ll walk you through how to use Gibili Portrait Assistance and explore the OpenAI architecture behind it.

How to Use Gibili Portrait Assistance

Using Gibili is straightforward and requires no prior technical knowledge. Here’s a simple step-by-step guide:

1. Enter Your Description or Upload an Image
You can either type a text description of the portrait you want or upload an existing image to be enhanced or transformed by AI.

Text Prompt Example:

“A realistic portrait of a woman with curly brown hair, wearing a red scarf, in a cinematic lighting style.”

Image Upload:

If you have an image you want to modify or enhance, simply upload it, and Gibili will apply AI-powered enhancements or transformations.

2. Customize Your Preferences
You can fine-tune details such as:

Art Style: Realistic, digital painting, anime, etc.
Background: Solid color, blurred, natural scenery.
Facial Expressions: Smiling, neutral, surprised.
Additional Features: Glasses, hats, jewelry, etc.

3. Generate the Image
Press Enter, and within seconds, Gibili will produce a high-resolution portrait based on your input or uploaded image.

4. Refine and Download
If you want adjustments, you can tweak your input and regenerate until you’re satisfied. Once ready, download your portrait in high-quality format.

The OpenAI Architecture Behind Gibili

Gibili Portrait Assistance is powered by OpenAI’s advanced image generation models, leveraging diffusion models to create highly detailed and realistic portraits. Here’s a simplified breakdown:

1. Text-to-Image & Image-to-Image Generation
When you provide a text prompt, the AI model translates it into a visual representation using deep learning techniques. If you upload an image, the model can enhance, transform, or stylize it while maintaining its core structure.

2. Fine-Tuned on Portrait Data
The model has been trained on a vast dataset of portraits across different styles, ensuring high accuracy and creativity in generated images.

3. Iterative Refinement
Instead of creating the final image instantly, the AI gradually refines it through multiple steps, ensuring greater precision and quality.

4. User-Guided Adjustments
Users can modify parameters like style and background, and the model will intelligently adjust the portrait while maintaining coherence.

Why Use Gibili Portrait Assistance?

✅ Easy to Use

No need for advanced design skills — just describe what you want or upload an image, and AI does the rest.

🎨 Customizable Output

From photorealistic portraits to artistic illustrations, you can tailor the results to your liking.

🚀 Fast & High-Quality

Generate high-resolution images within seconds.

🖌️ Creative Freedom

Perfect for artists, marketers, and content creators looking for unique visuals.

Get Started with Gibili Today!

Ready to create amazing AI-generated portraits? Try Gibili Portrait Assistance now and explore the limitless possibilities of AI-powered creativity!

DEV Community: angu10

I fine-tuned Gemma 4 E2B on a Mac to extract 22 fields from Indian GST invoices — privately, at $0 per call.

The model isn’t the hard part: the data pipeline I built to teach Gemma 4 E2B to read Indian GST invoices.

The model isn’t the hard part: the data pipeline I built to teach Gemma 4 E2B to read Indian GST invoices.

Why Gemma 4 E2B was the right model to try

The task

What fine-tuning actually changed

The first version: synthetic data was enough to build the pipeline

The real work was data engineering

Layer 1: generic synthetic invoices

Layer 2: real annotated invoices

Layer 3: Archive-derived layout variants

What Gemma 4 E2B actually learned

What this taught me about Gemma 4

What went wrong, and why it was useful

Failure 1: overfitting after a great checkpoint

Failure 2: NaN training from sequence problems

The practical result

If I were starting again

Why this made me more optimistic about Gemma 4, not less

The model looked great on validation until one real invoice broke four assumptions

The setup

Why validation looked so good

One real invoice broke four assumptions

The four assumptions that invoice broke

1. I assumed subtotal rows would be easy to identify

2. I assumed the model would map visible tax rates to the right field

3. I assumed missing fields would default safely

4. I assumed synthetic layout diversity was enough

Synthetic data did help. Just not in the way validation suggested.

The two failed runs were part of the lesson too

Run 1: the overfit run

Run 2: the NaN run

What the project actually proved

1. Synthetic data is excellent for bootstrapping a structured extraction task

2. Validation on synthetic data can dramatically overstate readiness

3. A small real corpus is disproportionately valuable

What I would change next

The lesson I am keeping

Building CloudGuide: A Real-Time AI Assistant for Navigating Google Cloud

Introduction

The Idea

Key Capabilities

Screen Understanding

Voice Interaction

Real-Time Voice Responses

API Grounding

Visual Click Guidance

System Architecture

Client (User Machine)

Backend (Google Cloud Run)

AI Model (Gemini Live)

Tool Calling and API Grounding

Visual UI Guidance

Challenges Encountered

Audio Feedback Loops

Voice Activity Detection

UI Changes

Deployment

What I Learned

Multimodal AI is powerful

API grounding improves reliability

Voice interfaces still need infrastructure improvements

Final Thoughts

Stop Print Debugging Your AI Agents: A Deep Dive into Agent Observability

Table of Contents

The Invisible Agent Problem

The Print Statement Spiral

The Real Cost of Poor Observability

Why Current Solutions Fall Short

1. Standard Logging Libraries

2. Cloud Observability Tools (DataDog, New Relic, etc.)

3. LLM Provider Dashboards (OpenAI, Anthropic)

4. Framework-Specific Tools (LangSmith for LangChain)

Introducing Agent Recorder

The Two-Decorator Solution

What Gets Captured Automatically

Running Your Agent

Viewing the Timeline

How It Works: Technical Deep Dive