DEV Community: Louis Sanna

The Hidden Bottleneck in LLM Streaming: Function Calls (And How to Fix It)

Louis Sanna — Thu, 12 Dec 2024 16:27:46 +0000

📢 Introduction

Picture this: You’re building a real-time LLM-powered app. Your users are expecting fast, continuous updates from the AI, but instead, they’re staring at a frozen screen. What gives?

Spoiler alert — it’s not your LLM that’s slowing things down. It’s your function calls.

Every time your app makes a call to process data, hit an API, or load a large file, you risk blocking the stream. The result? Delays, lag, and an experience that feels anything but “real-time.”

But don’t worry — this bottleneck has 3 simple fixes. In this post, I’ll show you:

Why function calls block LLM streams
The 3 strategies to prevent bottlenecks
How to keep your streams fast, smooth, and uninterrupted

Let’s get into it. 🚀

❌ Why Function Calls Are Slowing You Down

LLM streaming works by sending a steady flow of small chunks of text to the client. But here’s the catch: Every time you call a function during the stream — to process data, access an API, or run a calculation — the stream pauses until the function finishes.

This happens because most functions are synchronous by default, which means they block the current thread. Imagine you’re in a group chat, but one friend keeps pausing the conversation to answer a phone call. Annoying, right?

Here’s what’s really happening:

🔁 Synchronous (Blocking) Functions: The stream has to “wait” for these functions to finish before sending the next chunk of data.
🔥 Non-blocking (Asynchronous) Functions: The stream continues while the function does its work in the background.

Here’s a visual of the difference:

[ Blocking Call ] ---> Stream Pauses
[ Async Call ] ------> Stream Continues

🛠️ 3 Ways to Fix It

To avoid blocking the stream, you need to make your app non-blocking. Here are the 3 best techniques to do just that:

1️⃣ Use Asynchronous Functions

If your function is doing I/O (like hitting an API), make it asynchronous so it can "wait" for the API without pausing the stream. Async functions allow the app to keep streaming while the function completes.

When to use it:

When calling external APIs
When reading/writing to files or databases

How it works:

Use Python’s async def for your functions.
Use await to “pause” the function without blocking the stream.

Example: Streaming an LLM While Calling an API

import asyncio
from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse

app = FastAPI()

async def async_function(data):
    await asyncio.sleep(2)  # Simulate a slow API call
    return f"Processed: {data}"

async def stream_generator(request: Request):
    data_chunks = ["chunk1", "chunk2", "chunk3"]
    for chunk in data_chunks:
        processed_chunk = await async_function(chunk)
        yield f"data: {processed_chunk}\n\n"
        await asyncio.sleep(0.1)  # Simulate delay between chunks

@app.get("/stream")
async def stream(request: Request):
    return StreamingResponse(stream_generator(request), media_type="text/event-stream")

🔍 What’s happening here?

Each chunk is being processed asynchronously.
The stream keeps flowing while async_function is working.

Pro Tip: Use await asyncio.sleep() to simulate non-blocking behavior. Replace this with actual I/O tasks like API calls, file reads, or database queries.

2️⃣ Leverage Background Tasks

If you have heavy computations (like ML inference), you don’t want to keep your stream waiting. Instead, offload the task into the background and continue streaming while the computation runs.

When to use it:

When you have CPU-heavy computations (e.g., model predictions)
When dealing with large files or datasets

How it works:

Move heavy functions into a background task.
Use FastAPI’s BackgroundTasks to offload computations.

Example: Stream LLM Responses While Running a Background Computation

import asyncio
from fastapi import FastAPI, BackgroundTasks
from fastapi.responses import StreamingResponse

app = FastAPI()

async def background_task(data, results):
    await asyncio.sleep(2)  # Simulate a heavy ML computation
    results.append(f"Processed: {data}")

async def stream_generator(request, background_tasks):
    data_chunks = ["chunk1", "chunk2", "chunk3"]
    results = []
    for chunk in data_chunks:
        background_tasks.add_task(background_task, chunk, results)
        yield f"data: Processing {chunk}\n\n"
        await asyncio.sleep(0.1)  # Simulate a slight delay

    while len(results) < len(data_chunks):  # Wait for all background tasks
        await asyncio.sleep(0.1)

    for result in results:
        yield f"data: {result}\n\n"

@app.get("/stream")
async def stream(request: Request, background_tasks: BackgroundTasks):
    return StreamingResponse(stream_generator(request, background_tasks), media_type="text/event-stream")

🔍 What’s happening here?

The heavy computation (background_task) runs in the background.
The stream stays responsive, sending "Processing..." updates in real time.

Pro Tip: Background tasks are perfect for CPU-bound operations like ML inference, large file processing, and batch jobs.

3️⃣ Chunk Your Data

If you have to process large datasets, break them into smaller "chunks" and process each one at a time. This keeps the stream alive, rather than forcing it to wait for the whole dataset to be processed.

When to use it:

When dealing with large datasets (e.g., CSV files, large JSON)
When paginating results from a large database query

How it works:

Divide large datasets into chunks.
Process each chunk and stream it immediately.

Example: Stream Responses While Processing Large Datasets

import asyncio
from fastapi import FastAPI
from fastapi.responses import StreamingResponse

app = FastAPI()

async def process_chunk(chunk):
    await asyncio.sleep(1)  # Simulate processing time
    return f"Processed: {chunk}"

async def stream_generator(request):
    data_chunks = ["chunk1", "chunk2", "chunk3", "chunk4", "chunk5"]
    for chunk in data_chunks:
        processed_chunk = await process_chunk(chunk)
        yield f"data: {processed_chunk}\n\n"
        await asyncio.sleep(0.1)  # Simulate delay between chunks

@app.get("/stream")
async def stream(request: Request):
    return StreamingResponse(stream_generator(request), media_type="text/event-stream")

🔍 What’s happening here?

Instead of processing a big file all at once, the data is processed in chunks.
The stream stays responsive, sending updates as each chunk finishes.

Pro Tip: Use chunked processing for large datasets (like CSVs) to stream "partial results" instead of waiting for the whole job to finish.

📊 Which Method Should You Use?

Method	Use For	Use Case Example
Async Functions	I/O tasks (like APIs)	Streaming responses from API calls
Background Tasks	Heavy computation	Running ML inference while streaming
Chunked Processing	Large datasets	Streaming data from large files

🚀 Conclusion

When it comes to LLM streaming, blocking function calls are a hidden bottleneck. They stop the stream, causing lags and bad user experiences.

But now you know the 3 ways to fix it:

1️⃣ Use Async Functions for I/O tasks.

2️⃣ Use Background Tasks for heavy computations.

3️⃣ Use Chunked Processing for large datasets.

By using these techniques, you’ll keep your streams fast, smooth, and real-time.

💡 Want more LLM superpowers? Check out Louis Sanna’s guide on Responsive LLM Applications with Server-Sent Events. It’s the ultimate toolkit for building high-performance, real-time AI apps.

Want to know more about building Responsive LLMs? Check out my course on newline: Responsive LLM Applications with Server-Sent Events

Building the Ideal AI Agent: From Async Event Streams to Context-Aware State Management

Louis Sanna — Thu, 12 Dec 2024 16:26:41 +0000

Introduction

The dream of an autonomous AI agent isn’t just about generating smart responses — it’s about making those responses fast, interactive, and context-aware. To achieve this, you need to manage state across asynchronous tasks, handle real-time communication, and separate logic cleanly.

In this blog, you’ll learn how to design an ideal AI agent by:

Using asynchronous server-sent events (SSE) to create live, real-time AI responses.
Simplifying state management with context variables (contextvar).
Decoupling logic from network operations to build a scalable, maintainable architecture.

By the end, you’ll have a step-by-step understanding of how to design an agent that’s efficient, elegant, and easy to scale.

1️⃣ The Architecture of an Ideal AI Agent

Most "basic" AI agents are tangled in a mess of network calls, domain logic, and asynchronous event management. This makes it difficult to debug and hard to scale.

The ideal agent separates these concerns:

Event-driven Communication: Uses asynchronous Server-Sent Events (SSE) to stream updates to users in real time.
Context-Aware State Management: Manages context across multiple async calls using Python’s contextvar.
Decoupled Business Logic: Avoids tightly coupling logic with network operations, making it easier to maintain.

Diagram: Ideal AI Agent Architecture

+----------------------------+
|     Client (Web Browser)    |
|    Listens for SSE Events   |
+----------------------------+
            ⬇
+----------------------------+
|    FastAPI Backend          |
|    (Async Streaming)        |
+----------------------------+
            ⬇
+----------------------------+
|    Agent Logic              |
|  1️⃣ Generates Output       |
|  2️⃣ Emits Real-Time Events |
+----------------------------+
            ⬇
+----------------------------+
|    Event Queue + Context    |
|   Context-Aware State       |
+----------------------------+

2️⃣ Key Concepts to Build the Ideal Agent

1. Asynchronous Event Streaming (SSE)

Instead of waiting for the entire AI response to finish, we can stream each "chunk" of the response to the user. This makes the interaction feel faster, even if the total response time is the same.

How It Works:

The client opens an event stream (text/event-stream).

Every time the agent generates new content (like a sentence or paragraph), it streams that chunk to the client.

When the full response is complete, the event stream closes.

Why It’s Important:

Feels more interactive to users.

Allows for partial responses — users can see content as it’s created.

2. Context-Aware State Management (`contextvar`)

Agents often deal with asynchronous tasks that happen in parallel. Without context, shared state between these tasks becomes difficult.

Problem:

Two user requests hit the server at the same time. How do you ensure their states are separate?

Solution:

Use Python’s contextvars. This lets you manage request-specific variables, even when multiple requests happen at once. Think of it like thread-local storage for async code.

How It Works:

When a new request arrives, a queue is created in the context.

This queue holds the event messages (chunks) for that specific request.

As the agent generates output, it emits chunks into the queue.

Once the queue is empty and the task is done, the context is cleaned up.

3. Decoupled Agent Logic

The best AI agents keep network logic and business logic separate. Instead of directly streaming from the agent, we push events into a queue and handle event streaming separately.

This separation makes it easier to test, debug, and maintain the system.

Concepts You’ll Need:

emit_event(): Adds events to the queue.

close(): Closes the queue when the task finishes.

Streaming Response: Sends chunks from the queue to the client.

3️⃣ Step-by-Step Guide to Building the Ideal Agent

Step 1: Setup the Environment

Install the necessary libraries:

pip install fastapi uvicorn pydantic

Step 2: Build the Context-Aware Event System

This system tracks and streams events from the agent to the client. Here’s how:

Use contextvar to store the event queue.
Create functions to emit events and close the queue.
The agent will generate chunks, add them to the queue, and the queue will stream them.

context.py

import asyncio
import contextvars

# Create a context variable to store request-specific data
chat_context_var = contextvars.ContextVar("chat_context")

# Function to initialize the context (per request)
def build_chat_context():
    queue = asyncio.Queue()

    async def emit_event(event):
        await queue.put(event)

    async def close():
        await queue.put(None)  # Signals to close the stream

    return emit_event, close, queue

Step 3: The Chat Service

This is the core logic where the agent processes a request.

When a client sends a chat request, we create a new context.
The context tracks the queue where messages (events) are stored.
Each message chunk from the agent is streamed to the client in real time.

chat.py

from fastapi import FastAPI, APIRouter, Request
from fastapi.responses import StreamingResponse
from context import build_chat_context, chat_context_var
import asyncio

router = APIRouter()

async def process_messages(messages):
    emit_event, _, queue = chat_context_var.get()
    for message in messages:
        await asyncio.sleep(1)  # Simulate a delay to send chunks one-by-one
        await emit_event({"content": message})
    await queue.put(None)  # End of stream

@router.post("/api/stream")
async def stream(request: Request):
    emit_event, close, queue = build_chat_context()
    chat_context_var.set((emit_event, close, queue))

    task = asyncio.create_task(process_messages(["Hello", "How are you?", "Goodbye"]))

    async def event_generator():
        while True:
            event = await queue.get()
            if event is None:  # End of stream
                break
            yield f"data: {event}\n\n"

    return StreamingResponse(event_generator(), media_type="text/event-stream")

Step 4: Run the Service

Run the FastAPI server:

uvicorn chat:app --reload

Make a POST request to:

http://localhost:8000/api/stream

Watch the real-time response as chunks appear.

4️⃣ Advanced Techniques

1. Use `async def emit_event()` Instead of `yield`

The "yield" method can cause issues when events come from multiple functions. Instead, push events to a queue using emit_event(). This avoids yield problems, especially when sub-functions need to send events.

2. Manage Long-Running Tasks

Use asyncio.create_task() to process long-running tasks without blocking the entire stream. This allows multiple users to receive independent updates.

3. Use WebSockets Instead of SSE

For more interactive experiences, use WebSockets. Unlike SSE, WebSockets support bi-directional communication.

5️⃣ Key Takeaways

Context-aware agents can separate network logic from agent logic.
Use SSE (Server-Sent Events) for real-time feedback.
Manage agent state using Python’s contextvar to keep state isolated.
emit_event() makes it simple to send updates from any part of the agent logic.

6️⃣ Full Code Example

Here’s the complete file structure:

├── context.py         # Handles contextvars and event system
├── chat.py            # The core logic for the streaming service
└── main.py            # Starts the FastAPI server

7️⃣ Final Thoughts

Building an "ideal" AI agent isn’t just about improving its intelligence — it’s about making it more interactive, more maintainable, and more scalable. By using async events, contextvars, and real-time streams, you can create an agent that "feels" fast and responsive.

If you’re ready to level up your agents, apply these principles to your next AI project.

Want to learn more about building Responsive LLMs? Check out my course on newline: Responsive LLM Applications with Server-Sent Events

I cover :

How to design systems for AI applications
How to stream the answer of a Large Language Model
Differences between Server-Sent Events and WebSockets
Importance of real-time for GenAI UI
How asynchronous programming in Python works
How to integrate LangChain with FastAPI
What problems Retrieval Augmented Generation can solve
How to create an AI agent ... and much more.

Worth checking out if you want to build your own LLM applications.

Self-Correcting AI Agents: How to Build AI That Learns From Its Mistakes

Louis Sanna — Thu, 12 Dec 2024 16:25:27 +0000

Introduction

What if your AI agent could recognize its own mistakes, learn from them, and try again — without human intervention? Welcome to the world of self-correcting AI agents.

Most AI models generate outputs in a single attempt. But self-correcting agents go further. They can identify when an error occurs, analyze the cause, and apply a fix — all in real time. Think of it as an AI with a built-in "trial and error" mindset.

In this blog, you’ll learn:

What self-correction means for AI agents.
How to build an agent that adapts to mistakes.
How to apply the reflection pattern in agent design.

By the end, you’ll know how to design AI agents that not only fail gracefully but also improve on every attempt.

1️⃣ What is a Self-Correcting Agent?

A self-correcting agent is an AI system that can recognize its own failures and attempt a new strategy. If the initial approach doesn't work, the agent re-evaluates and tries an alternative path.

Analogy:

Imagine asking a chef to bake a cake, but they use too much sugar the first time. A standard AI would keep making the same mistake. But a self-correcting AI would notice the error, reduce the sugar next time, and adjust until the cake tastes perfect.

Why Do Self-Correcting Agents Matter?

Most AI tools (like ChatGPT) can only give you a single response. If it's wrong, you have to manually ask it to "try again." But a self-correcting agent can autonomously retry.

🛠️ Example Use Case:

An AI is asked to write a Python function that calculates Fibonacci numbers.

Attempt 1: The AI writes a slow recursive function.

Self-Correction: It notices that recursion is too slow.

Attempt 2: The AI rewrites the function using dynamic programming, making it faster.

2️⃣ Key Techniques for Self-Correction

How do we make an agent self-aware enough to recognize its mistakes? Here are three main techniques:

1. Error Detection

Identify if the result is "wrong" (like an API call failure, incorrect output, or bad performance).
Use error codes, exceptions, or test cases to detect failures.

2. Reflection

Agents reflect on their decisions, ask, "What went wrong?", and plan their next step.
Reflection can be achieved by logging errors, tracking unsuccessful API calls, or re-evaluating response quality.

3. Retry Logic

After reflection, agents retry with an improved strategy.
This might mean switching API providers, using more efficient logic, or applying a backup approach.

💡 Pro Tip:

Error logs can be fed back into the AI model to improve future performance.

3️⃣ Self-Correction in Practice

Let’s build a self-correcting AI agent using Python and FastAPI.

🧑‍💻 Step 1: The Problem

We want an AI agent that can generate a Python function. If the function fails to run or produces the wrong output, the agent will automatically correct itself.

Problem: Write a Fibonacci function that calculates the 10th Fibonacci number.

Challenge: If the agent generates a recursive version (which is slow), it should recognize this and rewrite it using dynamic programming.

🧑‍💻 Step 2: Set Up the Agent

Install the necessary dependencies:

pip install openai fastapi uvicorn

🧑‍💻 Step 3: Write the Agent

Here’s how the agent works:

It generates a Python function using OpenAI’s API.
It runs the function to check if it works.
If the function fails (slow, wrong, or error), it reflects and corrects the approach.

Code Implementation

import openai
import time
import asyncio

# 🔐 Replace with your OpenAI API key
openai.api_key = "your_openai_api_key_here"

# 🎉 Step 1: Ask the AI to generate a Fibonacci function
async def generate_fibonacci_function():
    prompt = "Write a Python function to calculate the 10th Fibonacci number."
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}]
    )
    function_code = response['choices'][0]['message']['content']
    return function_code

# 🧪 Step 2: Test the function to see if it works
def test_fibonacci_function(function_code):
    try:
        exec(function_code)  # Run the function in a safe execution environment
        result = eval("fibonacci(10)")  # Call the function with n=10
        if result == 55:  # Correct Fibonacci value for n=10
            return "success", result
        else:
            return "wrong_output", result
    except Exception as e:
        return "error", str(e)

# 🌀 Step 3: Self-Correct by asking for a new version of the function
async def self_correct_function():
    max_attempts = 3
    for attempt in range(max_attempts):
        print(f"🟢 Attempt {attempt + 1}")

        # Generate a new Fibonacci function
        function_code = await generate_fibonacci_function()
        print(f"Generated function:\n{function_code}\n")

        # Test the function to see if it works
        status, result = test_fibonacci_function(function_code)
        if status == "success":
            print(f"✅ Success! Fibonacci(10) = {result}")
            return
        elif status == "wrong_output":
            print(f"❌ Incorrect result: {result}. Asking AI to try a better method.")
        else:
            print(f"💥 Error: {result}. Asking AI to try again.")

    print("❌ Max attempts reached. Could not generate a correct function.")

# 🔥 Run the correction process
asyncio.run(self_correct_function())

4️⃣ How It Works

Generate Function: The AI writes a Python function for Fibonacci.
Run the Function: The agent executes the function and checks the result.
Self-Correct: If the result is wrong, it prompts OpenAI to try again with a smarter approach.

Output Example

🟢 Attempt 1
❌ Incorrect result: 42. Asking AI to try a better method.

🟢 Attempt 2
💥 Error: NameError: name 'fibonacci' is not defined. Asking AI to try again.

🟢 Attempt 3
✅ Success! Fibonacci(10) = 55

5️⃣ Key Patterns in Self-Correcting Agents

Error Detection: Look for incorrect output, slow performance, or exceptions.
Reflection: Log the problem. Why did it fail?
Retry Logic: Call a new version of the function, but smarter this time.

💡 Pro Tip:

Use a feedback loop to let the agent learn from mistakes. Feed logs back into the agent to help it recognize common issues.

6️⃣ When Should You Use Self-Correcting Agents?

Self-correcting agents are useful in cases where failure is frequent, and manual intervention is costly.

API Calls: Retry if an API fails.
Code Generation: Re-generate code if it throws errors.
Data Analysis: Fix incorrect predictions in ML models.

7️⃣ Benefits of Self-Correcting Agents

Problem	Solution
Agent gets it wrong	Retry with a better approach
API request fails	Retry with exponential backoff
Code generation error	Use a smarter prompt

8️⃣ Take It to the Next Level

Use a Cache: Store successful outputs so the agent doesn’t start from scratch.
Add Feedback Loops: If a function fails often, feed logs into a training process.
Track Agent Confidence: If the agent is unsure, have it run test cases.

9️⃣ Wrapping Up

You now have the blueprint for a self-correcting agent that can write, test, and fix Python functions. Here’s what we covered:

The 3 pillars of self-correction: Error Detection, Reflection, Retry Logic.
How to build an agent that generates and tests Python functions.
Best practices for building smarter, more reliable agents.

💪 Challenge:

Build a self-correcting agent that not only generates code but evaluates runtime performance. If the function is too slow, have it re-write the function for optimization.

Want to learn more about building Responsive LLMs? Check out my course on newline: Responsive LLM Applications with Server-Sent Events

I cover :

How to design systems for AI applications
How to stream the answer of a Large Language Model
Differences between Server-Sent Events and WebSockets
Importance of real-time for GenAI UI
How asynchronous programming in Python works
How to integrate LangChain with FastAPI
What problems Retrieval Augmented Generation can solve
How to create an AI agent ... and much more.

How to Build Smarter AI Agents with Dynamic Tooling

Louis Sanna — Thu, 12 Dec 2024 16:23:59 +0000

Introduction

Imagine having an AI agent that can access real-time weather data, process complex calculations, and improve itself after making a mistake — all without human intervention. Sounds kinda neat, right? Well, it’s not as hard to build as you might think.

Large Language Models (LLMs) like GPT-4 are impressive, but they have limits. Out-of-the-box, they can't access live data or perform calculations that require real-time inputs. But with dynamic tooling, you can break these limits, allowing agents to fetch live information, make decisions, and even self-correct when things go wrong.

In this guide, we’ll walk you through how to build an AI agent that can:

Access real-time API data.
Self-correct and improve its performance.
Use a clean, maintainable architecture for future upgrades.

By the end, you'll have the tools you need to build an agent that's as flexible as it is powerful.

1️⃣ What are Dynamic Tools in AI Agents?

Dynamic tools allow AI agents to go beyond static responses. Instead of just generating text, agents can "call" specific actions, like fetching real-time data, executing scripts, or correcting their own mistakes.

Here’s a simple analogy:

Static AI is like a person who answers only from their memory.

Dynamic AI is like someone who can use a search engine, calculator, or dictionary to give you better answers.

Why Does This Matter?

With dynamic tools, you can build smarter AI agents that can:

Call external APIs for real-time data.
Process information like calculations, translations, or data transformation.
Self-correct errors in their own logic or execution.

🛠️ Example Use Case:

"What's the average temperature in Tokyo and Paris right now?"

A static AI would fail, but a dynamic AI can:

Call a weather API for Tokyo.

Call a weather API for Paris.

Calculate the average temperature.

2️⃣ Building a Real-Time API Agent

Let’s create an AI agent that can fetch real-time weather data from an API and compute the average temperature between two cities.

Tools Required

FastAPI: To build the backend.
Asyncio: For asynchronous event handling.
An External Weather API: For real-time temperature data (like OpenWeatherMap or WeatherAPI).

🧑‍💻 Step 1: Setting up the Environment

Start by installing the necessary packages:

pip install fastapi uvicorn httpx

🧑‍💻 Step 2: Writing the Agent

Here’s how the agent works:

It receives a list of cities (like Tokyo and Paris).
It fetches live weather data for each city using HTTPX (an async HTTP client).
It calculates the average temperature and returns it.

Code Implementation

import httpx
from fastapi import FastAPI, Request

app = FastAPI()

# 🌦️ Step 1: Function to get weather data
async def get_weather(city: str) -> float:
    """Fetch the current temperature for a given city using an API."""
    API_KEY = "your_api_key_here"
    url = f"https://api.openweathermap.org/data/2.5/weather?q={city}&appid={API_KEY}&units=metric"
    async with httpx.AsyncClient() as client:
        response = await client.get(url)
        data = response.json()
        temperature = data['main']['temp']
    return temperature

# 🌦️ Step 2: Agent to calculate average temperature
@app.post("/average-temperature")
async def calculate_average_temperature(request: Request):
    """Takes a list of cities and returns the average temperature."""
    payload = await request.json()
    cities = payload.get("cities", [])

    if not cities:
        return {"error": "Please provide a list of cities"}

    temperatures = await asyncio.gather(*[get_weather(city) for city in cities])
    average_temp = sum(temperatures) / len(temperatures)

    return {"average_temperature": average_temp}

How It Works

get_weather(city): Calls the weather API to get the temperature for a city.
calculate_average_temperature(request): Loops over a list of cities, fetches weather for each, and calculates the average temperature.

🔥 Test It!
Start the server:

uvicorn filename:app --reload

Send a POST request to http://127.0.0.1:8000/average-temperature with this JSON body:

{
    "cities": ["Tokyo", "Paris"]
}

The response will look something like this:

{
    "average_temperature": 18.5
}

3️⃣ Building Self-Correcting Agents

What if the agent calls the API but gets a rate-limiting error or the data is incomplete? Can it self-correct? Yes, it can!

Self-correcting agents work by analyzing failures and then attempting a new approach. Here’s a simple example.

🧑‍💻 Step 1: Detecting Errors

When we call the weather API, we may receive an error. Instead of crashing, the agent should recognize the failure and retry.

Updated Code for Resilient Agent

async def get_weather(city: str) -> float:
    """Fetch the current temperature, retrying if necessary."""
    API_KEY = "your_api_key_here"
    url = f"https://api.openweathermap.org/data/2.5/weather?q={city}&appid={API_KEY}&units=metric"

    for attempt in range(3):  # Retry up to 3 times
        try:
            async with httpx.AsyncClient() as client:
                response = await client.get(url)
                if response.status_code == 200:
                    data = response.json()
                    return data['main']['temp']
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {e}")

    raise ValueError(f"Failed to get weather data for {city} after 3 attempts")

4️⃣ How to Build a Smarter Agent

To make our agent smarter, we can:

Add Self-Correction Loops: Retry API calls on failure.
Use Reflection: If a calculation fails, reattempt using a new approach (e.g., switch API providers).
Modularize Tools: Use dynamic tools so the agent can call functions as needed rather than using them all the time.

Here’s an enhanced design for our agent:

1️⃣ Agent Receives User Request (e.g., "What's the average temperature in Tokyo and Paris?")
2️⃣ Agent Uses Tool: Calls get_weather(city) for each city.
3️⃣ Agent Handles Errors: If a call fails, it retries with a new method.
4️⃣ Agent Returns the Result: Final response is sent to the user.

💡 Pro Tip:

Use reflection by logging all errors and decisions the agent makes. This log can be used to self-correct on future requests.

5️⃣ Best Practices for Dynamic Tool Agents

Use context-aware logging: Track failures and successes so the agent can learn.
Limit API calls: Use caching to avoid unnecessary calls to the same data.
Decouple logic: Split agent logic (like get_weather) into independent "tool" functions.

6️⃣ Key Concepts Recap

Dynamic Tools: Functions that agents can call (like APIs, calculators, or scripts).
Self-Correction: Agents analyze their own failures and retry tasks.
Simple Code, Smart Logic: Make your agents smarter with minimal code changes (like retries).

7️⃣ Wrapping Up

Congratulations! You've built a smarter AI agent with dynamic tools. It can access weather APIs, calculate the average temperature, and retry when things go wrong. The key concepts here — API calls, self-correction, and modular tools — are foundational for building more advanced agents that handle any user request.

Here’s a quick recap of what we did:

Built a weather-fetching agent.
Added resilience with self-correction.
Discussed best practices for agent development.

Want to take it to the next level? Here’s a challenge:

Build an AI agent that can pull data from multiple sources, merge it, and generate insights.

Want to learn more about building Responsive LLMs? Check out my course on newline: Responsive LLM Applications with Server-Sent Events

I cover :

How to design systems for AI applications
How to stream the answer of a Large Language Model
Differences between Server-Sent Events and WebSockets
Importance of real-time for GenAI UI
How asynchronous programming in Python works
How to integrate LangChain with FastAPI
What problems Retrieval Augmented Generation can solve
How to create an AI agent ... and much more.

Worth checking out if you want to build your own LLM applications.

🚀 Now it’s your turn to build smarter agents. Happy coding! 🚀

Mastering Real-Time AI: A Developer’s Guide to Building Streaming LLMs with FastAPI and Transformers

Louis Sanna — Thu, 12 Dec 2024 16:00:08 +0000

Introduction: Why Real-Time Streaming AI is the Future

Real-time AI is transforming how users experience applications. Gone are the days when users had to wait for entire responses to load. Instead, modern apps stream data in chunks.

For developers, this shift isn't just a "nice-to-have" — it's essential. Chatbots, search engines, and AI-powered customer support apps are now expected to integrate streaming LLM (Large Language Model) responses. But how do you actually build one?

This guide walks you through the process, step-by-step, using FastAPI, Transformers, and a healthy dose of asynchronous programming. By the end, you'll have a working streaming endpoint capable of serving LLM-generated text in real-time.

💡 Who This Is For:

Software Engineers who want to upgrade their back-end skills with text streaming and event-driven programming.

Data Scientists who want to repurpose ML skills for production-ready AI services.

What Is a Streaming LLM and Why It Matters?
Tech Stack Overview: The Tools You'll Need
Project Walkthrough: Building the Streaming LLM Backend
- Environment Setup
- Setting Up FastAPI
- Building the Streaming Endpoint
- Connecting the LLM with Transformers
Client-Side Integration: Consuming the Stream
Deploying Your Streaming AI App
Conclusion and Next Steps

1️⃣ What Is a Streaming LLM and Why It Matters?

When you type into ChatGPT or ask a question in Google Bard, you'll notice the response appears one word at a time. Streaming LLMs send chunks of text as they're generated instead of waiting for the entire message, i.e. they deliver in real-time.

Here’s why you should care as a developer:

Faster User Feedback: Users see responses sooner.
Lower Latency Perception: Users feel like the system is faster, even if total time is the same.
Improved UX for AI Chatbots: Streaming text "feels" human, mimicking natural conversation.

If you’ve used ChatGPT, you’ve already experienced this. Now it’s time to learn how to build one yourself.

2️⃣ Tech Stack Overview: The Tools You'll Need

To build your streaming LLM backend, you’ll need the following tools:

📦 Core Technologies

Tool	Purpose
FastAPI	Handles API requests and real-time streaming
Uvicorn	Runs the FastAPI app as an ASGI server
Transformers	Access pre-trained language models
asyncio	Handles asynchronous event loops
contextvars	Keeps track of context in async tasks
Server-Sent Events (SSE)	Streams messages to the client
Docker	Optional for containerization and deployment

💡 Note: Server-Sent Events (SSE) is different from WebSockets. SSE allows the server to push data to the client, while WebSockets support bi-directional communication. For LLM streaming, SSE is simpler and more efficient.

3️⃣ Project Walkthrough: Building the Streaming LLM Backend

Step 1: Environment Setup

Install Python and Pip: Ensure Python 3.7+ is installed.

Create a Virtual Environment:

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

Install Dependencies:

pip install fastapi uvicorn transformers asyncio

Step 2: Set Up FastAPI

Create a file named app.py. Here’s the basic FastAPI setup.

from fastapi import FastAPI, Response

app = FastAPI()

@app.get("/")
async def root():
    return {"message": "Welcome to Real-Time LLM Streaming!"}

Run the server:

uvicorn app:app --reload

Visit http://127.0.0.1:8000/ in your browser. You should see:

{ "message": "Welcome to Real-Time LLM Streaming!" }

Step 3: Build the Streaming Endpoint

Instead of returning a single response, we’ll stream it chunk-by-chunk. Here’s the idea:

The client makes a request to /stream.
The server "yields" parts of the response as they are generated.

Here’s the code for the streaming endpoint:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import asyncio

app = FastAPI()

async def event_stream():
    for i in range(10):
        await asyncio.sleep(1)  # Simulate response delay
        yield f"data: Message {i}\n\n"

@app.get("/stream")
async def stream_response():
    return StreamingResponse(event_stream(), media_type="text/event-stream")

🔥 Test It:

Run the server and visit http://127.0.0.1:8000/stream — you'll see "Message 0", "Message 1", etc., appear every second.

Step 4: Connect the LLM with Transformers

Now, let’s swap out the dummy messages for LLM-generated responses.

from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from transformers import pipeline
import asyncio

app = FastAPI()
llm_pipeline = pipeline("text-generation", model="gpt2")

async def generate_response(prompt):
    for chunk in llm_pipeline(prompt, max_length=50, return_full_text=False):
        yield f"data: {chunk['generated_text']}\n\n"
        await asyncio.sleep(0.1)

@app.get("/stream")
async def stream_response(prompt: str):
    return StreamingResponse(generate_response(prompt), media_type="text/event-stream")

🔥 Test It:

Run the server and visit:

http://127.0.0.1:8000/stream?prompt=Once upon a time

You’ll see the AI model stream the response live.

4️⃣ Client-Side Integration: Consuming the Stream

On the front end, you can use EventSource (a native browser API) to consume the stream.

Here’s the simplest way to do it:

<!DOCTYPE html>
<html lang="en">
<body>
  <h1>LLM Streaming Demo</h1>
  <pre id="stream-output"></pre>

  <script>
    const output = document.getElementById('stream-output');
    const eventSource = new EventSource('http://127.0.0.1:8000/stream?prompt=Tell me a story');

    eventSource.onmessage = (event) => {
      output.innerText += event.data + '\n';
    };
  </script>
</body>
</html>

This will display a live feed of the AI response on your webpage.

5️⃣ Deploying Your Streaming AI App

You’ve got it working locally, but now you want to deploy it to the world. Here’s how:

Step 1: Dockerize the App

Create a file called Dockerfile:

FROM tiangolo/uvicorn-gunicorn-fastapi:python3.8

WORKDIR /app
COPY . /app

RUN pip install -r /app/requirements.txt

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "80"]

Step 2: Build and Run the Docker Image

docker build -t streaming-llm .
docker run -p 80:80 streaming-llm

6️⃣ Conclusion: What’s Next?

Congratulations! 🎉 You’ve built a real-time, streaming LLM from scratch using FastAPI, Transformers, and Server-Sent Events. Here's what you’ve learned:

How streaming works (and why it matters).
How to use FastAPI for streaming endpoints.
How to stream LLM responses with Hugging Face Transformers.

Where to Go Next?

Optimize Your LLM: Use Hugging Face models like GPT-J or distilGPT2 for smaller, faster models.
Explore WebSockets: For two-way streaming (not just server->client).
Deploy to Cloud: Deploy your app to AWS, GCP, or Heroku.

🧠 Pro Tip: Add interactive client-side UI, like a chat interface, to create your own mini ChatGPT!

With this guide, you're ready to level up your developer skills and build interactive, AI-driven experiences. 🚀

Want to learn more about building Responsive LLMs? Check out my course on newline: Responsive LLM Applications with Server-Sent Events

I cover :

How to design systems for AI applications
How to stream the answer of a Large Language Model
Differences between Server-Sent Events and WebSockets
Importance of real-time for GenAI UI
How asynchronous programming in Python works
How to integrate LangChain with FastAPI
What problems Retrieval Augmented Generation can solve
How to create an AI agent ... and much more.

Integrating LangChain with FastAPI for Asynchronous Streaming

Louis Sanna — Thu, 12 Dec 2024 15:55:17 +0000

LangChain and FastAPI working in tandem provide a strong setup for the asynchronous streaming endpoints that LLM-integrated applications need. Modern chat applications live or die by how effectively they handle live data streams and how quickly they can respond.

Introduction to LangChain

LangChain is a library that simplifies the incorporation of language models into applications. It provides an abstracted layer over various components such as large language models (LLMs), data retrievers, and vector storage solutions. This abstraction allows developers to integrate and switch between different backend providers or technologies seamlessly.

Introduction to FastAPI

FastAPI is a modern, fast (high-performance), web framework for building APIs with Python 3.7+ based on standard Python type hints. It is designed for creating RESTful APIs quickly and efficiently, with automatic interactive API documentation provided by Swagger and ReDoc.

Combining LangChain with FastAPI

By combining LangChain with FastAPI, developers can create robust, asynchronous streaming APIs that handle real-time data efficiently. This integration is particularly useful for applications that require live updates, such as chat applications or real-time analytics dashboards.

Setting Up the FastAPI Project

First, install the necessary packages:

pip install fastapi langchain pydantic uvicorn

Next, define the FastAPI router and Pydantic models to structure and validate incoming messages.

from fastapi import FastAPI, APIRouter
from pydantic import BaseModel
from typing import List

app = FastAPI()
router = APIRouter()

class Message(BaseModel):
    role: str
    content: str

class ChatPayload(BaseModel):
    messages: List[Message]

    class Config:
        schema_extra = {
            "example": {
                "messages": [{"role": "user", "content": "Who are you?"}]
            }
        }

Creating the Streaming API Endpoint

Create an endpoint to receive chat messages and stream responses back to the client using LangChain. This is done by emitting server-sent events (SSE).

from fastapi import Request
from fastapi.responses import StreamingResponse
from langchain_openai import ChatOpenAI
import json

@router.post("/api/completion")
async def stream(request: Request, payload: ChatPayload):
    chat = ChatOpenAI()
    return StreamingResponse(send_completion_events(payload.messages, chat=chat), media_type="text/event-stream")

async def send_completion_events(messages, chat):
    async for patch in chat.astream_log(messages):
        for op in patch.ops:
            if op["op"] == "add" and op["path"] == "/streamed_output/-":
                content = op["value"] if isinstance(op["value"], str) else op["value"].content
                json_dict = {"type": "llm_chunk", "content": content}
                json_str = json.dumps(json_dict)
                yield f"data: {json_str}\\n\\n"

app.include_router(router)

Running the FastAPI Application

Run the FastAPI application using Uvicorn, an ASGI server implementation for Python.

uvicorn main:app --reload

Navigate to http://127.0.0.1:8000/docs to see the interactive API documentation generated by FastAPI. This documentation provides an easy way to test the API endpoints.

Why JSON Patch?

LangChain's astream_log method uses JSON Patch to stream events, which is why understanding JSON Patch is essential for implementing this integration effectively. JSON Patch provides an efficient way to update parts of a JSON document incrementally without needing to send the entire document. This is particularly useful in real-time applications where data needs to be updated frequently and incrementally.

Brief Overview of JSON Patch

JSON Patch supports several operation types, including:

Add: Inserts a new value into the JSON document at the specified path.
Remove: Removes the value at the specified path.
Replace: Replaces the value at the specified path with a new value.
Move: Moves a value from one path to another in the document.
Copy: Copies a value from one path to another.
Test: Tests that a specified value at a specified path exists.

Consider the original document:

{
  "baz": "qux",
  "foo": "bar"
}

Applying the patch:

[
  { "op": "replace", "path": "/baz", "value": "boo" },
  { "op": "add", "path": "/hello", "value": ["world"] },
  { "op": "remove", "path": "/foo" }
]

Results in:

{
  "baz": "boo",
  "hello": ["world"]
}

JSON Patch allows for efficient, incremental updates, making it ideal for applications that require frequent or real-time updates to their data.

Conclusion

By integrating LangChain with FastAPI, developers can build efficient asynchronous streaming APIs capable of handling real-time data. This setup is ideal for applications like chatbots, where timely responses and data processing are crucial. FastAPI's ease of use and LangChain's abstraction capabilities, combined with the efficiency of JSON Patch, make this combination a powerful tool for modern web development.

Want to learn more about building Responsive LLMs? Check out my course on newline: Responsive LLM Applications with Server-Sent Events

I cover :

How to design systems for AI applications
How to stream the answer of a Large Language Model
Differences between Server-Sent Events and WebSockets
Importance of real-time for GenAI UI
How asynchronous programming in Python works
How to integrate LangChain with FastAPI
What problems Retrieval Augmented Generation can solve
How to create an AI agent ... and much more.

Worth checking out if you want to build your own LLM applications - in the course I will provide extensive code examples and help you go from concept to deployment.

DEV Community: Louis Sanna

The Hidden Bottleneck in LLM Streaming: Function Calls (And How to Fix It)

📢 Introduction

❌ Why Function Calls Are Slowing You Down

🛠️ 3 Ways to Fix It

1️⃣ Use Asynchronous Functions

2️⃣ Leverage Background Tasks

3️⃣ Chunk Your Data

📊 Which Method Should You Use?

🚀 Conclusion

Building the Ideal AI Agent: From Async Event Streams to Context-Aware State Management

Introduction

1️⃣ The Architecture of an Ideal AI Agent

Diagram: Ideal AI Agent Architecture

2️⃣ Key Concepts to Build the Ideal Agent

1. Asynchronous Event Streaming (SSE)

2. Context-Aware State Management (contextvar)

3. Decoupled Agent Logic

3️⃣ Step-by-Step Guide to Building the Ideal Agent

Step 1: Setup the Environment

Step 2: Build the Context-Aware Event System

context.py

Step 3: The Chat Service

chat.py

Step 4: Run the Service

4️⃣ Advanced Techniques

1. Use async def emit_event() Instead of yield

2. Manage Long-Running Tasks

3. Use WebSockets Instead of SSE

5️⃣ Key Takeaways

6️⃣ Full Code Example

7️⃣ Final Thoughts

Self-Correcting AI Agents: How to Build AI That Learns From Its Mistakes

Introduction

1️⃣ What is a Self-Correcting Agent?

Why Do Self-Correcting Agents Matter?

2️⃣ Key Techniques for Self-Correction

1. Error Detection

2. Reflection

3. Retry Logic

3️⃣ Self-Correction in Practice

🧑‍💻 Step 1: The Problem

🧑‍💻 Step 2: Set Up the Agent

🧑‍💻 Step 3: Write the Agent

Code Implementation

4️⃣ How It Works

5️⃣ Key Patterns in Self-Correcting Agents

6️⃣ When Should You Use Self-Correcting Agents?

7️⃣ Benefits of Self-Correcting Agents

8️⃣ Take It to the Next Level

9️⃣ Wrapping Up

How to Build Smarter AI Agents with Dynamic Tooling

Introduction

1️⃣ What are Dynamic Tools in AI Agents?

Why Does This Matter?

2️⃣ Building a Real-Time API Agent

Tools Required

🧑‍💻 Step 1: Setting up the Environment

🧑‍💻 Step 2: Writing the Agent

Code Implementation

How It Works

3️⃣ Building Self-Correcting Agents

🧑‍💻 Step 1: Detecting Errors

Updated Code for Resilient Agent

4️⃣ How to Build a Smarter Agent

5️⃣ Best Practices for Dynamic Tool Agents

6️⃣ Key Concepts Recap

7️⃣ Wrapping Up

Mastering Real-Time AI: A Developer’s Guide to Building Streaming LLMs with FastAPI and Transformers

Introduction: Why Real-Time Streaming AI is the Future

Table of Contents

1️⃣ What Is a Streaming LLM and Why It Matters?

2️⃣ Tech Stack Overview: The Tools You'll Need

📦 Core Technologies

3️⃣ Project Walkthrough: Building the Streaming LLM Backend

Step 1: Environment Setup

Step 2: Set Up FastAPI

Step 3: Build the Streaming Endpoint

Step 4: Connect the LLM with Transformers

4️⃣ Client-Side Integration: Consuming the Stream

2. Context-Aware State Management (`contextvar`)

1. Use `async def emit_event()` Instead of `yield`