DEV Community: Bato

How Well Can OCR Read Doctor Handwriting in 2026?

Bato — Tue, 07 Apr 2026 07:18:06 +0000

Benchmarking four open-source OCR engines on 5,578 handwritten medical prescriptions

Key Takeaways

PP-OCRv5 (5M parameters) and GLM-OCR (0.9B parameters) both achieve 20%+ exact match on handwritten prescriptions, a 10x jump over Tesseract and EasyOCR

GLM-OCR leads on character accuracy (CER 0.328), while PP-OCRv5 leads on word accuracy (WER 0.789)

A 5M-parameter model trained on curated data rivals a 900M-parameter vision-language model

Neither engine is clinically deployable yet: even the best gets only 1 in 3 words exactly right

Last month I spent some time squinting at prescription scans, trying to figure out if a doctor wrote Amoxicillin or Amitriptyline. I got it wrong twice. That got me wondering: how would today's OCR engines handle this?

The stakes are real. Medication errors injure approximately 1.3 million people annually in the United States alone and cost an estimated $42 billion globally (WHO, 2017). Illegible handwriting is a well-documented contributor: 35.7% of handwritten prescriptions contain errors, compared to just 2.5% of electronic ones (Albarrak et al., 2014).

A study of 4,183 prescriptions found that 10.21% of them are illegible, and 19.39% are barely legible (Albalushi et al., 2023). The global OCR market is projected to reach $32.9 billion by 2030, growing at a 14.8% CAGR (Grand View Research, 2025). Healthcare is one of the fastest-growing verticals driving that growth. But I couldn't find a public benchmark comparing today's open-source OCR engines on handwritten medical text.

So I ran one myself. I tested four engines on 5,578 handwritten prescription word images, and the results surprised me.

Who Are the Four Contenders?

These four engines span three generations of OCR thinking, from traditional pattern matching to specialized deep learning to generative vision-language models.

Tesseract: The Veteran

Google-backed and over 18 years old, Tesseract is the default OCR engine for a generation of developers. It uses an LSTM-based architecture designed primarily for printed text. Stable, well-documented, and runs everywhere, but handwritten cursive is not its strength.

EasyOCR: The Accessible One

Built on a CRNN (Convolutional Recurrent Neural Network) architecture with roughly 10 million parameters, EasyOCR's selling point is simplicity: pip install easyocr and you're recognizing text in 80+ languages. It uses deep learning but remains a traditional detection-recognition pipeline.

PP-OCRv5: The Data-Centric Specialist

Baidu's latest, with just 5 million parameters. PP-OCRv5 uses an SVTR_LCNet architecture with a Guided Training of CTC (GTC) strategy. The real innovation isn't the architecture, though. It's the training data.

The PP-OCRv5 paper (Cui et al., 2026) shows that data quality trumps model scale. They curated 22.6 million training samples by filtering along three dimensions. First, difficulty: they use model confidence as a proxy and found that samples in the [0.95, 0.97] range hit a sweet spot, hard enough to teach the model something new, but not so hard that the labels are unreliable. Second, accuracy: they cross-check predictions against labels to weed out mislabeled samples. Third, diversity: they cluster training images into 1,000 visual groups using CLIP embeddings and ensure each cluster is represented. Together, these filters yielded 2-3x improvements in handwritten recognition from v3 to v5 without changing the model architecture.

GLM-OCR: The Compact Vision-Language Model

From Zhipu AI and Tsinghua University, GLM-OCR takes a fundamentally different approach. It's a 0.9-billion-parameter multimodal model combining a 0.4B CogViT vision encoder with a 0.5B GLM language decoder (Duan et al., 2026). Rather than traditional CTC or attention-based sequence recognition, it generates text autoregressively, like a language model that reads images.

An important note: 0.9B is compact for a vision-language model. For comparison, Qwen3-VL has 235 billion parameters and GPT-4o is even larger. GLM-OCR was designed for efficiency, using Multi-Token Prediction (MTP) to generate approximately 5.2 tokens per decoding step, yielding a roughly 50% throughput improvement over standard autoregressive generation. It's trained through a 4-stage pipeline that includes supervised fine-tuning and GRPO reinforcement learning.

These four engines represent a clear spectrum: traditional (Tesseract), specialized deep learning (EasyOCR, PP-OCRv5), and generative VLM (GLM-OCR).

What Makes the RxHandBD Dataset So Hard?

We use RxHandBD (Shovon et al., Mendeley Data), a dataset of 5,578 cropped word images extracted from handwritten medical prescriptions written by doctors in Bangladesh. Each image contains a single word with a corresponding ground-truth label.

This dataset is hard for four reasons:

Doctor's handwriting. Enough said.
Medical terminology. Drug names like "Amoxicillin" and "Metformin" alongside dosage notations like "5% dns" and "1+0+1."
Mixed language. English medical terms interspersed with Bangla script.
Inconsistent quality. Varying paper backgrounds, pen types, and image capture conditions.

Here are a few samples to give you a sense of the difficulty range:

From top to bottom: a drug name both modern engines nail ("Ronem"), cases where only one engine succeeds ("Vineet" and "Eylox"), and a close call where GLM-OCR gets closest but still misses ("Ambrox").

One important methodological note: these are pre-cropped word-level images. We're isolating the recognition half of the OCR pipeline, not testing detection (locating text regions on a full page). This means our results reflect recognition accuracy only. Real-world performance also depends on how well each engine detects text regions before recognizing them.

How Did We Run the Benchmark?

Metrics

We evaluate using four complementary metrics:

CER (Character Error Rate): Edit distance between predicted and reference strings, normalized by reference length. If the label says "Amoxicillin" (11 characters) and the OCR outputs "Amoxicilin" (1 deletion), the CER is 1/11 = 0.09. Lower is better.
WER (Word Error Rate): Same concept at the word level. Any mistake in a word counts the entire word as wrong. Lower is better.
Exact Match Rate: The strictest metric. Did the OCR output match the ground truth character-for-character after normalization? Higher is better.
Latency: Wall-clock milliseconds per image, measuring practical throughput.

Setup

All engines run on an Apple Silicon Mac (CPU only). Single run. Default configurations for each engine with no fine-tuning or domain-specific adjustments. PP-OCRv5 runs its full detection + recognition pipeline even on pre-cropped images. GLM-OCR processes each image with the prompt "Text Recognition:".

What Do the Numbers Say?

GLM-OCR achieves the lowest character error rate at 0.328, while PP-OCRv5 leads on word-level accuracy with a WER of 0.789. Both modern engines dramatically outperform the older generation, with exact match rates 8-13x higher than Tesseract or EasyOCR.

Engine	Type	Parameters	CER (lower=better)	WER (lower=better)	Exact Match (higher=better)	Latency (ms/img)
Tesseract	LSTM	n/a	0.785	1.043	2.5%	49
EasyOCR	CRNN	~10M	0.695	1.074	2.6%	14
PP-OCRv5	SVTR+CTC	5M	0.477	0.789	21.4%	103
GLM-OCR	VLM	0.9B	0.328	0.801	32.6%	141

Let me walk through what these numbers actually mean.

Finding 1: Tesseract and EasyOCR Barely Function on Handwriting

Both Tesseract and EasyOCR achieve under 3% exact match, meaning they get fewer than 1 in 40 words perfectly right. Their WER exceeds 1.0, which means on average they produce more errors than there are words. For practical purposes, these engines are unusable on handwritten medical text.

This isn't a knock on either project. Tesseract's LSTM architecture was optimized for printed text, and EasyOCR's CRNN similarly excels on clean, well-formatted inputs. Handwritten medical cursive is simply a different problem.

Finding 2: The Modern Engines Are a Generational Leap

PP-OCRv5 and GLM-OCR both break the 20% exact match barrier, a qualitative jump from the sub-3% performance of the older engines. The gap between "old" and "new" (a roughly 10x improvement in exact match) is far larger than the gap between PP-OCRv5 and GLM-OCR themselves.

Finding 3: Why Does GLM-OCR Win on Characters but Lose on Words?

This is the most interesting finding. GLM-OCR achieves a CER of 0.328, which is 31% lower than PP-OCRv5's 0.477. At the character level, the vision-language approach genuinely helps. GLM-OCR can use its language decoder's knowledge of likely character sequences to infer partially visible characters.

But PP-OCRv5 edges ahead on WER: 0.789 vs 0.801. It makes fewer word-level mistakes.

Why the divergence? My hypothesis: GLM-OCR's autoregressive generation occasionally produces subtle extra tokens or formatting variations. The GLM-OCR paper itself acknowledges "minor stochastic variation in formatting behaviors, particularly in line breaks and whitespace handling" (Duan et al., 2026). These small artifacts barely affect CER but can flip a word from "correct" to "incorrect" in WER/exact-match scoring.

The practical takeaway: which engine is "better" depends on your error metric. If you care about getting as close as possible character-by-character (for downstream spell-correction, for example), GLM-OCR wins. If you need clean word-level outputs with minimal postprocessing, PP-OCRv5 has the edge.

Finding 4: What Are the Real Deployment Trade-offs?

Both engines are fast enough for practical use. PP-OCRv5 processes images at 103ms each and GLM-OCR at 141ms. GLM-OCR's Multi-Token Prediction (generating roughly 5.2 tokens per step instead of one) keeps its inference speed competitive despite the larger model.

The bigger difference is model size and memory footprint. PP-OCRv5's 5M parameters take up roughly 20MB on disk, small enough for a Raspberry Pi or embedded device. GLM-OCR's 0.9B parameters need around 1.8GB at FP16 (less with quantization). Both run on CPU without a GPU, as our Apple Silicon benchmark shows, but GLM-OCR consumes significantly more RAM. The GLM-OCR paper notes that the model "enables deployment in both large-scale and resource-constrained edge scenarios" and supports frameworks like Ollama for local inference (Duan et al., 2026). For deploying across thousands of low-spec workstations, PP-OCRv5's small footprint is a clear advantage. For a centralized server or any machine with a few GB of RAM to spare, GLM-OCR is equally practical.

What Do the Predictions Actually Look Like?

Numbers tell one story. Seeing the actual predictions tells another. Here are four representative samples from the benchmark, hand-picked to illustrate each finding above.

GLM-OCR Nails a Complex Drug Name

"Ceftriaxone" (a common antibiotic), 11 characters long. Tesseract reads "Wau" and EasyOCR produces "@@FNRH," both useless. PP-OCRv5 gets close with "CEFRAXONE" (CER=0.18, missing the 'ti'). Only GLM-OCR reads it perfectly. On long drug names, the language decoder's knowledge of likely character sequences gives it a real edge.

When PP-OCRv5 Is Closer and GLM-OCR Hallucinates

"Zolfin" (a proton pump inhibitor). PP-OCRv5 reads "zolfiu" (CER=0.17, just the last character wrong). GLM-OCR outputs "2016'u" (CER=1.0), completely misreading the word as a number. This is the flip side of the VLM approach: when the handwriting doesn't match patterns in the training data, the language model can steer the output in the wrong direction entirely.

GLM-OCR Nearly Perfect

"Pantonix" (a pantoprazole brand). GLM-OCR outputs "Pantomix" (CER=0.125, one character off). PP-OCRv5 gets "Pantomy" (CER=0.375). Both are close, but GLM-OCR is three times more accurate by CER. Neither gets an exact match, which is the pattern behind Finding 3: GLM-OCR consistently gets closer character-by-character, even when neither engine gets the word exactly right.

When the VLM Goes Off the Rails

"Fexo" (fexofenadine, an antihistamine). EasyOCR reads "Texo" (CER=0.25) and PP-OCRv5 reads "-exo" (CER=0.25), both reasonable attempts. GLM-OCR outputs LaTeX math notation: $ \sqrt{e} x_{0} $ (CER=4.75). This is a rare but real failure mode of generative VLMs: the model interprets the handwriting as a math expression instead of text. It produced more characters than the ground truth has, which is why CER exceeds 1.0.

What Are the Limitations?

Single dataset, single run. RxHandBD is one dataset of Bangladeshi prescriptions. US, European, or East Asian handwriting styles may produce different rankings. We don't have confidence intervals from multiple runs.
Word-level only. We tested recognition on pre-cropped word images, not full-page detection + recognition. Real-world performance depends on the complete pipeline.
CPU-only. All engines ran on CPU. GPU acceleration could significantly change the latency picture, particularly for GLM-OCR.
Default configs. No engine was fine-tuned on medical data. Domain-specific adaptation could improve all of them.
No clinical validation. OCR accuracy and clinical safety are different things. A 32.6% exact match rate is impressive for research, but not nearly sufficient for automated prescription processing without human review.

Conclusion

Modern OCR has made a genuine leap on handwritten medical text. PP-OCRv5 (5M parameters, best word-level accuracy) and GLM-OCR (0.9B parameters, best character-level accuracy) both dramatically outperform Tesseract and EasyOCR.

The two champions represent fundamentally different design philosophies: a data-centric specialized pipeline vs. a compact vision-language model. Yet they arrive at remarkably similar performance levels. Both are open-source and practically deployable.

For practitioners building healthcare OCR systems: these two engines deserve serious evaluation. Start with your specific error tolerance, hardware constraints, and whether you need word-level recognition or full-page document understanding.

For researchers: this is a domain with high clinical impact and, as these results show, plenty of room for improvement. Even the best engine here gets only 1 in 3 words exactly right on doctor handwriting. There's real work left to do.

What datasets or engines should I test next? Let me know in the comments.

References

World Health Organization. "WHO launches global effort to halve medication-related errors in 5 years." March 29, 2017. https://www.who.int/news/item/29-03-2017-who-launches-global-effort-to-halve-medication-related-errors-in-5-years
Albarrak AI, Al Rashidi EA, Fatani RK, Al Ageel SI, Mohammed R. "Assessment of legibility and completeness of handwritten and electronic prescriptions." Saudi Pharmaceutical Journal, 2014. https://pmc.ncbi.nlm.nih.gov/articles/PMC4281619/
Albalushi AK, et al. "Assessment of Legibility of Handwritten Prescriptions and Adherence to W.H.O. Prescription Writing Guidelines." J. of Pharmaceutical Research International, 2023. https://pmc.ncbi.nlm.nih.gov/articles/PMC10686667/
Cui C, Zhang Y, Sun T, et al. "PP-OCRv5: A Specialized 5M-Parameter Model Rivaling Billion-Parameter Vision-Language Models on OCR Tasks." arXiv:2603.24373, March 2026. https://arxiv.org/abs/2603.24373
Duan S, Xue Y, Wang W, et al. "GLM-OCR Technical Report." arXiv:2603.10910, March 2026. https://arxiv.org/abs/2603.10910
Shovon MSH, et al. "RxHandBD: A Handwritten Prescription Recognition Dataset from Bangladesh." Mendeley Data. https://data.mendeley.com/datasets
Grand View Research. "Optical Character Recognition Market Analysis." 2025. https://www.grandviewresearch.com/industry-analysis/optical-character-recognition-market

Benchmark conducted independently. No affiliation with any OCR project. All engines evaluated using default configurations.

About the author: Botao Deng is an ML/AI engineer and researcher who builds and evaluates production models.GitHub

To the Programmer Quietly Drowning in AI Anxiety

Bato — Sun, 22 Feb 2026 07:22:04 +0000

A quiet word for those who feel like they’re falling behind

Let me guess how your morning went.

You opened your phone, scrolled through some tech feed: Twitter, Hacker News, Reddit, whatever your poison is, and within thirty seconds, you saw someone claim they built an entire SaaS product over the weekend using nothing but prompts and vibes. Then you saw a thread about a new model that makes the one you just learned obsolete. Then a CEO somewhere declared that software engineers have maybe five good years left.

You put your phone down. You picked up your coffee. And somewhere between the first sip and the second, a familiar knot tightened in your chest.

Am I falling behind? Should I be doing more? Is everything I've built going to be worthless?

Yeah. I know that feeling. I want to talk about it.

The Treadmill That Never Stops

The pace right now is genuinely absurd. It’s not just fast, it’s disorienting. Last month’s breakthrough is this month’s footnote. You barely finish a tutorial on one framework before the community has already moved on to something shinier. The vocabulary alone is exhausting: RAG, LoRA, Agents, MCP, function calling, each one demanding your attention like a toddler pulling at your sleeve.

And the showcase culture makes it worse. Every feed is a highlight reel. Everyone seems to be shipping, building, launching. Nobody posts about the afternoon they spent confused, reading the same documentation page four times. Nobody talks about the tools they tried that turned out to be useless.

It creates this illusion that there's a speeding train, and everyone is on it except you.

But Here's What I've Learned From Watching a Few Trains Go By

I've been in tech long enough to remember when the shift from classical machine learning to deep learning felt like the sky was falling. People who had spent a decade perfecting feature engineering, tuning gradient-boosted trees, building meticulous pipelines — they woke up one day and the entire conversation had moved to neural networks. A decade of expertise suddenly felt quaint.

Then deep learning itself went through its own upheavals. CNNs gave way to RNNs, then LSTMs, then attention mechanisms, then Transformers swallowed everything whole. At each turn, someone's specialty became a paragraph in a history chapter.

Then came BERT, then GPT, and suddenly pre-training plus fine-tuning was the only game in town. Another reshuffling. Another wave of existential dread.

You know what I noticed, though? The people who came through all of that, the ones who are still here and still relevant, they weren't the ones who had the best grip on any single technology. They were the ones who had learned how to learn. They had developed a kind of peripheral vision for change: the ability to sense what mattered, what was temporary, and when to invest their energy.

That skill set doesn't expire.

Not Every Wave Deserves Your Weekend

Here's something nobody tells you when you're in the thick of it: the shelf life of most technical hype is shockingly short. The vast majority of tools, frameworks, and paradigms that seem world-ending today will be footnotes in two years. Some of them will be footnotes in six months.

This doesn't mean none of it matters. It means not all of it matters equally.

And if you try to sprint after every single thing, if you treat every new announcement as a personal emergency, you will burn out. That's not a motivational cliché. It's a mechanical fact. Human beings are not designed to sustain a permanent state of urgency.

The more useful discipline isn't relentless pursuit. It's discernment. Learning to sit with the noise long enough to separate the signal. Asking: is this a real shift in how problems get solved, or is this just a new coat of paint on an old idea? Is this changing the questions we ask, or just the tools we use to answer them?

That kind of judgment is slow to build. But it's the thing that compounds.

Why I Don't Think We're Getting Replaced

I've heard the "programmers are done" narrative enough times to have an opinion on it, so here's mine: I think it's mostly wrong, and wrong in an interesting way.

The argument assumes that programming is fundamentally about producing code, and if a machine can produce code faster, then programmers lose. But that was never quite right. The hard part of software was never typing. It was figuring out what to type. Understanding messy requirements. Navigating system constraints. Making tradeoffs that don't have clean answers. Debugging not just logic errors, but conceptual errors, the kind where the code works perfectly and the product is still wrong.

AI is extraordinary at generation. It's getting better at reasoning. But it still needs someone to point it at the right problem, to validate its output against reality, to integrate it into systems that have history and politics and technical debt. That "someone" looks a lot like an engineer to me.

And here's the irony that I think gets lost in the panic: programmers are already the people closest to this technology. We're the ones working with the models every day, feeling out their edges, learning their failure modes. The anxiety often comes from proximity, when you're standing right under a wave, it looks like it's going to crush you. But proximity is also advantage. We're not watching this from the shore. We're already in the water.

A Case for Going Slow

I want to end with something that might sound counterintuitive in an industry obsessed with speed.

It's okay to be slow.

It's okay to not have an opinion on the model that dropped yesterday. It's okay to skip a hype cycle. It's okay to spend your weekend doing something that has nothing to do with AI and not feel guilty about it.

The people who build lasting careers in technology aren't the ones who mass-produce side projects on every trending tool. They're the ones who develop taste, a quiet, hard-won instinct for what matters and what doesn't. That kind of taste doesn't come from chasing everything. It comes from watching patiently, choosing deliberately, and trusting that you don't have to catch every wave to have a good ride.

So if the anxiety has been getting to you, if you've been lying awake wondering whether your skills still matter, whether you're doing enough, whether the ground beneath you is about to give way, let me say this plainly:

You are not behind. You are in the middle of a very loud, very confusing moment. And loud, confusing moments always feel more permanent than they are.

The wave will keep moving. So will you. And at your own pace, in your own way, you'll find where you stand.

I wrote this as much for myself as for anyone else. If it landed, I'd love to hear what you're going through. I suspect a lot more of us feel this way than the highlight reels suggest.

A Beginner's Guide to Multi-Agent Systems: How AI Agents Work Together

Bato — Sat, 21 Feb 2026 18:08:55 +0000

You've probably heard the term "AI agents" thrown around a lot lately. But recently, a new idea has been taking over engineering discussions: multi-agent systems. Not one AI doing everything: but a team of AIs, each with a specific job, collaborating to tackle complex problems.

Here's a surprise: if you've ever used Claude Code to refactor a large codebase or fix a tricky bug, you've already seen a multi-agent system at work, you just might not have known it.

If that sounds complicated, don't worry. By the end of this guide, you'll understand what multi-agent systems are, why they matter, and how to build a simple one yourself (no PhD required).

First: What Even Is an "Agent"?

Before we go multi, let's make sure we're clear on what a single agent is.

A traditional LLM (like GPT or Claude) takes input and produces output — one shot, done. An agent goes further: it can reason, use tools, and take actions in a loop until a goal is completed.

Think of it this way:

LLM: "Here's a summary of that article."
Agent: "I'll search the web for that article, read it, cross-check it with two other sources, and then give you a summary with citations."

Agents typically follow a loop:

Observe → Think → Act → Observe again → ...

A common implementation looks roughly like this:

def run_agent(goal: str, tools: list) -> str:
    messages = [{"role": "user", "content": goal}]

    while True:
        response = llm.chat(messages, tools=tools)

        if response.is_final_answer:
            return response.content

        # The LLM decided to use a tool
        tool_result = execute_tool(response.tool_call)
        messages.append({"role": "tool", "content": tool_result})

Simple enough. So why do we need multiple agents?

The Problem With One Agent Doing Everything

Imagine you ask a single agent to:

"Research our top 3 competitors, write a market analysis report, and then draft 5 LinkedIn posts based on it."

That's three very different jobs: researcher, analyst, copywriter. Cramming all of that into one agent creates real problems:

Context window overload — Long tasks fill up the LLM's memory fast, causing it to "forget" earlier steps.
Lack of specialization — An agent trying to do everything tends to do nothing particularly well.
Hard to debug — When something goes wrong, you don't know which "part" failed.
No parallelism — One agent does things one at a time. What if subtasks could run simultaneously?

This is exactly the problem multi-agent systems solve.

You're Already Using Multi-Agent AI

Before we get to theory, let's look at a tool many developers already have in their terminal: Claude Code.

When you ask Claude Code something simple like "fix the bug on line 42", it handles it in a single pass. But ask it something more complex:"refactor this entire module, write tests, and check for regressions", and something more interesting happens under the hood.

Claude Code acts as an orchestrator. Instead of trying to hold the entire task in one context window, it breaks the work down and can spin up subagents: separate Claude instances with specific, scoped roles. One subagent might be tasked with exploring the codebase structure, another with writing the actual refactored code, and another with running the test suite and reporting results. Each subagent operates independently, does its job, and reports back.

You
 └─▶ Claude Code (Orchestrator)
        ├─▶ Subagent A: "Explore the repo and map dependencies"
        ├─▶ Subagent B: "Rewrite the module based on the map"
        └─▶ Subagent C: "Run tests and report failures"

The orchestrator then assembles the results and gives you a single coherent answer — as if one very capable developer had done it all.

This is the multi-agent pattern in action. And the same design is behind tools like Devin, OpenAI's Operator, and many of the AI-powered developer tools launching in 2025–2026. Now let's understand how it works so you can build your own.

What Is a Multi-Agent System?

A multi-agent system (MAS) is a setup where multiple AI agents work together — each with a defined role — to complete a larger task. Think of it like a software engineering team: you have a project manager, a frontend dev, a backend dev, and a QA engineer. Each is an expert in their lane, and a coordinator ties their work together.

The key building blocks are:

1. The Orchestrator (a.k.a. the "Manager Agent")

This is the brain that receives the high-level goal, breaks it into subtasks, assigns those subtasks to specialized agents, and assembles the final result. The orchestrator doesn't necessarily do the actual work — it delegates.

2. Subagents (a.k.a. "Worker Agents")

These agents handle specific, well-scoped tasks. A ResearchAgent searches the web. A WriterAgent drafts content. A CodeAgent writes and runs code. Each has its own set of tools appropriate to its role.

3. Tools

Tools are functions that agents can call — web search, code execution, API calls, database queries, file I/O. Tools are what make agents actually useful in the real world.

4. Memory

Agents need context. Memory can be:

Short-term (conversation history within a session)
Long-term (a vector database or knowledge store that persists between runs)

5. Communication

Agents pass messages to each other — typically as structured text or JSON. The orchestrator sends a task; the subagent returns a result.

Building a Simple Multi-Agent System

Let's put this into code. We'll build a small, framework-agnostic example: a two-agent system where one agent researches a topic and another writes a blog intro based on the research.

We'll use Python and the OpenAI API (you can swap this for any LLM provider, the pattern stays the same).

Setup

pip install openai

import openai
import json

client = openai.OpenAI(api_key="your-api-key-here")

def call_llm(system_prompt: str, user_message: str) -> str:
    """A simple wrapper to call an LLM with a system + user prompt."""
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message},
        ]
    )
    return response.choices[0].message.content

The Research Agent

def research_agent(topic: str) -> str:
    """
    A specialized agent whose only job is to gather key facts about a topic.
    In a real system, this agent would have web search tools.
    For simplicity, we're having the LLM draw on its training knowledge.
    """
    system_prompt = """
    You are a research assistant. Your job is to provide a concise,
    factual summary of a given topic — 5 key bullet points, nothing more.
    Focus on accuracy and relevance. Do not editorialize.
    """
    result = call_llm(system_prompt, f"Research this topic: {topic}")
    print(f"[ResearchAgent] Done. Key facts gathered.\n")
    return result

The Writer Agent

def writer_agent(topic: str, research: str) -> str:
    """
    A specialized agent whose only job is to write engaging content
    based on provided research. It doesn't search — it just writes.
    """
    system_prompt = """
    You are a skilled technical writer for a developer blog.
    Given a topic and research notes, write a compelling, friendly
    introduction paragraph (3-4 sentences) that hooks the reader.
    Write for developers, not academics.
    """
    user_message = f"""
    Topic: {topic}

    Research notes:
    {research}

    Write the intro paragraph now.
    """
    result = call_llm(system_prompt, user_message)
    print(f"[WriterAgent] Done. Intro written.\n")
    return result

The Orchestrator

def orchestrator(goal: str) -> str:
    """
    The orchestrator receives a high-level goal, breaks it into subtasks,
    delegates to specialized agents, and assembles the final output.
    """
    print(f"[Orchestrator] Goal received: '{goal}'")
    print(f"[Orchestrator] Delegating research task...\n")

    # Step 1: Extract the topic from the goal (in a real system,
    # the orchestrator would use an LLM to parse the goal)
    topic = goal  # Simplified for this example

    # Step 2: Delegate to ResearchAgent
    research_output = research_agent(topic)

    # Step 3: Delegate to WriterAgent, passing the research output
    print(f"[Orchestrator] Delegating writing task...\n")
    final_output = writer_agent(topic, research_output)

    # Step 4: Return assembled result
    print(f"[Orchestrator] All tasks complete. Returning final output.\n")
    return final_output


# Run it
result = orchestrator("The rise of multi-agent AI systems in 2025")
print("=== FINAL OUTPUT ===")
print(result)

Sample Output

[Orchestrator] Goal received: 'The rise of multi-agent AI systems in 2025'
[Orchestrator] Delegating research task...

[ResearchAgent] Done. Key facts gathered.

[Orchestrator] Delegating writing task...

[WriterAgent] Done. Intro written.

[Orchestrator] All tasks complete. Returning final output.

=== FINAL OUTPUT ===
In 2025, AI stopped being a solo act. Multi-agent systems — where
teams of specialized AI models collaborate on complex tasks — emerged
from research labs into production engineering stacks at companies like
Google, OpenAI, and Anthropic. Rather than asking one model to do
everything, developers are now designing pipelines where a "manager"
agent delegates research, writing, coding, and verification to expert
subagents. If you've been wondering what all the buzz is about, you're
in exactly the right place.

This is the core pattern. In a production system, you'd add real web search tools, error handling, retry logic, agent memory, and parallel execution — but the orchestrator → delegate → assemble structure stays the same.

Real-World Use Cases

Multi-agent systems shine whenever a task is too large, complex, or varied for a single agent. Here are three common patterns you'll see in the wild:

1. Automated research pipelines
One agent searches and gathers sources, another reads and extracts key points, a third synthesizes findings into a report. No single agent's context window gets overwhelmed.

2. AI coding assistants (like Claude Code)
This is the most accessible real-world example. Claude Code uses an orchestrator-subagent model: when given a complex task, the main agent breaks it into subtasks and delegates — one subagent explores the codebase, one writes or modifies code, one runs shell commands and tests. Each subagent has a narrow, well-defined job. This same pattern powers tools like Devin and SWE-agent.

3. Customer support automation
An IntentAgent classifies the user's issue, a KnowledgeAgent retrieves the relevant documentation, and a ResponseAgent drafts the reply. Each agent is small, fast, and easy to tune independently.

Common Pitfalls to Avoid

Giving agents too much responsibility. The whole point of multi-agent systems is specialization. If your ResearchAgent is also writing and formatting the output, it's not really specialized.

Forgetting error handling between agents. What happens if the research agent returns nothing? Your writer agent will hallucinate. Always validate the output of one agent before passing it to the next.

Ignoring cost and latency. Each agent call costs money and time. More agents ≠ better results. Start with the minimum number of agents needed and add more only when you hit a real bottleneck.

No logging or tracing. In a chain of agents, debugging is hard without visibility. Add logs at every handoff (like the print statements in our example), and consider tools like LangSmith or Langfuse for production tracing.

Where to Go From Here

You now understand the fundamentals. Here are some good next steps depending on where you want to go:

Try LangGraph if you want a production-grade framework for building stateful, graph-based agent workflows with built-in support for cycles and conditional edges.
Try Google's Agent Development Kit (ADK) if you want Google's official framework — it was just announced and has great tooling for building hierarchical agent systems.
Try OpenAI's Agents SDK if you're already in the OpenAI ecosystem and want handoffs and tool-calling built in out of the box.
Read "Patterns for Building LLM-based Systems" by Eugene Yan — one of the best practical overviews of agent design patterns available.

Wrapping Up

Multi-agent systems aren't magic, and they're not just hype either. They're a practical engineering pattern for solving problems that are genuinely hard for a single AI to handle — tasks that are too long, too complex, or too diverse.

The pattern is simple: break down the goal → assign specialized agents → orchestrate the results. Start small, keep your agents focused, and add complexity only when you need it.

The era of AI teamwork is just getting started, and now you know how to build your own team.

Did this help? Drop a comment with what you're building — I'd love to hear what multi-agent use cases you're exploring.

Pandas 3.0's PyArrow String Revolution: A Deep Dive into Memory and Performance

Bato — Mon, 16 Feb 2026 07:50:49 +0000

Introduction

Pandas 3.0 made a game-changing decision: PyArrow-backed strings are now the default. Instead of storing strings as Python objects (the old object dtype), pandas now uses Apache Arrow's columnar format with the new string[pyarrow] dtype.

But here's the question that matters: How much does this new string dtype actually improve performance and memory usage in real-world scenarios?

To find out, I ran comprehensive benchmarks across diverse datasets and common string operations. The results? 51.8% memory savings on average, with operations running 2-27x faster.

This isn't a theoretical improvement, it's a fundamental shift in how pandas handles string data.

The Results: Summary Dashboard

Let me start with the headline numbers, then we'll dive into how I got them.

The Four Key Metrics

1. 51.8% Memory Savings

Across all test datasets, the new PyArrow string dtype used half the memory of the old object dtype. This isn't a marginal improvement, it's transformative for memory-constrained environments.

2. 6.17x Average Operation Speedup

String operations aren't just more memory-efficient: they're dramatically faster. On average, operations like str.lower(), str.contains(), and str.len() run 6x faster with PyArrow strings.

Some operations are even more impressive:

str.len(): 27x faster
str.startswith(): 16x faster
str.endswith(): 15x faster

3. 889 MB Total Memory Saved

Across our test datasets (totaling 645 MB on disk), we saved nearly 1 GB of RAM in memory. For a real data pipeline processing dozens of datasets, this compounds quickly.

4. Memory Overhead: The Game Changer

The bottom chart reveals something crucial about how pandas handles strings:

Old string dtype (object):

CSV files on disk: 645 MB
Loaded into pandas: 1,714 MB
Memory overhead: 165.7% (more than doubles!)

New string dtype (PyArrow):

CSV files on disk: 645 MB
Loaded into pandas: 825 MB
Memory overhead: 27.9% (minimal overhead)

What does this mean?

When pandas reads a CSV file, it doesn't just store the raw bytes: it creates in-memory data structures for fast operations. The old object dtype was incredibly inefficient, essentially duplicating string data multiple times. The new PyArrow string dtype keeps overhead minimal with a smarter memory layout.

This is the difference between pandas 2's Python-object approach and pandas 3's columnar Arrow approach.

The Methodology: Why 5 Different Datasets?

Now that you've seen the results, let me explain how I tested this. Real-world data comes in many shapes and sizes. A single benchmark on one type of data wouldn't tell the whole story.

That's why I created 5 distinct datasets, each representing common patterns you'll encounter in production:

1. Low Cardinality Dataset (1M rows)

What it is: Repeated categorical values like product categories, status codes, regions, and priorities.

Why it matters: This is typical of business data - think order statuses, customer segments, or department codes. The same values repeat millions of times.

Example columns:

category: "Electronics", "Clothing", "Food" (10 unique values)
status: "pending", "completed", "failed" (4 unique values)

2. High Cardinality Dataset (1M rows)

What it is: Mostly unique strings like user IDs, email addresses, and session tokens.

Why it matters: When every row is different (like customer emails or transaction IDs), pandas can't use simple optimizations. This tests worst-case scenarios.

Example columns:

user_id: "USER_00000001", "USER_00000002"... (1M unique)
email: "user123@example45.com" (1M unique)

3. Mixed String Lengths Dataset (1M rows)

What it is: A combination of short codes (2-5 chars), medium names (20-50 chars), and long descriptions (100-300 chars).

Why it matters: Real data isn't uniform. You might have product codes next to customer addresses next to order notes. This tests how pandas handles variable-length strings.

4. Dataset With Nulls (1M rows)

What it is: Data with missing values (10-33% nulls in different columns).

Why it matters: Messy data is reality. How does pandas 3.0 handle missing string data compared to pandas 2?

5. Large Dataset (10M rows)

What it is: A scaled-up version to test performance at scale.

Why it matters: Memory savings that look good at 1M rows might behave differently at 10M rows. This validates the findings scale linearly.

Memory Savings by Dataset Type

The memory savings from PyArrow strings vary significantly by dataset characteristics:

Best Case: Low Cardinality Data (-71.6%)

When data has repeated values (like categories), PyArrow strings shine:

Object dtype: 219 MB
PyArrow string dtype: 62 MB
Savings: 71.6%

Worst Case: Mixed String Lengths (-30.6%)

Variable-length strings see smaller (but still significant) savings:

Object dtype: 383 MB
PyArrow string dtype: 266 MB
Savings: 30.6%

The Pattern

Notice how savings correlate with data characteristics:

Repeated values (low cardinality) → Best savings (64-72%)
Unique values (high cardinality) → Good savings (53-55%)
Variable length (mixed sizes) → Moderate savings (31%)

Takeaway: PyArrow strings help everywhere, but they're especially powerful for categorical-like data.

Performance: Operation-Specific Speedups

This heatmap shows how much faster PyArrow strings are compared to object dtype for common string operations (values > 1.0 mean PyArrow is faster).

The Fastest Operations

str.len(): 10-27x faster
str.startswith() and str.endswith(): 11-18x faster
str.contains(): 3-5x faster
str.split(): 1-8x faster

The Pattern

Read operations (like len(), startswith()) → Massive speedups (10-27x)

These operations just examine existing data without modification

Transform operations (like replace(), split()) → Good speedups (2-5x)

These operations create new data, which limits the performance gains

The Trade-off: CSV Loading Time

There's no such thing as a free lunch. While PyArrow strings save memory and run operations faster, loading CSV files is 9%-61% slower.

Why the Slowdown?

When pandas reads a CSV with PyArrow strings enabled:

It parses the text (same as before)
It converts strings to PyArrow's columnar format (extra step)
This conversion involves building dictionary encodings and optimized memory structures

Pandas is doing more work upfront to enable better performance downstream.

Real-world impact: On our 10M row dataset, the difference is 1.63s vs 2.02s, an extra 0.4 seconds for 10 million rows. For many data pipelines, this upfront cost might be negligible compared to the 2-27x speedup in subsequent operations.

Pros and Cons: Should You Adopt PyArrow Strings?

Benefits of PyArrow String Dtype

Massive Memory Savings (30-72%)
Dramatically Faster String Operations (2-27x)
Minimal Memory Overhead (28% vs 166%)
Modern Data Ecosystem Integration

Trade-offs to Consider

Slower CSV Loading (9-61% slower)
- Initial data ingestion takes longer
- May impact workflows that repeatedly load small files
- The trade-off: slower start, much faster operations
Behavioral Changes
- String dtype behaves differently from object dtype in edge cases
- Need to update code that explicitly checks for object dtype
- Testing required for migration

The Recommendation

For most data workflows, PyArrow strings are a clear win. The memory and performance benefits far outweigh the trade-offs.

Consider staying with object dtype if:

You rarely work with string columns
Your datasets easily fit in memory
Load time is critical and you rarely perform string operations
You have legacy code that's deeply coupled to object dtype behavior

Definitely adopt PyArrow strings if:

You process large datasets with text data
String operations are a significant part of your workflow
Memory is a constraint in your environment
You're building production data pipelines
You work with modern data tools (Parquet, Arrow, DuckDB, etc.)

Conclusion

Our comprehensive analysis across 5 diverse datasets and 15+ string operations conclusively shows that PyArrow-backed strings deliver transformative improvements:

51.8% average memory savings across all dataset types
6.17x average operation speedup for string operations
Minimal memory overhead (28% vs 166% with Python objects)

PyArrow strings aren't just an incremental improvement, they're a fundamental reimagining of how pandas handles text data. By adopting Apache Arrow's proven columnar format, pandas has joined the modern data ecosystem while delivering massive performance and memory improvements.

For most data practitioners working with text data, the question isn't "Should I use PyArrow strings?" but rather "How quickly can I migrate?"

Questions or feedback? Feel free to open an issue or contribute to this analysis! The code we used in this analysis has been uploaded to this repo.

We All Accepted the "Python Tax.", Pandas 3.0 Just Reduced It.

Bato — Sun, 15 Feb 2026 06:51:21 +0000

I’ve been there. You have a "small" 3GB CSV file. You load it into a Pandas DataFrame on a 16GB machine, and suddenly everything freezes. You start manually chunking data, deleting columns, and praying to the OOM (Out of Memory) gods 🙃.

We’ve accepted this as the "Python Tax." We tell ourselves that object dtypes are just the price we pay for flexibility. Spoiler: They aren't. And we’ve been wasting RAM for years.

The "Object" Lie

For a decade, Pandas stored strings as NumPy objects. This was a beautiful abstraction with a dark secret: it’s incredibly inefficient. Each string is wrapped in a heavy Python object header. When you have 10 million rows, you aren’t just storing data; you’re storing a massive, fragmented mess of pointers.

The 10-Minute Upgrade That Saved 60% of My RAM

With the release of Pandas 3.0, the game changed. By default, it now uses a dedicated str type backed by PyArrow.

I ran the numbers because, honestly, I didn't believe at the first place. I kept my code exactly the same: no special flags, no engine tweaks, just a plain pd.read_csv(). Here is what happens when you stop using legacy NumPy objects:

The Results are Actually Insane:

Memory Slashing: In a mixed-type dataset of 10M rows, I saw a 53.2% drop in memory usage just by upgrading to version 3.0.
Text-Only DataFrame: In my experiment with 10M pure string rows, memory usage fell from 658 MB to 267 MB, 59.4% drop!

Pragmatism > Perfection

Is Pandas 3.0 perfect? No. But if you are working with text-heavy data, ignoring this upgrade is effectively choosing to pay for cloud resources you don't need.

What’s your weirdest pandas "Out of Memory" story?This type of error never fails to bring me back to the early days of pandas dev 😁