Raj Kundalia

Posted on Mar 21

How BAML Brings Engineering Discipline to LLM-Powered Systems

#llm #ai #python

TL;DR

BAML is a domain-specific language and toolchain for defining LLM function interfaces with strict, recoverable output parsing - addressing the reliability gap that makes production LLM systems painful to build and maintain. It generates type-safe client code from schema definitions across Python, TypeScript, Go, Ruby, and several other languages, and uses a parsing approach called Schema Aligned Parsing that recovers structured data even from garbled or partial model responses. For a working reference implementation, see:

GitHub - rajkundalia/error-analyzer-with-baml: Analyze Java compilation and runtime errors using BAML with a local Ollama model.

How I came to know about BAML

I was wondering about if there is something that tries to handle output from an LLM and then suddenly, a talk by Vaibhav Gupta landed. I started exploring more; if you want to explore like how did and not read this post, you can try asking these questions to know it by yourself:

What is BAML?
What is Pydantic? Does it relate to BAML? If yes, how does it relate to BAML?
What is PydanticAI? How does it compare to BAML? Can I use PydanticAI just for what BAML does? Does PydanticAI retry to get right output from the model?
How BAML handles a heavily hallucinated output?
What is instructor? [https://github.com/567-labs/instructor]? Compare it with BAML. - Follow up for clarity: If one is using PydanticAI, there is no point in using Instructor?
Where exactly does BAML fit into a standard RAG pipeline?
How does BAML help in token efficiency?
What is semantic streaming in BAML? What does problems does it solve? How does it help in Generative UI (add information about what Generative UI is in short)?
What is BAML code generator?
What is Schema Aligned Parsing? And what can it handle?
What kind of testing is done or can be done in BAML?
What is union in BAML?
How does logging and tracing or observability work in BAML?
How does BAML use Jinja templating to inject dynamic context, loops, and precise chat roles into prompts without messy string concatenation?
What are dynamic types (or runtime schemas) in BAML?
What aspects can BAML help in?
Will BAML make sense with something like Claude Agent SDK?

What BAML Is and the Problem It Solves

Every engineer who has tried building an LLM-powered feature knows the first hour of optimism and the next two weeks of fire-fighting. The model returns JSON with an extra key, or wraps it in markdown fences, or truncates mid-response. The prompt worked fine in POC/Demo. Now there are three different parsing bugs during production grade implementation, all subtly different.

BAML (or Basically a made-up language) - Boundary ML - exists to solve this class of problem at the right level of abstraction. It is a language-level contract between the application and the model. You define what you want the model to return, write the prompt logic in a dedicated templating layer, and BAML handles parsing, type-checking, retries, and client generation across Python, TypeScript, Go, Ruby, and other languages - with opt-in retry policies when you need them.

The project positions itself as the Pydantic of LLM engineering - a statement about philosophy rather than API compatibility. Just as Pydantic introduced runtime type validation into Python codebases that previously relied on convention and hope, BAML introduces structural guarantees into LLM pipelines that previously relied on prompt tuning and defensive try/except blocks.

How BAML Relates to Pydantic and Tools Like Instructor

Pydantic itself does one thing exceptionally well: it validates Python data structures against declared schemas. Feed it a dictionary, and it tells you whether it conforms to the model definition. It does not know anything about language models, prompts, or API calls - it is a validation library, and a very good one.

Instructor builds on top of Pydantic to handle the LLM layer. It takes a Pydantic model, wraps the OpenAI (or Anthropic, or other) API call, and uses function calling or JSON mode to coax the model into returning something the Pydantic validator can accept. When validation fails, Instructor can retry with the validation error message appended to the conversation, giving the model a chance to self-correct. This is practical, widely used, and works well for straightforward extraction tasks. What Instructor does not do is provide a dedicated authoring layer for prompts, generate client code from schema definitions, or go beyond retry logic when the model output is deeply malformed.

PydanticAI goes further than Instructor. It is an agent framework - it handles tool registration, multi-step agent loops, dependency injection, and result validation as part of a unified system. Validation failures feed back into the agent's run loop through a reflection mechanism, giving the model a chance to self-correct - structurally similar to what Instructor does but integrated at the framework level rather than as a wrapper. Comparing PydanticAI and BAML feature-for-feature would miss the point.

The more accurate comparison is about what layer each tool operates at. PydanticAI and BAML both handle structured output and retry behavior, but they do so with different default assumptions. PydanticAI is a Python framework - everything is Python, configured in Python, tested in Python. BAML is a language-level abstraction with its own syntax, its own code generator, and its own parsing engine that operates below what either Pydantic or the model's native JSON mode provides.

If a team is already using PydanticAI and happy with it, BAML is not a necessary replacement. If the team is hitting parsing failures that retry loops do not reliably fix, or needs multi-language client generation, or wants prompt authoring with first-class tooling support, BAML addresses different parts of the problem.

The BAML DSL and Code Generation

BAML is its own language. Not a Python DSL, not a configuration file format - a purpose-built syntax for describing LLM function signatures, data schemas, and prompt templates in a single, unified file format. A .baml file defines the inputs, the expected output structure, and the prompt template that connects them. The BAML compiler - written in Rust - reads those files and generates native client code in Python, TypeScript, Go, Ruby, and other languages. The Rust foundation is also what makes the SAP parsing engine fast enough to run inline on streaming responses without meaningful latency overhead - error correction applies in under 10ms, orders of magnitude cheaper than a retry API call. This is why BAML can credibly claim to be a language-level abstraction rather than a Python-centric library with thin wrappers for other runtimes.

This matters for a reason that is easy to dismiss as aesthetic but is actually structural: when the schema and the prompt live in the same file, they cannot drift apart. In a typical setup, the Pydantic model is in one file, the prompt string is in another, and the parsing logic is somewhere else. When the prompt changes, the schema might not. When the schema changes, the prompt often does not. This is less about convenience and more about eliminating an entire class of bugs - schema drift between prompt, parser, and application code - that is difficult to catch in review and invisible until it surfaces in production. BAML makes these co-located and co-versioned by design.

The generated client code behaves like a typed function call - call the function, pass the inputs, receive the validated return type. The underlying API call, parsing, and error handling are managed by the runtime. Retry behavior is available but opt-in, defined as an explicit policy in the .baml file rather than applied automatically. There is no boilerplate to maintain per endpoint.

Schema Aligned Parsing - BAML's Core Reliability Mechanism

Most structured output approaches rely on either JSON mode (asking the model to emit valid JSON) or function/tool calling (structured prompting that constrains the output format at the API level). Both of these approaches have the same failure mode: when the model output does not conform, parsing fails.

Without BAML, that failure looks like: model returns slightly malformed JSON, the parser throws, the application retries, the model might produce the same output again, and the request either surfaces an error or silently falls back. With BAML, that same malformed output goes through SAP, which extracts the structured data the model clearly intended to produce, and returns a typed object to the application - no retry required.

Schema Aligned Parsing - SAP - takes a different approach. Rather than requiring the model output to be valid JSON before interpretation begins, BAML's parser extracts structured data from whatever the model actually returns, using the declared schema as a guide for what to look for.

Consider what SAP actually handles in practice. A model that wraps its JSON in a markdown code fence - common with instruction-tuned models - would break a strict JSON parser. SAP strips the fences. A model that emits trailing commas or unquoted string values - technically invalid JSON - would fail JSON.parse. SAP corrects them. A reasoning model that outputs chain-of-thought text before the structured object would confuse most parsers. SAP identifies where the structured content begins and parses from there. An enum value returned in a different capitalisation or with surrounding punctuation gets normalised against the declared enum values in the schema.

What SAP does not do is hallucinate missing data. If the model completely omits a required field and there is no recoverable signal in the output, BAML reports a parse failure. The mechanism is about recovery, not invention. The practical result is a substantial reduction in false-negative parse failures - cases where the model actually produced the right conceptual answer but in a form that strict JSON parsing would reject.

This is the technical core of BAML's reliability claim, and it is a real engineering distinction from approaches that rely entirely on the model's ability to produce valid JSON every time.

Prompt Authoring with Jinja Templating

BAML uses Jinja-style syntax for prompt construction - powered by Minijinja, a Rust-native template engine implementing the Jinja templating language - which brings a mature, well-understood templating model into a space where most alternatives are either string concatenation or ad-hoc formatting functions.

The practical benefits are cleaner than they sound. Dynamic context injection - passing a list of documents, a user's history, or a set of retrieved chunks - is expressed as a loop in the template, not as string building in application code. Chat role separation (system prompt, user turn, assistant turn) is handled inline via role macros directly in the template - _.role("system"), _.role("user") - rather than being assembled through data structures outside the prompt. Conditional prompt logic, like including an extended set of instructions only when a particular flag is set, reads like a template rather than a maze of conditional string appends.

The alternative - building prompts through f-strings or concatenation - works until it does not. When prompts reach several hundred tokens with dynamic sections, the only way to debug them is to log the final assembled string and manually reconstruct how it was built - which requires understanding the application code that generated it, not the prompt itself. In BAML, the prompt template is the source of truth and can be inspected, versioned, and tested directly. The Jinja layer also makes it straightforward to separate prompt structure from the data flowing into it, which helps when iterating on prompt content without touching application logic.

Unions and Dynamic Types

BAML's type system supports union types - the ability to declare that a field or return value could be one of several distinct schemas. A model that might return either a SearchResult or an ErrorResponse depending on the query can express that distinction in the schema definition rather than through runtime inspection of the output.

Dynamic types solve a related but different problem. Unions work when the possible schemas are known at compile time. When the schema itself depends on data that only exists at runtime - categories pulled from a database, fields defined by user configuration, or tenant-specific structures - BAML provides a @@dynamic annotation on the type definition and a TypeBuilder API in the generated client. At runtime, application code uses TypeBuilder to add fields or enum variants before making the call, and the parser uses the extended schema to interpret the response.

A concrete example that illustrates both: an extraction pipeline where the possible document types (invoice, contract, medical record) are fixed and known - that is a union, declared once in the .baml file. If those document types and their fields are instead loaded from a database schema at request time, that is where @@dynamic and TypeBuilder come in. The distinction matters: unions are a schema design choice, dynamic types are a runtime extension mechanism.

Token Efficiency

BAML's schema-aware prompting tends to produce shorter system instructions than equivalent prompt engineering done by hand. Because the output structure is declared in the schema and the runtime handles parsing flexibility, prompts do not need extensive instructions about output formatting, JSON validity, or field naming conventions. Those concerns are handled at the tooling layer. For high-volume applications where token costs are meaningful, this reduction in system prompt overhead accumulates.

Semantic Streaming and Generative UI

LLM responses arrive token by token. In a chat interface, streaming the raw text is straightforward. In a structured output pipeline, streaming creates a problem: the output is not parse-able until it is complete, so the application has to buffer everything, parse at the end, and only then update the UI. This introduces latency from the user's perspective - the model is working, but nothing is happening on screen.

BAML's semantic streaming solves this by parsing the output incrementally as tokens arrive. Because the parser knows the expected schema, it can identify which field is being populated as the stream progresses. Streaming attributes on schema fields give developers explicit control over atomicity - a field can be configured to surface only when fully complete, or to stream token-by-token as a partial value, depending on what makes sense for the UI.

This enables a pattern often called Generative UI - rendering partial structured data into meaningful interface components as the model generates the response. An interface showing a list of extracted line items from a document does not need to wait for all line items to load simultaneously. Each item can appear as it is parsed. A dashboard that displays model-extracted analytics fields can populate each card progressively rather than flipping from empty to complete.

The mechanism is not unique to any particular UI framework - it is a property of the streaming parser that the generated client exposes. Applications consuming the stream receive typed partial objects they can render directly.

Testing in BAML

BAML includes a testing layer that allows declaring test cases directly in .baml files alongside the function definitions they test. A test case specifies the input and optionally assertions about specific field values or structural properties of the result, using @@assert expressions evaluated against the actual model output.

Tests run against live model APIs, either through the VSCode playground interactively or via baml-cli test from the command line. The CLI runner makes it straightforward to integrate BAML tests into CI pipelines, running them selectively on merge or on a scheduled basis.

The tooling also includes a playground - PromptFiddle - that surfaces prompt rendering, model output, and parse results interactively. This shortens the iteration loop on prompt changes considerably compared to editing, deploying, and inspecting logs.

Observability - Logging and Tracing

BAML provides structured trace data for every function call through a Collector API: the rendered prompt, the raw model response, the parsed output, timing, and token usage are all accessible by attaching a collector to a function call. This data can be pushed to Boundary Cloud for production dashboards and alerting, or routed to an external observability system.

For teams already using LLM observability tools like Langfuse (I have not used this!) or similar OpenTelemetry-compatible platforms, BAML's trace events integrate through standard logging hooks. The key value is that traces include the pre-parsing and post-parsing representations side by side - which makes it possible to distinguish whether a failure is a model issue (the model produced conceptually wrong output) or a parsing boundary issue (the model produced the right answer in a form the parser could not handle). That distinction matters when deciding whether to adjust the prompt, the schema, or the model configuration.

Where BAML Fits in a RAG Pipeline and with Agent Frameworks

A typical RAG pipeline has several identifiable layers: retrieval (vector search, keyword search, or hybrid), context assembly (chunking, ranking, formatting), model invocation (the API call), and response handling (parsing, post-processing, returning to the caller).

BAML operates at the model invocation and response handling layers. It does not replace a vector database, a retrieval library like LlamaIndex, or a reranking model. It does not manage document ingestion or embedding generation. BAML does not make retrieval better; it makes the interface between retrieval and generation reliable. What it replaces is the ad-hoc code that sits between the API call and the application: prompt construction, output parsing, retry logic, and client generation.

In a RAG system, BAML would typically receive the assembled context - the retrieved chunks, formatted by the application layer - as input to a BAML function. The function template injects that context into the prompt, calls the model, and returns a typed result to the application. The retrieval and chunking infrastructure remains unchanged.

For agent frameworks - the Claude Agent SDK, LangGraph, Autogen, or similar orchestration tools - BAML serves a similar role. Agent frameworks handle tool registration, loop control, state management, and multi-step planning. BAML-backed functions sit outside that loop as callable tools - the framework invokes them the same way it would any other tool, and BAML handles the structured output guarantees for that specific call. They are not alternatives; they operate at different layers. The combination is particularly useful when tools need to return strongly typed structured data that downstream steps in the agent depend on, rather than freeform text that the orchestrator has to interpret.

What to Do Next

The BAML playground at https://www.promptfiddle.com/ runs entirely in the browser - no installation, no API key setup. It is a good place to experiment with the DSL syntax and see how SAP handles malformed model output before committing to local setup. A broader set of working examples covering extraction, classification, streaming, and agent integration is available at https://baml-examples.vercel.app/.

The documentation at docs.boundaryml.com covers installation, the DSL reference, and integration guides for the major model providers. The thing worth evaluating specifically is SAP behavior under the failure cases that already exist in a current system - feed BAML the actual bad outputs that are currently causing parsing failures and observe how the recovery layer handles them. That test is more informative than any benchmark.

As LLM systems move from prototype to infrastructure, the cost of unreliable parsing compounds. BAML represents a considered answer to where that reliability boundary should live - not in the model, not in retry loops, but in a deterministic layer between them.