Iteration Layer

Posted on May 4 • Edited on May 30 • Originally published at iterationlayer.com

Why We Rebuilt Iteration Layer in Elixir

#api #programming

The Problem Changed

The first version of Iteration Layer was written in TypeScript. That was the obvious choice at the time. The product looked like a normal web app with a normal API surface: accept a request, call a model or processing library, return a response.

That shape did not last.

Content-processing infrastructure does not behave like a CRUD app once people start using it for real work. A single user upload can turn into document ingestion, OCR, schema extraction, image transformation, document generation, spreadsheet output, retries, confidence checks, and webhook delivery. Some requests are quick. Some are slow. Some fail halfway through because a PDF is malformed, an image is huge, a model call times out, or a downstream step needs to be retried without duplicating work.

The problem stopped looking like "build an API around AI models." It started looking like a distributed systems problem.

So we chose Elixir for the platform we launched.

Not because TypeScript is bad. Not because Elixir is magic. We rebuilt because the operational shape of the product changed, and Elixir gave us better primitives for the system we were actually building.

TypeScript Was Good for the First Version

TypeScript was the right way to get the first product into the world. The ecosystem is excellent for API development, SDK generation, frontend work, and shipping quickly. A small team can move fast because every part of the stack speaks the same language.

For the earliest version, that mattered more than anything else. We needed to prove that developers wanted composable content-processing APIs. We needed to build the first document extraction, image transformation, and generation workflows. We needed to learn where the architecture would break before choosing the launch stack.

TypeScript helped with that. But once the product shape was clear, we chose the runtime we wanted to operate for years before putting Iteration Layer in front of users.

The problems appeared later, when the product moved from individual API calls to complete workflows. The hard part was no longer "can we expose this operation over HTTP?" The hard part was "can we run many long-lived, failure-prone, multi-step jobs at the same time and keep the system understandable when something breaks?"

Those are different engineering problems.

AI Workflows Are Mostly Coordination

Most AI infrastructure discussions focus on models. Which model is best. Which prompt format works. Which provider has the lowest latency.

Those choices matter, but they are not where most of the engineering time goes in a production workflow.

The engineering time goes into coordination:

Accepting files from different sources without blocking request handlers
Running CPU-heavy and I/O-heavy steps without starving the rest of the system
Retrying transient failures without duplicating side effects
Timing out slow steps without losing the job state
Isolating one customer's bad batch from everyone else
Reporting progress and errors in a way humans can act on
Keeping memory bounded when files are large or many jobs run at once

A simple request like "extract this invoice and generate a summary report" hides a full workflow. The visible product is one API call. The system underneath is a set of supervised processes coordinating file handling, external calls, parsing, validation, billing, logging, and cleanup.

TypeScript can do this. We did do this. But the more coordination logic we wrote, the more we were rebuilding ideas that the BEAM has treated as first-class concerns for decades.

Supervision Is a Product Feature

The biggest mindset shift was supervision.

In many web stacks, a crashed worker is treated as an exceptional event. You add process managers, queues, retry libraries, health checks, and alerting around the application to keep it alive. The language runtime is not where most of that operational model lives.

In Elixir, supervision is part of the design vocabulary. Processes are cheap. Failure is expected. The question is not "how do we prevent anything from crashing?" It is "what should restart, what state should survive, and what should be isolated?"

That maps directly to content-processing workflows.

If a single document extraction job fails because the input is corrupted, that failure should not affect other jobs. If a temporary provider error hits one step, that step should retry with clear limits. If a background worker crashes, it should restart without taking the API down. If a queue backs up, the system should degrade predictably instead of turning into a pile of stuck promises and mystery memory growth.

Supervision is not an implementation detail when reliability is part of the product. Customers do not care whether the system used a clever retry helper. They care that their invoice batch does not disappear at 2 AM because one malformed PDF hit an edge case.

Concurrency Without Turning Everything Into a Framework

Iteration Layer runs workflows that mix different kinds of work.

Document extraction is file-heavy and model-heavy. Image transformation can be CPU-heavy. Document generation can involve rendering, fonts, layout, and binary output. Webhook delivery is network-heavy. Billing and usage tracking must be consistent even when processing steps fail.

These tasks do not all have the same latency profile, resource profile, or failure profile. Treating them as one generic background job category makes the system harder to reason about.

Elixir lets us model this more directly. A workflow can be broken into processes with clear responsibilities. Supervisors define what restarts. Queues define backpressure. Tasks can be isolated by operation type. The runtime gives us visibility into what is running instead of forcing every concern through the same event loop.

This maps well to a platform built around composability.

Our APIs are designed to chain: extract data from a document, generate a spreadsheet, produce a PDF report, transform images for the final output. The product promise is not that one operation works in isolation. The promise is that the pipeline holds together when multiple operations run in sequence.

The runtime should make that easier, not harder.

Long-Running Work Needs a Different Shape

HTTP requests are a poor mental model for long-running AI work.

Some calls finish quickly. Others depend on document size, page count, file format, model latency, and downstream generation. A workflow may need to stream files, call external systems, wait for a model, transform output, and deliver a webhook. Forcing all of that into a request-response mindset creates pressure in the wrong places.

You start extending timeouts. Then you add background jobs. Then you add polling. Then you add status records. Then you add retries. Then you add cleanup. Eventually the real product is the workflow engine around the API call.

Elixir is a better fit for that shape. The request can validate input and enqueue work. The processing steps can run under supervision. State can move through a controlled lifecycle. Failures can be explicit instead of buried in logs. The web layer and the processing layer can live in the same application without pretending they are the same kind of work.

That last part is important. We do not want a separate microservice for every operation just to get isolation. We want a system where isolation is cheap enough to use inside the application.

The BEAM gives us that.

A Small Language Is Easier for Agents

Another reason that mattered more than we expected: Elixir is small and standardized.

Iteration Layer is built in the same era as AI coding agents. We use agents while building the product, and we expect more of the surrounding ecosystem to be created by agents over time: SDK examples, integration snippets, workflow definitions, MCP tool descriptions, test cases, and internal maintenance work.

That changes the value of a language and framework stack.

TypeScript gives you many ways to solve the same problem. That flexibility is useful for humans, but it can make agent-generated code noisy. One file uses one async pattern, another uses a different framework convention, another pulls in a helper library for something the runtime already covers. The language is not the problem. The surface area is.

Elixir gives agents fewer reasonable paths. Pattern matching, pipelines, modules, behaviours, supervision trees, Ecto changesets, Phoenix contexts, Oban jobs. The conventions are narrow enough that generated code is easier to review, easier to correct, and easier to align with the existing codebase.

The product itself is agent-native. We ship MCP support because agents are becoming a normal way to build and operate workflows. The implementation stack should work well with the same tools we expect our users to adopt.

Rust Is the Escape Hatch

Content processing eventually hits work that should not be written in a high-level application language.

PDF parsing, image processing, layout rendering, format conversion, and model-adjacent operations can cross into CPU-heavy or library-heavy territory. Sometimes the right answer is not to force everything into the web stack. Sometimes the right answer is Rust.

Elixir makes that boundary practical through NIFs, especially with Rustler. We can keep orchestration, supervision, queues, billing, API shape, and workflow state in Elixir while moving performance-sensitive or systems-level code into Rust when needed.

This split is cleaner than pretending one language should do everything. Elixir coordinates. Rust handles the parts where memory control, native libraries, or raw performance matter. The BEAM still owns the reliability model around that work.

For Iteration Layer, that is important because the product spans formats and workloads. A malformed PDF, a strange image format, or a layout edge case should not force an architectural rewrite. We need an escape hatch that fits the platform, not a sidecar that becomes a second system to operate.

The AI Tooling Is Better Than People Assume

Elixir is not usually the first language people mention for AI infrastructure. Python owns the research ecosystem. TypeScript owns a lot of application-layer AI tooling. But the Elixir ecosystem is stronger than it looks when the job is production AI workflows.

The useful pieces are standardized around the same runtime model as the rest of the application. Nx and EXLA give us tensor and compiler foundations. Ortex gives us a path for ONNX Runtime. Bumblebee gives the ecosystem a standard shape for model loading and inference. Oban handles durable job execution. Phoenix gives us the API and web layer. OTP gives us supervision under all of it.

For our use case, this is more useful than chasing the largest model playground. Iteration Layer is not a notebook. It is a production system that needs to run document, image, generation, and agent workflows for customers without turning every capability into a separate service.

The Elixir stack lets us keep those concerns in one coherent system for longer.

Operational Simplicity Matters More Than Language Preference

A rewrite is expensive. The only good reason to do one is if it removes a class of problems that kept coming back.

For us, the recurring problems were operational:

Too much coordination logic lived in application code
Worker failure behavior was harder to reason about than it should be
Long-running jobs needed clearer lifecycle boundaries
Different operation types needed different concurrency limits
Observability needed to follow workflows, not just HTTP requests

Elixir did not remove the need to design those systems. It gave us better defaults for designing them.

Phoenix gives us a boring, productive web layer. Ecto gives us a clear database layer. Oban gives us durable jobs inside Postgres without adding another moving part for the current scale. OTP gives us the supervision model underneath the parts that should stay alive, restart, or isolate failure.

The result fits the way Iteration Layer is built: self-serve, API-first, workflow-oriented, and operationally lean.

Where Elixir Fits the Product

The product we are building is not a single-purpose OCR API. It is a composable content-processing platform. Customers use one auth token, one credit pool, and one API style to chain operations across documents, images, websites, spreadsheets, and generated outputs.

That positioning has technical consequences.

If a customer only needs one isolated operation, a specialized tool may be better. A point OCR service can go deeper on one document type. A dedicated image CDN can go deeper on global delivery. A cloud giant can offer a wider marketplace.

Iteration Layer wins when the workflow spans multiple steps and the customer wants fewer vendors, fewer credentials, fewer billing systems, and fewer glue layers.

Elixir is a good fit for that bet because it is built around systems that keep running. The language encourages explicit boundaries, message passing, supervision trees, and small processes that do one job. Those ideas match a platform where the product is the pipeline, not any one API call.

Why This Is the Launch Stack

We launched on Elixir because AI infrastructure becomes distributed systems infrastructure faster than you expect.

At the prototype stage, the model call looks like the product. In production, the model call is one step in a longer pipeline. The hard parts become backpressure, retries, observability, isolation, cleanup, data boundaries, and failure handling.

Content workflows make those problems obvious. Files are messy. PDFs are strange. Images arrive in formats nobody tested. Documents have missing fields, bad scans, unexpected layouts, and tables that break assumptions. The platform has to absorb that mess without making every customer rebuild the same operational shell around each API.

Elixir does not make those problems disappear. It gives us a runtime, language, and ecosystem that match them: supervised processes, explicit concurrency, durable jobs, a practical Rust boundary, and standard AI tooling that fits inside the application.

That is why Iteration Layer launched on Elixir, not the original TypeScript stack.

Get Started

If you are building workflows that extract data, transform files, generate outputs, or connect tools through MCP, try Iteration Layer with the 7-day trial. Start with one API, then add the next step when the workflow needs it. The APIs share authentication, credits, response patterns, and error handling so the second step does not become another integration project.

The docs cover the API surface, the MCP guide covers agent workflows, and the n8n guide covers automation-builder workflows.

DEV Community