DEV Community: Lena Hoffmann

Holding Many Modalities in One Session: A Parts-Based API Structure

Lena Hoffmann — Fri, 24 Jul 2026 19:05:07 +0000

Multimodal APIs are usually documented one modality at a time — here is image input, here is audio, here is streaming text. What the docs rarely cover is how to hold them in one coherent session, which is what an actual application needs.

Parts, not endpoints

The mental shift that made this tractable: stop thinking in endpoints, start thinking in typed parts. A request is an ordered list where each element declares its own type:

type Part =
  | { kind: 'text';  text: string }
  | { kind: 'image'; mime: string; data: Uint8Array | { uri: string } }
  | { kind: 'audio'; mime: string; data: Uint8Array }
  | { kind: 'video'; mime: string; uri: string }

Once input is a part list, interleaving stops being a special case. "Here are two images, a question about them, and an audio clip for tone reference" is just a four-element array — not three API calls you have to correlate afterwards.

Inline versus reference, and why it bites

Small assets go inline as base64. Large ones must be uploaded first and passed by URI. The threshold is provider-specific and the failure mode is bad: exceed it and you get an opaque 400, not a helpful "too large, upload it instead."

Worth building a size check into your part constructor so it routes automatically rather than discovering the limit in production.

Also: base64 inflates payloads by roughly a third. A "small enough to inline" image is meaningfully smaller than you assume.

Streaming breaks the tidy model

Text streams token by token. Images arrive whole. Audio may stream as chunks. So your response handler cannot assume a uniform shape — it needs to be a state machine over part-typed events, accumulating text deltas while treating an image part as atomic.

The bug I hit repeatedly: treating every stream event as appendable text, which quietly corrupts binary parts.

Context accumulation is the real cost

Multi-turn multimodal sessions grow fast. Every image stays in context and keeps costing tokens on every subsequent turn.

Two things that helped:

Summarise old turns — after N turns, replace early image parts with a text description of what they were
Explicit pinning — let the user mark which assets stay in context, and drop the rest

Without either, a ten-turn session with images becomes surprisingly expensive, and the cost is invisible until the bill arrives.

I put this architecture into practice at GeminiOmni — image editing, video generation, and live chat sharing one session model.

Caveat

Part-list APIs are converging across providers but are not identical. This structure ports with modest adapter work — it is not free.

One Session, Many Modalities: Structuring Multimodal API Calls

Lena Hoffmann — Fri, 24 Jul 2026 19:02:51 +0000

Parts, not endpoints

The mental shift that made this tractable: stop thinking in endpoints, start thinking in typed parts. A request is an ordered list where each element declares its own type:

type Part =
  | { kind: 'text';  text: string }
  | { kind: 'image'; mime: string; data: Uint8Array | { uri: string } }
  | { kind: 'audio'; mime: string; data: Uint8Array }
  | { kind: 'video'; mime: string; uri: string }

Inline versus reference, and why it bites

Worth building a size check into your part constructor so it routes automatically rather than discovering the limit in production.

Also: base64 inflates payloads by roughly a third. A "small enough to inline" image is meaningfully smaller than you assume.

Streaming breaks the tidy model

The bug I hit repeatedly: treating every stream event as appendable text, which quietly corrupts binary parts.

Context accumulation is the real cost

Multi-turn multimodal sessions grow fast. Every image stays in context and keeps costing tokens on every subsequent turn.

Two things that helped:

Summarise old turns — after N turns, replace early image parts with a text description of what they were
Explicit pinning — let the user mark which assets stay in context, and drop the rest

Without either, a ten-turn session with images becomes surprisingly expensive, and the cost is invisible until the bill arrives.

I put this architecture into practice at GeminiOmni — image editing, video generation, and live chat sharing one session model.

Caveat

Part-list APIs are converging across providers but are not identical. This structure ports with modest adapter work — it is not free.

What I Learned Building a Multimodal AI Studio Solo on Gemini + Veo

Lena Hoffmann — Wed, 10 Jun 2026 12:44:08 +0000

I spent a weekend wiring Google's Gemini and Veo APIs into a single app just to feel where the edges of multimodal AI actually are. It turned into a small studio I now use daily, and along the way I learned more about these models from plumbing them than from any paper. Here's the honest technical debrief.

Three pipelines, three completely different problems

I wanted one prompt box that could do video, image editing, and document Q&A. Naively I assumed they'd share most of the stack. They don't.

1. Image-to-video: the enemy is time, not pixels

Generating one good frame is solved. Video is about temporal coherence — frame 13 must agree with frame 12 or you get flicker and identity drift. Modern video models treat the clip as one object in space and time (latent diffusion over a width x height x time volume, with spatiotemporal attention) rather than 120 independent images. Conditioning on a reference image as the first frame is what makes image-to-video feel controlled: you've handed the model a strong anchor and asked it to extrapolate motion, not invent a world.

The surprise: native audio sync (Veo 3.1 generating clip + soundtrack jointly) does more for perceived realism than another notch of resolution. A door slam landing on the exact frame the door shuts is uncanny in a good way.

2. Instruction-based image editing: preservation is the hard part

Generating is unconstrained; editing must change one thing and preserve everything else. Condition the diffusion model on both the instruction and the source image's latents, cross-attend the instruction to steer only the referenced region, and bias hard toward preserving unedited latents. Push that preservation too soft and the subject's face quietly morphs across edits — the classic 'character consistency' failure that makes or breaks storytelling use-cases.

3. PDF chat: it's retrieval, not a long context

The naive 'paste the whole PDF' approach dies on long files (models get lost in the middle) and costs you the full document every turn. The version that works is a tiny RAG pipeline: chunk with overlap that respects structure, embed chunks into a vector index, retrieve the few nearest passages per question, and ground the answer in only those passages with a citation. Half the real work is just parsing hostile PDFs (multi-column, scanned, tables) into clean ordered text before any model sees it.

What was genuinely hard solo

Cost control. Every modality has a different price curve. I collapsed everything to one credit balance and route to the cheapest model that clears a quality bar per task. Hard-coding model names at call sites is a trap; put them behind one config.
Latency UX. Video takes seconds-to-minutes. The product is mostly about making waiting feel intentional — optimistic UI, job queues, auto-refunding failed jobs so a timeout never costs a user a credit.
Glue > models. The models are an API call. The studio is chunkers, parsers, queues, a credit ledger, and a lot of error handling. That's the actual product.

The takeaway

If you want to understand these models, stop reading and wire three of them into one app. The cheapest experiment is still the same one I ran: feed a model a single image and watch what it does with time. The result of mine, if you want to poke at it, lives at geminiomni-ai.com — but the real value was the debugging, not the demo.

Happy to compare notes if you're building in this space.