How We Built our API Multimodal Summary Engine

#ai #machinelearning #api #saas

I’m the founder of Fidget, an AI-powered video summarizer. Today’s post covers our multimodal engine’s architecture, complete with code examples.

When we set out to build our Multimodal Summary Engine, the idea was clear: ingest data from many sources (e.g. video, audio, metadata etc…) and use it to produce a neat, human-readable summary. If you rely on off-the-shelf summarizers, you still end up manually parsing transcripts and missing slide cues. That’s why Fidget’s multimodal AI engine was built from day one to capture every visual and audio nuance. Instead of simply transcribing audio, Fidget will listen for tonal emphasis, detects slide changes, and integrate on-screen text all in real time.

Building the Architecture for the Multimodal Engine

Firstly we needed a home for our new system, so we spec’d out the Fidget API. We knew developers didn’t want extra complexity, so Fidget exposes a single endpoint to handle incoming requests. However, we found that an API without guardrails is like a candy store without lockable cases — rate limiting and user permissions became a top priority. So, from day one, it was planned that every request through the endpoint gets checked against per-user quotas, tokens and roles.

A typical request might look like this:

curl -X POST https://api.getfidget.pro/v1/summarize \ -H "Authorization: Bearer sk-f9b9ba37-33b6-40e6-840e-e874d38e04a4" \ -H "Content-Type: application/json" \ -d '{"video_url": "https://example.com/video.mp4", "language": "en"}'

And have the response:

{ "success": true, "request_id": "fd7c9a1b-e8f2-4d3a-b8c5-2e7f3d8a9b1c", "processing_time": "0.87s", "video_metadata": { "title": "The Future of AI in Healthcare: Breakthroughs and Ethical Considerations", "duration": "15:42", "creator": "MedTech Insights", "language": "en", "topics": ["healthcare", "artificial intelligence", "ethics", "medical imaging", "drug discovery"] }, "summary": { "executive_summary": "This comprehensive presentation explores how AI is transforming healthcare through advanced diagnostics, personalized treatment plans, and predictive analytics. The speaker appears optimistic and highlights recent breakthroughs in medical imaging analysis that have achieved 97.3% accuracy in early cancer detection, outperforming human radiologists by 11%. The discussion covers how machine learning has accelerated drug discovery timelines by 60% and how predictive analytics now forecast patient outcomes with 85% accuracy across multiple conditions.", "chapter_breakdown": [ { "title": "Introduction to AI in Healthcare", "timestamp": "00:00 - 03:12", "summary": "Overview of current AI adoption in healthcare and historical context. The speaker appears happy and is standing against a whiteboard." }, { "title": "Medical Imaging Breakthroughs", "timestamp": "03:13 - 07:45", "summary": "Detailed analysis of how AI systems detect patterns in medical images with 97.3% accuracy. Various x-ray images are shown to highlight the points being made by the speaker." }, { "title": "Drug Discovery Revolution", "timestamp": "07:46 - 11:30", "summary": "Various scentists are shown working inside a lab performing medical tasks. The speaker is explaining the exploration of machine learning's role in accelerating pharmaceutical research" }, { "title": "Ethical Considerations", "timestamp": "11:31 - 15:42", "summary": "The video takes a more serious tone while discussion of privacy concerns, algorithmic bias, and regulatory frameworks. The speaker is attempting to stay optimistic but they appear pensive." } ], "key_insights": [ "AI systems can detect patterns in medical images that humans might miss, with 97.3% accuracy", "Machine learning has accelerated drug discovery timelines by 60%", "Predictive analytics can forecast patient outcomes with 85% accuracy", "Ethical frameworks must evolve alongside technological capabilities" ], "sentiment_analysis": { "overall": "positive", "confidence": 0.87, "segments": { "technological_advancements": "very positive", "ethical_considerations": "neutral", "future_outlook": "positive" } }, "related_topics": [ "precision medicine", "neural networks in diagnostics", "healthcare data privacy" ] }, "model_version": "fidget-v2.3.1", "tokens_processed": 5842 }

If the input video is unavailable or otherwise unreadable, our API returns an HTTP 400 status with an error code and clients can try again.

After the initial API design we sketched out our system flow. Imagine a request arriving at /v1/summarize: it first passes through an auth layer, then a rate-limiter and finally lands at a dispatcher that invokes the right downstream processes (we ended up calling them “modules.”) These gates ensure that a rogue client can’t soak up everyone else's resources or bypass business rules. This isn’t just about security; it also helps us maintain predictable performance as more users discover the Fidget API and allows us to scale up performantly.

System diagram of the Fidget Multimodal Summary Engine

Underpinning all of this is a strict interface between components, which is especially important because we anticipate adding new “modalities” down the road (more on that soon). Every module, whether it handles video frames or audio transcripts, exposes a stable set of input and output parameters. A clearly defined interface means modules talk to each other in a universal dialect: JSON objects with named fields, standardized error codes, and documented versioning. This interface can (and probably will) change with new major versions of the API e.g. /v1/summarize, /v2/summarize etc… but we always plan to keep supporting all versions in-line by keeping the same modules around.

Defining Modal Sources (or Modalities) within the API

A “modality” is just a fancier word for “data type” or “context source.” But not every piece of data is created equal — so we asked ourselves: what makes a good context source?

Relevance: If a video file’s metadata says it’s 2160p at 60 fps with a 10 Mbps bitrate, that’s interesting to our engine because it hints at video quality and length (for example.)
Availability: We prioritized sources that we could reliably extract at scale (e.g. standard container formats, well-defined audio codecs etc…)
Signal-to-Noise Ratio: A YouTube “tags” list might be partially user-generated and messy, while the actual audio waveform is unstructured but raw. We needed a sense of which fields tend to carry real, actionable value.

Once we identified our candidate sources — things like video metadata (duration, resolution, codec, description text), audio tracks (bits of speech or music) and key-frame snapshots (image frames at specific intervals), we had to decide how to interpret the data. Metadata often comes as JSON, so parsing fields like duration or bitrate is straightforward. But when we hit audio or visual data, things get messier: speech transcripts can be filled with filler words and images can be grainy or dark. That’s where our logic to handle noisy data kicks in. For instance, silent parts of audio get flagged and skipped, low-confidence speech segments are marked “uncertain,” and blurred frames are discarded or given a low relevance score.

Extracting Data from Distinct Modalities (audio, video, metadata, YouTube)

With our modalities defined, we built unique modules for each one. Each of these modules lives inside the API using that strict interface we mentioned earlier. In the end we ended up with three core services:

Metadata Extractor: Peels out raw JSON from tools like ffprobe for video or id3v2 for audio.
Audio Transcriber: Pulls audio tracks out of containers and sends them to our custom GPT-style omni model for processing.
Frame Snapshotter: Grabs “key frames” every few seconds or a configurable interval depending on confidence scores.

Each of these modules share a common set of input/output parameters.

For example, every module accepts a payload like:

{ "resource_id": "abc123", "input_path": "/tmp/abc123/source.mp4", "settings": { /* e.g., sampling_interval: 10 */ } }

…and produces something like:

{ "resource_id": "abc123", "output_path": "/tmp/abc123/frames/", "summary_path": "/tmp/abc123/frame_summaries.json" }

The magic is that any new modality we create in future e.g. OCR’d subtitles, social media comments, links in the description etc… just needs to implement the same interface in the Fidget engine.

From there, we needed to plumb each module together by registering it in a central “pipeline orchestrator.” When a request for summarization arrives, the orchestrator fans out to each active modality module simultaneously, waits asynchronously for each of their individual responses, and moves to the next stage. This approach means we can add or remove a modality with minimal friction.

The Video Summary AI Combinator

Once each module finishes its work, we collect everything into a staging area — which (for simplicity sake), ends up being a simple directory structure with JSON files and optional assets. To fuse these pieces, we built what we affectionately call “The Combinator.” It’s kind of like a blender where each ingredient (modality) gets measured by a weight slider (relevance or confidence.)

First, we had to define modality weights. Some data types are inherently more relevant for particular tasks. For a news clip, speech transcripts might matter most; for a “how-to” cooking video, key frames and on-screen text could carry more weight. We set up a configuration file where we can assign relative weights like:

audio_transcript: 0.4 key_frame_text: 0.3 metadata: 0.3

…to quickly and easily see how the different modalities affect the final output. Eventually, this will be automatic based on an initial scan and determined confidence values of the actual content.

When the Combinator runs, it pulls data from all the modules in a single step. Under the hood, it reads in audio_transcript.json, frame_summaries.json and metadata.json. It then normalizes fields (e.g. converting timestamps to a uniform “seconds since start” format) and constructs a consolidated in-memory representation like:

{ "resource_id": "abc123", "modalities": { "audio_transcript": [...], "frame_summaries": [...], "metadata": {...} }, "weights": { "audio_transcript": 0.4, "frame_summaries": 0.3, "metadata": 0.3 } }

Finally, the Combinator churns out a combined data‐set ready for the next stage: either on‐the‐fly summarization or feeding into a training pipeline.

Adding Modalities to AI Training Data

With the Combinator’s output in hand, we add modalities as context for our custom GPT-style AI model. The idea is that each modality module’s data becomes part of the training context. For example, our LM sees:

[METADATA] Title: “How to Bake Bread”; Duration: 00:05:32 [AUDIO] 0:00–0:03: “Welcome to my bakery show...” [FRAME] 0:05: Frame description: “Chef kneads dough.” ...

By feeding the LM a structured, modality‐tagged dataset, we teach it how to correlate, say, a mention of “kneading” in audio with the corresponding visual frame. During model training, we employ techniques like contextual embedding where each modality’s tokens get their own positional encoding. We also up-weight or down-weight entire modalities based on the Combinator’s weights in this step. This ensures the final LM doesn’t drown in a flood of irrelevant information — no one wants a summarizer that fixates on bitrates instead of human speech!

In early tests, our prototype processed one hour of lecture video in under six minutes, with near 100% accuracy.

Once the training data is prepared, we hit the familiar “train” button (submitting jobs to our internal ML instances.) Over multiple cycles, the model learns to generate coherent summaries that weave together the information its been provided. Information like metadata blurbs, spoken dialogue, and visual descriptions all get combined and correlated. At this stage we also monitor validation loss carefully, making sure the model doesn’t overfit to one modality at the expense of others. However, as mentioned previously, we’re hoping to further automate this in future and have it feed back into the weighting system.

Once all of this is done, we bundle it together in nice, neat, JSON format along with some other data relevant to the task (e.g. tokens processed, model used, time taken etc…) and return the response to the client with a lovely HTTP 200 status.

What’s Next for the API?

We’re currently in alpha with Fidget’s Multimodal Summary Engine, rolling it out to a handful of pilot customers at the moment. We’re aiming for a Summer 2025 public launch, where we’ll be monitoring the wider reception and community carefully.

So far, our post-launch roadmap includes:

Feedback Loops: We’d like to add surveys and usage telemetry within the UX so that users can flag wildly inaccurate summaries or suggest new modalities themselves (like on-screen text recognition etc…)
New Modalities on Deck: Imagine live chat comments for livestreams, social sentiment scores from Reddit posts or even things like linking into real-world news stories that are mentioned inside a video.
Fine-Tuning & Iteration: We’ll iteratively tweak modality weights, refine our noise filters, and periodically update the underlying language model to keep pace with slang, jargon and evolving content trends.
Scalability & Availability: We’re working hard to make every part of the Fidget API scalable, both in terms of usage and performance. We’ll be making this a top priority post-launch so you’ll always have Fidget available 24/7.

In short, we’ve laid a robust, extensible foundation; an API that enforces permissions and rate limits, a set of plug-and-play modality extractors, a clever Combinator to merge it all and a training pipeline that teaches our models the context behind the content. The journey from a raw video link to a concise, readable summary is now as smooth as butter.

👉 Want to shape Fidget’s roadmap? Join our API waitlist and receive early access, priority support and input into the development of Fidget.

We can’t wait to see what you build using the Fidget API!