Juan David Gómez

Posted on Feb 13

When 5 Minutes Isn't Enough: Moving AI Ingestion from Sync to Async (And Saving 99% Compute)

#ai #architecture #performance #systemdesign

In my last post, I introduced Synapse, the AI system I built for my wife that uses a Knowledge Graph to give her LLM a "Deep Memory."

In the early demos and tests, it looked perfect. She ends a chat, the system processes it, and the graph updates in about 50 seconds.

But demos are lies.

When we started using it for real, 45-minute chat sessions with tens of messages, the system fell apart. The "End Session" button would spin for 5 minutes and then crash.

I thought I had a simple timeout bug. It turned out I had a fundamental architecture problem.

Here is how I went from crashing servers and wasting tokens to a 99% reduction in Convex Actions compute time by implementing the Async Request-Reply Pattern.

The "Happy Path" Trap

My initial implementation was naive. I treated the heavy AI processing like a standard web request.

Convex (The Orchestrator) triggers an HTTP POST to my Python backend.
FastAPI (The Brain) calls Graphiti + Gemini to process the text.
FastAPI waits for the result and returns it.
Convex saves the result to the DB.

This is the standard Synchronous pattern.

The problem? Convex Actions have a hard execution limit (usually 5 to 10 minutes depending on the plan).

When my wife had a short conversation, processing took 1 or 2 minutes. Fine.
But when she had a deep conversation, the Graph extraction logic (running on Gemini 3 Flash) took 15 minutes.

You cannot fit a 15-minute task into a 5-minute box.

Attempt #1: The "Brute Force" Retry (And Why It Failed)

At first, I didn't realize it was taking 15 minutes. I assumed the Gemini API was just being flaky or slow.

So, I did what any engineer does when things fail: I added retries.

I configured Convex to retry the action with exponential backoff on failure.

Here is the disaster that followed:

Convex sends the request.
It waits 5 minutes. Timeout.
Convex thinks the request failed, so it schedules a Retry.
It sends the request again.

The Hidden Bug:
The Python backend didn't know Convex had timed out. The first process was still running in the background, consuming LLM tokens and writing to the graph.

Suddenly, I had two heavy processes processing the same chat log simultaneously. I was paying double the API costs, wasting bandwidth, and clogging my backend with "zombie" processes. And the user still got an error message.

The Turning Point: Observability

I couldn't fix what I couldn't see. I installed OpenTelemetry and connected it to Axiom to trace the actual execution time on the Python backend.

The trace was a slap in the face.

The ingestion wasn't failing; it was just slow. It consistently took 12 to 18 minutes for large sessions.

I realized this wasn't a bug I could "optimize" away. I needed to change the architecture.

The Solution: The Async Request-Reply Pattern

In software engineering, when a task takes longer than a user (or a server) is willing to wait, you decouple the Request from the Response.

I switched to a Polling Architecture.

Instead of Convex waiting for the answer, it just asks for a "ticket."

Convex sends a POST /ingest request.
FastAPI immediately returns 202 Accepted with a jobId. (Time taken: ~300ms).
FastAPI starts the heavy processing in a background task (asyncio.create_task).
Convex goes to sleep and wakes up every few minutes to check the status.

Here is the flow:

Why Linear Backoff?

I switched from Exponential to Linear backoff for the polling.

If I know a task takes at least 5 minutes, checking after 10 seconds is a waste of resources. Checking after 2 minutes is also a waste.

I set the scheduler to check after 5 minutes, then 10 minutes, then 10 minutes again. This reduces the noise on my server significantly.

The Results: 99% Efficiency Gain

The difference in resource usage is massive.

Before (Synchronous):

Convex Action running time: 5 minutes (blocking/waiting).
Result: Fail -> Retry -> 5 more minutes.
Total "Billed" Compute: ~10-15 minutes.
Token Waste: High (re-processing the same data).

After (Async Polling):

Request 1 (Trigger): ~300ms.
Request 2 (Poll at 5m): ~300ms.
Request 3 (Final Fetch): ~300ms.
Total "Billed" Compute: < 2 seconds.

We went from wasting 10 minutes of compute just "waiting" for a response, to using less than 2 seconds of active execution time to manage the same job.

More importantly, the Python backend never processes the same job twice. If Convex asks for the status of a job that is already running, FastAPI just says "Still working on it," and the work continues undisturbed.

Conclusion

This project taught me a valuable lesson about building "Vertical AI" apps: AI tasks are slow.

We are used to web requests taking 200ms. In the world of LLMs and Knowledge Graphs, a "fast" task might take 30 seconds, and a "deep" task might take 15 minutes.

If your backend takes longer than your timeout limit, don't just increase the timeout. Decouple the request. It makes your system more resilient, your bills lower, and your architecture cleaner.

I'd love to hear how you handle long-running LLM tasks. Let me know on X or LinkedIn.

Top comments (1)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.