DEV Community

Cover image for The Architecture Behind a Stateless AI Application
Nicanor Korir
Nicanor Korir

Posted on

The Architecture Behind a Stateless AI Application

This project has really been awesome to work on. I made an architectural decision early in Shamba-MedCare that felt risky at the time: no backend database. There was no need for user data at this moment, and getting the user response was the most important.

Every tutorial, every architecture guide, every "best practice" document assumes you'll store user data on a server. User accounts, session management, and data persistence all living in PostgreSQL or MongoDB, or DynamoDB.

But I kept asking myself: why? What user data does this application actually need to persist across devices? The answer was... nothing. And that realization shaped everything that followed.

The Three-Layer Split

Here's how the system actually works:

Frontend handles all user interaction, data persistence, and UI state. It compresses images before upload, manages the multi-step wizard flow, stores history locally, and renders results.

Backend does exactly one thing: transform image data into diagnosis data. It receives a request, builds a prompt, calls LLM, parses the response, and returns structured JSON. No state. No sessions. No database.

AI Layer is Claude Vision. It receives images with carefully crafted prompts and returns detailed diagnostic information.

Each layer has one job. Mixing responsibilities, like having the backend store history or the frontend call LLM directly, would create complexity without benefit.

Data Interaction

Simplicity is the goal, here is how it works

History and setting never leave the user's device. The API key passes through my server, but is never stored.

The Multi-Step Wizard: Why State Machines

The scan flow has five potential steps: plant part selection, crop type selection, media upload, analysis mode (for multiple images), and context entry. Implementing this as a traditional form with step numbers would be a nightmare.

Here's the problem with step numbers:

If the user uploads a single image, step 4 is context entry. If they upload multiple images, step 4 is mode selection and step 5 is context entry. The step numbers become meaningless because they depend on runtime conditions.

The solution is a state machine. Each state has a meaningful name: "part", "crop", "media", "mode", "context", "analyzing". The UI doesn't care about step numbers. It renders whatever state it's in.

The progress indicator ("Step 3 of 5" vs "Step 3 of 4") is computed dynamically based on whether mode selection will appear. Users see accurate progress without the code caring about arbitrary step numbers.

Storage Architecture: Three Tiers

Different data have different lifetimes. I implemented three distinct tiers:

Session storage holds consent flags. When you close the browser, consent expires. Next session, you make an active choice again. For health-related applications, I think users should consciously opt in each time rather than relying on consent from months ago.

Local storage holds everything that should persist: scan history, accessibility settings (font size, voice preferences), and the API key.

Embedded deep cache is a design choice that trips people up. Each history item doesn't just store a reference to results, it stores the complete diagnosis. All 25+ fields. Treatments, prevention tips, the full thing.

This bloats storage but enables true offline access. A farmer can reference last week's treatment recommendations without any network connection. That's critical for rural users.

The math works out: each scan is about 20-30KB with a thumbnail. At 50 scans maximum, that's roughly 1.5MB, well under the 5MB browser quota. Older scans rotate out automatically.

The Single Endpoint Philosophy

The backend has one API endpoint: POST to /api/v1/analyze. That's it.

Why not multiple endpoints? I considered having /analyze/single for one plant, /analyze/batch for multiple plants, and /analyze/video for video input. But here's the thing: they all do the same underlying operation. They all send images to LLM and return structured results.

The only difference is in the prompt construction and response handling. A mode parameter handles that cleanly. Multiple endpoints would mean:

  • Client-side routing logic
  • Duplicate validation code
  • Versioning complexity when the format changes
  • Documentation for three endpoints instead of one

One endpoint with a mode parameter is simpler to understand, test, and maintain.

User-Provided API Key

This is, of course, temporary, and since the app is still in testing and development stages, we need to prevent overuse of our API Credits

Cost transparency: Users see exactly what they're paying. No hidden markup. No surprise bills.

No key management: I don't need a database to store keys, rotation logic, or access controls, this reduces operational complexity by a lot.

Scalability: No shared rate limits. Each user has their own Anthropic quota.

Trust: Users control their own credentials. I literally cannot run up their bill unexpectedly.

The downside is friction. Users must create an Anthropic account and generate an API key before using the app. For sophisticated users (the current audience), this is fine. For mass-market adoption, I'd need to revisit this decision.

Batch vs Single Mode: The Same Plant Problem

When users upload multiple images, the system needs to know: are these different plants (analyze separately) or the same plant from different angles (analyze together)?

This isn't something AI can reliably infer. A tomato leaf photographed from above and below might look completely different. Are they the same plant? Only the user knows.

So the UI asks directly. "Same plant, different angles" sends all images in one request, LLM sees everything together and produces one diagnosis that synthesizes evidence across views. "Different plants" sends each image separately, three images means three independent diagnoses.

For video input, extracted frames default to "same plant" mode. The user was recording continuously, so frames presumably show the same subject.

Response Parsing: Handling Imperfection

Here's something I learned the hard way: LLM sometimes wraps JSON responses in markdown code blocks. Even when the prompt explicitly requests raw JSON.

The prompt says: "Return ONLY valid JSON, no markdown formatting."

LLM occasionally returns:

Enter fullscreen mode Exit fullscreen mode


json
{"health_score": 45, "disease": "Early Blight"...}

Enter fullscreen mode Exit fullscreen mode

The backend strips markdown code block markers if present. It handles both raw JSON and markdown-wrapped JSON identically.

The broader principle: prompts aim for perfection; parsing assumes imperfection. Every field has a fallback. Missing health score defaults to 0. Missing severity defaults to "moderate". A partial response is better than a crashed request.

The Trade-off Table

I believe architecture is really just trade-off documentation. Here's the honest accounting:

Decision What We Gave Up What We Gained
No backend database Cross-device sync, unlimited history Privacy by design, simpler operations
User-provided API keys Frictionless onboarding Cost transparency, no key management
Optional plant/crop selection Guaranteed input accuracy Accessibility, faster expert workflow
Full results cached in history Smaller storage footprint True offline access to treatments
Single API endpoint Clear operation separation Simpler integration, less client logic

None of these decisions is universally correct. They're correct for this application's specific constraints: privacy-conscious users, offline usage requirements, cost sensitivity, and a small development team.

Different constraints would lead to different choices. A hospital app would need a database. An enterprise tool would need centralized key management. A children's app would need user accounts.

Architecture isn't about finding the "right" pattern. It's about understanding your constraints and making explicit choices that fit them.

The Result

The system that emerged is simple in a specific way. Not simple as in easy to build, simple as in each piece does one thing.

Frontend handles users. Backend handles transformation. LLM handles intelligence. Storage tiers handle different lifetimes. The scan wizard handles a multi-step flow. Each piece is testable in isolation and replaceable without affecting others.

That's the goal of architecture: not clever abstractions, but clear separations. Not perfect patterns, but honest trade-offs.

When I look at this system, I can explain every decision. That's what makes it maintainable, not the code, but the clarity of intent behind it.


Source code on GitHub

Top comments (0)