My Phone AI Pipeline Was a Prototype. Now It's a Real Project.

#ai #python #rag #softwareengineering

Three upgrades, one repo, and a promise kept.

A few weeks ago, I wrote about building a RAG pipeline on my phone. It worked. Barely.

I used subprocess calls to talk to Ollama. Every time I restarted Termux, the bot forgot everything we'd discussed. And I was running the smallest Gemma 4 variant because I was scared the bigger one would crash my device.

I promised I'd rebuild it. Today, that rebuild is live.

What Changed

Three things I said I'd fix and actually fixed:

Native API Instead of Subprocess

My original code shelled out to Ollama using Python's subprocess module. It worked, but it was janky. The new version uses Ollama's native REST API via requests.post(). Cleaner code. Fewer moving parts. Proper error handling. The model now returns structured JSON instead of raw text I had to parse.

Persistent Memory Across Sessions

This was the big one. My old pipeline had amnesia. Restart Termux, lose everything.

Now there's a chat_memory.json file that stores a rolling summary of past conversations. The pipeline injects that memory into every prompt, so the model remembers what we talked about even across restarts. If you type memory in the interactive mode, it shows your conversation history.

It's not a vector database for memory. It's a lightweight JSON log. But it works on a phone without eating RAM. That's the engineering tradeoff.

Upgraded to Gemma 4 E4B

I was running E2B (2.3B params) because I assumed my phone couldn't handle more. I was wrong. The E4B (4.5B params) fits comfortably with quantization. The reasoning quality jump is noticeable especially on multi-step questions where the old model would lose the thread.

The Repo

Everything is on GitHub now:

github.com/Dexter2344

The README explains how to set it up, what dependencies you need, and how to run it. If you've got Termux on Android, you can clone it and have your own offline AI running in under 30 minutes.

What's Still Hard

Let me be honest about what I haven't solved:

· The phone still heats up after 20+ minutes of continuous inference. Thermal throttling is real.
· Android will kill the Ollama process if you switch apps for too long. I haven't found a workaround yet.
· The embedding model is still my lightweight hashing approach, not a real transformer. That's next on the list.

Why This Matters

Every time I publish one of these, someone reaches out and says "I didn't know you could do that from a phone." That's the whole point. You don't need a $2,000 laptop or cloud credits to build real AI systems. You need curiosity, patience, and a willingness to break things.

The code is free. The repo is public. Go build something.

Top comments (1)

Chioma Maduka • Jun 4

Amazing!