AK DevCraft

Posted on May 11 • Edited on Jun 26

$0 Personal Agentic AI Assistant - Architecture - Part 1

#ai #openclaw #machinelearning #llm

OpenClaw Challenge Submission 🦞

Introduction

A productivity tool that promised to change everything, charged monthly, and quietly became background noise. AI assistants are going the same way — another tab, another login, another $20/month for something you open twice a week. Probably, most of us regret the subscription that we have today.

Welp! What if you didn't have to?

In today’s world, the infrastructure exists to run a capable, always-on personal AI assistant, one that lives in your day-to-day regular apps like Telegram or WhatsApp, remembers you, browses the web, and handles real tasks — for exactly zero dollars a month. Not a trial. Not a teaser. Permanently free, on infrastructure you control.

This article explains the architecture that makes it possible and why each piece matters.

The Subscription Trap

Most people's AI setup looks like this: Claude.ai, ChatGPT, or any other AI providers in a browser tab or mobile app, opened when needed, closed when done. Conversations are saved, and you can go back to what you discussed last time if you're in the same thread. But it's passive history, not active memory. You have to go and find it. And across that whole time, it couldn't reach out, take action, or do anything unless you opened it first.

That's not an assistant. That's a very smart search box.

A real assistant is always on. It knows who you are. It operates in the apps you already use. It can take actions, not just generate text, and it doesn't charge you for existing.

Until recently, building that required either paying cloud AI bills or owning serious hardware. Both most likely out of reach for most people. However, that can be changed.

Three Shifts That Can Make This Possible

1. Open-weight models are now genuinely capable

Meta's Llama, Google's Gemma, and others have closed the gap with proprietary models significantly over the past few years. A 3-8 billion parameter model running locally can handle the majority of everyday tasks like summarising, drafting, answering questions, and light reasoning, that people actually use AI assistants for day-to-day.

2. Cloud providers offer permanently free compute

Oracle Cloud's Always Free tier gives you up to 4 ARM CPU cores and 24GB of RAM — permanently, with no expiry date. Not a 12-month trial like AWS. Not credits that run out. A real server running 24/7 at zero cost, forever, as long as you keep the account active.

That's enough to run Ollama with a capable local model.

3. Free API tiers have become genuinely useful

Google's Gemini 2.5 Flash-Lite is generally capped at 250K Tokens Per Minute (TPM) on the free tier with no credit card required. For a personal assistant handling one person's queries, that's more than enough headroom. When a local model is too slow or too limited for a task, Gemini catches it — for free.

Put these three things together, and the economics change completely.

The Architecture - Tech Stack

First Iteration

Oracle Cloud ARM Instance — your always-on server. 4 CPU cores, 24 GB RAM, permanently free. Hosts everything. Never sleeps, never charges.
Ollama — runs open-source language models locally on your server. No API calls, no cost, no data leaving your machine. The primary brain is for most tasks.
Gemini API (free tier) — Google's fallback for when the local model is too slow or hits a complex task. 1,000 free requests per day—no credit card.
OpenClaw — The agent layer that ties everything together. Connects to Telegram, maintains memory across conversations, runs scheduled tasks, and routes requests between local and cloud models intelligently.

Second Iteration

Details of next iteration improvements optimizing Personal Agentic AI Assistant with Llama.cpp, Gemma 4 12B, MCP, and Tavily - Next-Iteration Improvements: Optimizing Personal Agentic AI Assistant

What It Can Actually Do

This isn't just a toy setup. On this stack, you get:

Telegram access — message your agent from your phone, anywhere, like texting a person
Persistent memory — it remembers your preferences, ongoing projects, and past conversations
Web search — real-time search via Tavily's free tier integrated directly into responses
File operations — read, write, and summarise documents on the server
GitHub integration — search issues, review code, summarise pull requests
Scheduled tasks — set reminders, recurring summaries, automated workflows
Custom agents — define specialised subagents for specific tasks (code review, research, writing)

What it can't do as well as a paid service: complex multi-step reasoning at speed, very long document analysis, and tasks that push the limits of a 3B parameter model. For those, the Gemini fallback steps in.

The Honest Tradeoffs

Zero cost doesn't mean zero compromise. Know what you're getting into:

Speed — local CPU inference is slower than cloud APIs. A response that takes a few seconds on Claude.ai might take > 30 seconds locally. With Gemini as a fallback, complex tasks are fast. Simple tasks on the local model are slow but free.
Quality ceiling — a 3B local model is noticeably less capable than Claude Sonnet or GPT-4. For writing, summarisation, and Q&A, it's fine. For nuanced reasoning or complex code, it shows limitations.
Setup effort — this is not a five-minute install. There are VCN configurations, systemd services, API keys, and model downloads involved. It takes an afternoon to set up correctly. Once running, it requires minimal maintenance.
Oracle ARM capacity — Oracle's free ARM instances are in high demand. You may need to retry provisioning multiple times or upgrade to Pay As You Go (which still costs $0 for Always Free resources) to get reliable access.

Who This Is For

It makes sense if:

You're comfortable with a terminal and basic Linux
You want AI infrastructure you actually control
You're experimenting and don't want ongoing costs
You're comfortable with slower responses in exchange for zero cost

It doesn't make sense if:

You need production-grade reliability
Response speed is critical
You want a turnkey experience with no configuration
You'd rather pay $10-20/month for something that just works

For the right person, this is the most interesting AI setup you can build right now. Not because it beats the paid alternatives on any individual metric, but because it's yours — running on your server, with your data, on your terms, for nothing, and most importantly, your private data on your laptop is far away from accidentally being exposed.

What's Next

This article is the first in a five-part series:

The Architecture ← you are here
Setting Up Free Cloud Server — VCN, ARM instances, static IPs, the gotchas
Running Ollama on ARM — model selection, disk management, CPU inference reality
Installing OpenClaw on Linux — avoiding every trap
Telegram Integration — Telegram, Gemini fallback, end-to-end testing

The complete series is out, and links are updated.

If you have reached this point, I have made a satisfactory effort to keep you reading. Please be kind enough to leave any comments or share any corrections.

My Other Blogs:

Top comments (3)

Harjot Singh • May 31

A $0 agentic assistant is a great forcing function - the budget constraint makes you architect well instead of papering over inefficiency with a bigger model. Free tiers + local models + ruthless context discipline is exactly the stack that teaches you where the spend actually goes, because you feel every wasted call.

The architecture decisions that keep it at $0 are the same ones that keep a paid system cheap at scale: route trivial work to local/free, cache aggressively, and only escalate to a paid API for the rare hard call. The discipline you're forced into at $0 is what most people should be doing at $200/mo anyway. Same routing thesis behind Moonshift (prompt to a shipped SaaS on your own GitHub+Vercel). Looking forward to the rest of the series; what's handling the local inference - Ollama, or a free API tier? (Moonshift's first run's free if useful.)

AK DevCraft • Jun 1

Spot on! You hit the nail on the head. The $0 budget constraint completely changes how you think about architecture. When every unnecessary token or looped retry impacts your limited CPU cores or hits a hard rate limit, you are forced to design a tight system. You can’t just throw a 128k context window at a lazy prompt and call it a day.

To answer your question: For the local inference layer, I’m now bypassing Ollama due to heavy CPU overhead and running a compiled llama.cpp server (llama-server) running as a background systemd service. I have it serving a quantized 3B model (Qwen 2.5 Coder 3B Instruct) bound to just 3 CPU cores on an Oracle Cloud free-tier instance. But there is a limitation with the 3B models on how they can use the agent toolkit; in that case, it falls back to the free API tier. In fact, I'm trying to create separate subagents to perform dedicated tasks based on model capabilities. Probably going to create another article with recent improvements.

Thanks for the callout on Moonshift! The concept of a prompt-to-deployed DAG pipeline pushing straight to your own GitHub/Vercel infra is super clean and aligns perfectly with keeping operational costs near zero. I'll definitely check it out.

AK DevCraft • May 12

Well! quite late to the OpenClaw challenge but sometimes it’s better to put yourself out there than never try at all.