Kevin Djabaku Ocansey

Posted on Apr 6

Building JarvisOS.

#ai #llm #mobile #agents

What is JarvisOs

Local models have been growing fast. Frameworks like Ollama make it easy to load and run models on a desktop or servers, but last year I came across Cactus. It is an inference engine built specifically to run LLMs, vision models, and speech models on any smartphone, including low-range devices. Cactus Compute Think of it as Ollama, but for mobile.

Cactus provides SDKs in Flutter, Kotlin, and React Native that let developers build workflows with agentic tool use, RAG, and more. But those SDKs are app-level. They sit inside your application. There was nothing treating the phone itself as the compute platform — no system-level orchestration layer, no persistent agentic runtime that other apps could plug into.

That's the gap our team plans to fills. We're building an agentic system on top of Cactus, running as privileged Android system services. Everything stays on the device — no cloud routing, no API calls home. The phone isn't a remote interface to some server somewhere. It is the server.
We did some research, defined our requirements, and started building. The next section covers what you need to get the system running if you want to contribute.

Setup and Requirements

To build JarvisOS, you need real control over the OS. That means a custom Android distribution. We chose LineageOS — a free, open-source Android distribution that extends the functionality and lifespan of mobile devices from more than 20 manufacturers, and gives us the ability to modify the system server itself.

We started on LineageOS 23, ran into issues, and dropped back to LineageOS 21 for active development. LineageOS 22.1 is based on Android 15 QPR1, which is what we're targeting for actual device deployment when the time comes.

wiki.lineageos.org

[Link above: Build for Nothing Phone Lineage docs]

For your development machine, here's what you need:

Before anything else, you need the right machine.

OS: Ubuntu 22.04 or newer on Linux. We used WSL2 on Windows
RAM: At minimum 16GB. However for LineageOS 21 and up, 64GB is recommended (the less RAM you have, the longer the build will take. Your machine will remind you of this constantly)
Storage: 400GB free for LineageOS 21 and up.
A Nothing Phone 2 (codename Pong) or a Google Pixel 6. Both have NPUs for on-device inference.

Once your machine is ready, the first real step is installing the repo tool. This is Google's tool for managing Android's hundreds of individual git repositories as one coordinated source tree.

Step 1: Initialise the LineageOS source

mkdir -p ~/android/lineage && cd ~/android/lineage
repo init -u https://github.com/LineageOS/android.git -b lineage-21.0 --git-lfs --no-clone-bundle

Step 2: Add the JarvisOS manifest

mkdir -p ~/android/lineage/.repo/local_manifests
cat > ~/android/lineage/.repo/local_manifests/jarvos.xml << 'EOF'
<?xml version="1.0" encoding="UTF-8"?>
<manifest>
  <remote name="JarvisOs"
          fetch="ssh://git@github.com/"
          revision="main" />

  <project name="ocansey11/android_frameworks_base"
           path="frameworks/base"
           remote="JarvisOs"
           revision="lineage-21.0" />

  <project name="ocansey11/vendor_jarvisos"
           path="vendor/jarvisos"
           remote="JarvisOs"
           revision="main" />

  <project name="ocansey11/cactus"
           path="vendor/cactus"
           remote="JarvisOs"
           revision="main" />
</manifest>
EOF

The manifest file is what tells repo where to find each piece of the project and where to put it in your build tree. Each entry has three key things: the GitHub repo to clone (name), where it lands locally (path), and which branch to use (revision).

The block at the top just defines GitHub as the source so you don't have to repeat the URL for every project.

That's all it is — a map. Repo reads it, pulls everything down, and your tree is assembled

Step 3: Sync

repo sync

You should have something like this in the end

~/android/lineage/                          ← LineageOS build root
│
├── frameworks/
│   └── base/                              ← ocansey11/android_frameworks_base
│       └── services/core/java/com/android/server/rag/  <- Jarvis as a system service
│           ├── RagService.java
│           ├── core/
│           ├── indexing/
│           ├── inference/
│           ├── model/
│           ├── search/
│           └── tools/
│
└── vendor/
    ├── jarvisos/                          ← ocansey11/vendor_jarvisos
    │   ├── sepolicy/
    │   ├── prebuilts/objectbox/
    │   └── (documentation)
    │
    └── cactus/                            ← ocansey11/cactus

System Server and Architecture — How We Built It and Why

We have already explored using cactus at the Application level. Currently Jarvis will be a background service. Think of it as an extension of system_server — the same process that runs ActivityManager, WindowManager, and every other privileged Android service. Our RagService boots alongside those on startup.

Communication follows the standard Android pattern: apps talk to RagManager (our public API), which crosses the process boundary via Binder IPC using an AIDL interface, landing inside RagService which orchestrates everything — RAG pipeline, model selection, tool dispatch.

Is this the most performant approach possible?

Honestly, no.

The fastest possible on-device AI pipeline would be purpose-built native code: custom C++ inference with hand-tuned ARM kernels talking directly to hardware accelerators, zero JVM overhead, no Binder serialisation costs on every call. That's what a team at Qualcomm or Google would ship.

What we have instead is a Java system service calling into Cactus through a JNI wrapper, with Binder IPC adding latency on every query boundary. Each tool dispatch adds a broadcast round-trip with a 10-second timeout window.

The reason is, i personally do not have the experience with android systems programming and i use books like

Android Systems Programming by Roger Ye
Effective Java by Joshua Bloch

Along with Claude Code to help architect the specs and code the various pipelines.

Also, Cactus already solved the hard part — ARM SIMD kernels, KV cache quantisation.

Building on top of it meant we could focus on the thing that actually didn't exist: the agentic orchestration layer. The system server approach gave us process isolation, kernel-enforced permissions, and a persistent runtime that survives app restarts — things you can't get from an app-level SDK.

We want to start somewhere. And this is a solid somewhere. After enough experience we can fiure out how to actually do this from scratch properly. Think of it as our experiment to figure out if it can actually work.

Jarvis system service architecture

Below is a new system background service we are adding which extends system server to handle rag, persistent memory etc the foundation for agentic behaviour

frameworks/base/services/core/java/com/android/server/rag/
│
├── RagService.java          ← Main orchestrator, entry point
├── IRagService.aidl         ← Binder contract
├── Android.bp               ← Build file
│
├── core/
│   ├── RagManager.java      ← Public API wrapper
│   ├── RagException.java    ← Exception definitions
│   ├── JarvisStore.java     ← ObjectBox store init
│   ├── ModelRegistry.java   ← Manages model + index handle pairs
│   └── IndexQueue.java      ← Queues indexing tasks
│
├── indexing/
│   ├── RagIndexWorker.java  ← Processes the index queue
│   ├── TextExtractor.java   ← Handles multiple file types
│   ├── ChunkingStrategy.java← Splits documents into chunks
│   └── JarvisFileObserver.java ← Watches filesystem for changes
│
├── search/
│   └── MetadataSearch.java  ← Metadata-based search
│
├── model/
│   ├── SourceFile.java      ← ObjectBox entity
│   ├── DocumentChunk.java   ← ObjectBox entity
│   ├── Folder.java          ← ObjectBox entity
│   ├── Chunk.java           ← ObjectBox entity
│   ├── Conversation.java    ← ObjectBox entity
│   ├── Message.java         ← ObjectBox entity
│   ├── UserContext.java     ← ObjectBox entity
│   ├── AccessLog.java       ← ObjectBox entity
│   └── TaskMemory.java      ← ObjectBox entity
│
├── inference/
│   └── CactusWrapper.java   ← Single entry point to Cactus
│
└── tools/                   ← Managing Tools from Apps
    ├── AppRecord.java        ← One entry per installed app
    ├── ToolRecord.java       ← One entry per tool
    ├── ToolScannerService.java ← Scans APKs on install
    └── ToolDispatcher.java   ← Resolves + fires tools

Vendors — The supporting Layer

vendor/jarvisos holds everything that supports the system services but isn't part of them:

Sepolicy — Telling Android to Trust Us

Android doesn't trust new system services by default. SELinux is the security layer baked into every Android device. This enforces strict rules about what each process is allowed to do. Without the right policy, our service would be blocked from reading files, sending broadcasts, or talking to other services, regardless of what the code says.

The sepolicy rules in vendor/jarvisos/sepolicy/are what grant JarvisOS the permissions it needs at the OS level.

ObjectBox — The Database Layer and Where We're Headed

The compiled ObjectBox libraries sit in vendor/jarvisos/prebuilts/objectbox/ and get picked up at build time. At runtime they power everything in the model/ folder ie SourceFile, DocumentChunk, Conversation, TaskMemory, AccessLog. It's an embedded database that handles both structured queries and vector search in the same store.

Our indexer which stores metadata and pointers, never content, never embeddings. SourceFile holds file paths, hash, and mime type. DocumentChunk stores a short summary and a cactusIndexId, an integer pointer into Cactus's binary index. The actual embedding will live in Cactus.

When RagIndexWorker indexes a file, it calls CactusWrapper.embed() to generate the vector, then CactusWrapper.indexAdd() to store it.

ObjectBox only keeps the ID that points there.

This separation also sets up where we want to go. Traditional RAG retrieves chunks that look similar, useful, but it can't connect facts that live in different documents. Also for mobile, users may have different ways of referring to tasks in which semantic similarity may not be enough

GraphRAG adds a knowledge graph between the indexer(objectbox) and the retriever(cactus), so instead of returning isolated chunks it returns entities and the relationships between them. ObjectBox is already positioned to store that graph layer. We haven't built it yet, but the architecture doesn't need to change to get there.

Cactus — The Engine

vendor/cactus is our fork of the Cactus inference engine. At runtime it powers everything that requires a model — embeddings, vector search, and LLM completions. The entire codebase talks to it through exactly one file: CactusWrapper.java (in the android/framework/base directory).

That wrapper exposes five primitives we actually use. init loads a model and returns a handle. embed turns text into a float array. indexInit, indexAdd and indexQuery manage the vector index. complete runs inference with optional RAG context and tool definitions injected as a system message.

Everything goes through JNI — Java calls into the Cactus C++ engine via native methods. It's blocking by design. RagIndexWorker and RagService already run on background threads so that's fine.

The wrapper is intentionally thin. Cactus handles the hard parts — ARM kernels, quantisation, attention. We handle the orchestration above it. If we ever need to swap Cactus out or contribute changes upstream, there's exactly one folder to touch.

The Changing Landscape of Small Models

The SLM space is moving fast. Genuinely fast. Gemma 4 dropped this week — four sizes, with the E2B and E4B built specifically for on-device use. It is up to 4x faster than the previous version and uses up to 60% less battery. The entire family moves beyond simple chat to handle complex logic and agentic workflows, with native function calling built in.

For JarvisOS that is significant. We currently manage multiple model and index handle pairs through ModelRegistry because different tasks need different models. A capable single model that handles inference, embeddings, and tool calling natively starts to change that equation. And on top of that Gemma is multimodal.

But this is exactly why moving slowly and deliberately matters. The landscape shifts every few months. If we had tightly coupled our architecture to a specific model six months ago we would be rewriting it now. Instead CactusWrapper is a clean boundary, ModelRegistry is flexible, and swapping in Gemma 4 or whatever comes next is a simple configuration change.

Important research and papers released recently

Running a model on a phone is a different problem to running one in the cloud. Every token generated costs compute. Every token the model has already seen costs memory. On a server you throw hardware at both problems. On a phone you can't. Two research directions we've been following attack this directly.

CALM

Continuous Autoregressive Language Models (CALM), is a paradigm shift from discrete next-token prediction to continuous next-vector prediction. CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, this allows the team to model language as a sequence of continuous vectors instead of discrete tokens (Shao et al., 2025).

As the research space grows, this technique will be necessary to further reduce the number of autoregressive steps models need to take, making on-device inference faster and more practical without requiring larger hardware.

Read about CALM here CALM Paper

TurboQuant

Quantization compresses LLMs so they can run on devices with limited RAM, but there's a tradeoff. Compressing weights from 16 bits to 4 bits means rounding values, and those small errors compound across thousands of operations, quietly degrading accuracy. Google's TurboQuant tackles this differently by compressing the KV cache specifically down to 3 bits, with no training or fine-tuning required and no measurable accuracy loss.

TurboQuant: Redefining AI efficiency with extreme compression

Reinforcement Learning

Reinforcement learning is how you make a system improve through experience rather than explicit programming. For JarvisOS, the practical application is tool selection. Right now ToolDispatcher picks tools based on semantic similarity. Over time, it should learn which tools actually produce good outcomes for which queries. RL gives you that feedback loop without needing labelled training data, just signals from what worked and what didn't. Some papers to help are

Reinforcement Learning for Strategic Tool Use in LLMs

Federated learning

Federated learning is a technique to improve models without centralising data. Each device trains locally on its own interactions and shares only the model updates, never the raw data. The common assumption is that this requires a central server to aggregate those updates — but recent research shows that's not actually necessary. Serverless approaches like Plexus demonstrate that devices can coordinate directly with each other peer-to-peer, with no central infrastructure at all. (Dhasade et al., 2025)

Practical Federated Learning without a Server

For JarvisOS that matters. A privacy-first OS that routes model improvements through a central server would be contradicting its own premise. Peer-to-peer federated learning means phones running JarvisOS could get smarter over time from each other, without anyone's data ever leaving their device.

Final Remarks

Right now every app that wants local AI capabilties bundles its own model. One user will have GPT, Claude, Deepseek, Perplexity etc. Imagine A notes app requiring a user to download a model. A calendar app; downloads a model. A navigation app downloads another. The phone ends up with multiple copies of similar models eating storage and RAM, each isolated, none aware of the others. That's the wrong direction.

MCP — the Model Context Protocol popularised by Anthropic — tries to solve a similar problem on the desktop. It lets AI models connect to external tools and services through a standardised interface. It's a good idea, but it's designed for a world where the model lives on a server and tools are remote services. On a phone, running an HTTP server in the background is exactly the kind of thing Android will kill to save battery.

JarvisOS takes a different position. The model lives at the OS level. Apps don't download models — they register tools. A calendar app tells JarvisOS what it can do by declaring capabilities in its manifest, and JarvisOS handles the intelligence. Binder IPC replaces HTTP — it's kernel-enforced, microsecond latency, and Android won't kill it because it's a system service.

The shift this requires from app developers is actually small. Instead of building AI into your app, you describe what your app can do (well-defined actions with clear inputs and outputs, declared in your manifest. Think of it like a ContentProvider but for intent. You're not exposing data, you're exposing capability). JarvisOS figures out when to call it. This way apps get smarter without carrying the weight of a model, and the phone gets a single intelligence layer that sees across all of them rather than being fragmented across dozens of isolated AI stacks.

References

Cactus Compute. (2025). Cactus: AI Inference Engine for Phones & Wearables. https://github.com/cactus-compute/cactus

Cactus Compute. (2025). Cactus v1: Cross-Platform LLM Inference on Mobile. https://cactuscompute.com

LineageOS. (2024). Changelog 29: LineageOS 22.1 based on Android 15 QPR1. https://lineageos.org/Changelog-29/

LineageOS Wiki. (2025). Build for Nothing Phone (Pong). https://wiki.lineageos.org/devices/Pong/build/

Nothing Technology. (2025). Nothing Phone (3): Snapdragon 8s Gen 4. https://androidauthority.com/nothing-phone-3-snapdragon-chip-3568225/

Google DeepMind. (2026). Gemma 4: Byte for byte, the most capable open models. https://blog.google/innovation-and-ai/technology/developers-tools/gemma-4/

Google Android Developers. (2026). Gemma 4: The new standard for local agentic intelligence on Android. https://android-developers.googleblog.com/2026/04/gemma-4-new-standard-for-local-agentic-intelligence.html

Shao, Z., et al. (2025). Continuous Autoregressive Language Models (CALM). https://arxiv.org/abs/2510.27688

Zandieh, A., & Mirrokni, V., et al. (2026). TurboQuant: Redefining AI efficiency with extreme compression. https://research.google/blog/turboquant-redefining-ai-efficiency-with-extreme-compression/

Chi, J., & Zhong, W. (2025). ReTool: Reinforcement Learning for Strategic Tool Use in LLMs. https://arxiv.org/abs/2504.11536

Daly, et al. (2024). Federated Learning in Practice. https://arxiv.org/abs/2410.08892

Dhasade, A., et al. (2025). Practical Federated Learning without a Server. https://arxiv.org/abs/2503.05509

ObjectBox. (2024). ObjectBox: On-device vector database. https://objectbox.io

Acknowledgements
The following articles informed the research and visuals used in this post:
Chitalia, R. (2024). An Introduction to RAG and Simple Complex RAG. Medium. https://medium.com/enterprise-rag/an-introduction-to-rag-and-simple-complex-rag-9c3aa9bd017b

LearnByBuilding. (2024). RAG from Scratch. https://learnbybuilding.ai/tutorial/rag-from-scratch/

GradientFlow. (2024). Techniques, Challenges and Future of Augmented Language Models. https://gradientflow.com/techniques-challenges-and-future-of-augmented-language-models/

Siddique, Y. (2024). Tool Calling for LLMs: A Detailed Tutorial. Medium. https://medium.com/@yasir_siddique/tool-calling-for-llms-a-detailed-tutorial-a2b4d78633e2

LLaVA Team. (2024). LLaVA-NeXT Video. https://llava-vl.github.io/blog/2024-04-30-llava-next-video/

Shemet, R. (2025). Cactus On-Device Inference. HuggingFace. https://huggingface.co/blog/rshemet/cactus-on-device-inference

DigitalOcean. (2024). Model Quantization for Large Language Models. https://www.digitalocean.com/community/tutorials/model-quantization-large-language-models

CodeToDeploy. (2024). How On-Device LLMs Rewrite the Rules of App Development. Medium. https://medium.com/codetodeploy/how-on-device-llms-rewrite-the-rules-of-app-development-ad1fe44e64c4

DEV Community