Chris

Posted on Nov 14, 2025

Azura: local-first personal assistant (feedback wanted)

#ai #community #programming #devjournal

Hey devs 👋

I'm working solo on a project called Azura and I’d love blunt technical + product feedback before I go too deep.

TL;DR

Local-first personal AI assistant (Windows / macOS / Linux)
Runs 7B-class models locally on your own machine
Optional cloud inference with 70B+ models (potentially up to ~120B if I can get a GPU cluster cheap enough)
Cloud only sees temporary context for a given query, then it’s gone
Goal: let AI work with highly personalized data while keeping your data on-device and making AI compute more sustainable by offloading work to the user’s hardware

Think of it as Signal, but for AI:

private by default
transparent about what leaves your device
and actually usable as a daily “second brain”.

Problem I’m trying to solve

Most AI tools today:

ship all your prompts and files to a remote server
keep embeddings / logs indefinitely
centralize all compute in big datacenters

That’s bad if you want to:

use AI on sensitive data (legal docs, internal company info, personal notes)
build a long-term memory of your life and work
not rely 100% on someone else’s infrastructure for every tiny inference

On top of that, current AI usage is very cloud-heavy. Every small task hits a GPU in a datacenter, even when a smaller local model would be good enough.

Azura’s goal:

Let AI work deeply with your personal data while keeping that data on your device by default, and offload as much work as possible to the user’s hardware to make AI more sustainable.

Core concept

Azura has two main execution paths:

Local path (default)
- Desktop app (Win / macOS / Linux)
- Local backend (Rust / llama.cpp / vector DB)
- Uses a 7B model running on your machine
- Good for:
  - day-to-day chat
  - note-taking / journaling
  - searching your own docs/files
  - “second brain” queries that don’t need super high IQ
Cloud inference path (optional)
- When a query is too complex / heavy for the local 7B:
  - Azura builds a minimal context (chunks of docs, metadata, etc.)
  - Sends that context + query to a 70B+ model in the cloud (ideally up to ~120B later)
- Data handling:
  - Files / context are used only temporarily for that request
  - Held in memory or short-lived storage just long enough to run the inference
  - Then discarded – no long-term cloud memory of your life

Context engine (high-level idea)

On top of “just call an LLM”, I’m experimenting with a structured context engine:

Ingests:
- files, PDFs, notes, images
Stores:
- embeddings + metadata (timestamps, tags, entities, locations)
Builds:
- a lightweight relationship graph (people, projects, events, topics)
Answers:
- “What did I do for project A in March?”
- “Show me everything related to ‘Company A’ and ‘pricing’.”
- “What did I wear at the gala in Tokyo?” (from images ingested)

Standard RAG is part of this, but the goal is an ongoing personal knowledge base that the LLM can query, not just a vector search API.

All of this long-term data lives on-device.

Sustainability angle (important to me)

Part of the vision is:

Don’t hit a giant GPU cluster for every small query.
Let the user’s device handle as much as possible (7B locally).
Use big cloud models only when they actually add value.

Over time, I’d like Azura to feel like a hybrid compute layer:

Local where possible,
Cloud only for heavy stuff,
Always explicit and transparent.
And most of all, PRIVATE.

What I’d love feedback on

Architecture sanity
- Does the “local-first + direct cloud inference” setup look sane?
- Any better patterns you’ve used for mixing on-device models with cloud models?
Security + privacy model
- For ephemeral cloud context: what would you need to see (docs / guarantees / telemetry) to trust this?
- Anything obvious I’m missing around temporary file handling?
Sustainability / cost
- As engineers: do you care about offloading compute to end-user devices vs fully cloud?
- Any horror stories optimizing 7B vs 70B usage that I should know about?
Would you actually use this?
- If you’re into self-hosting / local LLMs:
  - What’s missing for this to replace “Ollama + notebook + random SaaS” for you?

Next steps

Right now I’m:

Testing 7B models on typical consumer hardware
Designing the first version of the context engine and schema

If this resonates, I’d really appreciate:

Architecture critiques
“This will break because X” comments
Ideas for must-have features for a real, daily-use personal AI

Thanks for reading 🙏

Happy to dive into any part in more detail if you’re curious.

DEV Community