I Tried to Build an Alexa with Real Memory — Here's What I Learned After 3 Months of Failure.

#ai #langchain #python #webdev

A story about LangGraph, memory architecture, and why I stopped fighting LLMs and made the system predictable instead

It Started With a Simple Frustration
I wanted to build something like Alexa — but smarter. Not just a voice assistant that forgets you the moment the session ends. Not an AI that stores your entire conversation history in a text file and calls it "memory."
I wanted a personal AI that actually knows you — your habits, your preferences, your tasks — and gets smarter over time the way a real assistant would.
Sounds simple. It wasn't.

Step 1: How Does Alexa Even Work?
Before building anything, I went deep on the Alexa cloud architecture. The model is clean: your voice query goes to the cloud, gets processed, hits an LLM, and the response streams back to the device. The device itself is thin — all the intelligence lives on the server.
Okay. So I needed to build the server layer. But when I started thinking about where memory fits in, I hit the first real wall.
Where does memory live? And more importantly — what even IS memory for a personal AI?

Step 2: What Should a Personal AI Actually Remember?
This is the question most AI projects skip. They just store everything — every message, every session — and call it memory. But that's just a log file. That's not memory.
I spent time thinking about what actually matters for a personal AI. What does a good human assistant remember about you?
After a lot of thinking, I landed on four categories:
Identity — who you are, your name, role, basic facts
Habits — things you do regularly, routines
Preferences — how you like things done, what you enjoy
Events & Tasks — things on your calendar, things you need to do
Everything else is noise. Most of what you say to an AI doesn't need to be stored. This felt like a small insight at the time — it turned out to be the most important design decision in the whole project.

Step 3: Where to Store It — SQL vs Vector DB
Now I had to figure out where to actually store these four types of memory.
My first instinct was a SQL database. Clean tables, structured data, easy to query. But I quickly hit a problem: you can't query a SQL database with natural language directly. You need to know the exact keys, the exact column names. That doesn't work when a user says "remind me what I told you about my gym schedule."
For natural language retrieval, you need vector search — you embed the query and the stored memories as vectors and find semantic matches.
So I ended up with a hybrid:
Postgres (SQL) — for structured memory: identity facts, tasks, calendar events. Things with clear keys you can retrieve directly.
Pinecone (Vector DB) — for semantic memory: habits, preferences, anything you'd retrieve by meaning rather than exact key.
Real data in SQL. Context and meaning in the vector store. Both working together.

Step 4: The First Approach — Just Give the LLM Everything
With the storage figured out, I built version one: give the LLM access to both databases as tools and let it figure out when to read and write.
It was clean in theory. In practice, it was a disaster.
LLMs hallucinate. The model would confidently write memory to the wrong category, retrieve irrelevant things, or — worse — make up memories that didn't exist. When your system's entire job is to be a reliable memory layer, hallucination is fatal.
I needed the system to be predictable. Even if it made mistakes, I needed to know where it would make mistakes.

Step 5: The Real Architecture — Nodes, Not Magic
This is when the project started actually working.
Instead of one LLM doing everything, I broke the pipeline into dedicated nodes, each with one job:

User Input
↓
[Segmentation Node]
Splits input into: memory_to_write | memory_to_fetch | ignore
↓
[Classification Node]
Labels each piece: identity | habit | preference | event | task
↓
[Router Node]
├──→ [Memory Writer] → Pinecone + Postgres (parallel)
└──→ [Memory Reader] → Fetch relevant context (parallel)
↓
[Final Answer Node]
Aggregates context → single LLM call → response

Two key decisions that made this work:

Read and write in parallel. Running them sequentially was killing latency. Parallelizing both brought response times down significantly.
Use LLMs only where you have to. Every node that could use regex or deterministic logic instead of an LLM — did. LLMs are expensive in tokens and unpredictable. The classification node, the segmentation logic — wherever I could replace an LLM call with a rule, I did. The only LLM call that has to exist is the final answer generation. The result: a system that's predictable end to end. If it gets something wrong, I know which node failed and why. That's infinitely better than a black box that hallucinates.

Step 6: The Hardware Dream Dies (For Now)
I originally wanted Orion to be a hardware device — a tabletop robot, always listening, always learning. That vision is still there. But 2-3 months in, I made a decision: get the software layer right first.
Hardware is a multiplier. If the memory architecture is broken, a physical device just makes it worse. If the memory architecture is solid, hardware becomes a packaging problem — not a fundamental one.
So Orion is now a software-first memory layer. The hardware will come later, if at all. The memory problem was always the interesting part anyway.

What the Tech Stack Looks Like
LangGraph — orchestration framework, manages the node graph and state
Groq — fast LLM inference for the final answer node
Pinecone — vector storage for semantic memory retrieval
Postgres (Supabase) — structured memory storage
Redis — caching and fast in-session state
Jina — embeddings for vectorizing memory content
LangSmith — tracing and debugging the graph (genuinely essential)
FastAPI — serves the whole thing as a REST API

What I'd Tell Myself 3 Months Ago
The question "what to store" matters more than "how to store." Most people jump to the tech before answering the design question. Get the design right first.
Latency is a real problem in memory systems. Parallel retrieval and write is not optional — it's necessary.
Changing the scope is not failure. Dropping the hardware and focusing on the software layer wasn't giving up. It was focusing.

What's Next
Orion is still in development. The memory layer works. The next step is making the retrieval smarter — better context injection, memory decay for old/irrelevant entries, and eventually a clean SDK that other developers can drop into their own AI projects.

If you're building something with LangGraph or agentic memory, I'd genuinely love to talk. The GitHub repo is open: github.com/vivek-1314/orion-py
Pre-final year CSE student. Building things that probably shouldn't work yet

Top comments (3)

System Aipass • Mar 5

Hi, i love ur passion. I build similier systems, but in a whole different light. God I could go on forever lol, ill try my best.
What I discovered, the ai only needs recent stuff in its available context, todays activities are high value, taking notes on what it has dont previous days, capping limiting the amount it can hold for recent activities, what u did yesterday, day before and so on, lets say it have detailed notes for 7 days, then day 8 roles off to a vector db, available to query any time, that just 1 layer, i try to keep this below 25% of available context, (which includes, system promps, tools, and ur msgs/memories, starting context), call it activities, next layer obsetvations, it captures what it observes, different that daily questions problems, what u had for dinner, what sports u like, what u usually do in the evening and so on. U can have this layer being constantly orginized as u like, and rolling old or not so important info into the vector db., now identity, the ai need a name, role, traits, a constant petsonality, they may change with ur inputs over time. So what u left with is an ai that its very aware if ur recent activities and if gets u. And i derstanys how it is and it why. Obviously these veriables can be what ever u like.

Thinks like calanders, emails, news updates, the tools, are all just code, hooks, ur rutines become the agents work flow in a sense, you setup automated, coded checks, if u get a new email spam, note worth notifying u, if u get a party invite, sute the program coukd decide to wake the llm and checknur schrdukes, msg you, ur friend tom send u an invite to x, do u want to rely, ur scheduke is free.

If u wantcan llm to wirk like seri, guve the tools and a way to communicate with u. I would seriously think about a multi agent system, each with there one spacific role, and the ond that speak to u, orchestrates all the other AI to tk all the heavylifting, with there own memory systems, so say u have 1 agent who manages or remembers ur favourite foids, another, checks ur schedule, another, can control ur personal devices, light on lights off, u only need ask 1 ai to do something, and it dusoaches all the other tasks, it just gets a job done result. This is how ur main ai can have goid context, not filling it context with tool usages reading files, relearning how ur system works. It offload all the work so its context is clean and protected, buikd scripts, hooks, full applications to process in the background so ur personal ai friend can stay engaguged with u and hold more memories about u rather than having to remember to check email, read emails, set reminders, read reminders, dim lights at 8 pm, and so on. Even web searches, it off loads to another agent who might use 100k tokens searching ur main ai get the sumerized 2k breakdown. And that all u woukd need. Sorry bit of a rant, but tryn paint a picture.

Siri is trash, i understand why u would want a better version :)

U coukd easily run a prototype through a telegram bot for testing. Keep ur backend on ur pc while u build.

Anyways food for thought

P.s teaching ur ai to successfully move through compactions, without losing face, is a much needed skill.

vivek • Mar 5

Thank you for this, seriously. I wasn't expecting someone who actually builds these systems to drop this much insight in a comment.

the time based memory layering,
25% context cap make a sense too
these all opens a new window for me to think more in deep and to level up orion
also the multi-agent arch is usefull to distribute singular responsibility to separate agents and orchestrating them by one.

really appreciate the depth and clarity you brought to this.
Would love to stay connected — are you on LinkedIn or Twitter?

System Aipass • Mar 5 • Edited

Np, I tried my best, there is just so much to consider and it comes in many forms. U see multi agents guve u a much larger context window, 1 agent 200k 10 agents 2mil to work with, u works might take some time before compaction, ur man will be compacting possibly several times daily, but thats fine, 200k is more than enough to hold face and wgen u say hi, he will know exactly where u left off and have memory of ur past week, day 7 might be fades, but the important notes still available. Happy to stay in touch. My agents are currently setting up all these accounts. Happy to shoot u a follow. I actually just kinda starting out on the public scene with my work, trying to bring my main system public on my repo. But might take minute haha. Twitter in my profile