DEV Community: vivek

Engineering a LangGraph UI Pipeline

vivek — Tue, 17 Mar 2026 17:22:23 +0000

Build an Agentic Pipeline for Frontend Development: Design Decisions, Trade-offs, and Practical Lessons

1. The Token Economy
Managing tokens is a real challenge, i am using Groq’s free tier (250k TPD)[
openai/gpt-oss-120b].

Audit Every Step
Do not wait until the end of the project to optimize. Monitor token consumption at every node. audit at every step (i used langsmith for tracing the node calls)
The best call is the one you never make
A pipeline must treats the LLM as a last resort for reasoning. Every time you replace a prompt with a regex or a hardcoded template, you’re shrinking your ‘LLM Surface Area’ — which is the only real way to kill latency and save your budget

LLMs are elite at reasoning and planning. They can architect a component structure or plot a multi-step migration. But (2+3) can be done better by cpu then transformer : llm.invoke(“2 + 3”).

2. The “Eraser-First” Workflow
Don’t start with code. I found success by extracting raw logic from GPT-4/Claude and then manually refining it on Eraser.io.

Before touching your IDE, define every node by:
Strict Data Contracts: Define the Input/Output schemas clearly.
The Pivot:
Your initial diagram is a hypothesis. As you observe real-world model responses, be prepared to refactor nodes. Rigid architectures fail; adaptive graphs win.

3. Radical Latency Reduction: Parallelism
In a linear graph, your latency is the sum of every node’s response time. In a production-grade graph, your latency should only be the length of the longest path.

Fan-Out/Fan-In Architecture:
I restructured my graph so that independent tasks — like generating components while simultaneously drafting pages code files— run in parallel.
The Multi-Thread Advantage:
Moving from sequential flows to parallel node execution reduced my total execution time by a major factor. If nodes don’t depend on each other’s data, they shouldn’t wait for each other to finish.

4. Avoiding the Tool-Calling Trap
Giving an LLM a massive toolbox feels powerful, but it increase latency . Every tool added introduces a “reasoning cycle” where the model must decide how to frame the call.

Minimize the Toolset:
Only provide tools for tasks the LLM cannot predict or calculate via code.
Deterministic Prediction:
LLMs often call tools in loops to “figure out” a solution. If you can manage that logic via pre-defined code paths or “prediction” nodes, do it. Don’t let the LLM waste time deciding what you, as the developer, already know.

5. Debugging

Isolated Debugging: I Uses LangSmith from Node #1. Validating each node’s output in isolation prevents “spaghetti logic.” If you wait until the full graph is finished to debug, you will never reach a stable state.

Agentic development is a system optimization problem. By prioritizing parallel node execution (low latency), replacing redundant LLM calls with deterministic logic(saving tokens and latency), and enforcing strict schema contracts(better structured outputs), you move from “experimental wrappers” to production-grade software.

I Tried to Build an Alexa with Real Memory — Here's What I Learned After 3 Months of Failure.

vivek — Thu, 05 Mar 2026 01:50:13 +0000

A story about LangGraph, memory architecture, and why I stopped fighting LLMs and made the system predictable instead

It Started With a Simple Frustration
I wanted to build something like Alexa — but smarter. Not just a voice assistant that forgets you the moment the session ends. Not an AI that stores your entire conversation history in a text file and calls it "memory."
I wanted a personal AI that actually knows you — your habits, your preferences, your tasks — and gets smarter over time the way a real assistant would.
Sounds simple. It wasn't.

Step 1: How Does Alexa Even Work?
Before building anything, I went deep on the Alexa cloud architecture. The model is clean: your voice query goes to the cloud, gets processed, hits an LLM, and the response streams back to the device. The device itself is thin — all the intelligence lives on the server.
Okay. So I needed to build the server layer. But when I started thinking about where memory fits in, I hit the first real wall.
Where does memory live? And more importantly — what even IS memory for a personal AI?

Step 2: What Should a Personal AI Actually Remember?
This is the question most AI projects skip. They just store everything — every message, every session — and call it memory. But that's just a log file. That's not memory.
I spent time thinking about what actually matters for a personal AI. What does a good human assistant remember about you?
After a lot of thinking, I landed on four categories:
Identity — who you are, your name, role, basic facts
Habits — things you do regularly, routines
Preferences — how you like things done, what you enjoy
Events & Tasks — things on your calendar, things you need to do
Everything else is noise. Most of what you say to an AI doesn't need to be stored. This felt like a small insight at the time — it turned out to be the most important design decision in the whole project.

Step 3: Where to Store It — SQL vs Vector DB
Now I had to figure out where to actually store these four types of memory.
My first instinct was a SQL database. Clean tables, structured data, easy to query. But I quickly hit a problem: you can't query a SQL database with natural language directly. You need to know the exact keys, the exact column names. That doesn't work when a user says "remind me what I told you about my gym schedule."
For natural language retrieval, you need vector search — you embed the query and the stored memories as vectors and find semantic matches.
So I ended up with a hybrid:
Postgres (SQL) — for structured memory: identity facts, tasks, calendar events. Things with clear keys you can retrieve directly.
Pinecone (Vector DB) — for semantic memory: habits, preferences, anything you'd retrieve by meaning rather than exact key.
Real data in SQL. Context and meaning in the vector store. Both working together.

Step 4: The First Approach — Just Give the LLM Everything
With the storage figured out, I built version one: give the LLM access to both databases as tools and let it figure out when to read and write.
It was clean in theory. In practice, it was a disaster.
LLMs hallucinate. The model would confidently write memory to the wrong category, retrieve irrelevant things, or — worse — make up memories that didn't exist. When your system's entire job is to be a reliable memory layer, hallucination is fatal.
I needed the system to be predictable. Even if it made mistakes, I needed to know where it would make mistakes.

Step 5: The Real Architecture — Nodes, Not Magic
This is when the project started actually working.
Instead of one LLM doing everything, I broke the pipeline into dedicated nodes, each with one job:

User Input
↓
[Segmentation Node]
Splits input into: memory_to_write | memory_to_fetch | ignore
↓
[Classification Node]
Labels each piece: identity | habit | preference | event | task
↓
[Router Node]
├──→ [Memory Writer] → Pinecone + Postgres (parallel)
└──→ [Memory Reader] → Fetch relevant context (parallel)
↓
[Final Answer Node]
Aggregates context → single LLM call → response

Two key decisions that made this work:

Read and write in parallel. Running them sequentially was killing latency. Parallelizing both brought response times down significantly.
Use LLMs only where you have to. Every node that could use regex or deterministic logic instead of an LLM — did. LLMs are expensive in tokens and unpredictable. The classification node, the segmentation logic — wherever I could replace an LLM call with a rule, I did. The only LLM call that has to exist is the final answer generation. The result: a system that's predictable end to end. If it gets something wrong, I know which node failed and why. That's infinitely better than a black box that hallucinates.

Step 6: The Hardware Dream Dies (For Now)
I originally wanted Orion to be a hardware device — a tabletop robot, always listening, always learning. That vision is still there. But 2-3 months in, I made a decision: get the software layer right first.
Hardware is a multiplier. If the memory architecture is broken, a physical device just makes it worse. If the memory architecture is solid, hardware becomes a packaging problem — not a fundamental one.
So Orion is now a software-first memory layer. The hardware will come later, if at all. The memory problem was always the interesting part anyway.

What the Tech Stack Looks Like
LangGraph — orchestration framework, manages the node graph and state
Groq — fast LLM inference for the final answer node
Pinecone — vector storage for semantic memory retrieval
Postgres (Supabase) — structured memory storage
Redis — caching and fast in-session state
Jina — embeddings for vectorizing memory content
LangSmith — tracing and debugging the graph (genuinely essential)
FastAPI — serves the whole thing as a REST API

What I'd Tell Myself 3 Months Ago
The question "what to store" matters more than "how to store." Most people jump to the tech before answering the design question. Get the design right first.
Latency is a real problem in memory systems. Parallel retrieval and write is not optional — it's necessary.
Changing the scope is not failure. Dropping the hardware and focusing on the software layer wasn't giving up. It was focusing.

What's Next
Orion is still in development. The memory layer works. The next step is making the retrieval smarter — better context injection, memory decay for old/irrelevant entries, and eventually a clean SDK that other developers can drop into their own AI projects.

If you're building something with LangGraph or agentic memory, I'd genuinely love to talk. The GitHub repo is open: github.com/vivek-1314/orion-py
Pre-final year CSE student. Building things that probably shouldn't work yet

The leetcode comfort trap

vivek — Sun, 28 Dec 2025 07:09:35 +0000

Solving 2–3 LeetCode problems and going to sleep feeling accomplished is the same dopamine loop as hitting the gym, training hard, and going home for a nice sleep. It feels productive, but it’s safe.

You’re grinding in a **sandbox **where failure has the weight of a feather. You spend three hours "deeply thinking" about an O(n \log n) solution, close the tab, and the universe remains unchanged.

That’s not engineering; that’s paper trading.

It’s easy to feel like a genius when the constraints are pre-defined and the "End" button is always in reach.

Real Difficulty is Sustained Pressure LeetCode isn’t bad—it’s a sharp tool, a necessary warm-up. But if you're settling for green checkmarks, you’re rotting in your comfort zone. Real growth happens when:

The Verdict LeetCode is the gym; it is not the sport. Use it to sharpen your blade, but don't spend your life polishing the metal while the monsters are outside the door. If your practice feels flat, move to something heavier. If the project scares you, you’re finally aiming right.

I can tell you this with respect because I’m currently in the trenches—grinding on a real, project while managing LeetCode side by side.

I Tried to Give Memory to an AI… and Learned the Hard Way

vivek — Sat, 13 Dec 2025 17:56:34 +0000

I’m building Orion, a proactive AI companion. Not just a chatbot — something that actually remembers things over time and can act intelligently before being asked.

**My first attempt was naïve:
“Let’s pass all past memory to the LLM!”

Result: 💸 Massive token usage, 🤯 polluted context, 😅 nonsense answers.

**Second attempt:

“More retrieval = better answers”

Result: LLM got distracted, important memories got buried, cost went up, and I learned that more isn’t always better.

What finally started working: structured memory layers
Short-term memory → current session, quick context
Long-term memory → structured facts like preferences, events
Vector memory → semantic recall for similar past situations
TTL + scoring → forgetting intentionally is a feature

The tricky part isn’t the LLM or the vector DB — it’s deciding what to remember, when to update, and what to forget.

I documented everything:
architecture diagrams, memory flows, mistakes, lessons learned. It’s messy, but real.

Repo (full system + docs + diagrams):
👉 https://github.com/vivek-1314/orion-py