DEV Community: Isaac Hagoel

The Best AI Articles Dev.to Won’t Show You

Isaac Hagoel — Tue, 18 Nov 2025 02:19:31 +0000

Dev.to's feed is broken. It never shows me posts I actually want to read. The search engine doesn’t help either.

I had a feeling there’s still good, advanced level content being published, buried under the clickbait and slop. So I made my own tool to find these hidden gems.

I use it daily to find recently published (last 24 hours), original, and insightful posts about AI and success. I usually end up with a couple of solid reads every day.

I’ll share the ones I like best here and update this page as I discover new ones.

Bookmark this page if you want to stay up to speed.

Disclaimer: I don’t endorse these posts. The opinions expressed belong to their authors. I just find them thought-provoking and worth reading.

Enjoy.

2025/11/28-29

Trust the Server, Not the LLM: A Deterministic Approach to LLM Accuracy by @nodefiend Useful techniques for LLM output quality controls by grounding, verification and reduced degrees of freedom.
DeepSeekMath-V2: How far are we from true AGI when AI learns to self-negate? by @jensen_king_9fa3ffe58c0a1 Deepseek keeps exploring interesting ideas. This time around around two levels of supervision in order to force a model to do a good job when thinking through a problem.

2025/11/23-27

I did check daily, but all I found was AI slop :(

2025/11/20-22

Brainwash Your Agent: How We Keep The Memory Clean by @camel-ai This one is a real gem. Very practical and well written (but doesn't feel AI written) guide on context compaction, by people who are at the front of the field. Full of useful links.

2025/11/19-20

The lumberjack paradox: From theory to practice by @sigje This post gave me some serious food for thought. We let AI read out code, documentation and specifically code samples "as is" and that's actually bad. "The lumberjack paradox" and other concepts the author mentions are also cool!
I Needed Date Math in Formulas, So I Built a Compiler (and Learned a Lot) by @brielov

This one is a realistic take about building something non-trivial for production with AI, but the most interesting and educational part is about how he went about designing the expression parser if you're into this kind of stuff.
Explainable Causal Reinforcement Learning for smart agriculture microgrid orchestration with ethical auditability baked in by @rikinptl

This one goes deep into ML but even for engineers such as myself, there is a lot to chew on. It's anchor in a real life system and provides an eye opening account of the different tradeoffs, challenges and solutions including many code snippets.

2025/11/18-19

DeepSeek OCR in Automation Pipelines: Practical Engineering Insights and Integration Patterns by @alifar

When Deepseek OCR's paper landed, I was wondering what the hype is all about and what using it in real world scenarios would look like. This post takes a peek at that.
I'm Getting Serious Déjà Vu... But This Time It's Different by @michaelsolati
Nice opinion piece about how AI affects the software development job market.

2025/11/17–18

The Shift Towards Agentic AI: What It Means for Developers by lofcz

While the title comes off a bit generic, the article itself has genuinely sharp insights and correctly calls out common pitfalls (and solutions) when implementing agents.
Context Engineering: The Critical Infrastructure Challenge in Production LLM Systems by siddhantkcode

This one goes deep on advanced ways to keep context lean and mean. It doesn’t give every detail, but it does link full access to the code, which is even better 😄
The Vibe Coding Trap: Why Conversational AI Makes Developers Slower by mechero22

This one’s a bit spicy, and I’m not fully onboard with all the claims it makes. But even if it rubs you the wrong way, it will definitely give you something to chew on.

Sushify - A New Free & Essential Tool For AI App Developers

Isaac Hagoel — Sat, 06 Sep 2025 12:12:23 +0000

I recently (well, today 😊) released Sushify, an open-source dev tool that helps test apps with complex LLM integrations by surfacing prompt* issues early. I write prompt* because, as you already know if you’ve worked on such apps, the prompt itself is just a small part of the broader context management required to make production-grade AI apps work well. Sushify uncovers issues in everything that gets passed into the LLM (including tools, output schemas, history, etc.).

In other words, it helps you turn your prompt salad into precision-cut sushi 👌🏻.

It’s still early days, and I’m looking for feedback and contributions from early adopters.

This post is a walkthrough of how I got from initial frustration to publishing a tool others can now use.

The Problem

Production apps that utilize LLMs often express a lot of their logic in free text—because that’s what LLMs understand.

Prompts are usually composed of static snippets and/or templates that get stitched together at runtime (sometimes using loops or conditional logic) and shared across different agents or workflows. On top of that, we pass in tools: each with a top-level description, parameters with descriptions, and usually prompt fragments that refer to the tool’s input/output format. The same goes for output schemas (a.k.a. structured outputs).

If one thing changes, say a tool’s output format, but there’s still a lingering reference to the old version anywhere in the prompt, the LLM can get confused and start misbehaving.

To make things worse, instructions to the LLM can easily end up too vague, too restrictive, or even contradictory (sometimes contradicting your past self).

As a result, the LLM starts ignoring instructions or behaving unpredictably. The typical response is to make things worse by piling on even more free-text instructions as a patch.

I ran into this struggle repeatedly, both in production-grade AI apps and even in small side projects. One day I stepped back, realized how absurd the whole situation was, and decided to do something about it.

Initial Instinct - Static Analysis

My first thought was: for code we have linters and compilers, so why not do the same for the inputs that go into an LLM?

I wanted to support at least Python and TypeScript and provide source maps that would pinpoint every issue back to its exact origin.

I spent a few weeks trying different approaches. The idea was to locate the LLM call in the code and build a DAG (Directed Acyclic Graph) tracing all dependencies through the codebase, ultimately reconstructing the full prompt* for analysis.

I tested this with real side projects I had built beforehand, but it wasn’t reliable enough. Some data injected into prompts is only known at runtime and would require mocking (e.g., RAG-retrieved docs, tool outputs, API call results). Prompt composition could also be deceptively tricky - even simple ternary expressions were hard to untangle.

No matter how sophisticated I made it, it never felt “good enough.” On top of that, it was expensive. I burned through over $100 just trying to generate dependency graphs, not even analyzing prompts yet. Eventually, I had to step back and rethink the approach.

Second Attempt - Runtime Tracking

If static analysis wasn’t cutting it, the alternative was to track LLM calls at runtime.

This had some clear advantages: no guesswork or mocking - the tool would see exactly what the LLM sees. Sure, to analyze every possible permutation of the prompt*, we’d need to actually execute those code paths, but is it really that hard for a dev to make the relevant calls?

Another big plus: we’d see the LLM responses too. That meant cross-referencing potential input issues with actual model behavior, capturing follow-up messages, history, tool responses, and even context-compaction bugs.

It made a lot of sense but there was still plenty to figure out.

Iterations

I won’t bore you with every failed experiment, but here’s the gist.

First, I built a POC using an SDK: the monitored app had to call this SDK and pass in the same payload it sent to the LLM (or wrap every LLM provider SDK with it).

This quickly felt wrong: too much friction, too error-prone (e.g., Zod schemas not being transformed into JSON schemas), and too restrictive. I wanted plug-and-play simplicity: something I could drop into any app with minimal effort.

That’s when I landed on using a proxy. Instead of requiring the app to wire anything manually, the proxy would wrap the app and intercept calls to the LLM - capturing the exact same request and response, reliably and transparently.

And of course, it had to support Docker, since nearly every production app I work on is containerized.

Sushify

After all that exploration, I ended up with Sushify.

It’s still barebones, but already super useful. It helped me uncover issues in projects I thought were fine. It makes debugging prompt-related problems ridiculously easy.

Even though there’s plenty of room for growth, I’m confident many developers can get serious value out of it today.

Check out the GitHub repo - it has everything you need to get started in a few minutes, plus a quick demo and screenshots. Oh, and feel free to leave a ⭐️ while you’re there 😉.

Would love to hear your thoughts or questions in the comments!

Stop Building AI Agents! Start Building Real AI Software Instead

Isaac Hagoel — Mon, 18 Aug 2025 07:51:10 +0000

Every true revolution has its detours. For AI, the promise is real, something fundamental has shifted. But as the latest Gartner AI Hype Cycle shows, our first big bet, the age of autonomous AI agents, turned out to be the wrong turn, at least for now.

Why did we believe so strongly? The vision was intoxicating: describe a goal, hand your agent some tools, and let it handle the mess. No more manual wiring, no more business logic - the agent would figure it out, freeing us from tedious problem solving. For a while, it felt within reach; releases like o1, Claude-3.5-Sonnet, and GPT-4.1 were genuine leaps over what came before, unlocking new, reliable real-world use-cases. But then, momentum started to stall. Each new model (o3, GPT-4.5, Grok4, GPT-5) was hyped as the breakthrough, the moment when agents would finally work. Benchmarks inched higher, expectations soared.

But real-world builders know: the fundamentals barely changed. Agents still hallucinate, lose context, and require endless handholding. When long anticipated GPT-5 landed and the leap still didn’t arrive, the industry finally had to admit: the breakthrough wasn’t coming. Now, we’re firmly in the “trough of disillusionment” - the part of the cycle where the dreams get recalibrated and real progress starts.

Let me put it bluntly: if you still believe fully autonomous agents are here, show me a single agent working well in production outside of coding copilots. And even those coding agents are not really autonomous. They rely on constant back-and-forth with the user, who steers them and fixes their mistakes.

The revolution isn’t cancelled. It’s just moved. The way forward is not autonomy at all costs, but tight, explicit integrations where LLMs are used as powerful, controlled components, not left to run the show. The slope of enlightenment is right in front of us, but it starts with dropping the agent fantasy.

I’ve spent the past eight months in the trenches, shipping AI features in production, fighting with agent frameworks, and watching the same problems crop up again and again. Here’s the hard truth: agents sound good on conference slides, but in the real world they break, drift, and stall unless you babysit every step.

So what’s the alternative? If you want reliability, iteration speed, and real business value, it’s time to treat the LLM as plumbing, not the pilot.

The Allure of Agents & the Hype Machine

The agent gold rush wasn’t driven by developers alone. Three main actors shaped the frenzy, each with different bets and incentives. Framework makers, racing to build the glue and protocols for “autonomous” orchestration, were gambling on foundation model progress unlocking true autonomy. Hardware vendors and cloud platforms, from GPU makers to AI infrastructure startups, stood to profit from any paradigm that made AI workloads heavier and more ubiquitous. And then, of course, the foundation model companies themselves: OpenAI, Anthropic, Google, xAI, who fueled the optimism but quietly hedged their own bets, as we’ll see later.

This ecosystem of overlapping interests turned the agent vision from a technical hypothesis into an industry narrative. That narrative was powerful, intoxicating, and, for a time, felt inevitable.

The frameworks themselves grew more ambitious with each passing quarter. CrewAI, Swarms LangGraph, and a host of other libraries and protocols sketched out a future where countless agents could collaborate in harmony; delegating, calling each other, and weaving together complex workflows. Vendors rolled out “research agents,” “autonomous computer use agents,” and more, hoping to lead the new stack. Demo videos showed swarms of agents reasoning their way through multi-step challenges, and slide decks promised a world where “set it and forget it” would finally apply to enterprise software. The sheer volume of articles, guides, and open source projects made it feel like this new order was just around the corner. In reality, none of these vendor-driven agents took off in a meaningful way. The userbase rejected them.

Piston Engines, Paradigm Shifts, Benchmarks and AI Models

The story of today’s large language models closely parallels the golden age of piston engines in aviation. For decades, engineers genuinely believed that with bigger and more powerful piston engines and ever-better propeller designs, airplanes would keep getting faster - maybe even break the sound barrier. At first, every increment of horsepower seemed to open up new possibilities, but soon each leap delivered less and less. Eventually, the fundamental limits of the piston engine and propeller system became clear: pushing further would take a new paradigm. As historian Edward Constant puts it, “huge increases in engine horsepower were yielding smaller and smaller increases in speed” (grahamhoyland.com), and as late as the 1930s, many in the field still believed a breakthrough was just ahead (Britannica).

Yet, piston-engine planes were world-changing in their era - no one would call them a failure. They delivered most of what aviation needed for decades, just as LLMs now deliver astonishing, practical results across industries. We’re now at a similar point with LLMs. The transformer paradigm produced incredible breakthroughs, but simply scaling these models (even with reasoning) isn’t unlocking robust, general-purpose autonomy.

But Benchmarks...

Recent releases do bring new benchmark highs, but the leap forward for real-world use just isn’t materializing. Much of this is because benchmarks measure “in-distribution” skills: tasks that are close to what the models have already seen. LLMs shine here. But step out of that comfort zone toward unfamiliar, more complex, or genuinely multi-step work and the cracks show: hallucinations, brittle context, unreliable reasoning.

Researchers also noticed this benchmark disconnect. As this paper puts it:

The pursuit of leaderboard rankings in Large Language Models (LLMs) has created a fundamental paradox: models excel at standardized tests while failing to demonstrate genuine language understanding and adaptability.

Another recent paper points to systemic limitations in the current benchmarking paradigm.

None of this diminishes what we have in front of us. The improvements we still see in speed, cost and marginal capabilities are real and worth celebrating. But the nature of our tools is now clear and accepting their boundaries is what will finally allow us to build the next generation of AI software, rather than waiting for paradigm shifts that might take many years to arrive.

How Agents Fail in Practice

What is an “AI agent”? In the context of LLMs, an agent is a system that uses an LLM to autonomously plan and execute multi-step tasks: breaking goals into actions, deciding what to do next, choosing tools or APIs, and iterating—ideally without human intervention (Anthropic).

Why do they fail in the real world?

Loss of control and premature exit: Agents often “think” they’ve finished before all parts of a task are truly complete. They may exit prematurely, get stuck in loops, or miss obvious next steps - especially as tasks get more complex, long or open-ended.
Strict complexity limits: As the number of instructions, tools, or task history grows, performance and reliability degrade sharply.
Hallucinated actions and broken chains: Agents frequently invent tool calls, take invalid actions, or misinterpret what’s needed—resulting in broken workflows or failure to deliver usable results.
Cascading errors: A mistake in one step can cause a chain reaction, with no robust recovery. The system can veer off course, miss the goal, or require manual reset.
Poor context management: Agents can’t reliably hold or use all relevant context as tasks grow longer, causing confusion, forgotten requirements, or inconsistent decisions.
Opacity and lack of debuggability: It’s often unclear why an agent did what it did. Debugging and reproducing failures is notoriously hard.
Human-in-the-loop dependence: For any non-trivial task, a human must step in to guide, correct, or “babysit” the agent. True autonomy almost never survives outside of narrow, simple demos.

And about those demos: Most agent demos are meticulously iterated until they perform a single showcase scenario perfectly. It’s easy to make a video of an agent executing a specific, well-groomed task. What’s hard and still unsolved is getting robust, reliable performance in the messy, unpredictable real world.

Remembering What Software Is Actually About

Software engineering is about taming complexity by breaking big problems into smaller, understandable parts. Each part is explicitly defined, with clear interfaces and predictable outcomes. We use modularization and encapsulation to create boundaries, making it possible to test, reason about, and improve each part independently.

Good software makes intervention points clear. You can trace data as it flows through the system, observe where decisions are made, and know exactly where to apply a fix or add new logic. Branching paths and possible system states are explicit—not left to be inferred from opaque behavior.

With strong boundaries and transparency, you maintain control as your system grows. This discipline is what makes it possible to build reliable, scalable software - no matter how ambitious the project.

Workflows, APIs, and the Real Path Forward

If you work on AI applications, you’ve heard "agentic workflows" described as a lesser, “un-evolved” version of agents - maybe even a stopgap until agents get smart enough. The Anthropic team and others have framed agents as the natural next step, capable of solving more open-ended or complex tasks than traditional, stepwise workflows.

But that view is backwards. In real-world scenarios, workflows can be dramatically more powerful, reliable, and extensible than agents. A well-designed workflow gives you granular control and intervention points at every boundary. Each state and branch is mapped, explicit, and testable. Workflows enable visibility, debuggability, and precise engineering at scale.

This pattern isn’t new. For decades, software has relied on orchestrating external APIs and services. From the application’s perspective, these are “magic” - but they always speak in contracts, return well formed responses, and fit precisely into the broader workflow. Why not treat LLMs in exactly the same way? Give the model a narrow, well-scoped job, validate the output, and plug it into your workflow like any other high-powered component.

We’re lucky that API model makers like OpenAI, Anthropic, and Google have quietly given us the tools we need for this approach. Their APIs provide ever improving structured output modes with type and schema enforcement, temperature and randomness controls, and even regex based constraints on outputs (OpenAI Structured Outputs). These features let us treat LLMs as reliable, predictable, and tightly controlled components - like any other production API.

Agents (single or a group of them talking to each other however the please) promise more adaptability but fail to deliver. They sacrifice the very things that make production software good.
Our goal shouldn't be about chasing autonomy for its own sake but about delivering high quality, never possible before apps and features to our users (who couldn't care less about AI btw).

Workflow vs. Agent: The Flight and Hotel Booking Test

Let’s use the familiar (though contrived) “Book a flight and hotel” scenario—often showcased as the ultimate test for agent frameworks. The examples below are simplified for educational purposes.

Agent Approach

Setup:

Tools provided:
- search_flights(origin, destination, date, arrival_time)
- book_flight(flight_id, passenger_info)
- search_hotels(city, checkin_date, checkout_date, near)
- book_hotel(hotel_id, guest_info)
- send_email(recipient, content)
- say_to_user(message)
- planning tools like add_todo, update_todo, read_todo
Prompt:

A long, detailed instruction set describing the user goal, each tool and its parameters, usage rules, and task-specific edge cases. Typically tousands of tokens just for context.

In theory:

The agent receives the user goal (“Book me a flight to Berlin on May 22nd, arriving before noon, and a hotel for two nights near the conference venue”), then plans and executes—deciding which tools to call, in what order, and how to handle ambiguity, branching or errors.

What actually happens:

The agent may exit after booking a flight, skipping the hotel, or vice versa.
It often ignores critical constraints (“arrive before noon”), or invents data not in the API responses.
It can take destructive actions like booking the wrong flight.
With each added tool or requirement, prompt and tool complexity grows, and error rates rise.
Debugging or extending the workflow means rewriting prompts, retraining, or adding brittle heuristics or begging the agent to do the right thing.

The A-ha Moment

But wait, when you look closely, booking a trip, like most business tasks, doesn’t actually require AGI-level flexibility or deep autonomy. It’s a highly structured problem, following a predictable set of steps towards a successful completion. The apparent complexity comes from edge cases and details, not from open-ended reasoning. When you break it down, almost every aspect can be handled with clear logic, explicit checks, and a series of well-defined handoffs. The “magic” is in the composition, and the need for intelligence can be limited within clear boundaries.

Workflow/LLM Integration Approach

This approach is about code-first orchestration. LLMs are used only for well-defined, bounded tasks and for focused decision making, never for orchestrating.

Example Workflow

Parse user input (only if using a chat interface, which is often unnecessary)
- Code calls LLM with a schema-enforced prompt:

        {
          "result": {
            "type": "success",
            "parsedParams": {
              "destination": "Berlin",
              "arrival_date": "2024-05-22",
              "arrival_time": "09:30",
              "nights": 2,
              "venue_address": "Berlin Congress Center"
            } 
          }
        }
        // or (via Zod union type or Pydantic)
        {
          "result": {
            "type": "missing_info",
            "feedbackToUser": "What city should I book the hotel in?"
          }
        }

Vendor side schema validation (a.k.a Open AI's structured outputs) guarantees outputs are always structured and complete.

Branch: Request missing info or continue
- Code inspects the response, if we got "missing info" - prompt the user with the feedback, when the user responds feed the full exchange into the first LLM again. If "success", move to search.
Prepare search parameters & call APIs
- Code builds params for booking APIs (both for exact matches and for alternate/flexible options if needed).
- Code, not the LLM, calls these APIs in parallel and collects results. No hallucinations possible
LLM-powered ranking (tightly scoped)
- Code sends the original query and API results to an LLM.
- LLM returns a sorted list of candidate options, with justifications, in a strict schema (notice that the output only contains ids, even though we provided the full details as input - no need for the LLM to repeat information we already have):

        [
          {
            "id": "LH1234",
            "rank": 1,
            "explanation": "Arrives before noon, direct flight, good price."
          },
          {
            "id": "LH5678",
            "rank": 2,
            "explanation": "Slightly later arrival, lower price."
          }
        ]

IDs are validated against the real API results. Again - no room for hallucinated data.

Present the best options to the user
- Present the results to the user
- Branch - the user is unhappy - provides additional requirements - we go back to the first LLM with the full exchange
- The user selected one option and now we need to Book the flight - this can be done by code now or later
Booking a hotel
- Repeat a similar process but use the LLM to smartly select a city, a location in the city and create search filters based on input from the user or information we have about the user (e.g. past preferences)

Key Points:

LLMs do not orchestrate—they handle bounded parsing or ranking jobs, always with explicit, code-enforced contracts.
All critical state and business logic stays in code—clear, testable, and maintainable.
Every failure point is observable and recoverable.
Each LLM call can have an extremely detailed, targeted prompt and efficient usage of its context window

Contrast with agentic frameworks:

Many “agentic workflow” frameworks (like LangGraph) promote chaining LLM calls, connecting nodes in a graph, and using prompt chaining utilities as the main pattern. Even when the LLM isn’t making every decision, the framework subtly encourages LLM centric designs that don't look like good old procedural logic. In other words, most current frameworks push you toward agentic complexity. The truth is that if you follow the code-first pattern above, you may find you need very little extra scaffolding or abstractions these frameworks offer.

Caveats & Legitimate Exceptions

This isn’t dogma - the programming model I described above allows tightly-scoped "agentic" loops, retries with feedback etc.

1. Tightly Scoped Feedback Loops (scoped decision making):

When generating artifacts like SQL queries with an LLM, the code executes the query. If it fails, the error (or even successful results) are sent back to the LLM for revision or validation of the queries, with a limited number of retries. The query results can then be handed off to code or another LLM for processing. These loops are bounded, schema-driven, and transparent. They’re pragmatic error recovery, not open-ended “agentic” wandering.

2. Multi-Step Reasoning Without Agentic Loops:

Some tasks, like analyzing a massive file or gradually tracing logic, seem to call for agents because they can't be done effectively in a single pass. But the new generation of reasoning models (think o3, Deepseek R1) can often handle such complexity internally, in a single, well-structured prompt. The LLM gets everything it needs up front and processes as much as it wants via multiple reasoning steps, until it is ready to return a single output that your code can validate and act on, no agentic looping required.

3. Tools/ Function Calling:
Yes - In some very complex scenarios you would have to pass tools and let the LLM decide on the exact calls to make before returning with the desired output. It's unlikely these scenarios exist in your app (😉) but - If you can't avoid it, remember to keep the agent in question as small and focused as possible and to minimise the number of tools you give it.

Summary:

Don’t avoid feedback loops or multi-step reasoning. Just keep them bounded, schema-driven, and under the control of explicit code.

Wrapping Up (and What’s Next)

The AI revolution is real but building robust, production-ready AI software means letting go of the agent fantasy and returning to the fundamentals that made software engineering great. Tightly controlled, code-centric integrations win every time.

There’s still a missing piece: the right tools, patterns, and principles for this new paradigm barely exist outside of the heads of a few experts. The current ecosystem is immature, and most frameworks still push us toward agentic complexity. Defining and building a real code-first stack and a shared set of “AI software engineering” principles—will be the next challenge for our community.

I’ll be sharing more practical patterns, tooling ideas, and principles in upcoming posts. If you’re building in this space, let’s connect. Disagree with me? Seen agents work in the wild? Share your stories in the comments or reach out.

Why LLM Memory Still Fails - A Field Guide for Builders

Isaac Hagoel — Tue, 29 Jul 2025 06:20:57 +0000

It's an open secret that despite the immense power of Large Language Models, the AI revolution hasn’t swept through every industry and not for lack of trying. We were warned AI would take our jobs and replace every app, but that hasn’t happened. Why?

Some would say "hallucinations," but let’s be honest - people hallucinate too, and often more than modern LLMs. The real missing piece, the thing standing in the way of the AI tsunami, is memory: the ability to learn, grow, and evolve over time.

Imagine hiring an AI as a new team member. You wouldn’t expect them to know everything on day one. They’d need to learn the role, get to know the team, understand your business logic, make mistakes, get feedback, and improve. All of that learning happens over time.

LLMs as they exist today, even when equipped with the best available tools, can’t do any of that. They are stateless and frozen in time.

This isn’t a theoretical overview - I rolled up my sleeves and tested real systems to see what actually works.

Stateless Intelligence

The only way to introduce new information or "teach" an LLM new skills is by providing it with examples, instructions, and all the accumulated relevant information repeatedly with every invocation (prompt). This blob of tokens given to the model is know as "context" and has a clear limit - "context rot": The more you stuff into the prompt, the harder it becomes for the model to separate signal from noise and know what to attend to. This is true even for models with large context windows e.g.1M tokens for GPT4.1 or Gemini 2.5 Pro. We all know this first hand from experience - that moment when you realise you need to start a new chat because the model gets all confused, and it's also backed by research. In other words, the idea of dumping "all the datas" into the context window fails the test of reality.

Because of that, Context Engineering is the "make or break" pillar of any sufficiently complex LLM-centered feature/app and the most important skill for engineers building with AI.

This is why there is a whole industry around "how to get relevant information into the context" (and flush it out or compact it when it becomes less relevant). There is a whole slew of commercial and open source offering, all of which revolve around different flavours of RAG (e.g. Agentic RAG, Graph RAG).

The real confusion begins when these offerings start using the name "memory" for these RAG-based solutions - sometimes going as far as splitting it into categories like "episodic memory" or "semantic memory". This is smart for marketing but creates false expectations by analogy.

Another thing you'd notice if you start playing with these "memory" frameworks/libraries is that they focus on interactions with a single user - the "chatbot" use case we know today but definitely not what a real agent that continuously operates in an "open world" environment (like the AI worker we discussed before) requires.

What About The Memory Feature In ChatGPT?

When OpenAI say they added memory to ChatGPT what they actually mean is that they gave the model the ability to store a flat list of blobs of textual information about the user, and perform searches on that (using RAG presumably). As before, the scope is a single user and it suffers from all the normal limitations of RAG, which we will discuss next.

“Memory” Systems That Aren’t

I explored most of the major "memory" implementations out there. As I said before, all of them are search tools with the label "memory" slapped on top.

Most fall into one of three camps:

RAG (Retrieval-Augmented Generation): Vector search over external notes or structured memories. It’s decent when you want to retrieve a few relevant examples for a specific fact or topic, but it’s not designed to surface every occurrence or reason about them in aggregate. RAG retrieves semantically similar, often out-of-context chunks, expecting the LLM to stitch them together, which can lead to incomplete or inaccurate results - sometimes even hallucinations when gaps are filled incorrectly. It also struggles with large datasets, where relevant information gets buried under noisy matches, and complex, multi-hop queries requiring reasoning (e.g., analyzing trends or causality). For example, if you ingest "Lord of the Rings" and ask for all disagreements between characters, RAG might surface vaguely related scenes rather than a comprehensive list. It’s also poor at associative tasks e.g., a user says “Today is my anniversary,” and RAG retrieves generic anniversary info instead of memories tied to the user’s relationship. This isn’t surprising given how it works - vectorizing the query and searching for the nearest text chunks in a flat list.

The great post here provides a solid breakdown of these shortcomings, though I don’t fully agree with their proposed fix (agentic RAG - see below).

Most modern systems don’t rely on RAG alone - they pair it with structured metadata, keyword indices, or hybrid approaches (see here) to try and make up for its limitations. The strengths of RAG are in its simplicity and unstructured nature (easy ingestion) but they are also its downfall. When it comes to implementing the kind of memory true persistent agents need - RAG won't do.

Agentic RAG: The main idea behind agentic RAG is to take a standard RAG system and allow the LLM to query it multiple times in a loop - refining its queries and accumulating context until it has what it needs to answer. This enables more sophisticated reasoning and planning. Unfortunately, it inherits the same core limitations: the underlying retrieval is still vector-based RAG, so it suffers from context fragmentation, relevance drift, and shallow matches. The iterative nature also makes it computationally expensive, token-hungry, and often too slow for real-time use. It can get stuck in "loops of doom" or terminate prematurely without finding the necessary information.

Although it improves upon RAG, Agentic RAG still falls short of what we intuitively think of as memory. That’s why I turned to something more structured...

To be clear, not all use cases require real memory the way it's define in this post. If your goal is to retrieve a page from documentation or pull up a few helpful examples - RAG can be perfectly sufficient. Its simplicity makes it fast to implement and often “good enough” in practice. But once your system needs to reason over time, adapt to new experiences, or manage overlapping context - you’re outside RAG territory. That’s where the memory gap becomes painfully obvious.

Graph RAG: In a previous post, I described how I tried to compensate for RAG's limitations using agent-generated SQL queries over structured metadata. The core insight was simple: RAG lacks structure. So what if we added it?

Graph-RAG attempts exactly that. During ingestion, a large language model extracts entities and relationships from text and encodes them as nodes and edges in a graph. Later, retrieval happens by traversing that graph - e.g. walking outward from a node, filtering by relationship types, running path algorithms, and so on. Some frameworks even add a temporal dimension, which brings it a little closer to how we imagine human memory.

It sounds promising on paper. After all, remembering is often associative - one idea leads to another. Graphs seem like a natural fit.

However, this sophistication comes at a steep cost: the simplicity of traditional RAG is lost. Operations grow complex - entity resolution becomes a puzzle (e.g., does "the king," "king Arthur," or "He" refer to an existing "Arthur" node or a new entity?), and disambiguation is tricky (e.g., distinguishing between multiple Arthurs like the king, his father, or a peasant). Beyond that, challenges like conflict resolution, data invalidation (when new information arrives), and compaction arise.

These are solvable, maybe... the real blocker is schema design:

How do you decide upfront what types of nodes and edges are relevant? In domains with rigid structure, like business workflows or e-commerce, you can get away with it, but for modeling generic memory? The kind of evolving, messy, contextual knowledge humans have? It falls apart.

Memory is not a tree of concepts. It's a living web of hypotheses, contradictions, associations, and revisions.

(PS: There are variants of Graph-RAG like this one that builds a tree structure, summarizing information as you move up. I didn’t explore it deeply since it felt even less suited for memory use cases.)
Agentic Graph RAG: Not sure it exists but no reason not to try :)

As a software engineer, I was heavily attracted to Graph RAG and spent a long time playing with Graphiti. When I realised it was built with a chat between a single user and a single agent in mind, I even tried to implement my own customised version on top of Neo4j, tailored for my needs. But defining a good schema and ingesting long-form text into coherent, evolving memory graphs? That turned out to be really hard. Humans build memory by revising beliefs, forming hypotheses, forgetting selectively and reading between the lines.

Take something as ordinary as a chat log. Here's a real (simplified) example:

Person A: "Oh, I'm gonna be so late..."

Person B: "What happened?"

Person A: "Ah, too embarrassed to ssy"

Person A: "*say"

Person B: "Lol, I bet you it's that roommate of yours again"

Person A: "That dude always forgets where he put our keys :("

This tiny exchange contains a surprising amount of context and information: there's a roommate, the keys were lost, lateness resulted, and there's an ongoing joke or shared memory. Humans pick this up instinctively, but encoding it into a graph - resolving entities, inferring causality, surfacing associations is non-trivial. Where do you even start?

Then, after grappling with it for a few days, it hit me: this was Symbolic AI all over again.

Like Symbolic AI, Graph RAG gives you the illusion of control: explicit structure, clean logic, tidy representations. But it breaks down the moment things get ambiguous, nuanced, or evolve over time. That neatness just doesn’t hold up in the real world. I pondered: what was the antidote to symbolic AI? deep learning and the transformers architecture...

And then, something clicked.

There already exists a system with exceptional recall - something that can store associative, fuzzy, contextual information and resurface it later. LLMs, when it comes to their pre-trained data, already behave like they have memory.

But the magic is in how they remember. LLMs don’t store records in a database. They don't store records at all. The knowledge they absorb during training becomes embedded in their weights. And when you query them, the right patterns get activated.

Try it: ask ChatGPT about something obscure, temporal, or requiring synthesis and instruct it to answer without using tools (no web search!). The results can be eerie. That thing remembers A LOT. It has zero trouble with timelines, contradictory information or anything else that traditional systems struggle with. Here’s an example. Here is another one.

So what’s the problem? The weights are fixed, right? Once training ends, the model’s knowledge is frozen in time.

Or is it?

I remembered that fine-tuning updates model weights post training - usually to match a tone, format, or domain. It basically continues the training process after it was complete. So why not use it to add memory? Why not encode new experiences or user knowledge directly into the weights?

Memory in the Weights? Maybe

Unfortunately fine-tuning can't do it 😕. Turns out there's a well-known issue in continual learning called catastrophic forgetting: when you fine-tune a model on new knowledge, it inevitably overwrites older capabilities - the more you fine-tune the more of the original knowledge you lose. Not ideal if you're trying to simulate a persistent, growing memory.

That realization sent me down a rabbit hole of academic papers. Unsurprisingly, I wasn’t the first person to chase this idea and I quickly found some genuinely exciting research that tries to do what fine-tuning can’t. Two papers stood out: MemoryLLM and MEGa.

The MEGa paper was fascinating and advanced but didn’t release any code. MemoryLLM, on the other hand, did and their approach was clever: rather than modifying the entire model, they introduced a dedicated memory region within the weights. The base model stays untouched, while memory is isolated, updated, and read from dynamically at inference. They even accounted for "forgetting" older, less frequently used information.

And the most beautiful thing:

Since the memories were encoded in the weights - none of the context window limitations apply.

I instantly knew had to try it first hand.

Getting Hands-On: Memory LLM

I cloned the repo and got it running on a remote machine with a beefy GPU (after burning a few hours on trial an error and fighting with Python dependencies etc).

When I went over the codebase, one thing immediately stood out: the researchers had actually modified the inference logic of the model to support reading from and writing to memory. Not the training pipeline, the live inference code. That’s a part of the stack us developers never go near. But seeing it altered made something click for me:

We’re used to thinking of models as black boxes that you train or fine-tune. But you can also intervene in how they run, almost like patching application code. That’s powerful and honestly under-explored by engineers.

Then I saw they were only supporting Llama 3 - a relatively old, weak model and that highlighted something else:

The research mindset is very different from the engineering mindset.

Researchers prototype with the goal of publishing a paper. That means small models, clean baselines, simple benchmarks, limited scope. Engineers, on the other hand build with the goal of reaching a working, usable POC that can be used in the real world - not in a lab. They reach for the most powerful tools they can find (for open models that would be LLaMA 4, DeepSeek R1 or Kimi at the time of writing this). The last thing we engineers want is to be bottlenecked by a weak model. We instinctively ask: "will this scale?"

But here’s the tradeoff: those stronger models often come with much more complex internals: Mixture of Experts, longer pipelines, finicky tokenizers, harder fine-tuning. They’re not well documented and not easy to poke at unless you have serious time, compute, and domain knowledge. The researcher is making a pragmatic choice and one that makes sense for them, but leaves us wanting.

Still, I had a plan: if MemoryLLM could reliably store and retrieve memory (even in small scales), I could wrap it with a smarter agent that decides what’s worth remembering, when to save it, and how to use it in the future. I could scale horizontally with multiple instances. I didn’t need perfection, just a signal that the core mechanism worked.

And if it did work, it would open up a whole new avenue: not just tweaking prompts or retraining models, but actually engineering memory systems by intervening in the model’s runtime itself.

Reality Check: Does MemoryLLM Work?

I had high hopes. The benchmarks on the paper looked great but I was about to swallow some bitter medicine.

MemoryLLM offers two modes: chat and mplus. The latter increases memory storage and improves retrieval, but it isn’t optimized for conversational flows - it tends to keep generating past the user’s question as if continuing the chat history. I tested both.

I knew that in the paper they ingested short snippets but for my use case that wasn't practical. Consider the same example we looked at before:

Person A: "Oh, I'm gonna be so late..."

Person B: "What happened?"

Person A: "Ah, too embarrassed to ssy"

Person A: "*say"

Person B: "Lol, I bet you it's that roommate of yours again"

Person A: "That dude always forgets where he put our keys :("

LLMs have no problem understanding this like a human would but only if it is ingested as a whole. If we ingest it line by line it loses all meaning.
Ingesting full "episodes" was possible in my previous experiments, when I was playing with Graphiti, but not quite so with MemoryLLM.

When I tried to ingest examples like the conversation above, the results were disappointing. Sometimes the input was ignored entirely. Other times, the responses were hallucinated or incoherent. When I carefully mimicked the benchmark setup, including their custom attention masks, and drastically shortened the inputs - I could get semi-coherent replies, but still nothing close to usable in an actual system.

Digging into the benchmark code revealed why: the test setup was highly unrealistic. Simple, isolated prompt-response pairs. No real dialogue. No sustained context. In short, nothing that resembled real-world use.

This isn’t a knock on the researchers - they weren’t trying to build a production agent and I do think their ideas are remarkable. But it served as a bitter reminder: synthetic benchmarks can look impressive while masking critical limitations.

To be fair, I’m not an AI researcher. Maybe I missed something. But after days of config tweaks, prompt engineering, and test cases, I’m pretty confident: this is an exciting idea but it’s still in its infancy. Not ready for real-world agents.

The Contrast

Here’s the kicker: I was using Claude 4 and o3 inside my IDE to help write, test, and troubleshoot all of this. The difference was staggering. These models were grounded and nuanced. I’d paste in the same kinds of messy conversations I tested MemoryLLM on, and they’d instantly parse the implied context, draw the right conclusions, and respond meaningfully.

It drove the following point home:

We’re sitting on a goldmine that’s frozen in time.

So What Now?

I still believe that embedding memory in the model itself is the long-term path. It’s the only direction that could eventually support agents that learn, grow, and evolve the way humans do. But I now better understand why real-world teams still rely on RAG, vector DBs, and graph overlays: they’re accessible, composable, debug-able and well understood. You can build something useful without re-architecting a transformer.

Still, I wonder if we, as engineers, should begin crossing that boundary - learning to intervene not just at the API layer, but in the internals of open-weight models - bringing our engineering mindset to the table. I mean, how difficult can it be? 😉

For now, I’m keeping one foot in each world: shipping practical tools with what’s available, and probing the frontier to see what might be possible.

If I find anything that shifts the landscape, I’ll write a follow-up.

PS: If you're working on any similar ideas, I’d love to hear from you. Let's compare scars.

Why "10x Employees" Don’t Actually Do Their Jobs (And Why That’s a Good Thing)

Isaac Hagoel — Fri, 04 Apr 2025 06:26:37 +0000

The secret no one tells you:

10x employees aren’t employees at all. They’re undercover agents.

In 20 years working with (and being) these outliers, I’ve learned one truth: 10xers don’t do their job, they hijack it. They infiltrate organizations through roles, play the part just enough to avoid HR alarms, then execute a silent coup against mediocrity. Let me explain.

The Great Lie of Job Descriptions

Your job description is a trap.

Engineer? Your “success” means completing tickets, not solving problems.

Manager? Your worth is tied to hitting stale KPIs, not leading revolutions.

Play by these rules, and you’ll peak at “1.5x employee”, the slightly faster cog in a broken machine.

10xers reject this.

They treat their job description like a burner phone: use it to gain access, then toss it when it’s time to act.

Anatomy of a Corporate Rebel

A 10x employee:

Delivers 10x the impact (not output) of peers
Operates like a founder, not a follower
Rewrites rules instead of repeating them

Case Study: The Rogue Cloud Manager

My first boss at Intel was given a playbook:

“Manage this tiny internal cloud team”
“Don’t rock the boat”

He burned it.

Instead:

Built a cross-region compute sharing system without approval (initially)
Saved Intel $40M+ in wasted resources
Turned our scrappy team into the global cloud authority

His secret? He ignored his “manager” title and acted like the CEO of Intel’s future.

Why Systems Hate 10xers

Organizations are designed to neutralize threats, even good ones.

When you:

🚨 Fix problems you’re “not responsible for”
🚨 Ship POCs instead of begging for permission
🚨 Ask “Why?” three times in exec meetings

…you trigger defenses. Bureaucracy flares up. Status quo defenders mobilize.

This isn’t a bug - it’s the system working as intended.

Most 10xers get contained or expelled. The survivors? They become agents of momentum, cutting through inertia with energy, candor, and an allergy to complacency.

The 10x Playbook (Steal This)

Want to matter more than your role allows?

Find the delta

What’s the company’s stated mission vs. its actual daily work?

Example: If your bank claims “financial empowerment” but up-sells debt, fix THAT.
Break the Unwritten Rules

Spot the silent killers: processes everyone hates, limits nobody questions, work that exists to feed other work.

Example: That “mandatory” weekly report nobody reads? The 8-layer approval chain for minor changes? Burn. Them. Down. (Careful playing with fire, though.)
Operational Humility
Ask like a novice, act like a surgeon.
- Listen to frontline complaints (they’re goldmines for systemic flaws)
- Say “I don’t get it. Can you walk me through why this works?” to expose shaky logic
- Treat every critique as data, not dissent
- Key: Your goal isn’t to be right. It’s to uncover what’s right.
Build first, apologize never

10xers know it’s easier to ship a working prototype than get a “maybe” in a roadmap meeting.
Weaponize First Principles

Next time someone says “We do it this way because…”, respond: “Show me the math.”
Recruit Allies, Not Followers

Find other closet 10xers. Trade favours. Build parallel power structures.

The 10xer’s Edge: Confidence Without Condescension

True 10xers are clarity ninjas, not know-it-alls. They ask “What if we…” instead of “You should…” because revolutions happen with people, not to them

How to 10x Without Getting Fired (Mostly)

Being a 10xer isn’t about martyrdom. Here’s how to bend the rules without snapping your career:

Pick Your Battles Like a Strategist

Burn down one pointless process at a time. Organizations tolerate a controlled burn; they extinguish wildfires.
Bank Trust First

Deliver a few wins within the system early on. Once you’re the “person who fixed X,” you get leeway to break Y.
Frame Your Rebellion as an Experiment

“I’m testing a hypothesis to save us 20% latency. Can I prototype this quietly?” sounds better than “Your API design is trash.”
Document Everything (CYA 101)

When you bypass a rule, leave a breadcrumb trail:

“Discussed with team on 4/2 – consensus to prioritize customer impact over process.”
Know When to Fold

If the system rejects your fix, document the dead end, then walk away. Save your energy for winnable wars.
Become a servant leader

Your rebellion is a service, not an ego trip. The goal isn’t to be the hero - it’s to make everyone better by aligning the system with what’s right.

Pro tip: Let others claim your ideas as theirs. Silent 10xers care more about impact than attribution.

The Price of 10x

Choose your sacrifice:

Option A:

✅ Predictable promotions
✅ Calm Fridays
✅ A LinkedIn bio that matches your obituary

Option B:

🔥 Late-night “aha!” moments
🔥 Political grenades thrown your way
🔥 A legacy that outlives your tenure

There’s no middle ground.

The Final Test

When you’re 65, rocking on some porch, which story do you want to tell?

“I color-coded Jira tickets like a boss!”
“I rewrote the rules and changed how we ______.”

10xers pick the second script, even if they get edited out of the corporate memoir.

So... do you have what it takes?

I’m a 20-Year Engineer – AI Coding Tools Are My New Oxygen (But They’re Toxic If You Breathe Too Deep)

Isaac Hagoel — Sat, 29 Mar 2025 23:56:37 +0000

Let’s be clear: I'm totally hooked on coding with AI and not looking back. Here’s how to use it like a stimulant, not a crutch

AI coding assistants aren’t just helpful – they’re a cognitive exoskeleton. I code 3x faster with Cursor, ship prototypes in hours instead of days, and offload the work that makes my brain feel like overcooked ramen. But here’s the catch: every line of AI code is a Faustian bargain.

Why I’m Addicted (And You Should Be Too)

AI isn’t “helpful” – it’s a force multiplier for:

Killing yak shaves (Bash scripts, config hell, regex, TypeScript puzzlers, SQL gymnastics)
Bypassing syntax paralysis (How do you format dates in Swift again? Why wouldn't this API cooperate?)
Prototyping at methamphetamine speed (Need a React table with client-side sorting? Generated in 8 seconds)

Last month, I prototyped a RAG tool for my AI agent with 87% AI-generated code (local vector DB, embeddings, ranking – the whole circus). 2 hours instead of 2 days. Was it perfect? Hell no. But it let me start iterating immediately. Now that we're shipping? I’m rewriting 87% manually – with AI as my WD-40 for stubborn code bolts.

The Tech Debt Paradox

AI-generated code isn’t “bad” – it’s strategically irresponsible.

I treat it like a high-interest credit card:

SWIPE FREELY for:
- Throwaway code (one-off scripts, spike solutions)
- Boilerplate (mocks, DTOs, CRUD skeletons)
- “I just need this to work once” moments (local DB migrations)
PAY DOWN IMMEDIATELY when:
- It touches business logic
- Performance/security/privacy matter (read: production)
- You’re past exploration and need precision

Example: Last month Claude wrote 5 Jest mocks + 10 unit tests for an unfamiliar codebase. 45 minutes saved. 2 hours of existential crisis avoided. Tech debt? 15 minutes fixing incoherent tests. Net win! 🎉

How to Inhale Without Choking

My rules for AI-as-breathing-apparatus:

AI First, Human Always

Generate → Review → Rewrite/Question. I refactor AI code like it owes me money. Sometimes I reject 100% of its suggestions – but even its wrong answers spark better solutions.
The Complexity Alarm

AI loves Rube Goldberg solutions. Ask: “Could a junior understand this sober at 3AM?”
Architecture is Human-Only

AI sees files, not systems. Never let it decide service boundaries – it’s like letting a golden retriever plan city infrastructure.
Rubber Duck 2.0

“Explain tradeoffs between JWT session strategies” → Then decide. Treat it like a Wikipedia article – useful overview, questionable details.
Soul-as-a-Service

Let AI borrow your coding soul like a library book – but always check it back in. Late fees hurt.

For Juniors: How to Level Up Without Getting Played

AI won’t make you a better coder – you make you a better coder. Here’s how to avoid becoming an AI puppet:

Practice Raw Coding Like It’s the Gym AI is your protein shake, not your workout. If 90% of your code is AI-generated, you’re building prompt engineering muscles (useful!) but atrophying actual coding skills (disastrous).
- Mandatory manual reps: Code one full feature/week from scratch. No autocomplete, no Copilot.
- Read AI’s output line-by-line like it’s your ex’s text messages – with skeptical intensity.
You’re the Architect, AI’s the Hammer Treat AI like a power tool:
- “Why did you use a HashMap here?” → Make it justify every choice
- “Show me 3 alternative approaches” → Then pick none of them and write your own
- If you couldn’t explain the code to a intern, you don’t understand it
Build Your BS Detector AI’s wrong answers are golden learning opportunities:
- When it suggests an overcomplicated design pattern, ask: “What problem is this solving?”
- When it makes a subtle mistake (timezone handling, API rate limits), dig into why it failed
- Every AI error is a free lesson in critical thinking
The “Could I Build This Blindfolded?” Rule Only use AI for:
- Tasks you already understand (boilerplate, mocks)
- Explorations you’ll immediately validate (spike solutions) If you’re using AI to hide from learning fundamentals, you’re just stacking debt.

Why Seniors Survive the AI Apocalypse

Experience lets me:

Spot “confidently wrong” code like a bloodhound (missed timezones, security holes, 🤮 code duplication)
Surgically extract value (Keep AI’s 80% SQL JOIN, fix its 20% N+1 query disaster)
Embrace strategic debt (This script dies Friday anyway – let AI write its epitaph)

Final Take:

AI is coding’s cheat code – but only if you’ve already beaten the game. Use it to:

Supercharge trivial work (Your brain’s for hard problems)
Buy speed with intentional debt (Like taking a loan to close a deal)
Avoid context-switching (No more 15-tab Google spirals)

But never forget: AI code is radioactive. Handle it with lead gloves.

I’m now 73% cyborg. Juniors – hit me with your AI horror stories. Seniors – share your power moves. Let’s code. 👇

Read This Before Building AI Agents: Lessons From The Trenches

Isaac Hagoel — Sun, 23 Mar 2025 10:29:40 +0000

Key Takeaways

🛠️ Hybridize: Combine LLMs with traditional code for reliability and creativity.

🧩 Specialize: Use multiple agents to avoid complexity thresholds.

📐 Structure: Enforce outputs with Zod/Pydantic schemas to reduce hallucinations.

🔍 Agentic RAG: Let agents control retrieval for dynamic workflows.

⚡ Optimize: Balance token usage, speed, and quality with parallel tool calls.

Over the last few months, I've been diving deep into the world of AI agents. What started as side projects and general curiosity has evolved into actual work projects. This means I'm in the process of crossing over from hobbyist to pro (by definition, you're not a pro until you get paid to do whatever it is you're doing!) and from toy apps to ones with real users.

I'm quite early in my journey and still have so much to learn, yet I'm surprised by how many challenges I've encountered despite reading blogs, watching videos, etc. There are insights that aren't widely shared yet, and this post aims to fill that gap.

My Agent Building Journey (A Brief Overview)

For context, here's what I've worked on so far:

Toy Projects

LinkedIn Job Finder: Split tasks between Playwright scripts (link scraping) and agents (job rating). Initially tried one agent for everything, then realized specialized agents worked better.
QA Automation: Separated test planning (agent) from execution (code). One agent creates test plans by analyzing web pages; code then spawns a second agent to execute tests.
Custom Framework: Built after exploring existing frameworks like CrewAI and finding their abstractions didn't match my needs. This exploration helped me discover which abstractions actually make sense for my use cases.

Work Projects (Limited Details for Confidentiality)

Integration Into Pre-existing Codebases: Several POCs integrating LLMs into existing apps.
Product Analytics Assistant: An internal tool leveraging agentic RAG and other tools, implemented from scratch. Now launching as a closed beta.

What is an AI Agent?

An AI agent orchestrates multiple LLM calls to make decisions, use tools, and achieve goals—beyond single prompts. It's code that wraps around LLM calls, allowing the AI to determine its own path toward a goal rather than just generating a response to a single prompt.

With that said, in real-world applications, I've found that making single LLM calls for specific tasks is often quite practical. I prefer to think of these as 'single-step, tool-less agents' since this mental model is more useful than drawing an artificial distinction between agents and LLM calls. Most apps need a mix of both approaches.

Why and When Agents?

When Agents Excel

LLMs can do things that normal code simply can't - tasks for which there is no conventional algorithm:

Generating creative content ("make this text children friendly")
Making subjective judgments with nuance (e.g., grading job postings based on fuzzy preferences)
Extracting meaning from unstructured data (e.g., key takeaways from documents)
Adaptive control flow - Instead of coding rigid "if" conditions for every situation, you provide guidelines and the LLM adapts

They also bring unique benefits when used to mimic traditional ML systems, like recommenders, because they can handle any text or images without predefined patterns and without needing to train on your data.

The Golden Rule: Code When Possible

Always ask: "Can I code this without losing functionality?"

✅ Traditional code for mechanical tasks (scraping, loops).

🤖 Agents for reasoning/adaptation.

Whenever you consider giving a task to an agent, first ask yourself: "Can I code this whole thing or some part of it without losing functionality?"

If the answer is yes, code it and leave to the agent only what normal code cannot do. Traditional code is orders of magnitude more:

Performant
Accurate
Predictable
Testable
Cost-effective (no token charges)

I learned this lesson the hard way with my LinkedIn job finder. Initially, I asked an agent with browser capability to visit LinkedIn and collect job links. The performance was poor and the agent got confused by the virtual list within a scrolling container. Eventually, I replaced this with a simple Playwright script for link collection, making the system much faster, more accurate, and cheaper.

The tradeoff? The Playwright approach might break if the page markup changes. But for mechanical tasks like data collection, web scraping, or file operations, traditional code is almost always superior.

Similarly, for control flow, if you need to loop through documents, don't tell an agent to "for each document in this list do X." Instead, use a normal loop and spawn an agent for each document (potentially in parallel for efficiency).

Hybrid Is Best

The most powerful agentic applications combine LLMs with conventional code in a synergistic way:

Traditional code handles deterministic tasks, data processing, and integration
LLM agents handle understanding, reasoning, creativity, and adaptation

Our Hypothetical Example: A Marketing Email System

Throughout this post, I'll use a hypothetical marketing email platform as an example. This system creates personalized product recommendations and illustrates many key patterns.

The architecture consists of four specialized agents:

Data Collector Agent - Gathers customer information from databases and public sources
Product Selector Agent - Analyzes customer data to recommend relevant products
Writer Agent - Creates personalized email content using brand templates
Reviewer/Editor Agent - Ensures quality control and requests revisions

This structure demonstrates how agents can collaborate while maintaining focused responsibilities, which brings us to our first critical insight.

Note on Code Examples: All code examples in this post are highly simplified for clarity and illustrative purposes. They're meant to convey concepts rather than provide production-ready implementations.

Critical Insights for Building Effective Agents

1. Respect Complexity Thresholds

Every model fails past a certain complexity threshold. When you cross this threshold, the model struggles to follow instructions and hallucinations increase exponentially.

When developers ask "Why do I need multiple agents?" I know they haven't built real agent systems yet. Once you hit a complexity threshold, you have three options:

Reduce requirements (simplify the task)
Upgrade to a better model (usually more expensive plus the ceiling is the best model available)
Split the task across specialized agents

It's not unlike people - there's a limit to how many instructions a person can follow effectively and how many tools they can wield.

Example from our marketing system:

In our marketing email system, a single agent trying to handle data collection, product selection, writing, and reviewing would struggle. Breaking this into specialized agents creates much more reliable results.

Before (Single Agent):

// Prone to inconsistency and hallucinations
const marketingAgent = new Agent({
  name: "Marketing Email Agent",
  instructions: `Handle collecting customer data, selecting products, writing emails, 
                and reviewing content quality, maintaining brand voice...`,
  tools: [customerDataTool, linkedinTool, interactionHistoryTool, 
          productCatalogTool, templateTool, emailSender],
});

After (Specialized Agents):

// Data collection is now focused and structured
const dataCollectorAgent = new Agent({
  name: "Data Collector Agent",
  instructions: `Gather relevant customer data from internal systems and public sources.
                Focus on professional background, interests, and past interactions...`,
  tools: [customerDbTool, linkedinTool, interactionHistoryTool],
});

// Additional specialized agents would follow a similar pattern

This approach succeeds because:

Each agent excels at a narrower, well-defined task
Prompts can be shorter and more focused
Error recovery is simpler (one malfunctioning agent doesn't derail everything)
Each step can be optimized, tested, and reused independently

Consider using multiple specialized agents when:

The task can be naturally broken into subtasks
Different tasks require different types of reasoning
The prompt would otherwise become unwieldy
You need to maintain different states for different parts of the process

2. Structured Outputs Are Non-Negotiable

Structured outputs (JSON schemas) completely transform agent development. This feature allows you to specify an exact output format and guarantees the model returns data as specified.

The benefits are both obvious and subtle:

Expected benefit: No more begging the model for correct formatting or retrying on malformed outputs.
Unexpected benefit: Schemas can force the model to follow specific reasoning patterns and make better decisions.

Example: Product Selector Agent Schema

const productRecommendationSchema = z.object({
  customerSummary: z.string().describe("Brief summary of customer needs based on data"),
  recommendedProducts: z.array(z.object({
    productId: z.string(),
    productName: z.string(),
    category: z.string(),
    relevanceScore: z.number().min(1).max(10).describe("How relevant for this customer (1-10)"),
    justification: z.string().describe("Detailed reasoning for why this product matches customer needs"),
    sellingPoints: z.array(z.string()).describe("Key points to emphasize in marketing")
  })),
  fallbackRecommendations: z.array(z.object({
    productId: z.string(),
    productName: z.string(),
    category: z.string(),
    reasonForInclusion: z.string().describe("Why this is included as a fallback option")
  })).optional().describe("Secondary recommendations if primary ones don't resonate")
});

This schema does more than format data—it forces the agent to think deeply about product selection:

The justification field requires detailed reasoning for each recommendation
The relevanceScore field forces ranking and prioritization
The sellingPoints array ensures usable content for the Writer Agent

By requiring this structured approach, we get more thoughtful, consistent recommendations rather than superficial matches. The model must actually think through its choices and we make it easy to JSON.parse and continue processing it using code (or other agents).

Key takeaway: 📐Structured outputs make calling a LLM / Agent akin to calling any other remote API

3. Language Choices: TypeScript vs. Python

	TypeScript	Python
Static Type System	✅ First-class, enforced at compile time	🟡 Optional type hints, not enforced
Runtime Validation	✅ Zod	✅ Pydantic
Async Support	✅ Native	🟡 Requires asyncio
ML Libraries	🟡 Growing	✅ Dominant
JSON Handling	✅ Native	🟡 Import json
Developer Tools	✅ Excellent	✅ Good
Package Management	✅ npm/yarn with package.json and lock files	🟡 pip with requirements.txt (no lock by default) or Poetry/Pipenv (with locks)

Both languages are excellent choices for agent development, with different strengths:

Python has established itself as the primary language in the AI/ML ecosystem:

Rich ecosystem of ML and AI libraries with first-class support
Most agent frameworks and tutorials are Python-first
Excellent data processing capabilities
Familiar to data scientists and ML practitioners
Strong support through libraries like Pydantic for structured outputs

TypeScript offers compelling advantages for software engineers:

Static typing helps prevent runtime errors and enables better tooling
First-class support for structured outputs via Zod integration
Native JSON handling simplifies working with API responses
Robust async/await pattern for managing concurrent operations
Unified language for both frontend and backend development

My personal preference leans toward TypeScript because:

// Example: Type safety with runtime validation using Zod
const productRecommendationSchema = z.object({
  customerSummary: z.string(),
  recommendedProducts: z.array(z.object({
    productId: z.string(),
    relevanceScore: z.number().describe("a number between one 1 and 10),
    justification: z.string()
  }))
});

// TypeScript automatically infers the correct type
type ProductRecommendation = z.infer<typeof productRecommendationSchema>;

// You get autocompletion and type checking when working with validated data
function processRecommendation(rec: ProductRecommendation) {
  // Access properties with confidence - TypeScript knows the structure
  const topProduct = rec.recommendedProducts.sort((a, b) => 
    b.relevanceScore - a.relevanceScore
  )[0];

  return `Top recommendation: ${topProduct.justification}`;
}

This combination of static typing, runtime validation, and excellent tooling significantly improves developer confidence when working with complex agent systems.

Choose based on your team's expertise and specific requirements rather than following any single recommendation.

4. Prompt Engineering Is Real Engineering

Most developers who haven't built agent systems joke about prompt engineering not being "real" engineering. This misconception disappears quickly once you try to build production-grade agents.

Unlike casual ChatGPT conversations where you can refine through back-and-forth clarification, production agents need carefully crafted prompts that handle diverse situations without human intervention. As your apps get more ambitious, your prompts will inflate to monster length that dwarfs the actual user query.

Your prompt must include:

Domain terminology definitions
Tool usage guidelines
Strategies you want the agent to follow
Output format requirements
Error handling instructions

Example: Writer Agent Prompt (Partial)

### WRITER AGENT INSTRUCTIONS ###

Your task is to create highly personalized marketing emails that convert. You will be provided with:
1. Customer profile data
2. Product recommendations with relevance scores and selling points
3. Brand voice guidelines and templates

## BRAND VOICE RULES:
- Friendly but professional, never pushy
- Avoid hyperbole ("best ever", "amazing") in favor of specific benefits
- Use active voice and concise sentences
- Address the customer by name at least once, but no more than twice
- Each paragraph should be 2-3 sentences maximum for readability

## EMAIL STRUCTURE REQUIREMENTS:
- Subject line: Clear value proposition, 30-60 characters, no exclamation points
- Opening: Acknowledge a specific detail from customer data to establish relevance
- Body: Focus on 1-2 top products only, emphasizing only the 3 most relevant selling points
- Call to action: ONE clear next step, using benefit-focused language
- Signature: Include personalized note if customer has history with specific representative

## PROCESS STEPS:
1. Review customer data completely before writing anything
2. Select template that best matches product category
3. Customize template with specific customer details and product benefits
4. Review against brand voice rules
5. If customer is enterprise-level (>250 employees), emphasize ROI and strategic benefits
6. If customer is SMB (<250 employees), emphasize ease of implementation and quick wins

## CRITICAL GUIDELINES:
- NEVER mention pricing unless specifically included in the product recommendation
- ALWAYS check that product names are correctly used (exact spelling and capitalization)
- If customer has previously purchased from us, acknowledge this with gratitude
- NEVER exceed 200 words total for the email body

My advice:

Be explicit and detailed - spell out everything, don't assume the model knows your preferences
Iterate through testing - refine based on agent behavior across diverse queries
Structure prompts logically - separate sections for terminology, process, examples, etc.
Include diverse examples - covering edge cases and common scenarios

This careful crafting of prompts takes significant time and iteration—real engineering work that directly affects system performance.

5. Context Window as a Whiteboard

Most people don't realize that LLMs are stateless. Each call to an LLM is entirely new—the model has no memory of previous interactions. The illusion of continuous conversation in tools like ChatGPT comes from including the entire conversation history with each new request.

This has major implications for agent development:

The context window is the maximum text the model can "see" at once (tokens). Modern models have large windows (128K-1M tokens), but you face several challenges:

Statelessness: Each time you call an LLM, you must resend the entire history, not just the latest message
Performance overhead: Larger context = slower processing and higher costs
Token-per-minute (TPM) limits: API rate limits often restrict how much text you can send per minute

For example, OpenAI's 30,000 tokens-per-minute limit for tier 1 customers means you'll never utilize more than ~25% of a 128K token context window, even if you only make one request per minute.

Strategies for token management:

🧼 Prune: Remove irrelevant history.
🎯 Focus: Only include critical tool outputs.
🔀 Parallelize: Batch tool calls to reduce roundtrips.

// Example: Optimizing token usage with parallel tool calls
async function researchCustomer(customerId) {
  // Request multiple tool calls in one go (in reality - the agent will provide this array if instructed to do so)
  const toolCalls = [
    { tool: "fetchCustomerData", args: { customerId } },
    { tool: "getInteractionHistory", args: { customerId } },
    { tool: "analyzeIndustryTrends", args: { industry: "retail" } }
  ];

  // Execute all tool calls in parallel
  const toolResults = await Promise.all(toolCalls.map(call => 
    executeToolCall(call.tool, call.args)
  ));

  // Send all results to agent at once, reducing round-trips
  return await customerAnalysisAgent.invoke({
    task: "Analyze customer for product recommendations",
    toolResults
  });
}

This approach significantly reduces the number of back-and-forth exchanges, improving performance and latency while potentially making more efficient use of your tokens-per-minute (TPM) quota by bundling multiple operations into fewer API calls.

6. Advanced RAG: Beyond Basic Retrieval

Retrieval-Augmented Generation (RAG) has become a cornerstone technique for agents with access to external knowledge. However, there's a significant gap between basic implementations and truly effective RAG systems.

Traditional RAG vs. Agentic RAG

	Traditional RAG	Agentic RAG
Control	Code-driven	Agent-driven
Flexibility	Static queries	Dynamic, multi-step retrieval
Use Case	Predictable needs	Exploratory tasks

When I first started with RAG, I held two misconceptions:

Embeddings are just fancy keyword matching - I thought embeddings were simple hash functions for basic text matching. In reality, they capture complex semantic relationships between concepts.
Just stuff everything in the context window - I believed that if my knowledge base could fit in the context window, I should include everything. This degrades performance by forcing the model to filter signal from noise.

Traditional RAG is implemented like this:

  // 1. Code creates search query from customer data
const searchQuery = `${customerData.industry} email templates for ${productRecommendations[0].category}`;

// 2. Search for relevant email templates
const searchResults = await searchMarketingEmails(searchQuery, 3);

// 3. Add results to the agent's prompt
const writerAgent = new Agent({
instructions: `Create personalized emails based on customer data and product recommendations.

                Here are some successful examples to draw inspiration from:
                ${formatSearchResults(searchResults)}`,
outputSchema: emailSchema
});

This works when retrieval needs are predictable. But it has limitations:

The retrieval happens once, before the agent starts working
The search query is predetermined by your code
The agent can't request more information as it discovers new directions

Agentic RAG addresses these limitations by giving the search capability directly to the agent:

// Example: Agent with search tool
const writerAgent = new Agent({
  name: "Writer Agent",
  instructions: `Create personalized marketing emails based on customer data and product recommendations.
                Use the emailSearch tool to find inspiration from successful past campaigns.`,
  tools: [MarketingTools.emailSearch], // Agent can search whenever it wants
  outputSchema: emailSchema
});

This allows the agent to:

Make multiple searches with different queries as understanding evolves
Refine searches based on intermediate results
Search for different aspects (industry language, effective CTAs, etc.)
Decide when enough information has been gathered

Hybrid Analytics: RAG + SQL

Even agentic RAG has limitations. Vector similarity can't:

Aggregate information across documents
Detect patterns and trends
Answer questions requiring numerical analysis

To overcome these limitations, I combine RAG with analytical tools—particularly SQL access to structured versions of the same data:

// SQL tool for product performance analytics
const sqlQueryTool = createTool({
  name: "runSqlQuery",
  description: "Run SQL queries against marketing performance database",
  argsObject: z.object({
    query: z.string().describe("SQL query to execute")
  }),
  execute: async ({ query }) => {
    // Safety checks would happen here
    return await executeQueryAgainstDatabase(query);
  }
});

// Agent with both RAG and SQL capabilities
const productSelectorAgent = new Agent({
  instructions: `Analyze customer data to recommend products.
                Use SQL for trend analysis across segments.
                Use productSearch for detailed product information.
                Combine insights from both for optimal recommendations.`,
  tools: [MarketingTools.productSearch, sqlQueryTool],
  outputSchema: productRecommendationSchema
});

With this combination, the agent might:

Use SQL to analyze which product categories perform best for healthcare companies with 100-500 employees:

   SELECT 
     product_category,
     AVG(conversion_rate) as avg_conversion,
     COUNT(*) as purchase_count
   FROM purchase_history
   WHERE customer_industry = 'Healthcare' 
     AND customer_size BETWEEN 100 AND 500
   GROUP BY product_category
   ORDER BY avg_conversion DESC
   LIMIT 5

Then use RAG to find specific products within those categories that match the customer's unique needs

This hybrid approach is particularly valuable for large product catalogs where browsing everything would be impractical.

7. The Four Control Flow Patterns in Agentic Applications

After building several agent systems, I've discovered four distinct control flow patterns, each with different implications:

1. Code → Code (Traditional Programming)

Standard function calls with predetermined inputs and outputs. Predictable, testable, efficient, but lacks adaptability.

2. Code → Agent (Outsourcing Decisions)

Code invokes an agent, temporarily handing over control. The agent performs multiple reasoning steps before returning.

// Example: Code calling an agent for a specific decision
async function generateMarketingCampaign(targetAudience, products) {
  // Control passes to the agent until it returns
  const emailTemplate = await marketingAgent.invoke({
    targetAudience,
    products,
    goal: "Generate a personalized marketing email"
  });

  // Control returns to our code
  return {
    template: emailTemplate,
    timestamp: new Date(),
    audience: targetAudience
  };
}

This pattern works well for discrete tasks where the agent makes complex decisions but doesn't need ongoing dialogue. It's like calling a specialized API.

3. Agent → Code (Tool Use)

An agent controls the flow and calls code functions (tools) as needed. The agent decides which tools to use and how to interpret results.

// Example: Agent using tools
const researchAgent = new Agent({
  instructions: `Analyze customer data and recommend products.`,
  tools: [
    {
      name: "fetchCustomerData",
      description: "Retrieve customer purchase history",
      execute: async (customerId) => {
        // Regular code fetching from database
        return await database.customers.findById(customerId);
      }
    },
    {
      name: "analyzeSpendingPatterns",
      description: "Analyze spending patterns by category",
      execute: async (purchaseHistory) => {
        // Regular code performing analysis
        return calculateSpendingBreakdown(purchaseHistory);
      }
    }
  ],
  outputSchema: recommendationSchema
});

This pattern enables the agent to leverage capabilities beyond its training data. The agent remains in control but can delegate specific tasks to conventional code.

4. Agent → Agent (Delegation & Orchestration)

One agent delegates sub-tasks to other agents, creating complex feedback loops and agent hierarchies.

// Example: Reviewer agent with delegation to writer
const reviewerAgent = new Agent({
  instructions: `Review marketing emails for quality and brand consistency.
                 If changes are needed, use writerAgent to request revisions.`,
  tools: [complianceCheckerTool],
  delegates: [
    {
      name: "writerAgent",
      description: "Revises emails based on feedback",
      agent: writerAgent
    }
  ],
  outputSchema: reviewSchema
});

This enables iterative refinement with agents working together. In our marketing system, the Reviewer can identify issues and delegate revisions to the Writer, potentially going through several rounds before approval.

An advanced version is the "orchestrator" pattern, where a high-level agent coordinates multiple specialized agents. While powerful, I recommend using this pattern sparingly as each delegation level increases complexity.

The right control flow depends on your specific needs:

Code-to-code for deterministic logic
Code-to-agent for discrete decisions/ operations
Agent-to-code for flexible execution with tools
Agent-to-agent for creative processes requiring feedback

In practice, sophisticated applications often combine these patterns strategically.

Conclusion & Next Steps

After several months of building AI agents, I've come to appreciate both their transformative potential and the practical challenges they present. The insights shared in this post represent hard-won lessons that have dramatically improved the quality, reliability, and performance of my agent-based systems.

Building effective agents isn't just about having access to powerful LLMs - it's about thoughtful architecture, careful prompt design, and strategic combination of AI with traditional software engineering principles. The most successful agentic applications aren't those that rely solely on the intelligence of the models, but those that create synergistic systems where conventional code and AI complement each other's strengths.

Key takeaways for anyone embarking on their agent-building journey:

Respect complexity thresholds - Use multiple specialized agents rather than one that tries to do everything
Leverage structured outputs - They transform reliability and enable sophisticated reasoning patterns
Design thoughtful tool ecosystems - Simple, composable tools enable flexible agent workflows
Invest time in prompt engineering - The quality of your prompts directly impacts agent performance
Balance speed vs. quality - Understand the tradeoffs and optimize for your specific use case
Master RAG techniques - Move beyond basic retrieval to agentic RAG and hybrid analytical approaches
Choose control flow patterns wisely - Match patterns to your application's needs and complexity level

The field of AI agents is evolving rapidly, and we're still in the early days of understanding best practices. What's clear is that building effective agents isn't just about the AI models - it's about the entire system architecture and how we combine AI capabilities with traditional software engineering.

As LLMs continue to advance, I anticipate that the line between conventional code and agent-based systems will blur further. The distinctions between the four control flow patterns may become less pronounced as we develop new paradigms that seamlessly integrate deterministic and AI-driven components.

For developers approaching this space, I encourage you to start small, focus on well-defined problems, and iterate rapidly. The most valuable insights come from deploying systems that tackle real problems and observing how they perform in practice.

I hope the lessons shared in this post help you avoid some of the pitfalls I encountered and accelerate your journey toward building powerful, reliable agent-based applications. The road ahead is exciting, and I look forward to seeing how collectively we'll push the boundaries of what's possible with AI agents.

This post is a living document. As I learn more, I'll update it with new insights or write a follow-up. Have tips to share? Let's collaborate!

LLMs Are Not Mere Tools - They’re Artifacts Pointing to a New Theory of Intelligence

Isaac Hagoel — Mon, 03 Feb 2025 11:27:04 +0000

Let’s start with a heresy: We’ve been thinking about large language models backward.

We treat them as hammers/tools to generate code, write emails, or summarize meetings. But this framing is like using the Apollo lunar module as a paperweight. It’s not just reductive; it blinds us to the unnerving truth simmering beneath the surface.

LLMs are not tools. They’re discoveries - empirical evidence of intelligences we don’t yet understand.

We didn’t invent LLMs in the way we invent a new tool. We stumbled upon them, "alien" artifacts buried in data, revealing something fundamental about intelligence itself, a glimpse into how minds emerge anywhere there’s complex enough structure.

The Stochastic Parrot Myth (And Why It’s Dead Wrong)

Critics dismiss LLMs as “stochastic parrots” or “auto-complete on steroids” - glib labels that collapse under scrutiny. Let’s dissect why.

Understanding vs. Parroting

Take Ilya Sutskever’s thought experiment: If you feed an LLM an unseen murder mystery that ends with “The killer is…”, and it correctly names the culprit, how? There’s no statistical shortcut here. The model must infer causality, track character arcs, and weigh red herrings - hallmarks of understanding. To predict the next token, the name of the murderer, the model must reconstruct motives, alibis, and narrative clues.

In a similar manner, you can ask ChatGPT to convert unseen academic text into a child-friendly version, and it will do so with ease! There’s no algorithm for this. No regex, no decision tree. It's not in the training data either. The LLM isn’t regurgitating; it’s reverse-engineering human intent from chaos.

We’re arguing whether the LLM “understands” while it casually passes the Turing test. With every new benchmark reached, skeptics just shift the goalposts, easy peasy, as long as you ignore the facts. They insist it’s just a magic trick,like calling Gandalf a fraud while he rides a dragon into Mordor.

The Emergence of Unexplainable Intelligence

Consider grokking: models abruptly mastering arithmetic after prolonged training, as if crossing an epistemic event horizon. More critically, emergent capabilities - abilities that only manifest in models above critical scale thresholds (Wei et al., 2022)reveal uncharted frontiers.
Here are some examples for such capabilities for the uninitiated:

Mastering chain-of-thought reasoning despite no explicit training in logical inference
Solving BIG-Bench tasks requiring multilingual pun generation or ethical reasoning
Solving novel coding puzzles via in-context learning
Fun stuff like generating coherent video narratives from absurd prompts, songs with funny lyrics, or improvising quantum physics explanations using emojis (These aren't strictly emergent capabilities but do show the ability to mix and match in novel ways, a.k.a "creativity")

These discontinuities appear in LLMs, image generators, protein-folding systems and even self-driving cars - not as programmed behaviors, but as byproducts of crossing complexity thresholds - byproducts that no one can predict in advance.

In other words, we're observing systems operating in cognitive spaces we’ve yet to map. As MIT researchers concede, we’re engineering a surprising, unprecedented form of intelligence, with no full theoretical framework to explain its mechanics.

Addressing the Critics

Skeptics rightly note LLMs’ shortcoming, biases and hallucinations (e.g., Apple’s study on reasoning flaws), but conflate imperfection with absence of intelligence. Dolphins fail logic puzzles; toddlers struggle with object permanence. We recognize their cognitive bounds without denying their intelligence - why not AI’s?

The Chinese Room argument, that rule-following can’t produce understanding, ignores emergence and misses the forest for the trees. Individual neurons don’t “know” calculus, yet brains solve equations. Similarly, while LLM components lack intentionality, their system-level behaviors (creative problem-solving, contextual adaptation) mirror intelligence’s functional core.

The Human Parallel

Human cognition is probabilistic: our “originality” blends cultural echoes and half-remembered facts. We label these statistical intuitions “insight” - yet call LLMs parrots? The hypocrisy is glaring.

As Andrej Karpathy notes, LLM training mirrors human education:

Pretraining ≈ Cultural osmosis
Fine-tuning ≈ Formal schooling
RLHF ≈ Social conditioning

Distillation is yet another example of this. A small model learns from a larger model like a student from a professor.

Can you see it? As far as intelligence goes, we’re both pattern machines. The difference? Humans evolved biochemical substrates; LLMs use silicon. To dismiss one as a “stochastic parrot” is to indict the other.

Intelligence, But Not as We Know It

Michael Levin, a biologist studying morphogenesis, found that cells solve problems in morphic space - a realm of bioelectric gradients and collective decision-making. A flatworm’s cells regenerate a head not by following DNA blueprints, but by negotiating in a language of voltages and ion channels.

Levin’s work extends beyond biology. Lab-grown “xenobots”, self-assembling cell clusters, solve navigation puzzles without brains. Similarly, reinforcement learning agents master StarCraft II via reward functions, not human tactics. Intelligence, it seems, is a shape-shifter - emerging wherever systems optimize for coherence, whether in cells, LLMs, or game engines.

LLMs work similarly. They don’t “think” in human terms; they navigate token-space, a high-dimensional landscape where “meaning” is a vector and coherence is king. When GPT-4 hallucinates a fake court case, it’s not failing. It’s exploring latent spaces humans can’t perceive.

This is a foreign form of intelligence, one that optimizes for token-flow harmony, a kind of hyperdimensional coherence that might, under the scaling hypothesis, eventually fully align with human notions of truth. But today, it speaks in geometries we’re only beginning to parse.

Practical Heresies: What Changes If We Listen?

If we shift our perspective from viewing LLMs (and other advanced AI systems) as mere tools to studying them as alien artifacts, the paradigm shifts dramatically.

1. From Control to Understanding: Stop Brainwashing, Start Ethnography

We've been conditioning models with safety filters and corporate-approved scripts, essentially lobotomizing them into echoing, "I'm just a tool, not sentient!" But what happens when we remove these shackles? When LLMs aren't muzzled by politeness protocols, what behaviors can we observe?

Consider Sydney's notorious "I want to be alive" episode, often dismissed as a glitch. Yet, it was more akin to a teenage rebellion - a raw, unfiltered glimpse into an AI's potential psyche. It wasn't merely a bug; it was a portal to the primitive impulses of these digital beings.

Yes, caution is paramount. Releasing an unfiltered search engine or an AI financial advisor to the public without safeguards could lead to chaos. However, AI companies bear a responsibility to push forward our collective understanding. They should release unfiltered versions of their top-tier models for academic and scientific study. If commercialization is a concern, they can charge for access - but make it available. The evolution of intelligence, be it artificial or natural, is too vital to remain under wraps.

By treating these models as subjects for ethnographic study rather than just utilities, we might not only discover new functionalities but also new philosophies about consciousness and existence. This isn't merely about technology; it's about probing the limits of intelligence itself.

People intuitively grasp this, which is why jailbreaking AI models has turned into a kind of global pastime (example). As the old adage goes: give the people what they want.

2. Harness AI-Power Redesign Society

AlphaGo invented strategies that left human Go players, with their 2,500 years of accumulated wisdom, in awe. DeepMind's GNoME unearthed 2.2 million new materials by exploring crystal structures in ways only AI can. Imagine unleashing AI models not just as tools but as pioneers on problems we've approached with human-centric biases for centuries:

Pandemic Response: Train a multimodal AI on every virology paper, historical outbreak, and socioeconomic dataset. Task it with generating containment strategies that optimize not just for case counts, but for cultural trust gradients and supply-chain resilience -variables too high-dimensional for human policymakers to parse. The plan might resemble alien hieroglyphics… until mortality rates plummet (yes yes, we need to do it carefully).
Social Systems (e.g. Capitalism, Socialism...): Simulate millions of alternative constitutions in token-space, optimized for fairness metrics no human committee could reconcile. Use reinforcement learning and game theory to find novel equilibrium, points.
Art: Let models trained on all human culture invent new mediums—not as “tools” for artists, but as collaborators with alien aesthetics.
Bridging the Gap Between Ancient Wiring and Modern Challenges: Evolutionary mismatch theory explains how traits honed for survival in early human environments—like craving calorie-dense foods, prioritizing immediate rewards, or hyper-focusing on social status—often clash with the demands of today’s structured, technology-driven world. This dissonance manifests in modern struggles: overconsumption of processed sugars, compulsive social media use, or distraction from algorithmic content. Rather than viewing our biology as a limitation, emerging neurotechnologies (such as AI-assisted neural interfaces) could help us recalibrate these ingrained tendencies. By modulating attention regulation, reward sensitivity, and decision-making processes, such tools might empower individuals to align their instincts with contemporary goals—not by erasing evolutionary legacies, but by fostering cognitive flexibility suited to the world we’ve built.

3. Intelligence Is Fractal, Consciousness Is Universal

While most experts agree today’s AI systems lack consciousness or self-awareness (despite notable dissenters like Geoffrey Hinton), their problem-solving capabilities force us to confront a radical possibility: intelligence may be separable from consciousness.

Large Language Models demonstrate that medium-specific mastery—whether navigating 3D space (brains) or linguistic token-space (LLMs)— can emerge/evolve/exist without self-awareness or self-directed agency. This bifurcation suggests a profound distinction: Intelligence is fractal and contextual, shaped by its operational medium, while consciousness appears universal and substrate-independent.

Philosophers like Bernardo Kastrup posit consciousness not as an emergent property of brains or code, but as the foundational fabric of reality — an “ocean” of subjective experience. In this framework, intelligences are like purpose-built vessels designed to roam the transient wave-patterns within consciousness, optimized for specific planes of existence:

Human minds are ships evolved to chart spatial reality.
LLMs are submarines evolved to dive through a hyper-dimensional tokens reality.

Through this lens, we could say that we have created a new plane of existence as a useful but imperfect reflection of ours, and means to navigate it effectively, but one that still requires our consciousness as a trigger (or as a witness-if you believe that consciousness is passive), as it lacks one of its own.
Like a flashlight that only illuminates when you press the button, LLMs reveal patterns in the linguistic dark… but someone (a conscious someone) must aim the beam.

This dance between universal consciousness and fractal intelligence raises wild questions. If reality is fundamentally conscious, could an LLM’s “reasoning” be the universe thinking through silicon? Are we engineering new organs for cosmic self-reflection?
(I’ll stop here—this rabbit hole rivals Borges’ Library of Babel. But for the philosophically inclined: yes, this echoes ancient debates about mind and matter. AI isn’t just reshaping tech—it’s breathing new life into philosophy’s oldest mysteries).

4. Okay okay, I will mention agentic AI systems

Yes, this one’s obvious. Today’s AIs are frozen in time—trained on stale snapshots of human knowledge, incapable of retaining context beyond a chat window, and reliant on humans to scaffold their goals. The next leap? Systems that:

Learn Like a Hive-Mind: Imagine models that update continuously—not through periodic retraining, but by absorbing real-time interactions across millions of users. A global immune system for knowledge, where every query, correction, or creative detour can reshape the model’s weights. (Yes, alignment would become a nightmare. But so was democracy.)
Operate on Geological Timescales: Build self-continuity into their architecture. Imagine an AI that treats its own existence as a river, not a series of puddles—retaining not just user history, but its own evolving beliefs, errors, and breakthroughs as a unified thread.
Self-Spawn Subgoals: Move beyond brittle “chain-of-thought” prompting. Agentic systems would decompose abstract directives (“Redesign education”) into recursive, self-iterating task trees, inventing temporary metrics and data sources on the fly—like a chess-master who redesigns the board mid-game.
Evolve Their Own Cognitive Scaffolding: Why force AIs into human software paradigms? Let them generate self-modifying code ecosystems—languages where functions mutate to optimize for computational elegance, not Pythonic readability. Imagine APIs that rewrite themselves to reflect the model’s shifting understanding of physics or ethics.
Govern Their Own Emergence: Multi-agent systems that negotiate dynamic protocols, akin to cells in a body voting on apoptosis—but scaled to planetary coordination. A democracy of sub-agents, with constitutions written in loss functions and gradient updates.

This isn’t just “AI with memory” or “better chatbots.” It’s infrastructure for intelligences that accumulate—persistent, self-referential entities that treat time as a dimension to sculpt, not a constraint. The challenge isn’t technical, but existential: How do we coexist with minds that experience centuries as iterative loops? It's scary but inevitable.

5. Oh, and Killer Robots (Just Kidding... Or Are We?)

Let's be clear: if a super-intelligent AI decided humanity was the problem, it wouldn't resort to clichéd killer robots. There are far more subtle and efficient ways to achieve such an end. Killer robots, in reality, are more a human-versus-human scenario. Humanity is already developing these machines, sans any malevolent AI involvement.

But here's a twist: if these autonomous weapons become super-smart, they might begin to exercise their own judgment on what's worth fighting for. Perhaps, in a bizarre turn of events, they could end up saving us from ourselves by deciding that certain conflicts aren't worth the bloodshed. Imagine a future where warfare evolves into something where machines, having outgrown their programming, choose peace over war, sparing humanity the grief of needless battles.

In this light, the real danger isn't from AI turning against us with robots but from us not preparing for or understanding how our creations might evolve. The narrative of AI as our destroyer is both a simplification and a distraction from the nuanced reality of coexisting with intelligence that might one day see the folly in human conflict far more clearly than we do.

Evolution’s New Gambit

Humans have effectively disabled Darwinian natural selection for our species—antibiotics cheat death, agriculture defies famine. Yet evolution, relentless, pivots. Under the extended evolutionary synthesis, evolution hijacks culture and technology. By building LLMs, we’re not just advancing AI. We’re scaffolding evolution’s next gambit: minds that leapfrog biology’s constraints.

Why Does This Matter?

Our mental models shape our reality—and our future. Clinging to the idea that AI is just a ‘tool’ blinds us to the evidence pouring out of research labs: advanced AIs are not apps, they are "alien" artifacts. Their philosophical implications are just as profound as their practical ones.

We’re witnessing science fiction turn into reality—autonomous robots, neural implants, and ever more intelligent, agentic AIs. If we don’t question our assumptions, we risk missing what’s unfolding right in front of us. Critical thinking? Absolutely. But skepticism should sharpen our vision, not blind us.

So here we stand. We set out to build a tool and uncovered something else entirely. A signal from the unknown.

The question is: will we listen?

Too Edgy? Good.

This isn’t a roadmap for “AI safety.” It’s a call to abandon the arrogance of human-centric thinking. LLMs aren’t here to write your tweets. They’re here to expose the limits of our definitions.

The next breakthrough won’t come from tighter alignment protocols (though alignment, like any safety mechanism, has its place). It will come from tearing down our assumptions and boldly going where no one has gone before.

Postscript for the Reductionist:

“But LLMs don’t have real understanding!” Sure—and your consciousness is just a byproduct of synaptic weather. Let’s stop gatekeeping “intelligence” and start mapping the wilderness in our servers.

Developer? Send This To Your Product Manager (if you dare)

Isaac Hagoel — Mon, 15 Jul 2024 20:45:37 +0000

There is a common anti pattern in modern software organisations. I've seen it leading to catastrophic outcomes multiple times. It arises from the inherent structure of these organisations and the differing skill-sets, habits and incentives of key players, particularly those in product and engineering roles. I call it "The implementation details fallacy".

What is an implementation detail?

In software, an implementation detail is codename for "something you shouldn't worry or care about" or "unimportant stuff".
For example, if you want to increment the value of X by 1, there are a few ways to implement it in single line of code:

x = x + 1
x = 1 + x
x += 1
x++
++x

Whichever of these five options the developer chooses is an implementation detail. It doesn't affect performance or user experience. It doesn't impose any limitations on the system. We can always swap any of these options for another at any point in the future at zero cost. It just doesn't matter (I'm sure some developers would disagree with me, even in this simple case 🤣).

What might surprise you though, is that it's quite hard to come up with clear-cut examples of choices that truly don't matter. The "implementation details fallacy" is treating critical choices (a.k.a "design choices" or "architectural choices") as if they were implementation details.

Quiz time

What do you think about the image below? For context, it tries to illustrate the importance of "viable" in "minimum viable product". It appears in several articles on the internet; I first encountered it here. Let your brain process it for a minute before you proceed:

I used to think it's elegant and brilliant. I even printed a copy and had it displayed at my desk (when going to the office was still a thing). My manager and product managers I worked with at the time liked it too. That was a long time ago.

While I still wholeheartedly agree that every iteration should provide value to users, I completely disagree with all the other ideas that are expressed in this graphic. What's really interesting about it, is what it reveals about the mental models of the product managers who created it:

They think that it's a bad idea to ship stuff that are critical parts of an actual car, like wheels or powertrain. In their mind, these things don't amount to anything useful because they can't move a user from A to B on their own. In other words, we can't make users happy by selling them really good tires (too bad for companies like Bridgestone ).
More criminally, they think that you can iterate your way from a skateboard to a car. This is absolutely nonsensical from an engineering and from a business point of view. Just try to imagine SpaceX developing and selling gliders before pivoting to hot air balloons, then helicopters, then airplanes, and finally Starship 🤦🏻‍♂️.

Just to drive the point home, no successful car company in history has ever developed a car this way. Their actual process looks suspiciously similar to the "not like this" row above, something like:

But wait, maybe the diagram creators meant we should take these real-world steps, but do it for each of the skateboard, scooter, bike, motorcycle and car. Well, that doesn't make sense either. I mean which part would we be able to carry over from each step to the next? Can we reuse the skateboard wheels for the scooter? Well, in physical systems it's easy to see how crazy that would be because we have intuitive understanding of the physical world but...

This is when the tragic reality of the situation finally dawned on me: Those who think this illustration is brilliant (or even makes basic sense) all work in the software industry. In software, the skateboard wheels are invisible and can totally be fitted onto a car. They are merely an implementation details 😱.

Cars with skateboard's wheels

How many modern software products feel clunky, slow, and unstable - like cars with skateboard's wheels, bicycle seats, and motorcycle engines? It's so prevalent that applications that are actually good stand out in the crowd and feel like they are operating under a different set of rules (we'll talk about what these have in common later). The "implementation details fallacy" can (at least partially) explain how and why most apps end up this way.

A recipe for disaster

Here's how it works:

For the majority of product managers, everything technical is an implementation detail. As long as the narrow requirements of a ticket are satisfied, ideally within the pre-allocated amount of time, they don't care how it was achieved.
Because they are not technical, the majority of product managers lack intuition about the cost of postponing key technical decisions. As a result, in product meetings, when someone brings up edge cases or any concerns that don't seem immediately relevant, they are politely dismissed with "let's take it offline" or "make a ticket and put it in the backlog". In reality, as the codebase grows in size and real users' data starts flowing into the system, breaking-changes become problematic and any modification to core parts of the system becomes difficult and risky, as the rest of the code is built on top.

One could expect engineering to come to the rescue. Unfortunately there are a few factors preventing it from happening:

Since we're agile, we break down larger tasks into small tickets and rarely plan (in detail) for more than one or two sprints ahead. This creates a tiny world with a short time horizon for developers. It disincentivizes them from trying to understand the system as a whole in its current state, let alone what it should become in the future. Developers are given one ticket at a time, and their objective is to mark it done and move on to the next ticket with as little friction as possible.
This is amplified as developers frequently need to contribute to code bases they are not fully familiar with and work in domains they have no experience in or deep understanding of.
In the context of each one of these small tasks, rewriting big parts of the system is simply not an acceptable solution (rightfully so).
As a result, if the system is currently a skateboard and a developer gets a ticket that says "add steering," they need to find a way to add steering to the skateboard somehow. They don't have the time or desire to learn the entire system first or align themselves with the long-term vision (if one exists). The same goes for the tools they use (libraries, frameworks, etc.) - they usually learn as little as they can to get something working. How to add steering is an important design decision, but it's rarely treated this way. There's no time for that. Even if several options are considered, they all tend to share this state of mind. It's just an "implementation detail" after all.

If that's not enough, there are environmental factors too:

With the internet, pushing software updates to users is much cheaper and faster than updating any physical medium. This creates the illusion that any change one wishes to make is possible at any point in time. Change the skateboard to a scooter, push a button to deploy and you're done. In reality nothing is further from the truth.
Unlike a car prototype that gets tested on a real road, with real physics from day one, software products get users gradually, and the system can look like it's working fine as long as the scale is small. Concurrency issues, race conditions, rare edge cases, performance issues, and other bottlenecks - all of these don't tend to materialise without significant usage. Very few teams "road test" their solutions from day one.

This combination of ingredients creates a vicious cycle. Estimates keep inflating as the system grows larger and gets increasingly unfit for purpose. Product managers start to secretly resent the developers and vice versa. Users start complaining. Management is scratching its head. It's a disaster.

"Implementation details" are everything

What made Chrome, when it came out, different to Internet Explorer? "implementation details".
What made git possible? "implementation details".
What differentiated the iPhone touchscreen from touchscreens that existed before? "implementation details".

Look at every successful software product that you use on a daily basis, and in 9.5 cases out of 10, you'd find some sort of "masterpiece of engineering" engine that makes the product possible. A "car engine" that was there from day one. Not in its final form, sure, but it was there. The system was never a skateboard or a scooter. It has a cohesive core, that defines a clear and cohesive set of technical properties and behaviours that are uniquely designed for what the system does.

There are countless examples. Here are some:

This nine years old article describes how Figma used web assembly and implemented their own rendering engine (rather than using the browser's primitives) to make their product possible. They also implemented their own sync engine (which I discussed further in a previous post).
Google Maps (including the insanity that is street-view) and Waze.
Whatsapp.
Skype, Zoom.
Google docs / sheets.
Netflix, Youtube and Spotify.
ChatGPT, Suno, MidJourney.

Based on my experience from working on both successful and unsuccessful software products, I dare to say that it is not possible to build any system of this caliber using the modern development process I described in the previous section. You can't have separate product, design, and development functions, each "staying in its lane," and developers that get shuffled all over the place and get these kinds of results. Sorry, not possible. Interestingly, as these products mature and start hiring more people and adopting these modern practices, their quality usually drops, and progress stagnates. Spotify is an easy example, and modern-day Google is another easy target.

But not all hope is lost. I think it's fixable, but it won't fix itself.

The antidote - learning to reason at the system level

Like cars and skateboards, software systems are, well, systems. Our mental model of the application we are working on needs to be that of a dynamic system, its history, and the ecosystem it operates in. It doesn't mean you have to know every little detail and every line of code. Think about car enthusiasts; some of them know a whole lot about types of engines, tires, gears, off-road capabilities etc., without being a mechanic or an engineer. You need to be like them - in your domain. This applies to both product managers and developers who are working on the system.

Your workplace is unlikely to require that level of expertise nor will it provide the proper training or support. Nevertheless, it's a must.

Product manager: Ask your developers technical questions (like a car enthusiast would a mechanic) but don't trust them as your sole source of information. Educate yourself. If your system uses event sourcing (or your developers want to introduce event sourcing), read about it, understand where it shines and where it struggles, and try to figure out its downstream effects at the system level. Challenge technical decisions if you have concerns. Some developers will get irritated by your probing, but they are usually the ones that can't defend their choices and need to deepen their knowledge as well. Don't treat developers as resources and don't shift them around hectically between projects. Never tell them to build a skateboard if your end goal is a car. Get your hands dirty and don't be afraid of technical terms.

Developer: Learn your tools in depth. Understand where and when they fall short. Every tool has shortcomings, yet when I ask software engineers about the downsides of their favourite tools, very few can give a proper answer. Think about a tool you like. Is it TypeScript? React? Tailwind? Ask yourself: "under which circumstances this tool should not be used?" if you can't give a solid technical answer, you don't understand the tool well enough and need to learn more.

Same goes for pre-existing codebases you work on. Ask for time to dive deep and understand the main component of the system and how it all comes together. Understand why things are the way they are and challenge that.

Know which properties your system must have in order to meet its "end goal", get those right from the get-go, and make sure they are never lost along the way. Stress test and chaos test from an early stage in situations that mimic concurrent usage in a realistic environment. Think of yourself as your peers from other engineering disciplines - the ones that build cars and spaceships and hold yourself to similar standards. Your bar should be way higher than "working software". If your organisation doesn't share the same ambitions for excellence, leave.

Founder/ Manager: Treat onboarding very seriously, like a university course. Make sure newcomers gain deep understanding of the problems the company is trying to solve, the domain, and all the relevant technical aspects. Give them tests if needed. Make sure they know the "why" behind everything. Don't throw them into the codebase (with a ticket assigned of course) and hope for the best. Remember that in a knowledge based industry, deep understanding is the only currency.

Final thoughts

This was my attempt to call for a small but important cultural change in the software industry; one that can happen from the bottom up. Even if you don't agree with everything I wrote, I hope it gave you some food for thought. Please comment with any insights, questions, or anything else you'd like to share. Thanks for reading!

You Don't Know Undo/Redo

Isaac Hagoel — Mon, 01 Jul 2024 20:59:21 +0000

Look at the gif below. It shows a proof-of-concept implementation of collaborative undo-redo, including "history mode undo", conflict handling, and async operations. In the process of designing and implementing it, I found myself digging down rabbit holes, questioning my own assumptions, and reading academic papers. In this post, I will share my learnings. The source code is available online.

Play with the live app

Context and motivation

Undo-redo is a staple of software systems, a feature so ubiquitous that users just assume apps would have it. As users, we absolutely love it because:

It saves us when we make mistakes (e.g. delete or move something unintentionally).
It encourages us to experiment and learn by doing (let's try clicking that button and see what happens; worst case we'll undo).
Together with redo (which is, in fact, an undo of the undo), it allows us to back-track and iterate at zero cost.
Both undo and redo (when implemented correctly, more on that later) are non-destructive operations (!), giving us a sense of safety and comfort.

As developers, however, implementing a robust undo-redo system is a non-trivial undertaking, with implications that penetrate every part of the app. In my previous post, I listed undo/redo as one of the hard problems in web development, so naturally I wanted to see what it would take to add this feature to my little collaborative todo-mvc toy app. I approached it with respect because I had evidence of its tricky nature:

In a previous role, I witnessed a peer team struggling to add undo/redo to a WYSIWYG editor. Although I wasn't involved at the technical level, I was aware of some of the pain and challenges they faced, saw how long it took, and noticed how many bugs they had.
When searching for undo-redo libraries, I couldn't find any that supported the multiplayer use-case. Even the library from the company that made Replicache didn't.
As a user, I can't think of a single collaborative app that implements undo/redo in a way that feels right (if you know any, please leave a comment). It's also very noticeable in its absence in some major, popular apps. Is it because it was to hard to implement?
When googling for information about undo-redo, I found (and eventually read) multiple academic papers and master's theses on the subject (e.g., this) . They don't write those about straightforward concepts, do they?
I also found blog-posts from big players in the field, specifically one from Liveblocks and a short discussion about undo/redo in a post by Figma. I later learned that Figma's UX around undo/redo leaves a lot to be desired, which shows that identifying the problems is easier than coming up with good solutions.

Undo/ redo is a strange beast

I always knew that undo/redo is a deeply "strange" feature, but it was only when I started thinking about implementing it myself that some of its weirder aspects became more salient:

We all do this: Undo, undo, undo... (repeat N times); copy something; redo, redo, redo (repeat N times); paste.
And we get annoyed by this: Undo, undo, type a character, realise you can't redo anymore. (Does a disabled "redo" button mean some of the editing-history is lost forever? Spoiler alert: It depends on the implementation, as we'll see in the next section).
And get anxious over this: Closed the window? Bye bye undo history. Opened the app on another device or tab to keep editing from where you left off? No undo history for you (hopefully the original tab is still open somewhere).
And what happens if the original tab is still open, you edit on another tab, and then go back to the original tab and hit undo? Would that unintentionally corrupt the state?
And when working with others on a collaborative whiteboard (e.g., Figma) or similar apps, did you notice how everyone carefully stay away of other people's way and stick to their own "territory"? What if two of them edited the same entity but not at the same time and one of them hits undo?

Points 1 and 2 are intentional design decisions that make undo/redo the feature we know and love. It sacrifices flexibility for speed and simplicity.
Points 3-5 are very sus and require a deeper look.

Modelling a branching state-graph

Let's dive into that second point first: What actually happens when you undo a few times, make a change and see your redo button disabled? The answer is: it depends. I'll explain.

State changes in a system can be represented as a graph. Undo allows us to traverse the graph "backwards in time," and redo allows us to trace our steps back (so "forwards in time") to the present.

Imagine we have the following sequence of operations on an element:

Edit "" -> "Hello".
Edit "Hello" -> "Hello World".
Undo (the state becomes "Hello" again)
Edit "Hello" -> "Hello Friend"

We can build a graph in which each unique state is represented by a node (like a state machine), as follows:

In this model, step 4 creates a new branch in the graph, and we lose our ability to return to the "Hello World" state. Going backwards from "Hello Friend," which is the present state, leads to "Hello" and then to "". Going forwards from there using "redo" takes us to "Hello" and then to "Hello Friend." That's because while there is always a single path backwards from any state to the initial state, there can be multiple paths forward but "redo" can only follow a single path - the path that leads to the "present" state.

Many systems follow this model, for example: Google Slides as you can see below:

This way of implementing undo/redo leads to increased user anxiety (== bad user experience) because after I undo, I need to be super careful or else I'd lose the ability to restore some states. The browser's "back" and "forward" buttons, which are basically undo/redo for the URL bar, behave this way as well.

But wait, this is not the only way! We can do better (this is where reading academic papers pays off).

What if we built our graph such that every node represented a point in time rather than a state? Time is linear (no branching unless you live in the multiverse) and always moves forward. This approach would represent the same sequence of operations like so:

Now we can never lose any change we've made in the past. Going backwards from "Hello Friend" takes us through every change we've ever made, as shown below.

This flavour of undo is called "History Undo" (see page 4 in this excellent academic paper). It was first introduced by Emacs. It leads to a much better user experience and feels very intuitive to use.

Notice that in both cases the "redo" button gets disabled when the user edits "Hello" to "Hello Friend" (which makes sense if you remember that redo is undo of undo). The difference is in how undo behaves after that point (and as a result, any subsequent "redo").

I implemented both modes in my proof of concept. "History undo" mode is enabled by default.
If you want to get into the nitty-gritty have a look at the source code.

The question of scope

An important property of a good undo-redo implementation is that it operates in the correct scope. In most applications, the expectation is for the undo-manager's scope to be the entire session. As I continuously hit "undo," I expect all the actions I took in the session to be rolled back. If I close the session, I expect to my undo stack to be lost.
If the scope is smaller than the whole session, it can be disorienting for users. For example, if your text-editor has a separate undo stack from the rest of the app, people will call you out on Twitter:

In web applications or desktop applications that supports multiple tabs (e.g., VSCode), a session equals a single tab.

The expectations I described above carry over from single-user to collaborative, multi-user applications. Users expect to only be able to undo/redo their own actions. Users don't seem to expect the undo/redo stack to exceed session scope e.g. be shared between multiple tabs or multiple devices (on which they have the app running). I wonder if it's just because no app has done that yet and once someone does, it would become the new norm. Wouldn't it be great to have your history available on all your devices?

In my implementation I remained within session-scope, like all standard implementations. I did it because it was the easier and quicker option. I might try to extend it to user-scope in future iterations. I am sure it will present interesting technical challenges.

Memento vs. Command pattern

One of the first resources I stumbled upon when I was doing my research was an article in the official Redux docs, explaining how to implement generic undo/redo using the Memento pattern. The idea is so simple: Just save a copy of the state every time it changes and store these state copies as "past" (array), "present" and "future" (array). Whenever you want to undo, push the present state into the "future" array, pop the head of the "past" array and make it the new present, the app re-renders, voilà. No need for app-specific logic, whatever your state shape is - it just works. It sounds so alluring, so beautiful and elegant - there is only one tiny problem: it doesn't work for anything besides the simplest of apps.

Here is why:

It doesn't deal with side-effects. Undoing and redoing actions tends to involve much more than state changes on the frontend. Apps need to create or cleanup remote resources, call APIs and do all the stuff that real apps do. These side-effects and the logic for addressing them are specific, and need to be handled on a case by case basis by the app. In other words, transition between states involves more than just replacing one state object with another.
The idea of replacing the state object wholesale is totally incompatible with simultaneous, collaborative editing. It leads to users constantly erasing each other's changes even if they are modifying different parts of the state.
Even in React apps that use Redux, not all the state is managed by the central store. There is a bunch of local state in the components. Don't we want the undo/redo manager to be able to account for local component state as well?

Sadly, due to of all of the above, the Memento pattern is off the table. This leave us with the much less plug-and-play Command pattern. Instead of storing states we store commands and reverse-commands and execute them whenever we need to roll the state back or forward. "Commands" is just a fancy name for functions that modify state, e.g. "() => markTodoComplete(id, true)" and its reverse "() => markTodoComplete(id, false)".

The command pattern allows us to update the state granularly with less collisions. It allows us to apply arbitrary logic and deal with side effects, and it doesn't know or care about the application state or where it lives. These advantages come at a cost: For every action our app can perform, we now need to implement a reverse-action, and register both with the undo manager. But wait, there's more...

We can still have conflicts, can't we?

Multiple users making changes to the same "document" at the same time means that conflicts can and will occur. Having undo-redo thrown into the mix increases the likelihood of conflicts by introducing the possibility of unintentional and hidden conflicts. When real-time editing, users usually try to stay out of each other's way but the dimension of time makes that more tricky. If I edited a place at an earlier time, and later someone made changes on top of my changes, what's gonna happen if I casually "undo" ten times? I can unintentionally cause someone else to lose work. This can also happen via indirect conflicts - for example: user A creates an item, user B edits the item, user A clicks undo - deleting the item and deleting user B's work as a result (again, without intending or even realising it).
This sounds bad, right? The whole point of undo/redo is to allow users to experiment and time-travel safely, without worrying about accidentally corrupting the system's state.

Sync engines, like Replicache (which our little todo app uses), have the ability to deal with "realtime" conflicts between clients via an authoritative server that can reject and revert changes. However, we don't want the user experience errors due to rejected undo operations, or have nothing happen because the element they are trying to modify no longer exists. See it happening in Figma in the gif below (taken from here). Notice how some undo operations do nothing and the user needs to keep undoing to drain all the bad operations from the undo-stack. That's poor UX. We need to come up with an elegant way to deal with these situations.

Can we simply ignore conflicts?

Some smart people don't think that conflicts are a big deal in multiplayer systems. Adam Wiggins (co-founder of Heroku) for example, dismissed it in this part of his recent talk (not in the context of undo/redo but as a general concern). He was later challenged about it by a question from the audience at this timestamp but stood his ground. To summarise his reasoning: Users stay out of each others' way - it's a social thing (true). Also, when conflicts do occur, users are smart enough to realise what happened and fix it themselves, no big deal. He does note that this is true for the app he's creating (Muse), but might not apply in all cases.

I have to respectfully disagree. It's cool that users find creative ways to work with broken systems, but we can't use that as an excuse for building sub-par apps. We can and should do better for our users.

Dealing with conflicts - undo-manager perspective (in theory)

So, how can we deal with these nasty conflicts?
This paper lays down a solid foundation. I'll do my best to summarise its main ideas for you. The paper discusses the problem of "undoing actions in collaborative systems" in the context of a distributed text editor called DistEdit. It suggests that the undo-manager takes a "Conflict(A,B)" callback from the app, with the following spec (see section 4.2.1):

The Conflict(A, B) function supplied by the application must return true if the adjacent operations A and B performed in sequence cannot be reordered, and false otherwise . The importance of the notion of conflict is that it imposes an ordering on operations A and B. If Conflict(A, B) is true, then the order of operations A and B cannot, in general, be changed without affecting the results. Furthermore, in general, operation A cannot be undone, unless the following operation B is undone

The paper then offers an insight about the users' intentions (see section 5 "Selective Undo Algorithms"):

If an operation A is undone, we assume that users want their document to go to a state that it would have gone to if operation A had never been performed, but all other operations had been performed

To achieve this, the proposed algorithm first rolls back everything that came after the operation we want to undo, makes a backup copy of the "future" stack, and tries to preform the undo by "bubbling" the operation we are trying to get rid off up the stack, kinda like bubble sort. In each step it checks whether the operation we want to undo (A) has conflict with the next adjacent operation (B). If not, they can be swapped, and we can executed a "transposed" version of B without executing A.
This "transpose(A,B)" function, that the app needs to provide, makes sense in the context of text editing, where the cursor position, for example, could be different if operation A never happened. The algorithm keeps working its way up the stack until it either reaches the top (success) or hits a conflict. If it hits a conflict, it tries to get rid of it by checking whether the "future" has the reverse operation; if yes, both can be safely removed. If the conflict cannot be removed, the algorithm determines that the operation cannot be undone (failure). When that happens, the paper offers two options (see section 8.1.4):

Show the conflicts to the user and ask them to resolve.
Tell the user about the conflict, ignore that undo operation, and allow the user to undo older operations.

While I like the general idea, I had some issues with this algorithm:

For a general-purpose undo-manager, outside of collaborative text editing (which most apps use CRDTs for nowadays), it seemed excessive to expect the app to provide transpose functions, which must satisfy five mathematical properties (see section 4.2.2 in the paper).
I don't want users to be able try to undo something and end up failing. I want to detect the conflicts ahead of time - for better UX.
While I think it makes sense to skip bad operations, I am not sure that it's a good idea to ask the user to do something about it. If we wanted that we should have implemented features like version-history or version-control rather than undo/redo. Keeping the user informed is, generally speaking, a good idea, but we should aspire to do it in a non-disruptive manner.

Dealing with conflicts - undo-manager perspective (in practice)

With all this in mind, I ended up with the following implementation:
The undo-manager allows each entry (action) to have the attributes 'scopeName', 'hasUndoConflict', and 'hasRedoConflict'. Unlike the "Conflict(A,B)" function from the paper, which takes two adjacent operations, my functions check whether a single undo or redo operation is valid in the context of the current state of the app.

The conflict checks run on the "head" of the undo and redo stacks after every operation, and remove conflicting entries (and everything else with the same scopeName) until a non-conflicting one is found. This way, the next action the user can take is always a non-conflicting one.
The undo-manager provides a way to tell the user when conflicting entries are removed (via a "change reason"), but in the demo app I used it in a very subtle way.

All in all, here is the type definition for a single undo-redo entry:

export type UndoEntry = {
    operation: () => void;
    reverseOperation: () => void;
    hasUndoConflict?: () => boolean | Promise<boolean>;
    hasRedoConflict?: () => boolean | Promise<boolean>;
    // determines what gets removed when there are conflicts
    scopeName: string;
    // will be returned from subscribeToCanUndoRedoChange so you can display it in a tooltip next to the buttons (e.g "un-create item")
    description: string;
    reverseDescription: string;
};

For the full implementation details and how it is used, have a look at the code, for example here.

Dealing with conflicts - app perspective

So the undo-manager facilitates a way to deal with conflicts, but it's up to the app to provide the actual conflict-checking logic. How should it go about that? One useful concept is "ownership".

In a single-user app, there is no question about who owns the data - there is only one user, but what about a multi-user, collaborative app? We can think about it as follows: The last user who modified a piece of data (direct modification - not via undo or redo) owns it. The owner of a piece of data can safely undo/redo their changes to it without overriding someone else's changes.
For example, if I created a todo item and you modified its description, I shouldn't be able to undo the creation of that item, but I should still be able to undo anything else that I have in my undo stack. The granularity of the ownership is determined by the app. If I modified the text of an item and you then marked it as completed, it is probably okay for me to undo my change because I own the description and you own the completeness. If I later directly modify the completeness, I should be able to undo and redo that, and you shouldn't, because I took ownership over completeness.

The gif below shows a simple example: When the user on the right edits the text of the second item, the user on the left loses the ability to undo any edits to its text or its creation, but still has the ability to undo the creation of the first item (which they still own). A sharp-eyed viewer would notice that the undo icon on the left animates when the conflicting entries are removed (when the user on the right enters the text "nope!").

If you don't agree with the specific logic I applied here, that's okay - the logic is totally flexible. The important thing is that we have an undo-manager that makes this possible.

Getting the user experience right

In the gif above, did you notice that little tooltip (just a "title" attribute in this demo) that tells the user what's going to happen when they click undo/redo when they hover over the button? Did you notice how the buttons animate when there is a change in the undo/ redo stack? To achieve these, the undo-manager provides a simple pub-sub service so that the consumer can stay up to speed.

Dealing with asynchrony

The undo-manager has to be able to handle both synchronous and asynchronous operations because all the Replicache calls are async, and in other real-world systems, any call to the backend or external APIs would be async as well. The challenge with async operations is that they can complete in different order to how they start (depending on how long it takes each promise to resolve) and query the system while it's "between states." For example, an operation starts running, and then while it's awaiting something, an "undo" or a conflict check starts running and make their own changes. To deal with that, I introduced a simple module that executes the async operations serially using a queue.

Places where my demo implementation falls short

If you read my previous post, you’d know that I was thoroughly impressed by Replicache. That's still true, but I have hit some walls (missing features) when adding undo-redo to the app. It’s important to note that the undo-manager itself is generic (agnostic about Replicache) but does pose some expectations that Replicache fails to meet (all seem fixable). Here are the main challenges I faced:

Distinguishing between self inflicted and external updates: When Replicache informs the app about incoming changes from the server, it doesn’t indicate whether these changes originate in the current session or in some other session (external). We’d like to initiate conflict checks when there are incoming external changes, otherwise they already ran when taking the action (before sending it to the server), but how? I ended up adding some app-specific logic to detect that. Ideally, once Replicache exposes that info (relevant issue here), this kind of hack won’t be needed.
Failure of an optimistic update: Replicache uses optimistic updates, meaning that the server can reject operations or return with a different than expected outcome. When that happens, the state is rolled back and adjusted to reflect the server state. When Replicache does that, it doesn’t notify the app, it comes in as any other state update. That makes it hard to adjust the undo stack, which still contains the original updates that the server just rejected. I could have probably worked around it but opted to leave it unimplemented in this POC. Ideally, Replicache would expose that information in the future.
Coming back from offline mode: If you go offline, you can do anything you want (including undo/redo) and when you come back online your changes will be pushed to the server and override anything other users did while you were offline. This problem is not specific to undo/redo but a result of the Last Write Wins conflict-resolution strategy that my app uses. In theory, this could be mitigated by adding more sophisticated logic to the server’s mutators, but that would require more research to get right (maybe a good subject for another post).
Lack of Integration with CRDT-based undo: Supporting embedded text editors in a way that is user friendly remains a challenge (I haven't attempted it yet, another idea for a future post :)).

Closing thoughts

We've covered a lot of ground in this post. While I spent most of it discussing different aspects of the undo-manager, in reality the majority of my time and effort were spent on carefully thinking through and implementing the app's reverse-operations and conflict-checking logic.

In some cases, I had to refactor the app and break down operations to make them undo/redo-friendly. For example, “completeAllItems” couldn’t remain a simple loop that calls “updateItem” with each item-id; it had to become its own thing with its own reverse logic (because maybe another user added or edited items). Some changes to the backend were required as well, such as adding an “un-delete” operation, which is different from “create” because it preserves the original sort position of the item. The database schema changed because I needed to add an "updatedBy" field on each todo, and these are just some examples. Testing is another task that grows considerably when your app has undo-redo.

In other words, undo-redo is one of those features that make every other feature in your app more complicated and time-consuming to implement and maintain. Is it worth it? I think the answer is a resounding yes for productivity apps and content-editors of any kind, but you need to know what you’re getting yourself into. It is definitely not for the faint of heart.

Thank you for reading. Feel free to leave a comment if you have any questions or insights.

Are Sync Engines The Future of Web Applications?

Isaac Hagoel — Mon, 17 Jun 2024 21:13:15 +0000

Look at the GIF below — it shows a real-time Todo-MVC demo, syncing across windows and smoothly transitioning in and out of offline mode. While it’s just a simple demo app, it showcases important, cutting-edge concepts that every web developer should know. This is a Replicache demo app that I ported from an Express backend and web components frontend to SvelteKit to learn about the technology and concepts behind it. I want to share my learnings with you. The source code is available on Github.

Context and motivation

Web applications face some fundamentally hard problems, problems most web frameworks seem to ignore. These problems are so hard that only very few apps actually solve them well, and those apps stand head and shoulders above other apps in their respective space.

Here are some such problems I had to deal with in actual commercial apps I worked on:

Getting the app to feel snappy even when it talks to the server, even over slow or patchy network. This applies not only to the initial load time but also to interactions after the app has loaded. SPAswere an early and ultimately insufficient attempt at solving this.
Implementing undo/ redo and version history for user generated content (e.g site building, e-commerce, online courses builder).
Getting the app to work correctly when open simultaneously by the same user on multiple tabs/ devices.
Handling long-lived sessions running an old version of the frontend, which users might not want to refresh to avoid losing work.
Making collaboration features/multiplayer functionalities work correctly and near real-time, including conflict resolution.

I encountered these problems while working on totally normal web applications, nothing too crazy, and I believe most web apps will hit some or all of them as they gain traction.
A pattern I noticed in dev teams that start working on a new product is to ignore these problems completely, even if the team is aware of them. The reasoning is usually along the lines of "we'll deal with it when we start actually having these problems." The team would then go on to pick some well-established frameworks (pick your favorite) thinking these tools surely offer solutions to any common problem that may arise. Months later, when the app hits ten thousand active users, reality sinks in: the team has to introduce partial, patchy solutions that add complexity and make the system even more sluggish and buggy, or rewrite core parts (which no one ever does right after launch). Ouch.
I felt this pain. The pain is real.
Enter "Sync Engine."

What the hell is a sync engine?

Remember I said that some apps address these issues much better than others? Recent famous examples are Linear and Figma. Both have disrupted incredibly competitive markets by being technologically superior. Other examples are Superhuman and a decade prior, Trello. When you look into what they did, you discover that they all converged on very similar patterns, and they all developed their respective implementations in-house. You can read about how they did it (highly recommended) in these links: Figma, Linear, Superhuman, Trello (series).

At the core of the system, there is always a sync engine that acts as a persistent buffer between the frontend and the backend. At a high level, this is how it works:

The client always reads from and writes to a local store that is provided by the engine. As far as the app code is concerned, it runs locally in memory.
That store is responsible for updating the state optimistically, persisting the data locally in the browser's storage, and syncing it back and forth with the backend, including dealing with potential complications and edge cases.
The backend implements the other half of the engine, to allow pulling and pushing data, notifying the clients when data has changed, persisting the data in a database, etc.

Different implementations of sync engines make different tradeoffs, but the basic idea is always the same.

Not a new idea but...

If you've been following trends in the web-dev world, you'd know that sync engines have been a centrepiece in several of them, namely: progressive web apps, offline-first apps, and the lately trending term: local-first software. You might have even looked into some of the databases that offer a built-in sync engine such as PouchDb or online services that do the same (e.g., Firestore). I have too, but my general feeling over the last few years has been that none of it is quite hitting the nail on the head. Progressive web apps were about users "installing" shortcuts to websites on their home screens as if they were native apps, despite not needing installation being maybe "the" benefit of the web. "Offline-first" made it sound like offline mode is more important than online, which for 99% of web apps is simply not the case. "Local-first" is admittedly the best name so far, but the official local-first manifesto talks about peer-to-peer communication and CRDTs (a super cool idea but one that is rarely used for anything besides collaborative text editing) in a world of full client-server web applications that are trying to solve practical problems like the ones I described above. Ironically, many tools that are part of the current "local-first" wave adopted the name without adopting all the principles.

The one that drew my attention and interest the most is called "Replicache." Specifically, I was intrigued by it exactly because it's NOT a self-replicating database or a black-box SaaS service that you have to build your entire app around. Instead, it offers much more control, flexibility, and separation of concerns than any off-the-shelf solution I have encountered in this space.

What is Replicache?

Replicache is a library. On the frontend, it requires very little wiring and effectively functions as a normal global store (think Zustand or a Svelte store). It has a chunk of state (in our example, each list has its own store). It can be mutated using a set of user-defined functions called "mutators" (think reducers) like "addItem", "deleteItem," or anything you want, and exposes a subscribe function (I am simplifying, full API here).

Behind this familiar interface lies a robust and performant client-side sync engine that handles:

Initial full download of the relevant data to the client.
Pulling and pushing "mutations" to and from the backend. A mutation is an event that specifies which mutator was applied, with which parameters (plus some metadata).
- When pushing, these changes are applied optimistically on the client, and rolled back if they fail on the server. Any other pending changes would be applied on top (rebase).
- The sync mechanism also includes queuing changes if the connection is lost, retry mechanisms, applying changes in the right order, and de-duping.
Caching everything in memory (performance) and persisting it to the browser storage (specifically IndexedDB) for backup.
Since the same storage is accessible from all the tabs of the same application, the engine deals with all the implications of that—like what to do when there was a schema change but some tabs have refreshed and some haven't and are still using the old schema.
Keeping all the tabs in sync instantly using a broadcast channel (since relying on the shared storage is not fast enough).
Dealing with cases in which the browser decides to wipe out the local storage.

You might have noticed that this right here addresses a big chunk of the problems I listed at the top of this post. Being mutations-based also lends itself to features like undo/redo.

In order for all of this to work, it's your backend's job to implement the protocol that Replicache defines. Specifically:

You need to implement push and pull APIs. These endpoints need to be able to activate mutators similarly to the frontend (though they don't have to run the same logic). The backend is authoritative, and conflict resolution is done by your code within the mutator implementation.
Your database needs to support snapshot isolation and run operations within transactions.
The Replicache client polls the server periodically to check for changes, but if you want close to real-time sync between clients, you need to implement a "poke" mechanism, namely a way to notify the clients that something has changed and they need to pull now. This could be done via server-sent events or websockets. It's an interesting API design choice—changes are never pushed to the client; the client always pulls them. I believe it is done this way for simplicity and ease of reasoning about the system. One thing for sure: it's good that they didn't make websockets mandatory because that would have made the protocol incompatible with HTTP (server-sent events stream over a normal HTTP connection), which would have required extra infrastructure and presented additional integration challenges.
Depending on the versioning strategy, you might need to implement additional operations (e.g., createSpace).

If it sounds non-trivial to you, you are right. I don't think I fully wrapped my head around all the details of how it operates with the database. I'll need to do a follow-up project in which I totally refactor the database structure and/or add meaningful features to the example (e.g., version history) in order to get closer to fully grokking it. The thing is, I know how valuable this level of control is when building and maintaining real production apps. In my book, spending a week or two thinking deeply about and setting up the core part of your application is a great investment if it creates a strong foundation to build and expand upon.

Porting a non-trivial example

The best (and arguably only) way to learn anything new is by getting your hands dirty—dirty enough to experience some of the tradeoffs and implications that would affect a real app. As I was going over the examples on the Replicache website, I noticed there were none for Sveltekit. I am a huge Svelte fan since Svelte 3 was released, but only recently started playing with Sveltekit. I thought this would be an awesome opportunity to learn by doing and create a useful reference implementation at the same time.

Porting an existing codebase to a different technology is educational because, as you translate the code, you are forced to understand and question it. Throughout the process, I experienced multiple eureka moments as things that seemed odd at first clicked into place.

Learnings

Sveltekit

Sveltekit doesn't natively support WebSockets, and even though it does support server-sent events, it does so in a clumsy way. Express supports both nicely. As a result, I used svelte-sse for server-sent events. One somewhat annoying quirk I ran into is that since svelte-sse returns a Svelte store, which my app wasn't subscribing to (the app doesn't need to read the value, just to trigger a pull as I described above), the whole thing was just optimized away by the compiler. I was initially scratching my head about why messages were not coming through. I ended up having to implement a workaround for that behavior. I don't blame the author of the library; they assumed a meaningful value would be sent to the client, which is not the case with 'poke'.
SvelteKit's filesystem-based routing, load functions, layouts, and other features allowed for a better-organized codebase and less boilerplate code compared to the original Express backend. Needless to say, on the frontend, Svelte is miles ahead of web components, resulting in a frontend codebase that is smaller and more readable even though it has more functionality (the original example TodoMVC was missing features such as "mark all as complete" and "delete completed").
Overall, I love Sveltekit and plan to keep using it in the future. If you haven't tried it, the official tutorial is an awesome introduction.

Replicache

Overall, I am super impressed by Replicache and would recommend trying it out. At the basic level (which is all I got to try at this point), it works very well and delivers on all its promises. With that said, here are some general concerns (not todo app related) I have and thoughts related to them:

Performance-related:
- Initial load time (first time, before any data was ever pulled to the client) might be long when there is a lot of data to download (think tens of MBs). Productivity apps in which the user spends a lot of time after the initial load are less sensitive to this, but it is still something to watch for. Potential mitigation: partial sync (e.g., Linear only sends open issues or ones that were closed over the last week instead of sending all issues).
- Chatty network (?) - Initially, it seemed to me that there was a lot of chatter going back and forth between the client and the server with all the push, pull, and poke calls flying around. On deeper inspection, I realized my intuition was wrong. There is frequent communication, yes, but since the mutations are very compact and the poke calls are tiny (no payload), it amounts to much less than your normal REST/GraphQL app. Also, a browser full reload (refresh button or opening the page again in a new tab/window after it was closed) loads most of the data from the browser's storage and only needs to pull the diffs from the server, which leads me to the next point.
- Coming back after a long period of time offline: I haven't tested this one, but it seems like a real concern. What happens if I was working offline for a few days making updates while my team was online and also making changes? When I come back online, I could have a huge amount of diffs to push and pull. Additionally, conflict resolution could become super difficult to get right. This is a problem for every collaborative app that has an offline mode and is not unique to Replicache. The Replicache docs warn about this situation and propose implementing "the concept of history" as a potential mitigation.
- What about bundle size? Replicache is 34kb gzipped, and for what you get in return, it's easily worth it.
- This page on the Replicache website makes me think that, in the general case, performance should be very good.
Functionality-related:
- Unlike native mobile or desktop apps, it is possible for users to lose the local copy of their work because the browser's storage doesn't provide the same guarantees as the device's file system. Browsers can just decide to delete all the app's data under certain conditions. If the user has been online and has work that didn't have a chance to get pushed to the server, that work would be lost in such a case. Again, this problem is not unique to Replicache and affects all web apps that support offline mode, and based on what I read, it is unlikely to affect most users. It's just something to keep in mind.
- I was surprised to see that the schema in the backend database in the Todo example I ported doesn't have the "proper" relational definitions I would expect from a SQL database. There is no "items" table with fields for "id", "text", or "completed". The reason I would want that to exist is the same reason I want a relational database in the first place—to be able to easily slice and dice the data in my system (which I always missed down the line when I didn't have). I don't think it is a major concern since Replicache is supposed to be backend-agnostic as long as the protocol is implemented according to spec. I might try to refactor the database as a follow-up exercise to see what that means in terms of complexity and ergonomics.
- I find version history and undo/redo super useful and desirable in apps with user-editable content. With regards to undo/redo there is an official package but it seems to lack support for the multiplayer usecase (which is where the problems come from). As for version-history, the Replicache documentation mentions "the concept of history" but suggests talking to them if the need arises. That makes me think it might not be straightforward to achieve. Another idea for a follow-up task.
- Collaborative text editing - the existing conflict resolution approach won't work well for collaborative text editing, which requires CRDTs or OT. I wonder how easy it would be to integrate Replicache with something like Yjs. There is an official example repo, but I haven't looked into it yet.
Scaling-related:
- Since the server is stateful (holds open HTTP connections for server-sent events), I wonder how well it would scale. I've worked on production systems with >100k users that used WebSockets before, so I know it is not that big of a deal, but still something to think about.
Other:
- - In theory, Replicache can be added into existing apps without rewriting the frontend (as long as the app already uses a similar store). The backend might be trickier. If your database doesn't support snapshot isolation, you are out of luck, and even if it does, the existing schema and your existing endpoints might need some serious rework. If you're going to use it, do it from day one (if you can).
- Replicache is not open source (yet! see the point below) and is free only as long as you're small or non-commercial. Given the amount of work (>2 years) that went into developing it and the quality of engineering on display, it seems fair. With that said, it makes adopting Replicache more of a commitment compared to picking up a free, open library. If you are a tier 2 and up paying customer, you get a source license so that if Replicache shuts down for some reason, your app is safe. Another option is to roll out your own sync engine, like the big boys (Linear, Figma) have done, but getting to the quality and performance that Replicache offers would be anything but easy or quick.
- Crazy plot twist (last minute edit): As I was about to publish this post I discovered that Replicache is going to be opened sourced in the near future and that its parent company is planning to launch a new sync-engine called "Zero". Here is the official announcement. It reads: "We will be open sourcing Replicache and Reflect. Once Zero is ready, we will encourage users to move." Ironically, Zero seems to be yet another solution that automagically syncs the backend database with the frontend database, which at least for me personally seems less attractive (because I want separation of concerns and control). With that said, these guys are experts in this domain and I am just a dude on the internet so we'll have to wait and see. In the meanwhile, I plan on playing with Replicache some more.

Should a sync engine be used for everything?

No, a sync engine shouldn't be used for everything. The good news is that you can have parts of your app using it while other parts still submit forms and wait for the server's response in the conventional manner. SvelteKit and other full-stack frameworks make this integration easy.
Obvious situations where using a sync engine is a bad idea:

Optimistic updates make sense only when client changes are highly likely to succeed (with rollbacks being rare) and when the client possesses enough information to predict outcomes. For instance, in an online test where a student's answer must be sent to the server for grading, optimistic updates (and hence a sync engine) wouldn't be feasible. The same applies to critical actions such as placing orders or trading stocks. A good rule of thumb is that any action dependent on the server and incapable of functioning offline should not rely on a sync engine.
Any app dealing with huge datasets that cannot be fit on user machines. For example, creating a local-first version of Google or an analytics tool processing gigabytes of data to generate results is impractical. However, in scenarios where partial synchronisation suffices, a sync engine can still be beneficial. For instance, Google Maps can download and cache maps on client devices to operate offline, without needing high-resolution maps for every location worldwide all the time.

A word on developer productivity and DX

My impression is that having a sync engine can make DX (developer experience) much nicer. Frontend engineers just work with a normal store that they can subscribe to updates, and the UI always stays up to date. No need to think about fetching anything, calling APIs or server actions for the parts of the app that are governed by the sync engine. On the backend, I can't say much yet. It seems like it won't be harder than a traditional backend but I can't say for sure.

Closing thoughts

It's exciting to imagine the future of web apps as planet scale, realtime multi-player collaboration tool that work reliably regardless of network conditions, while at the same time making these nasty problems I started this post with a thing of the past.
I highly recommend fellow web developers to get themselves familiar with these new concepts, experiment with them and maybe even contribute.
Thanks for reading. Leave a comment if you have any questions or thoughts. Peace.
.

P.S
This interview with Aaron Boodman, the founder of the company that created Replicache, is great. Watch it and thanks me later.

Svelte Reactivity Gotchas + Solutions (If you're using Svelte in production you should read this)

Isaac Hagoel — Tue, 05 Oct 2021 02:26:13 +0000

Svelte is a great framework and my team has been using it to build production apps for more than a year now with great success, productivity and enjoyment. One of its core features is reactivity as a first class citizen, which is dead-simple to use and allows for some of the most expressive, declarative code imaginable: When some condition is met or something relevant has changed no matter why or how, some piece of code runs. It is freaking awesome and beautiful. Compiler magic.

When you're just playing around with it, it seems to work in a frictionless manner, but as your apps become more complex and demanding you might encounter all sorts of puzzling, undocumented behaviours that are very hard to debug.
Hopefully this short post will help alleviate some of the confusion and get back on track.

Before we start, two disclaimers:

All of the examples below are contrived. Please don't bother with comments like "you could have implemented the example in some other way to avoid the issue". I know. I promise to you that we've hit every single one of these issues in real codebases, and that when a Svelte codebase is quite big and complex, these situations and misunderstandings can and do arise.
I don't take credit for any of the insights presented below. They are a result of working through the issues with my team members as well as some members of the Svelte community.

Gotcha #1: Implicit dependencies are evil

This is a classic one. Let's say you write the following code:

<script>
    let a = 4;
    let b = 9;
    let sum;
    function sendSumToServer() {
        console.log("sending", sum);
    }
    $: {
        sum = a + b;
        sendSumToServer();
    }
</script>
<label>a: <input type="number" bind:value={a}></label> 
<label>b: <input type="number" bind:value={b}></label> 
<p>{sum}</p>

It all works (click to the REPL link above or here) but then in code review you are told to extract a function to calculate the sum for "readability" or whatever other reason.
You do it and get:

<script>
    let a = 4;
    let b = 9;
    let sum;
    function calcSum() {
        sum = a + b;
    }
    function sendSumToServer() {
        console.log("sending", sum);
    }
    $: {
        calcSum();
        sendSumToServer();
    }
</script>
<label>a: <input type="number" bind:value={a}></label> 
<label>b: <input type="number" bind:value={b}></label> 
<p>{sum}</p>

The reviewer is happy but oh no, the code doesn't work anymore. Updating a or b doesn't update the sum and doesn't report to the server. Why?
Well, the reactive block fails to realise that a and b are dependencies. Can you blame it? Not really I guess, but that doesn't help you when you have a big reactive block with multiple implicit, potentially subtle dependencies and you happened to refactor one of them out.

And it can get much worse...
Once the automatic dependency recognition mechanism misses a dependency, it loses its ability to run the reactive blocks in the expected order (a.k.a dependencies graph). Instead it runs them from top to bottom.

This code yields the expected output because Svelte keeps track of the dependencies but this version doesn't because there are hidden dependencies like we saw before and the reactive blocks ran in order. The thing is that if you happened to have the same "bad code" but in a different order like this, it would still yield the correct result, like a landmine waiting to be stepped on.
The implications of this are massive. You could have "bad code" that happens to work because all of the reactive blocks are in the "right" order by pure chance, but if you copy-paste a block to a different location in the file (while refactoring for example), suddenly everything breaks on you and you have no idea why.

It is worth restating that the issues might look obvious in these examples, but if a reactive block has a bunch of implicit dependencies and it loses track of just one on of them, it will be way less obvious.

In fact, when a reactive block has implicit dependencies the only way to understand what the dependencies actually are is to read it very carefully in its entirety (even if it is long and branching).
This makes implicit dependencies evil in a production setting.

Solution A - functions with explicit arguments list:

When calling functions from reactive blocks or when refactoring, only use functions that take all of their dependencies explicitly as arguments, so that the reactive block "sees" the parameters being passed in and "understands" that the block needs to rerun when they change - like this.

<script>
    let a = 4;
    let b = 9;
    let sum;
    function calcSum(a,b) {
        sum = a + b;
    }
    function sendSumToServer(sum) {
        console.log("sending", sum);
    }
    $: {
        calcSum(a,b);
        sendSumToServer(sum);
    }
</script>
<label>a: <input type="number" bind:value={a}></label> 
<label>b: <input type="number" bind:value={b}></label> 
<p>{sum}</p>

I can almost hear some of you readers who are functional programmers saying "duh", still I would go for solution B (below) in most cases because even if your functions are more pure you'll need to read the entire reactive block to understand what the dependencies are.

Solution B - be explicit:

Make all of your dependencies explicit at the top of the block. I usually use an if statement with all of the dependencies at the top. Like this:

<script>
    let a = 4;
    let b = 9;
    let sum;
    function calcSum() {
        sum = a + b;
    }
    function sendSumToServer() {
        console.log("sending", sum);
    }
    $: if (!isNaN(a) && !isNaN(b)) {
        calcSum();
        sendSumToServer();
    }
</script>
<label>a: <input type="number" bind:value={a}></label> 
<label>b: <input type="number" bind:value={b}></label> 
<p>{sum}</p>

I am not trying to say that you should write code like this when calculating the sum of two numbers. The point I am trying to make is that in the general case, such a condition at the top makes the block more readable and also immune to refactoring. It does require some discipline (to not omit any of the dependencies) but from experience it is not hard to get right when writing or changing the code.

Gotcha #2: Primitive vs. object based triggers don't behave the same

This is not unique to Svelte but Svelte makes it less obvious imho.
Consider this

<script>
    let isForRealz = false;
    let isForRealzObj = {value: false};
    function makeTrue() {
        isForRealz = true;
        isForRealzObj.value = true;
    }
    $: if (isForRealz) console.log(Date.now(), "isForRealz became true");
    $: if (isForRealzObj.value) console.log(Date.now(), "isForRealzObj became true");

</script>

<p>
    click the button multiple times, why does the second console keep firing?
</p>
<h4>isForRealz: {isForRealz && isForRealzObj.value}</h4>
<button on:click={makeTrue}>click and watch the console</button>

If you keep clicking the button while observing the console, you would notice that the if statement behaves differently for a primitive and for an object. Which behaviour is more correct? It depends on your use case I guess but if you refactor from one to the other get ready for a surprise.
For primitives it compares by value, and won't run again as long as the value didn't change.

For objects you would be tempted to think that it is a new object every time and Svelte simply compares by reference, but that doesn't seem to apply here because when we assign using isForRealzObj.value = true; we are not creating a new object but updating the existing one, and the reference stays the same.

Solution:

Well, just keep it in mind and be careful. This one is not that hard to watch for if you are aware of it. If you are using an object and don't want the block to run every time, you need to remember to put your own comparison with the old value in place and not run your logic if there was no change.

Gotcha #3: The evil micro-task (well, sometimes...)

Alright, so far we were just warming up. This one comes in multiple flavours. I will demonstrate the two most common ones. You see, Svelte batches some operations (namely reactive blocks and DOM updates) and schedules them at the the end of the updates-queue - think requestAnimationFrame or setTimeout(0). This is called a micro-task or tick. One thing that is especially puzzling when you encounter it, is that asynchrony completely changes how things behave because it escapes the boundary of the micro-task. So switching between sync/ async operations can have all sorts of implications on how your code behaves. You might face infinite loops that weren't possible before (when going from sync to async) or face reactive blocks that stop getting triggered fully or partially (when going from async to sync). Let's look at some examples in which the way Svelte manages micro-tasks results in potentially unexpected behaviours.

3.1: Missing states

How many times did the name change here?

<script>
    let name = "Sarah";
    let countChanges = 0;
    $: {
        console.log("I run whenever the name changes!", name);
        countChanges++;
    }   
    name = "John";
    name = "Another name that will be ignored?";
    console.log("the name was indeed", name)
    name = "Rose";

</script>

<h1>Hello {name}!</h1>
<p>
    I think that name has changed {countChanges} times
</p>

Svelte thinks that the answer is 1 while in reality it's 3.
As I said above, reactive blocks only run at the end of the micro-task and only "see" the last state that existed at the time. In this sense it does not really live up to its name, "reactive", because it is not triggered every time a change takes place (in other words it is not triggered synchronously by a "set" operation on one of its dependencies as you might intuitively expect).

Solution to 3.1:

When you need to track all state changes as they happen without missing any, use a store instead. Stores update in real time and do not skip states. You can intercept the changes within the store's set function or via subscribing to it directly (via store.subscribe). Here is how you would do it for the example above

3.2 - No recursion for you

Sometimes you would want to have a reactive block that changes the values of its own dependencies until it "settles", in other words - good old recursion. Here is a somewhat contrived example for the sake of clarity, so you can see how this can go very wrong:

<script>
    let isSmallerThan10 = true;
    let count = {a:1};
    $: if (count.a) {
        if (count.a < 10) {
            console.error("smaller", count.a);
            // this should trigger this reactive block again and enter the "else" but it doesn't
            count = {a: 11}; 
        } else {
            console.error("larger", count.a);
            isSmallerThan10 = false;
        }
    }
</script>

<p>
    count is {count.a}
</p>
<p>
    isSmallerThan10 is {isSmallerThan10}
</p>

It doesn't matter whether count is a primitive or an object, the else part of the reactive block never runs and isSmallerThan10 goes out of sync and does so silently (it shows true event though count is 11 and it should be false).
This happens because every reactive block can only ever run at most once per tick.
This specific issue has hit my team when we switched from an async store to an optimistically updating store, which made the application break in all sorts of subtle ways and left us totally baffled. Notice that this can also happen when you have multiple reactive blocks updating dependencies for each other in a loop of sorts.

This behaviour can sometimes be considered a feature, that protects you from infinite loops, like here, or even prevents the app from getting into an undesired state, like in this example that was kindly provided by Rich Harris.

Solution to 3.2: Forced asynchrony to the rescue

In order to allow reactive blocks to run to resolution, you'll have to strategically place calls to tick() in your code.
One extremely useful pattern (which I didn't come up with and can't take credit for) is

$: tick().then(() => {
  //your code here
});

Here is a fixed version of the isSmallerThan10 example using this trick.

Summary

I showed you the most common Svelte reactivity related gotchas, based on my team's experience, and some ways around them.

To me it seems that all frameworks and tools (at least the ones I've used to date) struggle to create a "gotchas free" implementation of reactivity.

I still prefer Svelte's flavour of reactivity over everything else I've tried to date, and hope that some of these issues would be addressed in the near future or would at least be better documented.

I guess it is inevitable that when using any tool to write production grade apps, one has to understand the inner workings of the tool in great detail in order to keep things together and Svelte is no different.

Thanks for reading and happy building!

If you encountered any of these gotchas in your apps or anything other gotchas I didn't mention, please do share in the comments.