DEV Community: Hamdi Mechelloukh

Two months building an investment bot. What it taught me about LLMs

Hamdi Mechelloukh — Thu, 18 Jun 2026 10:22:26 +0000

For two months, I tinkered together a small system that watches my portfolio and sends me, once a month, what it thinks I should do: buy, add, lighten, sell.

Wrong ideas, bugs hiding other bugs, decisions redone two or three times. And in the end, a much clearer picture of how language models actually behave, pretty far from what I imagined at the start.

The bot in one sentence

Once a month, a small script runs on a server. It pulls the composition of my portfolio and public market data, asks an artificial intelligence to analyze all that, and sends me a Telegram message with its recommendations.

The rest of the article is how I got to this little automated-monitoring script, by using the power of an LLM "correctly".

Act 1: the illusion of the edge

To be honest, I knew I couldn't have a head start on the market. But you still want to test the fantasy of code that makes you win on the stock market. I told myself that with enough data, a good model and a bit of code, I'd do as well as thousands of professional analysts.

It didn't last long. You can't beat the consensus with the consensus's own information; worse, you'll do worse than a good old "buy and hold on the S&P 500".

So I stopped chasing the edge. The value of this system was elsewhere: finding where the signals are favorable or not, seeing where I can't see, filtering out what I don't need to know, handing me proposals I could check, and then letting me make or not make a decision.

The edge exists, but it comes either from processing information better or faster than others (the whole quant business), or from having information others don't, and there it's either out of my reach, or it's insider trading. Me, scraping the same public numbers as everyone else, I have none.

The project took this realistic turn after that realization.

Act 2: the parade of models

A bit of vocabulary first, because the whole article rests on it.

An LLM (Large Language Model) is a program trained to predict the next word. You give it some text, it computes, for each possible word, a probability of being what comes next, then it picks one, and starts over. ChatGPT, Gemini, Claude are LLMs. That's all they do: predict the next word, one word at a time. The rest, the apparent reasoning, the analyses, emerges from this mechanism repeated billions of times.

My bot delegates its judgment to an LLM. Which one? I had to change the answer 6 times in 2 months, only to end up telling myself there isn't necessarily a right answer; you have to approach the LLM as a simple tool, like most SaaS.

Anyway, here's the path I took in terms of model choice:

Apr      Gemini alone
May 1    + Claude          (two models in parallel, to compare)
May 9    + Bear            (a 3rd model, deliberately pessimistic)
                           -> 3 voices, decision by vote
May 15   STOP. Opus only   (big cleanup, simplify everything)
May 22   back to Gemini    (cost and feature reasons)
Jun 13   MiMo              (Google terms-of-use change, and cost)

As you can see, at the start I began stacking models, because I had a serious lack of consistency in the bot's answers: one time it would tell me "buy Microsoft" and the next "sell Microsoft", even on 2 runs launched back to back.

It was pretty annoying, I was looking for a reliable answer, so the first idea was to reinforce the bot with 2 more LLM runs (Gemini) to get a kind of consensus, and I even went as far as adding a new model, Claude.

It was a bit better, but the bot was aggressive: it could recommend interesting names, but with too many negative theses. I needed a "devil's advocate", hence the idea of the "Bear", a 3rd LLM agent whose job was to look for the theses that lead to structural declines, and to cool down the "optimism" of the other two.

It was good, but it was expensive, and devilishly complex; something was off and it was tied to the architecture. I rewrote the bot, focusing and trying to simplify the prompts, and I still had the consistency problem.

After a few days, I went back to Gemini because my costs on Claude were a bit too high.

I started looking into the information the LLM was pulling in, and that's what made me remove the grounding search, because the LLM is heavily influenced by speculative noise.

In the end, I moved to MiMo: very good benchmark results, and a token usage cost (price per token + tokens needed to actually handle a task) that beats the competition. The terms-of-use changes for the Gemini API also cooled me off; when the free $300 disappear and you get a bill plus a withdrawal from your account worth 3 months of API usage, it kills the appetite.

Act 3: the memory that made the bot paranoid

I had given the bot a memory. Concretely, a file that kept its past analyses over thirty days, fed back in on each new run. The idea seemed obvious: add context so the bot wouldn't answer randomly.

Except the memory created a bias. The model saw that it had suggested selling a position the week before, and that pushed it to confirm that sale, over and over, regardless of new facts. A past opinion became a conviction, with no regard for the fundamentals. I patched. I added a rule to remove sells from the memory. It moved the problem without solving it. I patched again. Then I found that the memory was producing an outright form of self-censorship: the model aligned with its past instead of looking at the present.

It's what I call the paradox of experience, a bit like a lot of older people: we lean on our experience to decide whether a choice is good or not, except the context of a situation changes, and so the same decision can become a good one in another context, something experience erases.

At some point, I counted the patches. When a feature needs fix upon fix and each fix calls for another, the feature itself is the bug. I removed the memory. The bot went back to being stateless: no state, no memory. Every month, it analyzes the situation fresh, as if discovering it.

Adding state (memory) to a system makes it more complex and introduces dependencies on the past that can corrupt the present. Statelessness is often a feature, not a lack. For those who remember their control-theory classes: by feeding the model's past outputs back into its input, I hadn't opened a loop, I had closed one, a positive feedback loop. The output reinforced itself, and the system diverged. Going back to stateless is precisely returning to an open loop, each run independent, with no feedback from the past.

Keep this memory story in mind. Part of the instability I blamed on it maybe wasn't its fault. We'll come back to it a bit later in the article.

Act 4: speculation is not information

For the news monitoring, my first version used what's called grounding (or augmented retrieval).

Grounding is when you let the LLM go fetch information from the web in real time while it answers, instead of relying only on what it memorized during training.

On paper, perfect: the model gets to read the latest news. In practice, it mostly brought back rumors, analyst speculation, "word is that...". Over a one-year horizon, that kind of information isn't information. It's noise dressed up as signal.

We're still facing a disturbance, this time at the input: the quality of the input signal itself.

About-face, again. I cut the grounding and built monitoring on verifiable sources only: official regulatory filings (in the US, the documents companies are legally required to publish), established financial news feeds, and for the rest of the world, targeted searches. Then I imposed two hierarchies on the collected information:

AUTHORITY RANKING               SEVERITY SCALE
official source      >          G3  structural (changes the thesis)
established newswire >          G2  notable
the rest                        G1  anecdotal

The goal was no longer to read everything, but to sort. An official filing announcing a change of leadership (G3, official source) doesn't weigh the same as a speculative opinion piece (G1, the rest). The noise filtering promised in Act 1 was taking shape.

Act 5: the phantom facts

The system was now collecting verified facts. But the final recommendations seemed to ignore them. Worse: they were nearly identical to the runs where the monitoring had completely failed and brought back nothing at all. As if the facts didn't exist.

And yet they existed. They were right there in the text sent to the model. But they were grouped in a separate block, far from the place in the text where the model made its decision on each name. The model read them, then forgot them when it came time to decide.

The fix, once the diagnosis was made, was dumb: place each fact right next to the name it concerns, at the exact moment the model rules on that name. The lesson, though, runs deep:

With an LLM, the position of a piece of information matters as much as its presence. A fact present in the context but badly placed relative to the decision point is, in practice, an absent fact, especially when the context is long.

A corollary I wrote into the code right after: if the facts layer fails, the system crashes. It doesn't send a degraded report. A plausible but hollow report is more dangerous than no report, because it looks like real analysis. Better a visible crash than a false certainty nicely dressed up.

Act 6: the principle that reorganized everything

By dint of correcting behavioral drifts, I ended up formulating the rule that underpins everything else, and which is probably my biggest realization of the project.

The instructions given to an LLM must be principles, never numerical rules. Determinism must live in the data, not in the text of the instruction.

Deterministic means: always gives the same result from the same inputs. A computation is deterministic. Human judgment is not.

Concretely, from then on I forbade myself from writing things in the instructions like "aim for about 20% on this position" or "only do this". Why? Because a numerical threshold written in natural language gives you the worst of both worlds:

it makes the model rigid where I wanted nuanced judgment;
and it hands it a number to cling to and to make things up around (LLMs have an annoying tendency to embroider around the numbers you give them, because their answer is an estimate of the best answer to give, not the best answer to give).

If I want determinism, it has to be upstream, in the pipe that prepares the data. The mental model became this:

   UPSTREAM: DETERMINISTIC           DOWNSTREAM: NON-DETERMINISTIC
   (code, numbers)                   (the LLM's judgment)
 ┌────────────────────────────┐    ┌──────────────────────────┐
 │ portfolio at market value  │    │ weighs the pros and cons │
 │ universe filtered & scored │ ──> │ arbitrates between names │
 │ verified, dated facts      │    │ writes target weights    │
 │ consensus reliability      │    │ following PRINCIPLES      │
 └────────────────────────────┘    └──────────────────────────┘
   the numbers constrain              no magic number
   (computed, verifiable)             in the instructions

Everything that can be computed cleanly is computed upstream, in code, in a verifiable way, and handed to the model as a numerical constraint. The model, for its part, receives principles ("favor strong convictions", "a reliable consensus beats a high but uncertain target") and judges.

One last rule in the same spirit, on writing instructions: one rule = one statement, said once. Two near-identical rules are worse than one, because the reader (and an LLM even more so) looks for the difference between them. Since it doesn't exist, it invents one. Rewriting to condense isn't cosmetic, it's reducing the surface for error.

Act 7: the discovery, the noise was there from the start

Then comes the move to MiMo. And with it, not a new problem, but the revelation of an old problem I had never seen. To understand it, three definitions, like a staircase.

1. The probability distribution. At each word, the LLM doesn't pick "the" next word. It computes a probability for each possible word. For instance, after "the cat drinks", it might rate: milk 70%, water 25%, coffee 4%, etc.

2. The temperature. It's the knob that sets the randomness of the selection. High temperature: the model sometimes picks unlikely options (more creative, more unpredictable). Zero temperature: it systematically takes the most likely option. milk, every time.

3. The logits. These are the raw scores the model computes for each word before turning them into probabilities. They're the raw material of the decision.

At zero temperature, the model always takes the most likely word, so with the same inputs it should produce exactly the same output. Deterministic. In reality, I had misunderstood what an LLM is. It gives the illusion of an exact answer, when in fact it mostly gives an answer that's "good enough". When you code with an LLM, for example, it spits out working code, but not necessarily good code, or rather: not every time.

When I changed models, for the first time I wanted to measure stability before building on it. I ran the same prompt, on the same frozen data, several times in a row, at zero temperature. I expected identical outputs.

I got the opposite. Considerable variance from one run to the next. A few real numbers from this test on about thirty positions:

the sum of the proposed target allocations came to 44% on the first run, 105% on the second, 101% on the third;
out of the thirty-odd names, only 6 got a stable recommendation from one run to the next;
one run in three went completely off the rails.

Same input, same zero temperature, very different outputs.

This noise was not new with MiMo. It had been there from the start, with Gemini, then Claude, then every model I had used. I tried to fix it with deterministic instructions, because I needed control, but that's nonsense: I can't have both control (deterministic) and the AI finding me insights (non-deterministic). MiMo didn't bring anything special on this front; it was just the occasion where I understood the line between using the LLM or not, because it's just a tool, with, admittedly, a high level of abstraction, but not a solution or a human replacement, even if some companies do very nice marketing to say otherwise.

And there, Act 3 takes on another meaning. The instability I had blamed on the memory, those recommendations that flip-flopped from month to month that I fought with patches, a good chunk of it probably wasn't the memory at all. It was already this noise, invisible for lack of being measured.

Careful, though, not to rewrite history too cleanly: the memory's anchoring bias was real, feeding back its own past sells creates a genuine confirmation bias. So there were two overlapping problems, not a single misdiagnosed one. The memory wasn't innocent; it simply had an invisible accomplice I wasn't measuring.

Why an LLM stays noisy even at zero temperature

Here's the heart of it, and it's subtler than it looks.

At zero temperature, the choice of word is not random. The model takes the maximum, perfectly deterministically. So it's not the selection rule that injects randomness.

The noise comes one notch earlier: the logits themselves don't always land on the same number. And the cause isn't the one people assume. The common explanation ("the parallel computations on the GPU run in a random order") is misleading, and that's exactly what the work cited just below corrects: for a given computation shape, the model is in fact reproducible; re-run identically, it gives back the same logits.

The real culprit is the batch. On a server, your request is never handled alone: it's grouped with others, and the composition of that group (how many requests, of what lengths) changes on every call depending on load. And to go fast, the GPU splits up and adds the numbers in an order that depends on the shape of the batch. And floating-point addition (the computer's approximate arithmetic on decimal numbers) isn't associative: (a + b) + c doesn't give exactly a + (b + c). So different batch neighbors lead to a different split, hence a different order of additions, hence logits that move by a hair. Each computation taken in isolation is deterministic; it's the batch context that varies from one call to the next.

This isn't an absolute fatality, by the way. Recent work by Thinking Machines Lab showed that by rewriting these computations so they always add in the same order, whatever the batch composition, you can make a model perfectly reproducible at zero temperature: in their demo, 1000 generations became bit-for-bit identical, where the standard version produced 80 different ones. The price is a slowdown (on the order of 1.6 to 2 times depending on the kernels), and that's why consumer APIs don't enable it by default. So the noise at zero temperature is a fatality in practice, on common APIs, not in principle.

As long as the best candidate wins comfortably, it doesn't matter. But when two candidates are nearly tied, that hair flips the ranking:

Raw scores (logits) for the next word

  lighten   ████████████████████  8.40
  sell      ███████████████████▉  8.39   <- nearly tied!
  hold      ██████████            4.10

Run 1:  lighten 8.401 , sell 8.399  ->  we pick LIGHTEN
Run 2:  lighten 8.398 , sell 8.402  ->  we pick SELL
                       ^ tiny floating-point gap, and it all flips

And on a model that "reasons" (that generates a long chain of thought before concluding), a single early flip propagates and amplifies all along the reasoning. A hair's difference at the start, an opposite conclusion at the end.

So the right way to put it is:

It's not noisy because the choice is random, but because the estimated values that a deterministic choice rests on are themselves unstable. A deterministic choice over unstable estimates becomes unstable again near the ties.

And that detail changes the whole interpretation. The noise isn't blind. It concentrates exactly on the close calls, the ones where the model itself is undecided. A position where the conviction is clear never flips (the best candidate wins comfortably). The positions that waltz from one run to the next are precisely the ones where two options are nearly equal.

The noise marks the zones of real uncertainty. It's not a flaw to hide. It's information.

Act 8: don't fight the noise, make it vote

If the noise is information about uncertainty, the right response isn't to eliminate it. It's to aggregate it.

Rather than a single run, I launch several on the same data, then I make the results vote. It's exactly the idea of the Condorcet jury theorem, stated by an 18th-century French mathematician.

Condorcet's theorem (in a few words). If each juror has a better-than-50% probability of finding the right answer, and the jurors err independently of one another, then the more jurors you add, the more the majority vote tends toward the certainty of being right.

As a formula, the probability that the majority of N jurors is right, each correct with probability p:

                N
P(majority) =   Σ    C(N,k) · p^k · (1 − p)^(N−k)
              k=⌊N/2⌋+1

What to take from it without the symbols:

  p (juror quality)            majority vote as N grows
  ─────────────────            ───────────────────────────────────────
  p > 0.5  (better than coin)  ──> tends to 1   (certain to be right)
  p = 0.5  (coin flip)         ──> stays at 0.5 (voting doesn't help)
  p < 0.5  (worse than coin)   ──> tends to 0   (voting makes it worse!)

Watch the trap the formula makes visible: voting only improves things if each juror is already better than chance. If the model is bad on a question, multiplying the runs only amplifies the error. Voting makes a correct-but-noisy juror reliable; it doesn't save an incompetent one.

And there's a second trap, more insidious. Condorcet's theorem has two assumptions, not one: jurors better than chance (I just talked about that), and independent errors. But re-running the same model five times is five times the same network, the same biases, the same typical reasoning. The floating-point noise only decorrelates the outputs near the ties, exactly where I want them to vote. But on a systematic error (the model doesn't understand a sector, overrates a thesis), the five runs are wrong together, and worse: they're wrong unanimously. Because a unanimous vote is only a constant answer, and a constant answer is only the reinforcement of the model's thesis, not proof that it's right. 5/5 measures stability, never truth. To settle a consensus, you therefore need a source outside the model; the same model re-run will only repeat its thesis with confidence. Voting neutralizes the sampling noise; it doesn't correct the model's bias.

My model, on most positions, is far better than chance, just noisy near the close calls. An ideal use case for Condorcet. I refined the vote on two levels: first the direction (should this position go up or down?), then only the degree. Without that, a clear consensus on direction could get buried under slightly different action labels ("lighten" and "sell" both say: go down).

The result is exactly what I'd been looking for since Act 1 without knowing it:

  Stock A  sell 5/5      ->  stable: top of the map (to validate against facts)
  Stock B  add 4/5       ->  stable: consensus
  Stock C  buy 2 / lighten 2 / hold 1
                         ->  NO consensus: shown as "split,
                             your judgment decides"

The clear cases come out by consensus, the close ones show up honestly as split, and I'm the one who decides. But careful not to read this table as a ranking of good recommendations: after everything above, a "5/5" doesn't certify that selling is the right move, it certifies that the model is stable on it. What the ensemble really produces isn't a list of orders, it's a stability map: here's where my judgment is least needed, and here's where it's needed most. The consensus tells me where to look with confidence; it's the primary facts and me who validate the substance. The system stops pretending to a certainty it doesn't have.

I had started by making three different models vote, then I deleted everything to simplify. I end up making a single model vote several times. The structure is the same (a vote), but the reason changed completely: at the start I voted to combine different viewpoints; in the end I vote to neutralize the noise of a single model. It took me two months and a detour through the whole chain to understand the real point of that vote.

Then there's the objection: the version truly faithful to Condorcet would be several different models, each run several times, because different architectures err in a more decorrelated way. And let's be honest: the gain is real, not uncertain. An ensemble of different models really does decorrelate errors, that's been the whole point of ensembles forever. It's simply probably not worth it. Each extra model is one more provider to maintain, to pay, to monitor, formats and prompts to keep in sync, for a benefit that becomes marginal next to that operating cost. You leave Pareto's useful 80%. It's a cost trade-off, not a denial of the benefit. So I decided otherwise. The vote of a single model kills the noise, which is the essential part and nearly free. And for the bias, the source outside the model that I need, I already have it: the primary facts I confront the model with (Acts 4 and 5), and my own judgment on the split cases. The 5/5 tells me where the model is stable; the facts and I say whether it's right.

The question this article dodges

Everything above is about consistency: is the system stable, honest about its doubts? But consistency isn't correctness. A system can be perfectly stable, perfectly clear-eyed about its zones of uncertainty, and mediocre in returns. Perfect consistency is even, as we just saw, exactly what a prejudice repeated without flinching looks like.

I myself dismantled the idea of an edge back in Act 1: I have no serious reason to beat the market. So a tension remains that I don't really resolve: what good are such carefully crafted allocations if nothing guarantees they're better than a simple index fund? My honest answer: it's not a performance tool, it's a monitoring and decision-support tool. The targets it produces are a starting point for my judgment, not autopilot. It's still too early to measure whether I beat an index over time. For now, I'm very close to the S&P 500 and below the NASDAQ, over 2-3 months of development, which proves nothing, one way or the other: over that span, everything is drowned in market noise. Telling skill from luck in equity allocation takes years, and ideally going through a real downturn. So I won't have the answer for a long time, now that the bot is stable.

I wrote this article to explain how I made a machine consistent in its answers and how I gained lucidity about its limits, without claiming that it's right.

What I was really looking for

I started out wanting a machine that's right. I ended up with a machine that's honest about what it doesn't know. And that's far more useful.

After dismantling my illusions one by one, here's what remains, and what's enough to justify the machine: a monitoring system, with a reliable sieve against the noise of information, to surface what matters on the market and in my portfolio. No promise of beating the market. But at the one task I built it for, gathering the right information and discarding the rest, it is, without hesitation, better than me.

The most transferable lesson reaches far beyond finance:

An LLM isn't an oracle, it's a sampler. It draws its answers from a probability distribution. Its variance isn't a flaw to hide, it's a measure of its own confidence. Good systems built around LLMs don't hide uncertainty, they bring it to the surface.

The bot runs once a month and sends me its conclusions. But what it really gave me isn't a shopping list. It's a sharper way to think about decisions under uncertainty, and a deep respect for the difference between a system that answers, and a system that knows when it doesn't know.

Real-time streaming pipeline with Apache Flink 2.0, Kafka and Iceberg

Hamdi Mechelloukh — Tue, 31 Mar 2026 11:09:58 +0000

It's 2:03 PM. A flash sale just started.

In the warehouse, an operator is entering incoming orders into the management system. He types a quantity, makes a mistake, corrects it immediately. Two events, one reality. Thirty seconds apart.

The batch job that runs at 2 AM will see both. It won't know which one is right. Depending on how the reconciliation logic is written, if it exists at all, it picks one of the two, often non-deterministically. And if the correction falls into the next batch window, the problem doesn't surface right away: the morning's numbers are wrong, cleanly, with no technical error in sight.

This is a real and recurring source of data quality problems in data teams.

Processing events as they arrive, in order, with their temporal context intact, fundamentally changes how this problem is handled. That's the starting point for this project: an end-to-end streaming pipeline on the Olist e-commerce dataset, built with Apache Flink 2.0, Kafka and Iceberg.

The dataset and the problem

The Olist dataset is a public Brazilian e-commerce dataset: orders, products, sellers, customers, reviews. 100,000 orders over two years.

I had already built a batch lakehouse on this same dataset. The logical next step was to go to the other extreme: stream processing, one-minute calculation windows, anomaly detection at the second level. Three concrete needs:

Revenue by category in real time — know which category is performing at every minute
Anomaly detection — a customer placing multiple orders within a few minutes, or an order at an abnormal price
Global KPIs — average order value, order rate, total revenue in real time

These are the three jobs that make up the pipeline.

Why Apache Flink?

The question is worth asking. There are other options for streaming in Java:

Kafka Streams — easy to operate, no separate cluster, but limited to Kafka-in/Kafka-out topologies
Apache Spark Structured Streaming — micro-batches, minimum latency of a few seconds, but familiar if you already know Spark
Flink — true event-by-event streaming, native event-time processing, built-in CEP

Flink was the natural choice for two reasons.

CEP (Complex Event Processing). Detecting "3 orders from the same customer within 5 minutes" is not an aggregation, it's a temporal correlation between events. Flink CEP handles this natively with a pattern DSL. In Kafka Streams, it requires maintaining manual state and writing the temporal logic by hand.

Flink 2.0. Version 2.0 brought native Java 21 support. Working on the current version rather than an end-of-life one was a deliberate choice.

The architecture

Olist CSV → Simulator → Kafka (orders)
                              │
              ┌───────────────┼───────────────┐
              ▼               ▼               ▼
    RevenueAggregation  AnomalyDetection  RealtimeKpi
    (tumbling 1 min)    (CEP 5 min)       (windowAll 1 min)
              │               │               │
              ▼               ▼               ▼
      Kafka (revenue)  Kafka (alerts)  Kafka (kpis)
              │               │               │
              └───────────────┴───────────────┘
                              │
                    Apache Iceberg (MinIO)

Three independent jobs, one shared source topic, three output topics, and an optional Iceberg data lake.

The independence of the jobs is a deliberate choice. In production, you want to be able to restart AnomalyDetectionJob without affecting RevenueAggregationJob. Each job has its own checkpoint, its own state, its own topology.

Job 1: RevenueAggregationJob

The simplest of the three. It aggregates revenue by product category over one-minute windows.

orders → filter nulls → map to RevenueByCategory → keyBy(category)
       → TumblingWindow(1 min) → reduce + ProcessWindowFunction
       → Kafka sink + Iceberg sink (optional)

A few details that matter.

Watermark strategy. The pipeline uses event time, meaning the event timestamp in the Kafka message, not the arrival time. The strategy is forBoundedOutOfOrderness(10 seconds) with a 5-second idleness timeout.

Why idleness? If a stream is empty for several minutes (the simulator is stopped, for example), Flink can no longer advance its watermark. Without withIdleness, windows never close. With withIdleness(5s), Flink ignores silent partitions and advances anyway.

Side outputs. Invalid events (null price, missing timestamp) are not silently dropped. They are routed to a side output that logs them. This avoids the scenario where events disappear without a trace.

Two-phase reduction. Before the window is applied, a reduce combines events by category on the fly. The ProcessWindowFunction then only attaches the window start and end timestamps. Less state to store, less work at window closure.

Job 2: AnomalyDetectionJob

This one is more interesting. It detects two types of anomalies through two different mechanisms.

Threshold detection: price anomaly

A filter on price:

ordersStream
    .filter(event -> event.getPrice() != null
                  && event.getPrice().compareTo(priceThreshold) > 0)
    .map(event -> OrderAlert.priceAnomaly(event.getCustomerId(), event.getPrice()))

The threshold (500 BRL by default) is configurable via environment variable. One subtlety: the filter is > 0, not >= 0. An order at exactly 500 BRL is not an anomaly. This behavior is covered by a specific test.

Pattern detection: suspicious frequency

This is where CEP comes in.

Pattern<OrderEvent, ?> pattern = Pattern.<OrderEvent>begin("orders")
    .timesOrMore(suspiciousOrderCount)
    .within(Duration.ofMinutes(5));

PatternStream<OrderEvent> patternStream = CEP.pattern(
    ordersStream.keyBy(OrderEvent::getCustomerId),
    pattern
);

This pattern says: if the same customer places 3 or more orders within a 5-minute window, it's suspicious.

The key is keyBy(customerId). Without it, Flink would compare orders from different customers. With keyBy, each customer has their own independent CEP state.

Both streams, price alerts and frequency alerts, are then merged with union() before being sent to the Kafka output topic.

Job 3: RealtimeKpiJob

Global KPIs: average order value, orders per minute, total revenue. The calculation is straightforward, but the implementation reveals an interesting trade-off.

windowAll: the acknowledged bottleneck

To calculate total revenue across all orders in one minute, you need to aggregate all events together, without splitting by key. In Flink, this is called windowAll.

windowAll forces all events through a single processing instance. It's a bottleneck by design. At this volume (50 events per second), it's more than sufficient. If throughput rose to 50,000 events per second, a pre-aggregation by key followed by a merge would be necessary. We don't do that here because adding complexity for a hypothetical need is not good engineering.

Two-phase aggregation

The KPI calculation uses the AggregateFunction + ProcessAllWindowFunction pattern:

KpiAggregateFunction accumulates the count and sum as events arrive, continuously
KpiWindowFunction computes the average and derived metrics at window closure

This separation maintains minimal state (two numbers) instead of buffering all raw events. The ProcessAllWindowFunction only receives the final accumulator.

The BoundedHistogram

An optional but interesting detail: a custom Flink Histogram implementation.

The Flink Metrics API exposes three standard types: Counter, Gauge, Histogram. For a Histogram, Flink expects an implementation that returns percentiles, mean and standard deviation via a HistogramStatistics interface.

The BoundedHistogram is a fixed-size circular buffer (1000 values). When the buffer is full, new values overwrite the oldest ones.

public synchronized void update(long value) {
    values[writeIndex % values.length] = value;
    writeIndex++;
}

Simple, thread-safe, bounded memory. It allows Grafana to show the distribution of average order values, not just the latest single value.

Iceberg integration: what I didn't anticipate

The Apache Iceberg integration was optional in the initial architecture. In practice, this is where I spent the most time.

The classloader problem

Flink 2.0 loads its filesystem plugins (including flink-s3-fs-hadoop) in an isolated classloader, invisible to user code. When iceberg-flink-runtime tries to instantiate S3AFileSystem at write time, it can't find the class provided by the Flink plugin.

The solution: bundle hadoop-aws and the AWS SDK directly in the fat JAR, with aggressive exclusions to avoid dependency conflicts.

implementation("org.apache.hadoop:hadoop-aws:3.4.1") {
    exclude(group = "com.amazonaws", module = "aws-java-sdk-bundle")
}
implementation("com.amazonaws:aws-java-sdk-s3:1.12.780")
implementation("com.amazonaws:aws-java-sdk-sts:1.12.780")

The fat JAR reaches ~710 MB. Not ideal, but that's the real cost of an Iceberg + Flink + S3 integration outside a managed service.

Credential timing

Second surprise: HadoopCatalog reads its S3 configuration at construction time, not after. The intuitive pattern of creating the catalog and then injecting configuration doesn't work:

// Credentials are injected too late
HadoopCatalog catalog = new HadoopCatalog();
catalog.setConf(hadoopConf);

// Credentials must be in the Configuration before construction
Configuration hadoopConf = new Configuration();
config.toProperties().forEach(hadoopConf::set);
HadoopCatalog catalog = new HadoopCatalog(hadoopConf, config.warehouse());

Same applies to CatalogLoader.hadoop(). This behavior is not prominently documented. It's the kind of error you only discover through end-to-end testing.

Docker Compose and .env resolution

A less expected issue: Docker Compose v2 resolves the .env file from the directory containing docker-compose.yml, not from the current working directory.

# From the project root, this command ignores the .env at the root
docker compose -f docker/docker-compose.yml up -d

# You need to pass the path explicitly
docker compose --env-file .env -f docker/docker-compose.yml up -d

Without this, ICEBERG_ENABLED=true in the .env is ignored and jobs start without an Iceberg sink, with no error message.

Observability

Flink exposes its metrics via Prometheus on port 9249. Each job exposes custom metrics:

Metric	Type	Job
`windowsEmitted`	Counter	RevenueAggregationJob
`kpiWindowsEmitted`	Counter	RealtimeKpiJob
`lastWindowOrderCount`	Gauge	RealtimeKpiJob
`orderValueDistribution`	Histogram	RealtimeKpiJob
`priceAnomalyAlertsEmitted`	Counter	AnomalyDetectionJob
`suspiciousFrequencyAlertsEmitted`	Counter	AnomalyDetectionJob
`deserializationErrors`	Counter	All

These metrics land in Prometheus every 15 seconds and are visualized in Grafana. The deserializationErrors metric is particularly useful: if the simulator sends a malformed message, the counter rises and you see it immediately in the dashboard, without the job crashing.

Testing

The tests use Flink's MiniCluster, an embedded Flink cluster that runs in the test process, with no external infrastructure.

This choice has a cost: tests are slower (a few seconds each). But they test the actual behavior of Flink operators, not a mock. The AnomalyDetectionJobTest specifically validates CEP edge cases:

2 orders in 5 minutes → no alert
3 orders in 5 minutes → alert triggered
Order at exactly 500 BRL → no price alert

18 tests in total, covering all three jobs, the BoundedHistogram and the deserialization schema. The CI (GitHub Actions) compiles and runs all tests on every push, with a JaCoCo report as an artifact.

Batch or streaming: the real debate

Back to the opening scene.

Streaming is often perceived as expensive: the cluster runs continuously, the infrastructure never shuts down. That's true. But this comparison is incomplete.

A batch pipeline that handles events which correct themselves over time accumulates its own debt. Timeline reconciliation logic. Re-processing when an event arrives late. Alerts, manual interventions, data engineers spending time explaining why numbers are inconsistent across two windows. This cost is diffuse: it doesn't appear on any cloud bill, but it accumulates in sprints, in support, in technical debt.

Nobody actually does this calculation in practice — because it's too costly to conduct seriously.

This project doesn't claim to settle the debate. What it shows is that an end-to-end streaming pipeline with Flink 2.0 is accessible today without managed infrastructure, without Databricks, without Confluent Cloud. A docker compose up and the pipeline runs. The complexity is in the integration details, not in the paradigm itself.

The code is on GitHub. The start-e2e.sh script launches the entire pipeline in a single command.

You can also read this and other articles on my portfolio.

Building an open-source vendor-neutral lakehouse

Hamdi Mechelloukh — Fri, 20 Mar 2026 11:05:38 +0000

When you work in data, you always end up asking the same question: what happens if we need to switch platforms tomorrow?

I've seen firsthand that software vendors can be aggressive with pricing, and they won't hesitate to sunset a product that isn't generating enough revenue. When that happens, you need to migrate quickly, or face massive costs in migration, redevelopment, and lost time.

This conviction led me to build an end-to-end open-source, vendor-neutral lakehouse, from messaging to visualization. Here are the architecture choices, the trade-offs, and what I learned.

The stack: Kafka → Spark → Iceberg

Sources → Kafka → Spark (Bronze) → Spark (Silver) → Spark (Gold) → Streamlit
                                                                   ↑
                                                          Great Expectations
                                                          (quality at each layer)

Kafka for ingestion

Choosing event-driven for data transfer isn't trivial. In computing, managing time is one of the hardest problems. On the operational side, we're moving more and more toward event-driven architectures precisely for this reason: an event arrives when it arrives, and the system processes it. No batch window to respect, no "the file should have arrived at 6 AM".

Kafka is the de facto standard for this kind of architecture. Open-source, battle-tested, and crucially: no vendor lock-in. You can deploy it on any cloud or on-premise.

Spark for compute

You might ask: why Spark in an event-driven architecture? My position is pragmatic. Pure streaming via Kafka works well for ingestion into bronze, or even silver, to handle temporality upstream. But once you reach heavy transformations (aggregations, joins, enrichments), Spark remains the most battle-tested and portable tool.

Spark's advantage is that it runs everywhere: on a YARN cluster, on Kubernetes, on Databricks, on EMR, or locally for development. It's one of the few compute tools that doesn't lock you in.

Iceberg for the table format

Iceberg is the open table format that's gaining momentum. My choice was partly technical curiosity: I use Delta Lake daily at work, so I wanted to explore the alternative.

But beyond curiosity, Iceberg has properties that make it particularly suited for a vendor-neutral lakehouse:

Open format — no dependency on a specific vendor
Time travel — query data at any point in time
Schema evolution — add or modify columns without rewriting data
Partition evolution — change partitioning scheme without migration
Compatible with all engines — Spark, Trino, Flink, Dremio, Athena...

The project could just as well run with Delta Lake or Hudi. In fact, it would be interesting to offer format choice to anyone forking the project.

The layered architecture: bronze, silver, gold

The medallion pattern (bronze/silver/gold) structures data in three levels of refinement:

Bronze — raw data as it arrives, no transformation
Silver — cleaned, deduplicated, properly typed data
Gold — aggregated data ready for business consumption

Honestly, these terms are recent. A few years ago, we called them dataraw, dataprep, dataset. The vocabulary changes, the principle stays the same. What matters is to follow this progressive refinement logic without being rigid. Functional reality always takes precedence over technical rules. If data doesn't need three layers, it doesn't need three layers.

MinIO: S3-compatible without the lock-in

One point that might surprise you: why MinIO rather than S3 directly?

Because S3 is an AWS service, and using S3 means locking yourself into AWS. MinIO implements the S3 API identically: every tool that speaks S3 speaks MinIO without modification. You can develop and test locally, deploy on any cloud, and migrate to S3, GCS or Azure Blob Storage without changing a single line of application code.

That's exactly the vendor-neutral principle: use open standards rather than proprietary managed services.

Data quality: Great Expectations and its limits

Great Expectations is the most widely used data validation tool in the Python/Spark ecosystem. I integrated it at each pipeline layer to validate data on input and output.

The tool does its job well for simple quality rules: nullability, uniqueness, value ranges, formats. It's also a tool I've seen used in enterprise settings, which validated the choice.

But it has real limitations:

Complex quality rules (cross-table consistency, conditional business rules) are hard to express
Resource-intensive checks (massive joins for cross-source duplicate detection) don't scale easily
And most importantly: discovering quality issues is not enough

This last point is crucial and comes directly from my production experience at Decathlon. You can set up all the quality alerts in the world. If source teams have no commitment to fix the issues, nothing will change. You need to work on data quality service-level agreements: SLAs on fix turnaround, shared responsibilities, clear escalation paths. Without that, source teams will make little effort to resolve quality problems.

The difficulty of vendor-neutral

The biggest challenge of this project wasn't technical in the traditional sense. It was resisting the temptation of managed services.

At every step, there's a managed option that saves time:

Why manage your own Kafka when there's Amazon MSK or Confluent Cloud?
Why MinIO when S3 is there, configured in 2 clicks?
Why self-hosted Airflow when there's MWAA?

The answer is always the same: because the day the pricing changes or the service is deprecated, you need to be able to leave. This doesn't mean you should never use managed services. It means you should do it knowingly, and make sure the abstraction layer allows switching.

In practice, building vendor-neutral requires more upfront effort:

Terraform for declarative, multi-cloud infrastructure management
Docker for isolation and portability
Standard interfaces everywhere (S3 API, JDBC, etc.)

But once it's in place, the freedom it provides is invaluable.

Orchestration: Airflow

Airflow is the natural choice for orchestration in a vendor-neutral stack. Open-source, extensible, and above all: the community is massive. When you have an Airflow problem, someone has already had it and posted the solution on Stack Overflow.

Alternatives would be Dagster or Prefect, but Airflow remains the most widely deployed in production and the most in-demand on the market. Pragmatism.

IaC: Terraform for multi-cloud

Terraform is the piece that makes vendor-neutral viable at scale. Infrastructure is described in code, versioned in Git, and deployable on AWS, GCP or Azure with provider changes, no complete rewrite needed.

In this project, Terraform modules provision AWS infrastructure, but the same logic could be ported to another cloud without rebuilding the application architecture.

What I took away

Vendor-neutral has a cost, but so does lock-in

Building vendor-neutral requires more upfront work. But lock-in has a hidden cost that explodes the day you need to migrate. And that day always comes sooner than you think.

Open formats are your data's life insurance

Iceberg, Parquet, Avro: as long as your data is in an open format, you can switch compute engines without losing your data. It's the most important decision in a data architecture.

Data quality is an organizational problem, not a technical one

Tools like Great Expectations are necessary but not sufficient. Without service-level agreements with sources, quality alerts are just noise.

Functional reality takes precedence over patterns

Bronze/silver/gold is a good guide, not a religion. If your data only needs two layers, don't make three to respect a pattern. Architecture should serve the business need, not the other way around.

Streaming doesn't replace batch, it complements it

Kafka for real-time ingestion, Spark for heavy transformations. The two coexist, and that's healthy. Trying to do everything in streaming is as dogmatic as doing everything in batch.

Going further

The source code is available on GitHub. The project uses the Olist dataset (Brazilian e-commerce) as a data source, making it testable without heavy infrastructure.

You can also read this and other articles on my portfolio.

Lessons from 2 years as Production Manager at Decathlon Digital

Hamdi Mechelloukh — Fri, 20 Mar 2026 10:59:30 +0000

For two and a half years, I stepped away from code to manage data production for sales at Decathlon Digital. A role I discovered upon arrival: the job title said "Production Expert", and I quickly realized it was going to be a full-time commitment.

Here's what I learned from switching to the other side.

Context: Perfeco and sales data

Perfeco was the data product that served the company's economic performance and sales data. In practice, it meant:

An ingestion pipeline built on Talend and Redshift — data was processed and stored in Redshift, then pushed to S3
2 to 3 million sales per day ingested
Data exposed in the datalake and via an API consumed by multiple business teams
Kafka messages with XML payloads converted to CSV before loading
A scheduler (OpCon) to orchestrate ingestion jobs

My role: make sure all of this runs, every day, without interruption.

What "Production Manager" actually means day-to-day

Coming from development, you'd think production is about monitoring and a few alerts. Reality is very different.

Reducing the operational burden

My main goal wasn't to react to incidents, but to reduce their frequency. That meant:

Proactive alerting — setting up the right dashboards (QuickSight, Tableau) and alerts to detect anomalies before they become incidents. Automatic Jira ticket creation when a threshold is breached.
Data quality at the source — analyzing and detecting quality issues upstream, then escalating them to source teams. This is facilitation work, not code: convincing an upstream team that their data is poorly formatted takes time and diplomacy.
Run documentation — writing and maintaining on-call procedures so that any team member can intervene at 3 AM without relying on one person's memory.
Run KPIs — scripting metrics collection to objectively measure stability: incident count, resolution time, data availability.

Facilitating, not coding

The biggest surprise was the relational dimension of the role. I spent more time managing people than technology:

Facilitating consumer teams — improving incident communication. When an ingestion is delayed, 5 different teams need to know why and when it will be resolved. You need a clear channel, a clear message, and consistency.
Facilitating source teams — working with teams that produce upstream data so they fix quality issues at the root.
On-call planning — organizing rotations for the team, making sure everyone is trained and the load is fairly distributed.
Postmortems — I organized regular meetings with both data sources and consumers. Postmortems were filled collaboratively during these sessions: what happened, why, and what actions to take to prevent recurrence. This collaborative format aligned everyone and avoided the blame game.

The typical incident: when the scheduler crashes

My nemesis during those two years was the OpCon scheduler client crashing on the machine. Silently.

The scenario was always the same:

The OpCon client crashes → no jobs are launched
Sales keep arriving via Kafka (messages with XML payloads)
Messages pile up, hundreds of thousands within hours
When we restart the scheduler, the XML → CSV conversion job faces a massive backlog
The Talend job struggles, processing times explode, Redshift is overwhelmed, data arrives late in S3

The biggest incidents we had were all tied to this problem. What made it frustrating was that the client crash was silent: no alert, no explicit log. We'd only discover it by noticing the absence of data downstream.

The lesson: monitoring the absence of events is as important as monitoring errors. If a job that runs every 15 minutes hasn't executed in 30 minutes, that's a strong signal.

What I learned

Production is engineering

Reducing operational burden isn't just "adding alerts". It's designing an observability system, automating detection, documenting procedures, and measuring improvement. It's engineering work in its own right.

Communication is a technical skill

Writing a clear incident message, running a blameless postmortem, convincing a source team to fix a data format. These are skills as important as writing code. And they can be practiced.

Proactive alerting changes everything

The difference between a PM who reacts and one who manages is proactivity. When you discover an incident from an automatic alert at 8 AM instead of a call from a business team at 10 AM, you've gained 2 hours and a lot of peace of mind.

Monitor the silence

The most dangerous incidents don't generate errors: they generate silence. A pipeline that stops running, a scheduler that has crashed, a message that never arrives. Alerts on the absence of activity saved me more often than alerts on errors.

Documentation is not optional

In dev, you can sometimes get by with readable code and a few comments. In production, if the on-call procedure isn't written down, it doesn't exist. The person on call at 3 AM doesn't have time to guess.

Why I went back to technical work

After two and a half years, I decided to return to a Data Engineer role. The reason is simple: I felt I was regressing technically.

The PM day-to-day is fascinating: the diversity of problems, the human dimension, the direct impact on data reliability. But I was spending my days facilitating, documenting and communicating, and less and less designing and coding.

I was afraid of falling behind, of no longer being up to speed on fast-evolving technologies: Spark, Databricks, lakehouse architectures. The risk of becoming a purely managerial profile without technical expertise didn't sit well with me.

Today, looking back, I don't regret the experience. It gave me an understanding of production that many developers don't have. When I design a pipeline now, I naturally think about observability, error recovery, and operational documentation. These are reflexes that code alone wouldn't have given me.

In summary

If you're a developer and someone offers you a production-oriented role, here's what I'd tell you:

It's a real job, not a support role. It requires engineering, rigor, and a lot of soft skills.
You'll learn things that development will never teach you — crisis communication, priority management under pressure, the end-to-end view of a data product.
Set a time limit. It's enriching, but if your core expertise is technical, don't stay too long or you risk falling behind.
Bring those reflexes back into your code. Observability, documentation, monitoring the silence — these are skills that make better engineers.

You can also read this and other articles on my portfolio.

AgenticDev: a multi-LLM framework for generating tested code

Hamdi Mechelloukh — Fri, 20 Mar 2026 10:58:27 +0000

In late 2025, after spending hours prompting LLMs one by one to generate code, a question kept nagging me: what if multiple LLM agents could collaborate to produce a complete project? Not a single agent doing everything, but a specialized team (an architect, a developer, a tester), each with its own role, tools, and constraints.

That's how AgenticDev was born, a Python framework that orchestrates 4 LLM agents to turn a plain-text request into tested, documented code.

In this article, I share the architecture decisions, the problems I ran into, and the lessons learned.

Starting point: testing the limits of multi-agent collaboration

My initial goal was simple: explore how far LLM agents can collaborate autonomously. Not a throwaway POC, but a real pipeline where each agent has a clear responsibility:

Architect — analyzes the request and produces a technical specification (spec.md)
Designer — generates SVG assets from the spec
Developer — implements the code following the spec and integrating the assets
Tester — writes and runs tests, then sends failures back to the Developer

The idea is the Agent as Tool pattern: each agent is a node in an execution graph, not an LLM calling other LLMs chaotically.

Architecture: why LangGraph over an LLM orchestrator

My first approach was letting an orchestrator agent (Gemini) dynamically decide which sub-agent to call, via function calls. It worked, but I quickly identified a problem: the more generic the system, the more unpredictable it became.

The LLM orchestrator could decide to skip the Designer, call the Tester before the Developer, or loop indefinitely. For a framework that needs to produce reliable code, that's a deal-breaker.

So I chose to delegate orchestration to LangGraph, a deterministic graph framework. The pipeline becomes explicit:

Architect → Designer → Developer → Tester
                                      │
                                      ▼ (tests fail?)
                                   Developer ← fix loop (max 3×)

Each node is an autonomous agent, but execution order and retry logic are deterministic. The LLM controls the what (generated content), but not the when (execution flow).

_builder = StateGraph(PipelineState)
_builder.add_edge(START, "architect")
_builder.add_edge("architect", "designer")
_builder.add_edge("designer", "developer")
_builder.add_edge("developer", "tester")
_builder.add_conditional_edges(
    "tester",
    should_fix_or_end,
    {"fix": "fix_developer", "end": END},
)
_builder.add_edge("fix_developer", "tester")

The should_fix_or_end function is pure Python: it parses the Tester's output and decides whether to rerun the Developer or finish. No LLM in the decision loop.

The prompt caching problem and the switch to full Gemini

During the exploration phase, I very quickly hit API rate limits on Gemini. Every agent call sent the full system prompt, tool definitions, project context, thousands of tokens per request.

The solution: prompt caching. But Gemini and Claude handle it very differently.

Gemini: implicit caching

Gemini automatically caches repeated prefixes. If the system prompt and initial instructions are identical between two calls, Google reuses the cached context. On the code side, there's nothing to do: caching is transparent.

# Savings show up in usage metadata
cached = getattr(meta, "cached_content_token_count", 0)
total = getattr(meta, "prompt_token_count", 0)
logger.info("cache hit: %d/%d tokens (%d%%)", cached, total, cached * 100 // total)

Claude: explicit caching

Claude requires explicit cache_control: ephemeral markers on the blocks you want cached: the system prompt, tool definitions, and the first user message.

system = [{
    "type": "text",
    "text": self.instructions,
    "cache_control": {"type": "ephemeral"}
}]

claude_tools = [self._fn_to_claude_tool(fn) for fn in self.tools]
if claude_tools:
    claude_tools[-1]["cache_control"] = {"type": "ephemeral"}

Why I switched to full Gemini

I started with a multi-LLM architecture: Gemini for the Architect and Tester, Claude for the Developer. The idea was appealing: use each LLM where it excels.

In practice, Claude's API cost quickly made this approach unsustainable. A full pipeline run with Claude as Developer cost significantly more than with Gemini, especially during fix iterations where the context grows with each turn. So I decided to switch to full Gemini as the default pipeline, while keeping the ClaudeAgent in the framework as a configurable option.

This pragmatic choice also let me fully benefit from Gemini's implicit caching across the entire pipeline, without managing two different caching strategies in production.

The contrast between both approaches still pushed me to design the class hierarchy to isolate these differences:

BaseAgent (ABC)
├── GeminiAgent    → implicit caching, google-genai SDK
│   ├── ArchitectAgent
│   ├── DesignerAgent
│   ├── DeveloperAgent
│   └── TesterAgent
└── ClaudeAgent    → explicit caching, anthropic SDK
    └── DeveloperAgent

Each agent inherits its backend's caching strategy without having to worry about it.

Agent hierarchy: ABC and specialization

The core of the framework relies on a simple hierarchy:

BaseAgent (ABC) — defines the contract: run(context) → AgentResult, tool management
GeminiAgent — implements the agentic loop for Gemini (chat + tool calls)
ClaudeAgent — implements the agentic loop for Claude (messages + tool_use blocks)

Specialized agents (Architect, Developer, Tester) inherit from GeminiAgent and only define their instructions and tools:

class ArchitectAgent(GeminiAgent):
    def __init__(self):
        super().__init__(
            name="Architect",
            instructions="You are a software architect...",
            tools=[web_search, write_file],
            model_name="gemini-3.1-pro-preview",
        )

To add a new agent, just create a class, define its instructions, and add it as a node in the LangGraph pipeline. No need to touch the chat logic, tool calling, or caching.

The Designer: a special case

The DesignerAgent is an interesting case. Unlike other agents that use the standard agentic loop (chat → tool call → response → tool call → ...), the Designer makes direct API calls to generate SVG.

Why? Because SVG generation is a well-defined two-step workflow:

Planning — "what assets does this project need?" → returns JSON
Generation — "generate these N SVG sprites" → returns parsable text

No need for an agentic loop with tools here. The Designer still inherits from GeminiAgent (for the API client and key validation), but it overrides run() with its own logic.

The automatic fix loop

One of the most useful aspects of the pipeline is the fix loop. When the Tester detects failures, the Developer is relaunched in FIX MODE:

def should_fix_or_end(state: PipelineState) -> Literal["fix", "end"]:
    if (
        _has_test_failures(state.get("test_results", ""))
        and state.get("fix_iterations", 0) < MAX_FIX_ITERATIONS
    ):
        return "fix"
    return "end"

The Developer then receives the test output in its context, with a clear instruction:

"You are in FIX MODE — read existing files and fix these. Do NOT rewrite all files from scratch."

In practice, 3 iterations are enough in most cases to go from 60-70% passing tests to 100%.

Shared tools

Agents interact with the file system through 4 simple tools:

Tool	Role
`write_file(path, content)`	Write a file (creates parent directories)
`read_file(path)`	Read an existing file
`execute_code(command)`	Execute a shell command
`web_search(query)`	Web search via DuckDuckGo

These tools are plain Python functions, passed to agents through their constructor. The framework handles exposing them to the LLM in the right format (Gemini function declarations or Claude tool definitions).

The limits: a solid foundation, not a finished product

Let's be honest about what the framework can and can't do. AgenticDev excels at generating a functional project base: file structure, initial code, tests, documentation. For simple projects (CLI tools, libraries, small APIs), the output is often usable as-is.

But as complexity grows (intricate business logic, multiple integrations, performance constraints), the generated code will be a starting point, not the final product. There will be technical limitations (overly naive architectures, uncovered edge cases) and functional gaps (the LLM doesn't know your business context) that you'll need to fix manually or by vibe-coding with a tool like Claude Code or Cursor.

This is actually the workflow I recommend: let AgenticDev generate the skeleton, then iterate on it with a coding assistant to refine the details. The framework saves you the first hours of setup, not the last hours of polish.

What I learned

Specialization beats generality

An agent that "does everything" is less reliable than a team of specialized agents. The Architect can't code, the Developer can't test, and that's by design. Each agent has precise instructions and a limited scope.

Deterministic orchestration is non-negotiable

Letting an LLM decide the execution flow means accepting that the pipeline behaves differently on every run. For a code generation tool, that's unacceptable. LangGraph let me keep the LLMs' creativity while enforcing a predictable execution order.

Prompt caching is essential in multi-agent systems

Without caching, a 4-agent pipeline easily consumes 100k+ tokens per run, 80% of which is repeated context. Caching significantly reduces both costs and latency.

Cost dictates architecture

Starting with multi-LLM was intellectually satisfying, but economic reality caught up. Keeping the multi-backend abstraction while using a single provider by default is the right trade-off: you only pay for what you use, without sacrificing flexibility.

Agent instructions are code

Agent prompts aren't vague sentences: they're precise specifications with rules, examples, and edge cases. For instance, the Developer's prompt includes rules on Python vs TypeScript conventions, placeholder handling, and a mandatory completion audit before returning its response.

Going further

The source code is available on GitHub. The framework is designed to be extended: adding a new agent takes about ten lines of code.

You can also read this and other articles on my portfolio.

Next steps I'm considering:

Support for new LLM backends (Mistral, Llama)
Quality metrics on generated code
Interactive mode with human validation between each step