DEV Community: Alessandro Bahgat

Scopewise: Submission for Weekend Challenge: Earth Day Edition

Alessandro Bahgat — Mon, 20 Apr 2026 06:59:13 +0000

This is a submission for Weekend Challenge: Earth Day Edition

What I Built

ScopeWise is an agentic Scope 3 Category 6 (business travel) control center for a fictional company. The landing view isn't an annual report — it's an Opportunities panel that names the four or five policy levers a sustainability lead should pull next, each ranked by estimated tCO₂e recovered. Below that: a filter strip, the headline KPIs against an SBTi trajectory, a baseline-comparison table that benchmarks every route against a cabin-agnostic typical, and the heaviest-routes ledger with a "vs typical" ratio column flagged wherever a route is running >1.2× baseline. A Gemini 2.5 Flash drawer lets the operator ask natural-language follow-ups that call the same tools the UI reads from.

The twist is synthetic trips, real emissions, two TIM products. I curated a catalog of 365 real scheduled flights (42 carriers, 55 airports) from publicly-published airline timetables, then generated 2,000 synthetic corporate trips on top of that catalog. Each trip is submitted to Google's Travel Impact Model via computeScope3FlightEmissions — the enrichment pipeline keeps only rows where TIM returned source: TIM_EMISSIONS (real per-flight model output, not a fallback to a typical or modeled baseline). Of 2,000 candidate trips, TIM had specific data for 810 across 115 routes; the other 1,190 — discontinued services, wet-leased routes, regional carriers outside the catalog — were dropped rather than filled with an estimate. The v2 move: run computeTypicalFlightEmissions over every unique surviving route in one batch call, land that cabin-agnostic baseline alongside the specific number, and turn the gap between them into the signal the control center hangs off. The specific-flight claim stays load-bearing; typical is a separate data product used only for comparison.

Demo

Live URL: https://scopewise-demo.vercel.app

Code

Repository: github.com/abahgat/scopewise

How I Built It

1. The data-integrity problem with synthetic demos

Most "Scope 3 dashboard" demos are built on fabricated emissions values — a constant kg-per-km multiplied by a distance estimate. Judges can't verify the numbers, and a careful reader doesn't have to: the same dashboard with different constants would tell a different story. I wanted every figure on the screen to be defensible.

The move was to curate a catalog of real (carrier, flight_number, origin, destination) tuples from public airline schedules, then build 2,000 synthetic trips on top of that catalog. The enrichment script submits each trip to TIM's Scope 3 endpoint and enforces the specific-flight check at build time:

// scripts/enrich-emissions.ts
if (r.source !== 'TIM_EMISSIONS') {
  droppedFallback++;
  continue;
}

TIM_EMISSIONS is the marker TIM stamps when it has real per-flight model output — actual aircraft, load factor, seat configuration. Anything else (TYPICAL_FLIGHT_EMISSIONS, modeled defaults) means TIM fell back off the per-flight catalog for that market/date, so we drop the row instead of landing an estimate in its place. Of 2,000 candidate trips, 810 survive and 1,190 are dropped (discontinued services, wet-leased routes, regional carriers outside the catalog). Every surviving row carries a tim_model_version stamp and a defensible per-flight number; the 1,190 we can't enrich don't quietly turn into distance-formula guesses.

2. Baseline comparison via `computeTypicalFlightEmissions`

An annual report tells you how much you emitted. A control center tells you what to do about it. The move that turns one into the other is the second TIM endpoint: computeTypicalFlightEmissions. It takes a markets[] array of {origin, destination} pairs (up to 1,000 per batch) and returns a cabin-agnostic baseline in gCO₂/pax. For the 115 unique routes remaining after the specific-flight check, that's a single API call at enrichment time. The pipeline lands the baseline into the same row as the specific number, and every aggregation downstream carries an actual / typical ratio.

That ratio is the signal the Opportunities panel reads off of. Four insight categories fall out naturally:

HIGH_RATIO_ROUTE — routes with ≥3 flights averaging ≥1.8× typical (cabin-mix story)
DEPT_BUSINESS_OVERUSE — departments where BUSINESS+FIRST is >50% of their emissions
SHORT_HAUL_UPGRADE_PATTERN — BUSINESS/FIRST on routes where typical is small (proxy for <3h legs)
UNDERUSED_ECONOMY_OPPORTUNITY — (dept, route) pairs that are 100% non-ECONOMY on a route that sees ECONOMY elsewhere

The impact estimate per insight is the obvious one: sum_actual − sum_actual / avg_ratio, divided by 1,000, annualized. It's not a SBTi-grade abatement curve — it's a reasonable first cut that orders interventions by size and gives a number a human can argue with.

The honest caveat, stated explicitly under the baseline-comparison table in the UI: typical is cabin-agnostic, specific is cabin-aware. A FIRST-cabin trip's 4× ratio is dominated by cabin choice, not airline or aircraft. We lean into that rather than hiding it — the intervention framing ("if this dept flew BUSINESS on <8h routes instead of FIRST, you'd save X tCO₂e") is exactly what a policy lever looks like.

One UX consequence: filters (department, cabin, date range) live in client-side context and re-derive every aggregation from a single trip snapshot rather than re-fetching. At 810-row scale the whole page recomputes in well under a frame, so changing a dept pill updates the chart, the timeline, the baseline table and the Opportunities impact numbers instantly. That's the affordance that makes this feel like a console instead of a static report.

3. Snowflake as the single source of truth

The warehouse holds one fact table (ENRICHED_TRIPS), two dimensions (EMPLOYEES, AIRPORTS), and four analytical views that feed both the REST endpoints and the agent's SQL tool. V_TOP_ROUTES is the most interesting of them because it precomputes the share-of-total column the dashboard needs:

CREATE OR REPLACE VIEW V_TOP_ROUTES AS
WITH total AS (
    SELECT SUM(emissions_kg_co2e) AS company_total FROM ENRICHED_TRIPS
)
SELECT
    e.origin_iata      AS origin,
    e.destination_iata AS destination,
    COUNT(*)           AS flights,
    SUM(e.emissions_kg_co2e) AS total_kg,
    100.0 * SUM(e.emissions_kg_co2e) / NULLIF(t.company_total, 0) AS pct_of_total
FROM ENRICHED_TRIPS e
CROSS JOIN total t
GROUP BY e.origin_iata, e.destination_iata, t.company_total
ORDER BY total_kg DESC;

The other three views (V_EXEC_SUMMARY, V_DEPT_EMISSIONS_MONTHLY, V_CLASS_BREAKDOWN_MONTHLY) handle the KPI tiles, the department rollup, and the stacked-area chart by class of service. Because both the dashboard and the agent read through the same views, adding a new metric means one view change — not two code paths drifting out of sync. Auth is key-pair to a service user, private key stored base64-encoded in Vercel env vars.

4. Gemini with grounded tools

The chat endpoint is Vercel AI SDK v4 + @ai-sdk/google pointed at gemini-2.5-flash. Six tools are exposed, all typed with Zod:

get_top_routes({ limit, dept? }) — pre-baked route ranking
get_department_breakdown({ period }) — per-dept kg/employee for all, ytd, or a YYYY-MM
compare_to_benchmark({ dept }) — dept vs company average with signed delta %
get_insights({ category? }) — the v2 tool: returns the same Opportunities the UI is ranking, so the agent can say "this route shows up on your Opportunities panel as HIGH_RATIO_ROUTE" instead of re-deriving it
explain_chart({ chart_id }) — returns the exact slice behind each dashboard chart so the agent narrates what the user is actually looking at
run_sql({ query }) — the escape hatch

The escape hatch is the interesting one. run_sql is SELECT-only, enforced by this regex:

const SQL_WRITE_GUARD =
  /\b(INSERT|UPDATE|DELETE|DROP|ALTER|CREATE|MERGE|TRUNCATE|GRANT|REVOKE|USE|SET)\b/i;

Any query matching the guard is rejected before it reaches Snowflake, results are capped at 100 rows, and the Snowflake role itself is read-only as a second line of defense. The system prompt caps responses at four sentences, instructs the agent to cite "Google TIM (specific-flight model)" on every numeric answer, and forbids speculation.

An example interaction (representative, not a transcript):

User: Which department has the highest per-employee emissions, and how does it compare to the company average?

tool call: compare_to_benchmark({ dept: "SALES" })
result: { dept_avg: 4821.3, company_avg: 2940.7, delta_pct: 64.0 }

Agent: Sales leads at 4,821.3 kg CO2e per employee — 64.0% above the company average of 2,940.7 kg. Source: Google TIM (specific-flight model).

5. Why a TIM MCP server matters (forward-looking)

I also built timcp, a standalone MCP server wrapping TIM's three endpoints. For the bulk enrichment here I used the REST API directly — the whole dataset fits in two batches of 1,000, so a tool-discovery layer adds latency without buying anything. But for ad-hoc questions like "how does this flight compare to the typical emissions for this route?", exposing computeTypicalFlightEmissions as an MCP tool at agent runtime is the right shape. That's the natural next step: the agent gets a live benchmarking capability that doesn't require precomputing typical values into Snowflake.

6. Architectural decisions

Vercel AI SDK over Genkit — faster to ship on Next.js with native streaming; one runtime instead of two
Snowflake Cortex skipped for v1 — Gemini 3 Pro now runs inside Cortex (Jan 2026), which is a killer Snowflake+Gemini combo. Trial-edition permissions and the 36-hour clock argued against it; noted as future direction
Parallel subagent build — three workstreams (data, backend, frontend) ran concurrently in isolated git worktrees, gated by a frozen src/lib/types.ts contract and a mock fixture so backend and frontend could develop without waiting on Snowflake
Recharts, not Tremor charts — Tremor's chart package has a React 19 peer conflict; I kept Tremor cards and dropped to Recharts directly for the bar and stacked-area charts
Real flight catalog committed to the repo — 365 hand-curated tuples from public carrier schedules. It's reviewable, reproducible, and the "all emissions are TIM-specific" claim is auditable against a file anyone can read

Prize Categories

Snowflake. Business travel is roughly 1% of global CO2 and almost entirely reported as Scope 3 Category 6 — a meaningful Earth Day narrative that isn't already saturated. Technical execution: five analytical views (V_EXEC_SUMMARY, V_DEPT_EMISSIONS_MONTHLY, V_TOP_ROUTES, V_CLASS_BREAKDOWN_MONTHLY, V_ROUTE_DELTAS), key-pair auth from a serverless runtime, a PUT + COPY INTO bootstrap pipeline, and one fact table whose invariant (every row has a TIM_EMISSIONS-sourced emissions number, fallbacks dropped at enrichment time) is enforced before load. V_ROUTE_DELTAS joins the cabin-agnostic typical_emissions_kg_co2e column against the specific actual to compute the ratio the Opportunities panel ranks off of — one view change, two consumers (UI and agent) stay in sync. Creativity: the "synthetic trips, real numbers" design neutralizes the usual credibility gap of demo dashboards — the warehouse is doing real analytical work on numbers an auditor could trace back to Google.

Unfortunately, I did not get to deploy this on Snowflake by the end of the weekend: I tried to sign up for a 30-day trial but my domain kept getting rejected :(

Gemini. The agent never hallucinates a figure because it can't reach one without a tool call, and every tool result carries the TIM provenance string forward into the cited answer. The six-tool surface (five pre-baked + run_sql escape hatch) keeps the model's job narrow: pick the right tool, read the result, narrate in at most four sentences. The v2 tool get_insights({ category? }) returns the same ranked Insight[] the Opportunities panel is reading, so the agent can defer to the policy-lever framing the UI already surfaces instead of re-deriving it from raw SQL. Streaming is wired through the Vercel AI SDK's toDataStreamResponse() so tool-call chips and token-by-token output land in the drawer together. The SELECT-only regex guard on run_sql is a pattern I think is worth stealing — it's one line, it fails closed, and it converts "give an LLM a SQL console" from a scary idea into a reasonable one.

Getting Qwen and Gemma to play Zork (and why they get stuck in the maze)

Alessandro Bahgat — Wed, 08 Apr 2026 12:00:00 +0000

I was testing a local AI model this weekend when it started responding in Thai.

Not gibberish. Actual Thai script, mixed with Chinese characters. I’d asked it to play Zork, the 1981 text adventure, and it was doing everything except that.

This wasn’t what I set out to study. At work, I’ve had good results getting AI agents to respond to cloud alerts. A service throws an error, the agent reads the logs, traces the relevant code, and proposes a fix. But when a fix requires tracing a request from service A through a message queue to service B, then to service C’s database, the agent often gets lost. Not because it can’t reason about each piece. It can reason remarkably well about individual pieces in isolation. It just can’t hold the map.

I wanted to study that limitation in isolation. No pixels, no distributed systems, no production risk. Inspired by Ramp’s experiment getting Claude to play RollerCoaster Tycoon, I picked the simplest possible test of “can an agent find its way around?”: a text adventure.

I also suspected that small local models would struggle with this far more than frontier reasoning models. That made the experiment more interesting, not less: if tools and scaffolding are what let frontier models succeed, then small models are the best stress test for whether your harness is doing its job.

The setup

Zork drops you in front of a white house. You explore rooms, collect items, solve puzzles, and try to score 350 points. The world is a graph of interconnected locations with descriptions, objects, and a few characters. It’s played entirely through typed commands like go north, take lantern, and open trapdoor.

I wired together Jericho (Microsoft’s Python library that runs the original game file and exposes the state machine) with Pi Coding Agent, a TypeScript-based agent framework. Ollama running on an RTX 5080 with 16GB VRAM provided the model, and a custom bridge connected everything together, validating actions against Jericho’s state and logging every turn. I tested two models: Qwen 2.5 14B and Gemma 4 26B.

The goal was simple: tell the agent to play Zork. No hand-holding, no hints. Just: “you are an autonomous explorer, play the game.”

Day 1: why is my agent speaking Thai?

The first attempt was with Qwen 2.5 14B. I gave it a system prompt explaining it was an autonomous Zork player, handed it a tool to send commands to the game, and let it run.

It immediately broke character. Instead of playing, it started explaining how text adventures work. “In Zork, you typically want to explore your surroundings by using commands like LOOK and EXAMINE…” The model has been trained so aggressively on being a conversational assistant that it defaults to helping you play, rather than playing itself.

Fine. I tightened the constraints. Strict system prompt: “You are the autonomous player. Do not speak to me. Execute moves only. English only.”

That’s when it started outputting Thai.

Qwen responding in Thai and Chinese instead of playing Zork

Actual Thai script, interspersed with Chinese characters. Fragments like 推进完毕 (roughly: “progress complete”). Under heavy “no chitchat” constraints, the model was reaching for high-probability tokens outside English. Qwen’s multilingual training means that when you suppress its English conversational patterns hard enough, other languages become the path of least resistance.

This wasn’t random hallucination. It was a pressure valve. And it was the most visceral reminder I’ve had that giving agents uncontrolled access to your systems requires more than a well-written prompt. If you can’t predict what language the model will respond in, you definitely can’t predict what commands it’ll try to run.

Day 2: the architecture pivot

I tried Gemma 4 26B too, Google’s mixture-of-experts model that had been released just two days earlier. It was more stable in English-only mode, which solved the immediate language problem. But swapping the model didn’t change the gameplay much. Both models scored similarly across runs: mostly 0 or 10 points, with occasional flashes of competence.

The real issue was architectural, not model selection.

I’d started with a static prompt template describing how to perceive the game, reason about it, and act. But rigid templates caused the model to output tool calls as plain text instead of actually executing them. The template was teaching it to perform the format, not use the tool.

The fix was to move the intelligence into the dynamic tool output. Every response from the game included not just the text but a state summary: current location, inventory, score, and valid actions. I also added a thought parameter so the model could reason inside the tool call itself, giving it working memory without triggering the conversational assistant pattern.

Gemma 4 playing Zork through Pi's terminal UI, with the thought parameter and game state visible

One unexpected tension: constraining the model too hard (temperature 0.0, strict bans on any non-game output) made it more reliable at calling tools but worse at actually playing. The creative reasoning needed to solve puzzles requires some flexibility. Over-constrained models would execute actions mechanically but make no progress because they never paused to think laterally.

Day 3: the maze

One run hit 35 points in 49 moves: it found the hidden cellar, lit the lantern, navigated the underground, and defeated the troll with the elvish sword. But that was a lucky outlier. What’s consistent across runs is the moment everything breaks: the maze.

Benchmark results: both models scoring between 0 and 10 across automated runs

Zork’s maze is legendary. It’s a set of rooms that all have the same description (“This is a maze of twisty little passages, all alike”) with exits that loop back unpredictably. It’s non-Euclidean: going north and then south doesn’t return you to where you started. Humans solve it by dropping items as breadcrumbs and methodically mapping connections.

The agent walked in circles. For over ten moves, it tried different directions, got the same descriptions, tried again. No breadcrumb strategy. No attempt to map what it had already seen. Each move was a fresh guess with no memory of the previous attempts.

It was stuck.

Why agents get lost

The maze failure isn’t surprising in hindsight, but it’s instructive. The agent wasn’t failing at reasoning. Each individual move was a reasonable attempt to escape. It was failing at spatial cognition: the ability to build and maintain a mental model of a connected space.

A few things made this worse than I expected.

Without a persistent world model, every turn is an isolated event. In one run, the agent found a jewel-encrusted egg (a high-value treasure) and immediately threw it down a grating. No sense that this object might be important later. No concept of consequences spanning multiple turns.

The tooling fought back in unexpected ways, too. Ollama’s repeat_penalty parameter, designed to avoid repetitive output, broke pathfinding above 1.1. The model became reluctant to output go north twice in a row, even when that was the correct path. A parameter designed to improve text quality was destroying navigational logic.

And the interface itself shaped behavior: running the agent through Pi’s chat UI made it more conversational and less autonomous. A headless Python loop with a tight execute-observe-act cycle and no chat played noticeably better.

The microservices parallel

Zork’s maze is a 44-year-old version of a problem I see at work: a graph of nodes that all look similar, where you can only see your immediate surroundings, and where the only way to make progress is to build and maintain a map as you go. Tracing a request through service A → message queue → service B → database C is the same kind of spatial reasoning challenge.

Frontier models with proper tooling handle this much better than my local 14B and 26B models did in Zork. But the pattern that makes them succeed is the same one that would fix the maze: external memory, explicit maps, state injection. The model doesn’t discover the topology on its own. The system provides it. The lesson from Zork isn’t that agents can’t navigate complex systems. It’s that they can’t do it without scaffolding, and the smaller the model, the more scaffolding it needs.

What’s next

This is where I am now: an agent that can sometimes play Zork, sometimes wanders in circles, and reliably gets stuck in the maze. The scores are modest, the tooling is rough, and neither model has a clear edge over the other.

But the experiment is pointing at something real, and I now have a harness to keep pushing. Next, I want to give my agents better tools: maps they can query, memory they can write to, breadcrumbs they don’t have to invent. I’m eager to see how much that changes the scores.

If you want to understand where your agents are getting lost, give one a text adventure. The maze will show you exactly where the reasoning stops and the flailing begins.

This post was originally published on abahgat.com on April 6, 2026. If you enjoyed this piece, you can follow me on LinkedIn for more thoughts on engineering leadership and software.

Visualizing Ukkonen's Suffix Tree Algorithm

Alessandro Bahgat — Wed, 11 Mar 2026 12:05:00 +0000

Learning algorithms from books

I learned most of what I know about algorithms by poring over a copy of Introduction to Algorithms I got while in university. The book is very well known, especially among folks who got a formal education in computer science.

If you have studied it, you know the book: it is over a thousand pages long and it weighs enough to double as a doorstop.

I worked through large sections of it, pen in hand, trying to trace through increasingly complex algorithms, building intuition for their behavior and tradeoffs. The book covers the theory in great depth: correctness proofs, recurrence relations, asymptotic analysis.

But there was often a gap between reading an algorithm and truly understanding it. The book would present pseudocode, sometimes a few diagrams showing state at key moments and theorems about performance characteristics. The work of tracing what actually happens was left as an exercise to the reader. I did that work with pen and paper, drawing trees, crossing out nodes, scribbling indices in the margins. It worked, eventually. But it was slow, error-prone, and the understanding felt fragile.

Implementing from a paper

Years later, I ran into this gap again. I was working on a programming puzzle that required near-instant substring search over a large dataset. After some research, I settled on a Generalized Suffix Tree: a data structure that indexes all suffixes of a set of strings, enabling O(m) lookups where m is the length of the search pattern, even over an extremely large corpus.

The algorithm I chose for building the tree was Ukkonen’s, described in a 1995 paper. The paper is well written and includes the full algorithm in pseudocode:

One of several pseudocode snippets from Ukkonen's paper, describing the update function. Clear on paper, but its translation to working code is much more verbose than this.

It took me a few hours to get right. Not because the pseudocode was wrong: it was precise and correct. The difficulty was that the algorithm manipulates a tree in non-obvious ways. There is an “active point” that walks around the tree. Suffix links connect internal nodes as shortcuts. Three different extension rules fire depending on what is already in the tree and what is being added. The pseudocode tells you what to do, but building an intuition for why it works requires watching it happen.

I did what I always did: I sketched trees by hand. I traced the algorithm on the string cacao, then on banana, drawing and redrawing nodes and edges as each character was processed. When my Java implementation finally produced correct results, I was relieved, but my understanding of the algorithm still felt like it had been assembled from fragments.

The biggest frustration was that I had no way to inspect what my code was actually building. I relied on the usual bag of tricks: print statements, breakpoints, inspecting memory structures one by one in a debugger. But that is like understanding a forest by looking at one tree at a time. What I wanted was to see the whole data structure after each operation — to watch the algorithm work.

The visualization I wish I had

That idea stuck with me: build the algorithm in a language where rendering the data structure is easy, then step through the construction visually. JavaScript and D3.js are a natural fit: the algorithm produces a tree, and D3 is very good at drawing trees.

So here it is. The visualization below builds a suffix tree for the string banana using Ukkonen’s algorithm, step by step. Use the playback controls to move through the construction. The gold-highlighted node is the active point. Dashed arcs are suffix links.

This post was originally published on abahgat.com, where the visualizations are fully interactive — you can step through the algorithm, zoom, and pan. The GIFs here show a condensed version of the construction.

The paper describes the core logic across Sections 2–4. Here is test_and_split, the procedure that decides whether the tree needs to grow, which is a companion to the update function we showed earlier:

Procedure test_and_split from Ukkonen's paper. It returns true when the next character is already in the tree (the end point), and false after splitting an edge to make room for a new branch.

A few things to watch for in the visualization — each one corresponds to something in this procedure:

Branching in update: when test_and_split finds no existing transition for the next character, it splits the edge if needed and update creates a new leaf. These are the moments where the tree visibly grows.
Reaching the end point: when test_and_split finds that a transition for the next character already exists, the algorithm has reached what the paper calls the end point of the current phase. All remaining suffixes are already represented implicitly, so the phase stops. This is the key to the algorithm’s O(n) time: the end point can only move forward through the string across phases, bounding the total work.
Suffix links (the paper’s suffix function f): if an internal node has path-label xα, its suffix link points to the node with path-label α. The update procedure follows these links to jump to the next insertion point instead of walking from the root every time.
Finally, the ”$” terminator converts an implicit suffix tree, where some suffixes may end mid-edge, into an explicit one where every suffix terminates at a distinct leaf.

Adding more strings

A generalized suffix tree indexes multiple strings. Each string is added with its own terminator, and the tree grows incrementally. Below, panama is added after banana. Step through and notice how much of the tree structure already exists from the first string.

Searching

Once the tree is built, searching for a pattern means matching characters along edges from the root. The visualization below has both strings pre-loaded. Try searching for ana, then try pan, ban, xyz.

Try it yourself

An empty tree, yours to experiment with. Add strings, watch the construction, search for patterns. Use the scroll wheel to zoom and click-drag to pan if the tree gets large.

Head over to the original post to experiment with an empty tree: add your own strings, watch the construction step by step, and search for patterns. Use the scroll wheel to zoom and click-drag to pan if the tree gets large.

Beyond suffix trees

What excites me most is how well this generalizes. The gap between an algorithm on paper and an algorithm in memory has always been one of the hardest parts of learning computer science. Textbooks give you static diagrams. Debuggers give you one node at a time. Neither shows you the whole picture in motion.

Browser-based rendering, interactive SVGs, and JavaScript engines fast enough to run non-trivial algorithms client-side make it possible to close that gap for almost any data structure. Red-black trees, B-trees, tries, skip lists, hash tables with open addressing: all of them would benefit from this kind of treatment. Not as a replacement for the theory, but as a companion to it. Read the algorithm, then watch it work.

There is an obvious question lurking here: why bother learning algorithms at all when you can ask an LLM to write one for you? I think the question misses the more interesting possibility. LLMs are not just code generators; they are learning accelerators. You can ask one to explain a single step of an algorithm, to walk through an edge case, or to generate a diagram of how components interact. When I started working in a new codebase recently, the fastest way for me to build a mental model was not reading code or documentation. It was asking an LLM to produce component and sequence diagrams: a much higher-bandwidth channel for understanding, at least for the way I think.

That is the real shift. Not that machines can write algorithms so we don’t have to learn them, but that they can teach us in ways that adapt to how each of us actually learns. Through visualizations, through diagrams, through conversation, through whatever representation makes the concept click. This post is one example. The next one might look completely different, tailored to a different person and a different way of thinking.

We write fewer algorithms from scratch in our day-to-day work than we used to. But we still benefit from understanding them, whether it’s to choose the right data structure, to debug performance issues, or to evaluate tradeoffs. And for those of us who enjoy algorithms for their own sake, the tools for learning them have never been better.

The original Java suffix tree implementation is open source on GitHub. For the full backstory, see the project page and the story of the programming puzzle that started it all. Ukkonen’s original paper remains the definitive reference for the algorithm.

This post was originally published on abahgat.com on March 9, 2026. If you enjoyed this piece, you can follow me on LinkedIn for more thoughts on engineering leadership and software.

The Velocity Paradox

Alessandro Bahgat — Tue, 03 Mar 2026 13:47:00 +0000

We've all been there. You sit down with an AI agent on a Saturday morning to hack on a side project and it feels like magic. Ten minutes in, you are blown away by how quickly the agent can turn even poorly organized thoughts into working prototypes. You feel like you could do this all day.

And clearly, many of us do: we're rediscovering our passion for side projects, and every day a thousand bespoke ToDo apps are born, perfectly tailored to the unique needs of their creators.

At the same time, if you're in an engineering leadership role, you're also seeing your stakeholders dabble with agentic coding. They are shipping side-hustles on the weekend, and respectable work applications in an afternoon. Some of them might even look at you with ill-concealed suspicion. They want to know why their "pet feature" is stuck in a two-week cycle when they just whipped up a functional prototype over coffee.

And they aren't entirely wrong. AI agents have been writing 100% of my code for several months now. Informed by the wins on my side-projects, I wanted to see how much faster we could build at work. During the holiday break, I spent a few hours having Claude write a non-trivial feature that touched our database, cloud infra, mobile app, and the embedded application that runs on our hardware devices at Quilt. What would have taken me a week to write took an afternoon to generate.

Yet it still took weeks to get it tested and merged.

It felt like strapping a rocket engine to a tricycle. Exhilarating, sure, but the road ahead is still full of potholes, and there's a canyon where the bridge used to be. So why isn't the 100x improvement in how fast AI can generate code moving the needle on how fast we can ship features and improvements?

Coding was never 100% of the job. But for those of us managing legacy debt, AI doesn't just fail to solve our problems; it collides with them.

I've been at several conferences recently where I met leaders from "AI-native" companies, organizations founded in an age where agentic coding is the baseline. One founder told me they don't do code reviews at all; their CI pipeline is the reviewer. Another gives agents full control of their production infrastructure. For those of us anchored to a culture that is older than even just two years, these practices feel reckless. Yet even more measured companies are rethinking the fundamentals. OpenAI recently pulled back the curtain with their Harness Engineering article, showing engineering re-architected around AI from the ground up.

For the rest of us, the gap between "generating code" and "shipping value" is becoming a chasm. We are stuck in the Unhappy Middle, where the cost of code is diminishing rapidly, but the cost of review and verification is skyrocketing.

The Unhappy Middle

To understand why the promise of 100x faster progress thanks to AI still feels like an illusion, we have to look at the two forces we're being squeezed by.

On one side, we have the AI-Natives. These are companies and teams founded in the AI era. They have zero legacy debt, they can approach the craft of engineering with an open mind, and they use the same exact "boring" tech stacks the models were trained on. They don't have to go out of their way to "integrate" AI; they are born out of it. They don't have to refactor their code to support automated verification, they never knew a world without it.

On the other side, you have the companies with the slack to reinvent themselves. Shopify's CEO made headlines when he declared that AI proficiency is now a baseline expectation and that teams must justify why a job can't be done by AI before requesting headcount. Companies like that (or Google, I bet) can dedicate teams to rearchitect their codebase, tooling and processes and build the scaffolding that is required to make AI work at scale.

Then, there's the rest of us. I call it the Unhappy Middle.

We support live products and services, with customers trusting us and depending on us daily. The cost of failure is higher than a toy prototype. Unlike your ToDo app, you can't just throw an agent at a problem and hope it doesn't break your production environment.
We have accumulated technical debt as we were racing towards product/market fit, and yet never had the resources to pay it back. We have to balance work on infrastructure and developer experience with business priorities like opening new product lines. Most of these target ambitious schedules which (you guessed right) require taking on additional technical debt.
With the age of Zero Interest-Rate Policies well behind us, but not quite with the coffers of a larger company, we always have to be mindful of our runway, are constantly short-staffed and always "do more with less".

In short, we have to balance the technical complexity of an established company with the reality of a startup. Our survival depends on crossing the chasm as quickly as possible. Not every team is here. If your stack is standard and your tests are green, you may already be seeing the gains. But if any of this sounds familiar, the path forward is harder. Here are some examples from my reality.

Bespoke Frameworks: from Asset to Dead Weight

Before AI, we may have optimized for human speed by building bespoke frameworks, custom boilerplate generators or domain-specific languages and abstractions. For many teams, these were their "secret sauce": internal abstractions that helped teams move fast in 2022. They came at a price (typically, new engineers have to take some time getting comfortable with them), but they often paid off.

Today, those clever optimizations are anchors holding us back. AI agents are brilliant at standard React and Python because they've seen it a billion times. And, at the same time, they are completely illiterate in our proprietary and opinionated internals. Every time I ask an agent to work in our bespoke code, I'm paying an invisible tax: I spend a third of my time fixing hallucinations because our "clever" code isn't in anyone's training set. (I wrote more about why this happens in The Ghost in the Training Set.)

And you know what's funny? That's often why some of the best engineers I know are unimpressed by AI agents: they focus on the last time they saw Claude trip on a gotcha that's specific to their codebase and ignore the fact that it can build flawless React in the blink of an eye.

Zero Slack

We know technical debt is there, we always wanted to increase test coverage, we defer refactoring for testability because we need to fit one more feature before the release cut. We know that frameworks need to be standardized to become "AI-hospitable." But in the Unhappy Middle, you have zero slack. You're always racing, either to hit product-market fit or to extend your runway, and "cleaning up" feels like a luxury you can't afford.

This creates a painful tradeoff. In a side project, or a non-critical business app, failure is cheap. For a company with a legacy codebase, complex release processes and addressing user-critical needs, the stakes are considerably higher. Without the slack to build automated guardrails, we're left with manual human review and auditing.

And that's where the 100x speed gain from AI goes to die.

When Generation Outruns Verification

We often think of the craft of software engineering as composed of several loops, each covering a different stage of the lifecycle, from idea to product. A good visual to illustrate this is the slide below, from a talk Addy Osmani gave at LeadDev New York 2025.

At the center is the Inner Loop: the tight cycle of thinking, coding, building and testing. This is where "flow" happens. Surrounding that is the Submit Loop, where your code goes through linting and code review, and the Outer Loop, where it finally gets deployed and gets tested in the real world.

The promise of AI-assisted engineering is to effectively collapse the Inner Loop. When an agent can "Think" and "Code" a cross-stack feature in a single morning, that center circle feels like it's spinning at the speed of light.

But for those of us who are still in the Unhappy Middle, that loop is often broken before it even starts.

The Broken Inner Loop

The first problem teams are likely to encounter is a broken Inner Loop. Before AI, back in the day when code was expensive to write, tests were the first aspect of a healthy architecture to be sacrificed (or, in the best case scenario, deferred). When we skip writing tests, it's common to end up with code for which it's hard to write tests in the long run.

When you can't give an agent a deterministic way to verify its own work, the feedback cycle doesn't feed back into the AI, it feeds back into you. The agent isn't looping, it's just throwing code over the wall and waiting for you to tell it what happened.

In the best scenario you can imagine, the loop is closed by automation. The agent writes code, runs a test, sees the failure and iterates until it's green. The feedback is a tight, self-correcting circuit.

Without a way to automate verification, you're just making a mountain of work for yourself, or accepting to take an enormous amount of risk by shipping code that hasn't been properly tested.

You were promised AI agents working for you to help you be more effective; instead, you are working for your agents. Not only is it not fun, it's also a huge waste of your time because you are 100x slower than a software agent.

In my world, this isn't just a metaphor. I feel it physically. At Quilt, we make hardware devices, and you can't throw prompt engineering at the physical world. If a test requires me to get up, walk to a test bench and manually press a button, the inner loop isn't just broken; it's wide open.

And there are even worse consequences downstream.

The Slowing Submit Loop

Before AI agents were this capable, the high cost of writing code carried a hidden benefit. If an engineer spent two days wrestling with a complex feature, they effectively distilled a lot of context information into their brain. By the time they put a change up for review, the author was the deepest expert on those 200 lines of code.

That's not how it works today.

As wonderful as the democratizing effect of AI agents is (they enable engineers to contribute well beyond their historical area of expertise), it comes with downsides.

If an agent can't automatically verify its changes, and the author is not the most experienced engineer in the area affected by a change, the bulk of the burden of audit and review will shift to the reviewer.

On the average team, code reviews are assigned to the most experienced engineers in a given area or domain. In this new world, these folks are getting overloaded with more code to review. Worse, they can no longer assume that the author has the same depth of knowledge about the code that reviewers historically could take for granted.

At the extreme, this has multiple effects:

Because the agent did the heavy lifting, the human author may have a shallower understanding of the "why" behind specific implementation choices.
The reviewer is now receiving 10x more code, but with 10x less intent provided by the author. If the reviewer didn't (or couldn't) do a thorough review themselves, it's 10x more code reviews of a higher intensity. Think more of a forensic audit than a style check.
In a legacy codebase with bespoke frameworks, this can be extremely challenging. If neither the author nor the reviewer fully understands the "clever" choices the AI made, they can't distinguish between valuable additions and hallucinations, and therefore are taking a high risk shipping this to production.

The practical consequences are tangible. Code ends up spending more time waiting for review than in development (this is what happened to my proof of concept I mentioned earlier). Your most experienced engineers struggle to be productive themselves because they are drowning in code reviews.

But the most worrisome part is what this does at an emotional level.

From Craftspeople to Janitors

If we take the patterns above to the extreme and let them fester without fixing them, then we are taking on a huge organizational risk by turning our most senior engineers into Janitors.

Instead of going to a challenging workday where, at the end, we experience the joy of having created something new, we now have to pore over someone (or, rather, something) else's code to spot issues and problems. Some engineers feel like they are being paid to clean up AI hallucinations.

This can be deeply demotivating. No one likes being a linear bottleneck downstream of a stage that is accelerating at exponential speed. This is even more difficult at the speed this shift is happening, as many people are mourning the loss of the craft, made worse by simplistic takes about how the world of tomorrow needs fewer engineers.

I still deeply enjoy coding but I recognize that, even in the best of days, a lot of the code I wrote was boilerplate needed to wire together different application components. A very common micro-kitchen joke from my time at Google was that we were all just highly-compensated Protocol Buffer translators.

We miss the 20% of the code we used to write that was high-leverage and intellectually interesting, and forget the other 80% that was toilsome and repetitive.

From Janitors to Gardeners

If you treat every AI-generated PR like a chore to be cleaned up, you are a Janitor. To move fast in a legacy codebase, we need a considerable change in mindset. If you allow me another metaphor, we need to start treating our codebase less like a perfect jewel to polish and more like a plot of land to tend to.

I've been thinking about this metaphor for a while. As you scale an organization, you can't afford to micromanage; you provide structure and support so that decisions happen organically, aligned to what the business needs. The same applies to codebases.

Playing into the metaphor, a gardener may focus their attention on a few things:

Tending the Soil

Hospitable Ground -- Transforming AI-Hostile codebases into an AI-Hospitable playing field requires investing in reducing technical debt, so that AI can't hide behind it. It may mean moving away from bespoke patterns that routinely trip up agents, or making them work reliably. It means standardizing on a well-defined and documented set of abstractions, instead of having 3 different ways to set up an API server because we never finish migrations every time we deprecate an old pattern.
Nutrient-Rich Soil -- Agents are great at brute-forcing their way to a workable solution, but very often they struggle because the codebase lacks information beyond the code itself. Code written in haste often lacks documentation about "Intent" and the "Why" we made decisions. If we don't expose context about tradeoffs and historical decisions, our agents are operating with limited information. Well structured agents.md files are a good start. Checking in architectural guidelines and making them discoverable is increasingly paying off. Ironically, if you keep your design docs locked in Google Docs, your agent is blind to them (hey Google, when can we have MCP access to Google Docs?)

Scaffolding and Direction

Scaffolding -- You don't tell plants how to grow and expect them to listen; you provide scaffolding and support. In software, this can be types, interfaces and architectural boundaries. Well crafted designs that reduce coupling and abstract complexity behind well-defined interfaces are how you give agents a way to grow that is aligned to what you need.
Resilience -- Automated tests, lint checks and verifications are much more helpful for AI agents than they are to humans, as they enable both faster iteration speed and more confidence in the review stage of the submit loop. In the gardening metaphor, this is akin to the sturdy fencing that protects your plants from critters.

I find it ironic that many of the principles above are ones that practitioners have been advocating for under the banner of clean code, test-driven development and many others. We might callously shrug at the idea that we struggled to adopt them for the sake of our human co-workers and are now prioritizing them for the sake of our AI-agents. But the truth is that in the last decade, writing effective tests and good documentation cost us time: the time to think about them, and the time to type them. With AI agents being this capable, the typing cost is approaching zero. What remains is the thinking, and that was always the valuable part.

Building the Dark Factory

Our job is no longer to write the code. It's to build the factory that builds the code.

By now, it should be obvious that if we use AI only to automate the "Coding" stage of the development loop, we may not only struggle to make our team more effective, we may even hurt their effectiveness.

In the same talk by Addy Osmani I referenced earlier, he goes on to show several areas where AI can be effectively adopted to improve developer experience. In my day-to-day work, I've had considerable success using AI agents to troubleshoot bug reports and infrastructure alerts from our production fleet. The gains are real.

There is a growing conversation in engineering circles about "Dark Factories": fully automated systems that run without human intervention. In the age of AI, our job is no longer to write the code; it's to build the factory that builds the code.

Some high-leverage areas to start:

The Verification Machine -- Good test infrastructure should be the top priority. Well-written tests enable AI-agents to have much faster inner loops, but they also greatly help with faster code reviews. With good test scaffolding, you don't just ask "Will this code work in this scenario?" You can ask an agent to demonstrate the expected behavior via a unit test.
Address common tripping hazards for agents -- You likely have a few areas where agents routinely struggle. Don't just scoff when that happens, and use it to say "AI isn't quite there yet". Ask yourself why agents are struggling. Is it because of inconsistent patterns? Lack of context or documentation? Because your bespoke framework requires 1 year of experience in your own codebase to master? Making sure agents don't make the same mistake twice should be part of our responsibilities.
Reducing human dependencies for mechanical tasks -- Invest in building reliable automated end to end tests that rely on production-like observability to spot issues and regressions. Wherever manual testing is required, ask yourself "what would it take for this test to happen automatically?" In a hardware company like Quilt, this means augmenting our ability to perform more tests in software.
The Lights-Out Goal -- Aim to have a "Submit Loop" so robust that if tests pass and the architectural boundaries are respected, the code is "shippable" by default. Even if that goal feels unrealistic (e.g. for code that is security-critical or that runs on devices that are hard to recover), ask yourself "What would it take for me to be 100% confident in a change without needing to review it?"

A word of warning: don't confuse building the factory with building more features. If you ship 10x more features without correspondingly improving your infrastructure, you're taking on a compounding liability. If AI agents today are enabling you to move even just a bit faster than yesterday, aim to put some of those velocity gains towards your scaffolding, instead of putting everything on more features.

Crossing the Chasm

If the smartest AI in the world can't understand your code, it might not be the AI's fault.

The Unhappy Middle is a trap, but it's also an opportunity to rethink what engineering leadership looks like.

This requires a fundamental shift in our ego as developers. Instead of 'pwning' the agent every time it trips on our proprietary abstractions, we need to 'own' our codebase and make it more AI-hospitable. If the smartest AI in the world can't understand your code, it might not be the AI's fault, but it might be a sign that our "cleverness" has become our biggest liability.

If we don't cross the chasm quickly and change our mindset about how we write software, we risk being buried under our own AI-generated slop. The first step is to stop prioritizing just features as our primary output and start prioritizing the speed and accuracy of the factory.

It is notoriously hard to get organizational buy-in to address technical debt. The key is to reframe: this isn't about "cleaning up" to pay off debt, it's about investing in tooling to accomplish 10x velocity.

And even then, there are harder questions ahead. If you actually succeed in building the "factory," you'll quickly find that the technical bottleneck has evaporated, only to leave you with an organizational one. A 10x software factory is effectively useless if it's embedded in a 1x decision-making process. And it is possible that we are approaching a Great Filter-like event for companies in the business of software — one that separates those who adapt from those who drown. But those are topics for another day.

For now, the goal is clear: stop just auditing lines of code and start building the systems that define the future of our industry.

Let us begin.

This article was originally published on abahgat.com on February 23, 2026. If you enjoyed this piece, feel free to follow me on LinkedIn for more thoughts on engineering leadership and software.

The Ghost in the Training Set

Alessandro Bahgat — Mon, 16 Feb 2026 20:10:18 +0000

During the last several weeks, I've run into setting up MCP servers a few times and noticed something surprising. MCP has been gaining popularity and, as things mature, it's running into paradigm shifts. In early 2025, the recommended way to build MCP servers over HTTP switched from SSE (Server-Sent Events) to Streamable HTTP.

To my surprise, the agents I use most (Gemini and Claude) kept reverting to SSE. It wasn't until I started digging that I realized what was happening: the models were haunted by the statistical momentum of their own training data.

Even when LLMs are aware that Streamable HTTP is the standard now—and can competently answer questions about it when asked—the "statistical momentum" in their training data pulls them back to the old standard. Because most of the examples they have seen were written using the old approach, they default to it when generating code.

Note: Astutely, Claude ships with an /mcp-builder skill, which serves as a specialized instruction package. Try building an MCP server with Gemini now and you'll be surprised to get a perfectly functional implementation built on a deprecated pattern.

Why This Is Happening: The Invisible Weight of Training Bias

LLMs don't just "read" instructions in a traditional sense; they weigh them against their internal probability map. If the majority of MCP implementations in their training set used SSE, that creates a massive bias in that direction.

This is a sneaky pattern. We don't naturally think about how old (or new) a model's training set is. If you're working on a bleeding-edge domain, you may find yourself with an agent offering a beautiful implementation that is actually a frozen snapshot of last year's best practices.

Agents thrive on Common Knowledge, but they struggle with Private Context. When we use bespoke patterns or fast-moving standards, we are essentially moving the agent into a zero-shot environment without even realizing it.

From Instructions to Infrastructure

You may be tempted to overcome this through prompting (ALWAYS use Streamable HTTP), but over time, you should move these guardrails into your agents.md files. We need to shift technical standards out of lossy prompts and into the tooling infrastructure.

Well-written skills and tools help a lot here. Anthropic's /mcp-builder is extremely effective in ensuring you land with a well-functioning implementation that overcomes the inherent bias in the models.

The Trap of "Contextual Debt"

Just like code accumulates technical debt, continuously adding to agents.md without cleaning up leads to Contextual Debt. Files become bloated with a mountain of "Don't do X" or "Remember Y."

We are getting to a point where our "Instruction Budget" is as important as our compute budget. If you have clashing instructions across multiple files, you're not just wasting tokens; you're creating "hallucination traps" that are far more expensive to debug than a standard syntax error.

Strategies for Garbage Collecting `agents.md`

Here are a few things that seem to work for me:

Progressive Disclosure: Borrow from the Claude skills playbook. Instead of one giant instruction file, use a modular approach (e.g., a docs/MCP_STANDARDS.md file linked from your root agents.md).
The "Zero-Prompt Test": Periodically run your project with a blank instruction file, especially after model updates. If it works well, your instructions have become cruft. Delete them.
Project-Level Ground Truth: Get your team to own maintaining agent configs as much as they would maintain their editor configurations. Up-to-date documentation is now more precious than ever.

Conclusion: Managing the Agent's AI "Memory"

Regardless of the tropes around "software engineering being dead," more of our job is moving up the stack from writing code.

We are increasingly managing the attention and memory of our agents. The most sustainable systems will be the ones where instructions and scaffolding are pruned as ruthlessly as—if not more than—the code itself.

Enjoyed this?

I write about the intersection of engineering leadership and the "agentic" era. If you're navigating similar paradigm shifts in your own team, let's connect:

LinkedIn: Follow me for more leadership insights
Personal Blog: abahgat.com — where I dive deeper into the code and the strategy.

Receiving Feedback Is A Skill

Alessandro Bahgat — Wed, 02 Sep 2020 15:12:59 +0000

Delivering feedback is a critical part of my day job as a manager at Google. However, it took me a while to realize that receiving feedback is one of the skills that helped me grow the most in my career.

For many of us, our job is the first setting where we receive developmental feedback from people other than our parents or teachers. That experience may be quite shocking.

I still remember the first time I got professional feedback early in my career. I remember almost every single word that my manager chose to use.

What I remember even more vividly though is the strong reaction that feedback caused in me. Within seconds, I got defensive, I felt like I was being criticized, attacked, unappreciated. I heard what they were trying to tell me, but something inside me kept translating that into a personal criticism. A statement about how I, personally, fell short of expectations.

Good feedback sounds like "here's one thing you can do better next time". Better feedback sounds like "here's one thing that you could do differently to achieve a greater result".

Embracing that mindset allowed me to accept, process and build on feedback. While I can't say I prefer criticism over praise, constructive feedback no longer makes me uncomfortable. Instead, I actively seek it.

Changing my mindset around feedback required me to make two key changes:

stop doing things that hurt my ability to improve
start doing things that help build on what I hear

Things I Stopped Doing

Taking It Personally

The main reason I had a difficult time processing feedback is the fact that I often took it personally.

When receiving feedback about something I did, I often read it as feedback about me. Oftentimes, that was not the intention.

Instead of hearing "this email was hard to understand", I heard "you do not communicate effectively". When the other party was saying "this piece of code is brittle", I was hearing "you are a lousy programmer".

I often ended up reacting defensively. I was unable to hear and processing the actual message I needed to receive.

Most developmental feedback will naturally trigger a defensive attitude. That prevents us from getting the full value of what the other person is trying to tell us. We need to make a conscious effort to not jump to defensive mode, and rather engage in active listening.

Arguing With Feedback

Even worse than taking feedback personally, I sometimes found myself wanting to argue with the person delivering it. I wanted to explain why I disagreed with what they were seeing or try to convince them that they were wrong.

In most cases, arguing with feedback is pointless. Take an example from many years ago.

A colleague approached me and told me "I think the comments you left in this review were too harsh".

Now, if they cared enough to bring up this feedback, perhaps they were not the only ones. Or maybe my communication style could have had an unintended effect on some people, some time.

Yes, I could have argued with my colleague, perhaps even convince them that my tone was not that bad. Winning the argument might even have felt better.

That would not have changed the my comments did trigger a negative reaction for them. Quite likely, others might have had the same reaction. Knowing that, having that awareness, made me more thoughtful when writing review comments. I can tell they were better received from that moment on.

Arguing with people who are trying to give us feedback, does not help us. Eventually, people will shy away from telling us where we can improve. It leads us to us working with less information about what we can do to get better. In the long run, we miss out on a significant opportunity.

Things I Learned To Do Instead

Being Thankful

A friend of mine once shared a quote that sounded like "feedback is a gift"

Good feedback is thoughtful and timely. Often, it is as difficult to deliver as it is to receive. It is especially difficult for people we are not very close with.

Any yet, some people choose to take a risk. They let us know where we can do better. They do that knowing well that we may feel hurt by what they say.

Because of this, the first thing I do when receiving feedback is thank whoever is giving it. I thank them because they took a risk and did something uncomfortable. I also thank them because what they are telling me has the potential of making me much better.

Good feedback allows us to identify growth areas. Areas where we could invest more to get better at something we have been trying to do. Even those of us that have good self-awareness often need to work hard to find where they need to improve the most.

If someone is coming to us with feedback, they may be sparing us a lot of hard work required to identify areas of improvement.

The least we can do is thank them profusely for the gift they just gave us and get to work.

Following Up

Whenever I receive feedback about something I can improve and want to work on, I note it down. Over time, this list becomes my feedback log.

Keeping a list of the items I am trying to get better at is a way to hold myself accountable. I go through this feedback log every few weeks and reflect on the progress (or lack of progress) I have seen so far.

This helps me making sure I make the most of the feedback I was generously given and use it to gradually get better. I try to spend some time every week to work on some of the most important items on the feedback log.

Doing this helps me well beyond the result of addressing feedback. It also helps me ground my identity as someone who can accept feedback gracefully and use it as a tool to keep growing every day.

Wrapping Up

A few simple changes in perspective helped me change my view on feedback. I went from seeing it as a threat to my own self-worth to a stepping stone to become a better version of myself.

The results of this attitude compound over time as I keep focusing my energy towards addressing the most critical feedback items.

I originally published this post on my personal website (abahgat.com).

Follow me on Twitter for more content like this.

The Programming Puzzle That Landed Me a Job at Google

Alessandro Bahgat — Mon, 27 Apr 2020 19:27:49 +0000

Back in 2011, as I was getting a bored with my job and I started looking for new options. During my search, my friend Daniele (with whom I had built Novlet and Bitlet years before) forwarded me a link to the careers page of the company he was working for at the time, ITA Software.

While Google was in the process of acquiring ITA Software, ITA still had a number of open positions they were looking to hire for. Unlike Google, however, they required candidates to solve a programming challenge before applying to engineering roles.

The problems to solve were surprisingly varied, ranging from purely algorithmic challenges to more broadly scoped problems that still required some deep technical insight. As I browsed through the options, I ended up settling on a problem that intrigued me because I thought it resembled a problem I might one day wanted to solve in the real world and seemed to try to test both the breadth of my knowledge (it required good full stack skills) as well as my understanding of deep technical details.

I have good memories of the time I spent investigating this problem and coming up with a solution. When I was done, I had learned about a new class of data structures (suffix trees), gained a deeper understanding of Java's internals. A year later, I got a job offer due in part to this puzzle.

The Problem Statement

The brief for the challenge was the following:

Instant Search

Write a Java web application which provides "instant search" over properties listed in the National Register of Historic Places. Rather than waiting for the user to press a submit button, your application will dynamically update search results as input is typed. We provide the file nrhp.xml.gz, which contains selected information from the register's database.

Database The key component of your server-side application is an efficient, in-memory data structure for looking up properties (written in pure Java). A good solution may take several minutes to load, but can answer a query in well under 0.1 ms on a modern PC. (Note that a sequential search of all properties is probably too slow!) An input matches a property if it is found at any position within that property's names, address, or city+state. Matches are case-insensitive, and consider only the characters A-Z and 0-9, e.g. the input "mainst" matches "200 S Main St" and "red" matches "Lakeshore Dr." Note that the server's JVM will be configured with 1024M maximum heap space. Please conform to the interfaces specified in nrhp.jar when creating your database.

Servlet Your servlet should accept an input string as the request parameter to a GET request. Results should include the information for a pre-configured number of properties (e.g. 10), the total number of matches which exist in the database, and the time taken by your search algorithm. Your servlet should be stateless, ie. not depend on any per-user session information. Paginate your additional results as a bonus!

Client Your web page should access the servlet using JavaScript's XMLHttpRequest object. As the user types, your interface should repeatedly refine the list of search results without refreshing the page. Your GUI does not have to be complicated, but should be polished and look good.

Please submit a WAR file, configuration instructions, your source code, and any comments on your approach. Your application will be tested with Tomcat on Sun's 64-bit J2SE and a recent version of Firefox.

Client

I started building this from the UI down.
The puzzle brief mentioned using XMLHttpRequest, so I avoided using any client-side libraries (the functionality I was asked to build on the client was, after all, quite simple).
The screenshot included with the puzzle brief included just a text field for the search query and a list of results.

I wrote a function to listen for key presses, dispatch an asynchronous call to the server and render the response as soon as it came back. By 2011, I had been coding web applications for a while and I was able to implement that functionality in less than an hour of work.

Web application and Servlet code

The Servlet layer was also quite simple, since all it had to was handle an incoming XML request and dispatch it to what the brief called a database. Again, less than an hour of work here.

At this level, I also wrote code to parse the database of strings to index from an XML file containing data from the National Register of Historic Places. The Tomcat server would run this code when loading my web application and use the resulting data to construct a data structure to use as an index for power the fast search functionality I needed to build. I needed to figure that out next.

Finding a suitable data structure

This is, unsurprisingly, the most challenging part of the puzzle and where I focused my efforts the most. As pointed out in the problem description, looping sequentially through the list of landmarks would not work (it would take much longer than the target 0.1ms threshold). I needed to find data structure with good runtime complexity associated with lookup operations.

I spent some time thinking about how I would implement a data structure allowing the fast lookup times required in this case. The most common fast-lookup option I was familiar with, the hash table, would not work straight away with this problem because it would expect the search operation to have the full key string.
In this problem, however, I wanted to be able to look up entries in my index even when given an incomplete substring, which would have required me to store all possible substrings as keys in the table.

After doing some sketching on paper, it seemed reasonable to expect that tries would work better here.

Suffix trees

As I was researching data structures providing fast lookup operations given partial strings, I stumbled upon a number of papers referencing suffix trees, commonly used in computational biology and text processing, offering lookup operations with linear runtime with respect to the length of the string to search for (as opposed to the length of the string to search within).

Plain suffix trees, however, are designed to find matches of a given candidate string sequence within a single, longer, string, while this puzzle revolved around a slightly different use case: instead of having a single long string to look up matches in, I needed to be able to find matches in multiple strings. Thankfully, I read some more and found a good number of papers documenting data structures called generalized suffix trees that do exactly that.

Based on what I had learned so far, I was convinced this type of tree could fit my requirements but I had two likely challenges to overcome:

Suffix trees tend to occupy much more space than the strings they are indexing and, based on the problem statement, "the server's JVM will be configured with 1024M maximum heap space" and that needed to accommodate the Tomcat server, my whole web application and the tree I was looking to build.
Much of the complexity of working with suffix tree lies in constructing the trees themselves. While the puzzle brief was explicitly saying my solution could take "several minutes to load", I did not want the reviewer of my solution to have to wait several hours before they could test my submission.

Ukkonen's algorithm for linear runtime tree construction

Thankfully, had I found a popular algorithm for generating Suffix Trees in linear time (linear in the total length of the strings to be indexed), described by Ukkonen in a paper published in 1995 (On–line construction of suffix trees).

It took me a couple days of intermittent work (remember: I was working on this during nights and weekends -- I had another day job back then) to get my suffix tree to work as expected.

Interestingly, some of the challenges with this stage were revolving around a completely unexpected theme: Ukkonen's paper includes the full algorithm written in pseudo-code and good prose detailing the core steps. However, that same pseudo-code is written at such a high level of abstraction that it did take some work to reconduct it to fast and efficient Java code.

Also, the pseudo-code algorithm is written assuming we are working with a single string represented as a character array, so many of the operations outlined there deal with indices within that large array (e.g. k and i in the procedure above).

In my Java implementation, instead, I wanted to work with String objects as much as possible. I was driven by a few different reasons:

Java implements string interning by default -- there is no memory benefit in representing substrings by manually manipulating indices within an array of characters representing the containing string: the JVM already does that transparently for us.
Working with String references led to code that was much more legible to me.
I knew my next step would be to generalize the algorithm to handle building an index on multiple strings and that was going to be much more difficult if I had to deal with low level specifics about which array of character represented which input string.

Generalized Suffix Trees

This last consideration proved to be critical: generalizing the suffix tree I had up to this point to work with multiple input strings was fairly straightforward. All I had to do was to make sure the nodes in my tree could carry some payload denoting which of the strings in the index would match a given query string. This stage amounted to a couple hours of work, but only because I had good unit tests.

At this point, things were looking great. I had spent maybe a couple days reading papers about suffix trees and another couple days writing all the code I had so far. I was ready to try out running my application with the input data provided with the puzzle brief: the entire National Register of Historic Places, an XML feed totaling a few hundred megabytes.

Trial by fire: `OutOfMemoryError`

The first run of my application was disappointing. I started up Tomcat and deployed my web application archive, which triggered parsing the XML database provided as input and started to build the generalized suffix tree to use as an index for fast search. Not even two minutes into the suffix tree construction, the server crashed with an OutOfMemoryError.

The 1024 megabytes I had were not enough.

Thankfully, a couple years earlier I had worked with a client that had a difficult time keeping their e-commerce site up during peak holiday shopping season. Their servers kept crashing because they were running out of memory. That in turn led me to learn how to read and make sense of JVM memory dumps.

I never thought I would make use of that skill for my own personal projects but this puzzle proved me wrong. I fired up visualvm and started looking for the largest contributors to memory consumption.

It did not take long to find that there were a few memory allocation patterns that were not efficient. Many of these items would hardly be an issue for an average application, but they all ended up making a difference in this case because of the sheer size of the tree data structure being constructed.

Memory micro-optimizations

Analyzing a few heap dumps suggested me a series of possible changes that would lead to savings in memory, usually at the cost of additional complexity or switching from a general purpose data structure implementation (e.g. maps) to special purpose equivalent tailored to this use case and its constraints.

I ranked possible optimizations by their expected return on investment (i.e. comparing value of the memory savings to the additional implementation complexity, slower runtime and other factors) and implemented a few items at the top of the list.

The most impactful changes involved optimizing the memory footprint of the suffix tree nodes: considering my application required constructing a very large graph (featuring tens of thousands of nodes), any marginal savings coming from a more efficient node representation would end up making a meaningful difference.

A property of suffix tree nodes is that no outgoing edges can be labeled with strings sharing a prefix. In practice, this means that the data structure implementing a node must hold a reference to a set of outgoing edges keyed by the first character on the label.

The first version of my solution was using a HashMap<Character,Edge> to represent this. As soon as I looked at the heap dump, I noticed this representation was extremely inefficient for my use case.

Hash Maps in Java are initialized with a load factor of 0.75 (meaning they generally reserve memory for at least 25% more key/value pairs than they hold at any given point) and, more importantly, with enough initial capacity to hold 16 elements.

The latter item was a particularly poor fit for my use case: since I was indexing strings using the English alphabet (26 distinct characters) a map of size 16 would be large enough to accommodate more than half the possible characters and would often be wasteful.

I could have mitigated this problem by tuning the sizing and load factor parameters but I thought I could save even more memory by switching to a specialized collection type.
The default map implementations included in the standard library require the key and value types to be reference types rather than native types (i.e. the map is keyed by Character instead of char) and reference types tend to be much less memory efficient (since their representation is more complex).

I wrote a special-purpose map implementation, called EdgeBag, which featured a few tweaks:

stored keys and values and two parallel arrays,
the arrays would start small gradually grew if more space if necessary,
relied on a linear scan for lookup operation if the bag contained a small number of elements and switched to using binary search on a sorted key set if the bag had grown to contain more than a few units,
used byte[] (instead of char[]) to represent the characters in the keys. Java's 16-bit char type takes twice as much space as a byte. I knew all my keys were ASCII characters, so I could forgo Unicode support here and could squeeze some more savings by casting to a more narrow value range.

Some more specific details on this and other changes to reduce the memory footprint of my suffix tree implementation are in the Problem-specific optimizations section of the Suffix Tree project page.

Conclusion

When I tested out my program after the memory optimizations, I was delighted to see it met the problem requirements: lookups were lightning fast, well under 0.1ms using the machine I had back then (based on an Intel Q6600 2.4GHz CPU) and the unit tests I had written gave me good confidence that the program behaved as required.

I packaged up the solution as a WAR archive, wrote a brief README file outlining design considerations and instructions on how to run it (just deploy on a bare Tomcat 6 server) and sent it over email. Almost a year later, I was packing my bags and moving to Amsterdam to join Google (which had by then acquired ITA Software).

I owe it in no small part to the fun I had with this coding puzzle.

When I think of how much I enjoyed the time I spent building Instant Search, I think it must be because it required both breadth (to design a full stack application, albeit a simple one) and depth (to research the best data structure for the job and follow up with optimizations as required). It allowed me to combine my background as a generalist with my interest with the theoretical foundations of Computer Science.

The careful choice of specifying both memory and runtime constraints as part of the problem requirements made the challenge much more fun. When the first version I coded did not work, I was able to reuse my experience with memory profiling tools to identify which optimizations to follow up with. At the same time, I built a stronger understanding of Java's internals and learned a lot more about implementation details I had, until then, just given for granted.

When ITA retired Instant Search (and other programming puzzles¹), I decided to release the Java Generalized Suffix Tree as open source for others to use. Despite the many problem-specific optimizations I ended up making, it is generic enough that has been used in a few other applications since I built it, which gives me one more thing to be thankful for.

I write about programming, software engineering and technical leadership. You can follow me on twitter for more posts like this.

This post was originally published on abahgat.com on Sep 30, 2019

While the original page is no longer online, the Wayback Machine still has a snapshot of the original page with the original selection of past programming puzzles. They are still a great way to test your programming skills. ↩

Visual and HTML Testing for Static Sites

Alessandro Bahgat — Tue, 01 Oct 2019 18:18:53 +0000

Over a year ago I switched from having my site hosted on a CMS to having it built statically and served as a collection of static pages. I have been extremely happy with the end result for all these months -- the site is very easy to update and effortless to maintain -- but I just made a few changes that made my experience even better.

Why test Static Sites

Even for sites as simple as this, it is surprisingly easy to make breaking changes without realizing. Over the time I have been maintaining abahgat.com, I ended up accidentally introduction bugs more than a few times. Here a few examples of things I ran into:

broken links -- by default, Hugo does not validate any of the links in the content I am editing, which means that I have to be careful and make sure all URLs and paths are valid
incorrect theme configuration -- the more complex the theme I am using is, the more configuration options it will offer. The more options I have to configure, the more likely I am to make mistakes.
bugs in theme customizations -- Hugo is great at allowing to override and customize theme templates. However, this is another source of potential issues.
bugs in the theme code itself -- No software is perfect, and any theme I might be using can have its own bugs and edge cases. This might be especially true for you if you are actively developing your own theme or you frequently update it to the most recent version available.

Most of the issues above still affected me when I was hosting my site on Wordpress (I did break links and styling every now and then) but one advantage of working with a statically generated site is that we can leverage many of the tools that are available to web developers to catch issues early (and potentially block deploys if any issues are detected). So I set out to find what kind of options I had to improve my workflow so that I could make changes with more confidence that I wouldn't accidentally break my site.

What can be tested

Based on the list above, I knew I was looking to set up tests to detect, in order of priority, problems such as:

broken internal links
invalid or malformed HTML
issues with layout or presentation
invalid RSS feed entries

Thankfully, I was able to find a way to cover most of these.

Testing HTML with `html-proofer`

Covering the first items on the list has been fairly straightforward with html-proofer.

Provided you have Ruby installed, you can get html-proofer as a gem via the command below

gem install html-proofer

and then run it via

htmlproofer --extension .html ./public

This will scan the ./public directory for any files with html extension and output a report listing any issues with the markup in those files.

When I first ran it on my site, I got a pretty good list of actionable warnings. The messages are fairly specific and easy to understand, as you can tell by looking at the snippet below:

- ./public/author/abahgat/index.html
  *  356:11: ERROR: Opening and ending tag mismatch: section and div (line 356)
- ./public/author/index.html
  *  356:11: ERROR: Opening and ending tag mismatch: section and div (line 356)
- ./public/blog/index.html
  *  829:2157: ERROR: Unexpected end tag : p (line 829)
- ./public/blog/maps-for-public-transport-users/index.html
  *  internally linking to uploads/2009/01/p-480-320-0e6ac38d-252e-47fa-be79-0ae974dad8d2.jpeg, which does not exist (line 476)
     <a href="uploads/2009/01/p-480-320-0e6ac38d-252e-47fa-be79-0ae974dad8d2.jpeg"><img class="size-full wp-image-364 aligncenter" src="/img/wp-uploads/2009/01/p-480-320-0e6ac38d-252e-47fa-be79-0ae974dad8d2.jpeg" alt="" width="200" height="300"></a>
- ./public/blog/page/2/index.html
  *  linking to internal hash #broken-priorites that does not exist (line 1456)
     <a href="#broken-priorites">The way priorities are managed is broken</a>
  *  linking to internal hash #duplicates that does not exist (line 1453)
     <a href="#duplicates">Lots of issues are duplicates</a>
  *  linking to internal hash #missing-info that does not exist (line 1455)
     <a href="#missing-info">Bug reports do not include enough information</a>
  *  linking to internal hash #processes that does not exist (line 1454)
     <a href="#processes">The system imposes over-engineered processes</a>
  *  linking to internal hash #tracker-misuse that does not exist (line 1452)
     <a href="#tracker-misuse">The issue tracking system is misused</a>

Even with default settings, html-proofer is able to catch most of the issues I was interested in detecting: the list above features a good mix of problems caused by invalid links in my Markdown sources, errors due to how I was misusing my template and bugs in the template I was using.

Fixing the issues required a combination of updating a few broken links, cleaning up the Markdown sources for my site, submitting a few bugs and Pull Requests against the theme I am using.

Overall, all the issues flagged made sense and worth fixing.

Visual Testing with Percy

As useful as html-proofer is, it does not help catching layout and presentational issues that are not due to invalid markup. I have had good experiences with visual testing and review at work and I was interested in using screenshots to detect layout issues and catch any unintended presentational changes on my own site too.

I cared about this because upgrading my Hugo theme sometimes involves non-trivial changes that could go wrong (despite George, the author, keeping really good change logs).

Also, I wanted to make customizations to the theme and having testing in place is the only way I know to make sure I don't inadvertently break anything (since I will not review every single page manually every time I make layout changes, having a way to be warned about any differences is very valuable).

I ended up settling on Percy, a tool that was clearly designed first and foremost for testing dynamic web applications but also offered an option to test static sites via a command line program.

The main idea behind a snapshot testing system is to keep a set of approved snapshots ("goldens"), capture a new set of snapshots upon change and flag any differences for review. Changes can be either intended (in which case the screenshot is approved and becomes the new golden) or accidental (in which case they are flagged as regressions and expected to be fixed before pushing a new version).

Percy offers a nice interface to highlight any difference between snapshots and can be easily integrated with GitHub and other source control systems to make approving any updated snapshots part of the code review process.

Percy runs as a service, so you will need to create an account with them before being able to use it. Once you have done that you can try it by following the instructions on their documentation page and running the following command on your site (where ./public is a directory containing your static pages):

npx percy snapshot ./public

Running tests on every change via CI services

Unlike the HTML tests, which test a specific version of your site in isolation, the value of snapshot testing lies in comparing your site against a previously approved set of snapshots, which need to be kept up to date.

I then configured a simple workflow with CircleCI, having it build my site with Hugo, run html-proofer on the generated sources, grab a fresh set of screenshots on every change and flag any differences for review.

From what I could tell, many other CI services can be configured to do the same; I ended up choosing CircleCI because I thought its Docker-based setup worked better for what I was trying to do and I had little trouble finding Docker images suitable for running the steps in my workflow.

Below the resulting configuration:

version: 2.1

orbs:
  hugo: circleci/hugo@0.3

jobs:
  snapshot:
    docker:
      - image: buildkite/puppeteer:v1.15.0
    steps:
      - attach_workspace:
          at: .
      - run: npm install percy
      - run: PERCY_TOKEN=$PERCY_TOKEN npx percy snapshot ./public

workflows:
  main:
    jobs:
      - hugo/build:
          version: "0.55.6"
          html-proofer: true
      - snapshot:
          requires:
            - hugo/build

The first section sets up build with Hugo via an Orb (Orbs are CircleCI's packages of functionality that can be packaged and reused) that also runs html-proofer tests on the resulting build.

The snapshot task installs percy via npm and then invokes it on the directory containing the sources generated in the previous step. It runs on the Docker Puppeteer image, which comes with most of Percy's package dependencies already installed.

Note:
There seems to be a Docker image maintained by Percy but I could not get it to work. I suspect it is because it ships with an old version of the percy command, I did not investigate this further.

With this configuration, every commit and Pull Request will trigger a Hugo build, run your site through html-proofer and capture a new set of snapshots. If any visual differences are detected, they can be inspected and approved via Percy's web interface.

Note that there is no deploy workflow since I configured Netlify to automatically publish a new version of my site whenever I push to the master branch.

Tweaking the setup

If you got to this point, your configuration will feature sensible defaults and help you capture a number of issues caused by your own mistakes or any issues introduced by the theme upstream.

There are a few opportunities to make the setup more efficient, but they require making changes with the CircleCI configuration above since the Orb we used before does not expose a good way to pass flags to tweak neither the build nor test test. (This might be fixed by the time you read this).

You can click here to see a CircleCI configuration file that you can further customize based on the sections below.

Here some of the tweaks you might consider implementing.

Test pages with a future publish date and drafts

Hugo allows you mark pages as drafts or to set a publish date to a future time (for scheduled content). Neither of these pages will be built by default in your deploy workflow, but you might want to do that when running your tests so that you ensure that content passes validation even as it is being edited (as opposed to being surprised by unexpected errors just when you thought you were ready to publish).

You can do this by passing the -D and -F flags to the hugo command during the build step.

Consider enabling minification

If you are building your site with minification enabled when you are deploying, you might have to make a decision:

if you enable minification only on the deploy workflow (and leave it disabled for development), the version of the site you will be testing will not be identical to the version you are publishing. This might hide subtle bugs that you would not be able to track down easily (such as this one).
on the contrary, if you do enable minification, debugging issues flagged by html-proofer and percy might be slightly more difficult, since the resulting source code will be more difficult to read.

I do not have a firm recommendation here, I am currently working with the latter setup and it has been working fine so far but isolating the cause of an issue is slightly harder this way.

If you want try this, you need to pass --minify to the hugo command during the build step.

Skip redundant screenshots

Just like, when writing unit tests, we don't want to have multiple redundant tests that cover the same behavior, in most cases it is not necessary to take screenshots of pages that use the same template and have very similar content.

For example, if part of your site is a blog that features tags and categories (in Hugo, this would apply to any taxonomy), you will not need to take screenshot of every individual tag page as you won't get much value out of them, since they all look the same. They will rather be a burden to maintain (should your theme ever change, you'd have many more -- very similar -- screenshots to approve).

You can probably make a similar case for directory pages (say, if you have 40 pages of articles, the screenshots for the second to thirty-ninth pages are likely going to be the same.
There could be value in testing the first and last page separately since you'd imagine they would have a different configuration for the next/previous navigation elements, but that is up to you.

Thankfully, the percy command offers a way to manually exclude certain paths from being considered when grabbing screenshots. The syntax for that argument expects globs, which can take some trial and error to get right.

In case it helps, here a configuration that worked reasonably well for me so far:

npx percy snapshot ./public -i \
  'categories/!(coding|coding/**)/*.html',\
  'tags/!(amsterdam|amsterdam/**)/*.html',\
  'blog/page/!(1|2)/*.html'

What the above is doing is excluding all categories but one (Coding) and all tags excluding one (Amsterdam). It is also ignoring any page beyond the second in the /blog directory.

Capture screenshots less frequently

I have yet to run into this limitation but I could see how, if your site is very large and/or if you commit very frequently, you may be concerned about exceeding Percy's free quota (5000 screenshots/month).

I have not had to handle this in any special way so far, but here a few options:

Percy grabs screenshots of each page on your site in both Chrome and Firefox to ensure your site behaves well across browsers. You may decide you are comfortable with taking the risk of having smaller issues undetected and grab screenshots only on one of the two. This will mean you will consume half as many snapshots every time you run visual tests.
Percy will also test your site on a couple different viewport sizes. This is helpful to ensure your site works well on desktop and mobile devices. Again, you may be comfortable with just running tests on one configuration in order to reduce resource consumption by half.
You may configure your CircleCI workflow to toggle the snapshot step manually and run it only when you have meaningful changes to test (e.g. if you are adding new content or upgrading your theme). If you do this, you still want to make sure you refresh your screenshots based on master fairly often, otherwise you might find yourself with visual diffs that cover so many changes together that are no longer informative. And if you run this very infrequently, you might as well just choose to run the percy command locally.

Realistically, for most personal sites, you can likely go a long way with the free quota. If you are considering this for a large corporate site, I would rather consider paying for a higher tier and get more snapshots rather than trying too hard to capture fewer and have a less informative workflow.

Tests are even more valuable if you are a theme developer

If you are developing a theme that others are going to use, testing this way is likely to be even more impactful: you can save yourself quite a bit of time by having a way to catch issues before you ship a new version instead of relying on your users to report problems they run into after they upgrade.

You can apply most of the suggestions above by making sure that you have an example site (the Academic theme I use is great for this) that exercises most of the features in your theme, especially the ones that are not enabled by default. This would also likely reduce the time you spend manually inspecting your pages to make sure they still render as expected.

Conclusion

This has been a great opportunity to learn about great tools that are available out there (I will definitely consider Percy for the next app I will build in my own time) and how they can help greatly even with sites that are statically generated.

I have accomplished most of the goals I had in mind when I started playing with this. There is one item left open for future investigation (mainly, a way to ensure the RSS for my site is valid and well-formed) but the CircleCI workflow I set up gave me a good foundation I can extend to cover more tests.

This post was originally posted on my personal website, abahgat.com, where I write about software, design and human factors.

If you enjoyed this post, you can be notified of more by signing up for this newsletter or following me on twitter

DEV Community: Alessandro Bahgat

Scopewise: Submission for Weekend Challenge: Earth Day Edition

What I Built

Demo

Code

How I Built It

1. The data-integrity problem with synthetic demos

2. Baseline comparison via computeTypicalFlightEmissions

3. Snowflake as the single source of truth

4. Gemini with grounded tools

5. Why a TIM MCP server matters (forward-looking)

6. Architectural decisions

Prize Categories

Getting Qwen and Gemma to play Zork (and why they get stuck in the maze)

The setup

Day 1: why is my agent speaking Thai?

Day 2: the architecture pivot

Day 3: the maze

Why agents get lost

The microservices parallel

What’s next

Visualizing Ukkonen's Suffix Tree Algorithm

Learning algorithms from books

Implementing from a paper

The visualization I wish I had

Adding more strings

Searching

Try it yourself

Beyond suffix trees

The Velocity Paradox

The Unhappy Middle

Bespoke Frameworks: from Asset to Dead Weight

Zero Slack

When Generation Outruns Verification

The Broken Inner Loop

The Slowing Submit Loop

From Craftspeople to Janitors

From Janitors to Gardeners

Building the Dark Factory

Crossing the Chasm

The Ghost in the Training Set

Why This Is Happening: The Invisible Weight of Training Bias

From Instructions to Infrastructure

The Trap of "Contextual Debt"

Strategies for Garbage Collecting agents.md

Conclusion: Managing the Agent's AI "Memory"

Enjoyed this?

Receiving Feedback Is A Skill

Things I Stopped Doing

Taking It Personally

Arguing With Feedback

Things I Learned To Do Instead

Being Thankful

Following Up

Wrapping Up

The Programming Puzzle That Landed Me a Job at Google

The Problem Statement

Client

Web application and Servlet code

Finding a suitable data structure

Suffix trees

Ukkonen's algorithm for linear runtime tree construction

Generalized Suffix Trees

Trial by fire: OutOfMemoryError

Memory micro-optimizations

Conclusion

Visual and HTML Testing for Static Sites

Why test Static Sites

What can be tested

Testing HTML with html-proofer

Visual Testing with Percy

Running tests on every change via CI services

Tweaking the setup

Test pages with a future publish date and drafts

Consider enabling minification

Skip redundant screenshots

Capture screenshots less frequently

Tests are even more valuable if you are a theme developer

Conclusion

2. Baseline comparison via `computeTypicalFlightEmissions`

Strategies for Garbage Collecting `agents.md`

Trial by fire: `OutOfMemoryError`

Testing HTML with `html-proofer`