DEV Community: Harshdeep Singh

AI Data Centers and Nature - What the fuss is really about?

Harshdeep Singh — Sat, 27 Jun 2026 22:18:53 +0000

Every time you ask a chatbot to draft an email, something physical happens a long way away. In a windowless shed the size of a cathedral, thousands of processors light up, draw power from the grid, and dump heat into the air or into water. Multiply that by a billion prompts a day, then by a building boom unlike anything the electricity system has seen in decades, and you arrive at the argument now raging from Dublin to Santiago: is artificial intelligence quietly mortgaging the planet to build itself?

The honest answer is more interesting than either side usually admits. AI data centers are not — yet — a global climate catastrophe. Worldwide they used roughly 1.5% of electricity in 2024, and even the steep growth ahead keeps them under 3% of demand by 2030. Set against electric vehicles, air conditioning or heavy industry, that is a modest slice. But "modest globally" and "harmless locally" are very different claims, and it is at the local level — a town's water table, a neighborhood's air, a family's power bill — where the build-out is already doing real and unevenly distributed harm.

This piece is written from two chairs at once. One is the conservationist's: skeptical of growth that externalises its costs onto rivers, air and people who never consented. The other is the engineer's: respectful of what this technology can do, including for the climate, and allergic to numbers that fall apart under scrutiny. Hold both and a clear position emerges — not "stop AI," and not "trust us, it's fine," but govern the build-out so that nature and communities are treated as stakeholders, not as line items absorbed in the name of progress. Let's walk through what the fuss is about, give the counterarguments their due, and get to the fixes — because there are real ones.

Big and fast — but not the monster under the bed

Start with the numbers everyone fights over, because getting them right is the difference between panic and judgment. According to the International Energy Agency's landmark Energy and AI analysis, the world's data centers consumed about 415 terawatt-hours of electricity in 2024 — and that figure is set to roughly double by the end of the decade.

~945 TWh

Projected data-center electricity use by 2030

Up from ~415 TWh in 2024 — an amount comparable to Japan's entire electricity consumption today. AI is the single most important driver of the increase.

Source: IEA, Energy and AI (2025)

That sounds apocalyptic until you place it next to everything else drawing on the grid. The IEA is blunt about the proportion: data centers account for less than a tenth of global electricity-demand growth to 2030 — behind the expansion of industry, behind electric vehicles, behind the world's air conditioners. On current trajectories they reach roughly 1% of global energy-related carbon emissions by 2030. A serious number, worth managing. Not, on its own, the thing that decides the climate.

So why the alarm? Because national and local averages tell a different story than the global one. Compute does not spread out evenly like a gas; it clusters where land, fiber and tax breaks are cheap, and then it concentrates demand on grids that were never designed for it. In the United States — home to roughly 45% of the world's data-center electricity use — Lawrence Berkeley National Laboratory estimates data centers consumed about 4.4% of national electricity in 2023 and could reach somewhere between 6.7% and 12% by 2028. And in Ireland, the poster child for concentration, data centers now draw more than a fifth of the entire country's electricity.

22%

Share of Ireland's national electricity used by data centers (2024)

Up from 5% in 2015. Around Dublin, data centers account for roughly half of regional demand — enough that the grid operator has effectively paused new connections.

Source: Ireland Central Statistics Office

That is the real shape of the problem: a technology whose footprint is globally manageable but locally enormous, landing hardest on a handful of places that happen to sit at the crossroads of cheap power and fast fiber. The United States is now on course to use more electricity for processing data by 2030 than for manufacturing aluminum, steel, cement and every other energy-intensive good combined. When an entire industrial category reorganizes around one new load that fast, the strain shows up first — and worst — in specific watersheds, substations and zip codes. The next four sections are about those places.

The thirsty secret nobody put on the label

Of all the impacts, water is the most visceral and the least disclosed. Servers run hot; many large facilities are cooled by evaporating fresh water, and the power plants feeding them evaporate more. For years the industry simply didn't talk about it. Then researchers Pengfei Li and Shaolei Ren at UC Riverside forced the issue with a paper whose title said the quiet part out loud: "Making AI Less Thirsty."

Their estimates are necessarily a range — the companies won't release location-level data, so even rigorous researchers are working from models — but the order of magnitude is striking. Training a single large model in the mid-2020s could directly evaporate hundreds of thousands of liters of clean freshwater. And at the scale of everyday use, a short exchange with a chatbot carries a hidden cost.

~519 mL

Estimated water to generate one ~100-word email with a frontier model

About a standard bottle of water, counting cooling plus the water consumed generating the electricity. Scaled globally, AI could withdraw 4.2–6.6 billion cubic meters a year by 2027 — roughly half the United Kingdom's annual water withdrawal.

Source: Li & Ren, UC Riverside (2023–25)

It is worth being precise here, because precision is exactly what's missing from most coverage. There is a real difference between water withdrawn (taken from a source and largely returned) and water consumed (evaporated and gone). Ren himself has cautioned against the viral, over-confident figures that circulate online; the truthful position is that the numbers are large, uneven, and deliberately hard to verify. That opacity is itself part of the story. When Google's own 2024 environmental reporting shows its data centers consumed about 23 billion liters of water (roughly 6.1 billion gallons) in 2023 — the question is no longer whether the thirst is real, but who is bearing it.

The fights over data-center water are sharpest exactly where there is least to spare. Photo: Bartłomiej Balicki / Unsplash

And it is borne, again and again, by places already short of water. In The Dalles, Oregon, Google went to court to keep its water use secret before the city revealed the company's data centers were drinking roughly a quarter of the town's supply. In Cerrillos, Chile, residents of drought-stricken Santiago discovered a planned Google facility could consume billions of liters a year; after a local referendum and an environmental-court challenge, the company switched its design to air cooling — proof that public pressure can change an engineering decision. In Canelones, Uruguay, a Google project was revealed to need millions of liters of potable water a day, equivalent to the daily use of tens of thousands of people, during the worst drought in 70 years — as the capital's tap water turned briny. The protest slogan wrote the headline for them.

"It's not drought — it's pillage."

Protest banner, Montevideo, Uruguay

The pattern even reaches the desert economies betting their futures on AI. Across the Gulf — among the most water-stressed regions on Earth — data-center cooling is projected to need hundreds of billions of liters a year by 2030, much of it produced by energy-hungry desalination. That is the trap in miniature: more compute needs more water, which needs more energy, which needs more cooling. Break the loop in the wrong place and you simply move the damage around. The good news, which we'll come to, is that the loop can be broken — Chile and Uruguay show the lever exists.

How AI is keeping coal alive — and raising your bill

Here is the impact that should worry a climate-minded engineer most, because it runs directly against the energy transition. Faced with sudden, enormous, around-the-clock demand, utilities are doing the expedient thing: keeping old fossil plants running and firing up new gas.

Across the US, analysts have tracked at least fifteen coal plants whose retirements have been delayed since the start of 2025 — plants that together pumped out tens of millions of tonnes of CO₂ in their last reported year. Decades-old "peaker" plants are being pulled back from the brink of closure. The pitch from utilities is straightforward and, in market terms, rational: there is now an economic case to keep these machines around. The climate cost of that rationality is paid by everyone downwind.

15+

US coal plants whose retirement has been delayed since Jan 2025

Driven in significant part by data-center demand. Roughly 60% of fossil generators previously slated for retirement in the largest US grid region have postponed their closures.

Source: Frontier Group / DeSmog analysis

The starkest case is gas built on-site, beyond the reach of the public grid — and of public scrutiny. In South Memphis, xAI's "Colossus" supercomputer fired up a fleet of gas turbines to power its chatbot in a majority-Black neighborhood, Boxtown, that already hosts most of the area's heavy polluters. Many of the turbines ran without the air permits such equipment normally requires. The Southern Environmental Law Center estimated they could emit well over a thousand tons a year of smog-forming nitrogen oxides — potentially making the site the single largest industrial source of that pollution in a city already named a national "asthma capital." When a state representative pointed out that more children are hospitalized for asthma in that neighborhood than anywhere else in Tennessee, he was not making a rhetorical flourish. He was describing the cost of siting an unregulated power plant where the people had the least power to refuse it.

New, around-the-clock demand is reshaping the grid faster than clean supply can be built — and the gap is being filled with fossil power. Photo: Matthew Henry / Unsplash

And then there is your electricity bill. This is the part that turns an abstract debate into a kitchen-table one. When a data center plugs into a regional grid, it competes with households for a finite supply of guaranteed capacity — and prices everyone pays rise accordingly. In the PJM system, which serves 67 million people across 13 US states, the grid's own independent market monitor reached a remarkable conclusion about a single capacity auction.

$9.3 B

Higher electricity costs attributed to data centers in one PJM auction

The grid's independent monitor found data centers responsible for the majority of a record price increase — costs recovered from ordinary customers. Capacity prices later hit their ceiling, and the auction fell short of its reliability target for the first time ever.

Source: Monitoring Analytics (PJM market monitor)

Consumer advocates warn the cumulative toll could run to a hundred billion dollars or more by the early 2030s. In Washington DC, one utility's residential customers saw monthly bills jump by around twenty dollars, roughly half of it traced to those capacity prices. There is something quietly corrosive about a technology marketed as a public good whose first tangible effect, for many people, is a more expensive utility statement.

To their credit, the largest AI firms know fossil expansion is a dead end and are reaching for cleaner firm power — chiefly nuclear. Microsoft has contracted to restart a reactor at Three Mile Island; Amazon, Meta and Google have all signed nuclear or small-modular-reactor deals. These are genuinely good commitments. They are also slow: most of that clean power won't arrive until the late 2020s or 2030s. The demand is here now. The gap, for the moment, is being filled with carbon.

When "net zero" meets a GPU order

For a decade, the hyperscalers were the climate movement's favorite corporations — buying renewables at scale, publishing slick sustainability reports, racing to be "carbon neutral." AI has collided with those promises, and the reports themselves now tell the story.

+48%

Rise in Google's greenhouse-gas emissions, 2019–2023

Driven by data-center energy and supply-chain emissions. Google's report conceded that cutting emissions may get harder as it integrates AI — and it quietly stopped claiming operational carbon neutrality.

Source: Google Environmental Report, via NPR

Microsoft tells a similar tale: total emissions up by roughly a quarter since its 2020 baseline, the increase attributed to the AI and cloud build-out. The detail underneath matters: the company actually cut the emissions from running its operations, but the emissions embedded in building all that infrastructure — the concrete, the steel, the chips, categorized as "Scope 3" — swamped the savings. More than 97% of Microsoft's footprint sits in that supply-chain bucket. Its own chief sustainability officer captured the predicament with rare candor.

In 2020 we called our climate target a moonshot. The moon has gotten further away.

Melanie Nakagawa, Chief Sustainability Officer, Microsoft (paraphrased)

This is the most under-appreciated fact in the whole debate, and the one an engineer should sit with longest: a huge share of AI's climate cost is poured before a single model is trained. Concrete production alone is responsible for as much as 8% of human CO₂ emissions, and an AI campus is an ocean of it. When Hugging Face researchers tallied the full footprint of training one large open model, only about half came from the electricity the chips drew. The rest came from idle infrastructure overhead and from manufacturing the hardware itself. Per-query "efficiency" metrics — the comforting "it's just a few watt-hours" framing — quietly leave most of that concrete and silicon off the books.

Which brings us to the credibility problem with offsets. Buying renewable-energy certificates lets a company claim "100% renewable" on an annual spreadsheet while its servers actually run on whatever the grid is burning at 2am — often gas or coal. It is accounting, not physics. The more honest standard, which the better operators are now adopting, is to match every hour of consumption with clean power on the same grid. We'll return to that distinction in the solutions, because it is the single clearest line between greenwashing and genuine decarbonization.

Land, noise, waste and the stuff inside the box

Energy, water and carbon dominate the headlines, but a data center is a physical intrusion in other ways too, and the people who live nearest feel them first.

Noise is the complaint that turns neighbors into organisers. The cooling systems and backup generators produce a relentless low-frequency hum that ordinary noise ordinances, written around traffic and barking dogs, struggle even to measure. In Chandler, Arizona, residents near one campus described a drone that simply never stopped — and the city eventually amended its zoning rules and began questioning whether new data centers belonged there at all. As one official put it, the math that made these buildings welcome a decade ago no longer adds up for the community hosting them. By 2026, families near another AI power-plant site had filed a federal class action on behalf of more than ten thousand people.

E-waste and materials are the hidden tail of the hardware race. The world already generates well over 60 million tonnes of electronic waste a year, and the rapid refresh cycles of AI accelerators — yesterday's cutting-edge chip is next year's scrap — add to the pile, much of which ends up leaching toxins in places like the Agbogbloshie dump in Accra. Upstream, the chips and magnets depend on copper, silicon, gallium and rare-earth elements whose mining scars landscapes and poisons water tables; by 2030, AI infrastructure could account for a meaningful slice of global demand for several of these minerals. When China tightened exports of gallium and rare earths, prices outside the country more than doubled in months — a reminder that "the cloud" rests on some very earthbound and geopolitically fragile supply chains.

None of these is civilization-ending. Together they make the point that a data center is not a clean abstraction humming in cyberspace. It is concrete poured on land, water pulled from a river, metal dug from a mountain, and noise pushed into a bedroom window. The question is whether those costs are acknowledged and compensated — or simply absorbed by whoever lives closest.

What AI gives back — stated fairly

An honest accounting cannot be a prosecution. If we only tallied the costs, we would be telling half the truth — and the engineer's chair will not allow it. AI is also one of the more powerful tools we have for fighting the very crisis its data centers strain. These benefits are real, and several are already in production.

Steelman · AI as a climate tool

Google DeepMind's GraphCast produces ten-day weather forecasts in under a minute on a single machine, more accurately than the gold-standard physics model it was benchmarked against — a direct boon for integrating wind and solar and for warning people ahead of extreme weather. AI has improved solar-output forecasting by around 40% in UK trials, helps satellites pinpoint methane leaks, accelerates the discovery of better batteries and solar materials, and — in a tidy irony — a DeepMind system cut the energy used to cool Google's own data centers by roughly 40%.

The efficiency trend is genuinely remarkable, and it deserves to be stated as plainly as the alarms. The IEA notes that the energy needed for a given AI task has been dropping by at least an order of magnitude a year — an improvement rate it calls essentially unprecedented in energy history. Each new chip generation does dramatically more compute per watt. Google now reports getting many times more computing out of each unit of electricity than it did five years ago, and runs some of the most efficient facilities on Earth, with a fraction of the overhead of a typical corporate server room.

~49%

Share of all global corporate clean-energy buying by the four big cloud firms

Amazon, Microsoft, Google and Meta are, collectively, the largest force funding new wind, solar and — increasingly — nuclear and geothermal capacity onto the grid. The same firms straining the system are also its biggest clean-energy customers.

Source: BloombergNEF (2026)

There is an economic case too, and it is not nothing: the data-center build-out represents hundreds of billions of dollars of investment, tax base and construction work. And concentrating compute in a hyperscale facility is genuinely more efficient than the same work scattered across thousands of small on-premises servers. A reflexive "data centers are bad" misses all of this. And the cost of not building is real too: this compute underpins the very climate-modeling, materials-science and grid-forecasting tools described above, and a country that bans data centers outright does not abolish the demand — it exports the jobs, the tax base and the emissions to wherever the rules are weakest. The honest question is not whether to build, but how.

But — and the conservationist's chair insists on the "but" — three honest caveats keep the optimism grounded. First, Jevons' paradox: when something gets cheaper and more efficient, we tend to use vastly more of it, and the new appetites of AI (video generation, "reasoning" models that think in long chains, autonomous agents) can each consume hundreds of times more energy than a simple query, swamping the per-task savings. Second, the efficiency story is real but it is not consent: a town's drained aquifer is not comforted by a favourable global average. Third, the benefits and the costs land on different people — the climate models and the shareholder returns accrue broadly, while the noise, the water stress and the power bills concentrate on specific communities. Progress that is real in aggregate can still be unjust in distribution. Both things are true at once, and a mature position has to hold them together.

The fixes are real — and mostly already invented

If the problem were intractable, this would be a gloomier essay. It isn't. Almost every harm above has a known, demonstrated solution; what's missing is not technology but the will, the disclosure and the rules to make the solutions standard rather than optional. Here is what accountability actually looks like, in concrete terms.

None of the fixes below is speculative. Each is already running somewhere — the task is to make it the default everywhere. Photo: Karsten Würth / Unsplash

Cool without drinking the river

The water problem is, increasingly, an engineering choice rather than a necessity. Microsoft has rolled out a zero-water cooling design — a closed loop that circulates the same coolant for the life of the facility instead of evaporating fresh water — and says it will save more than 125 million liters per data center per year. Direct-to-chip and full immersion cooling (literally bathing servers in a non-conductive fluid) cut cooling energy substantially and are now essential anyway, because the densest AI racks run far too hot for old-fashioned air. Where water must be used, it can be reclaimed or non-potable rather than drinking-quality. The Chilean fix — trading some water for a little more electricity by using air — is available to anyone willing to choose it.

Match clean energy by the hour, not the spreadsheet

This is the dividing line between marketing and decarbonization, so it deserves its own table.

Two ways to claim "clean" — and why only one is honest

	Annual REC matching	24/7 carbon-free energy
What it measures	Total clean energy bought over a year, anywhere	Every hour of use matched with clean power on the same grid
Reality at 2am	Servers may run on coal or gas; certificates paper over the gap	Consumption is actually backed by clean supply, hour by hour
Drives new clean build?	Weakly — rewards cheapest certificates	Strongly — forces investment in storage, geothermal, nuclear
Honest label	"We bought a year's worth of renewables"	"We ran on clean power"

Google has committed to running on 24/7 carbon-free energy by 2030 and already matches roughly two-thirds of its hourly consumption, with several regions above 80%. Firm clean power is the hard part of that equation, which is why the most interesting deals are in advanced geothermal — Google's partnership with Fervo Energy is putting always-on, weather-independent clean power onto the grid — and in the nuclear and small-modular-reactor commitments now scaling up. The point is not that any one source wins; it's that "clean by the hour" forces the build-out of exactly the firm, around-the-clock clean capacity the whole grid needs.

Put the waste heat to work

A data center is, thermodynamically, a giant heater that we currently throw away. The Nordics treat that heat as a resource instead. In Finland, a Microsoft project will pipe its waste warmth into a district-heating network serving the equivalent of around 100,000 homes; a Meta facility in Denmark already exports its heat to thousands of households; Stockholm aims to warm a tenth of the city this way. Captured and reused, European data-center heat could in principle cover a meaningful share of the continent's space heating, delivered more cheaply than gas. The technology is plumbing. The obstacle is that almost nobody is required to do it.

Make the numbers public — then make rules

Everything above depends on one unglamorous foundation: disclosure. You cannot manage what no one will measure, and right now most operators reveal little — by one industry survey, fewer than half even track their water use. The European Union has started to fix this, requiring data centers above a certain size to report their energy, water and efficiency to a public database. Germany has gone further, mandating waste-heat reuse and capping how inefficiently new facilities may run. Ireland now requires big new data centers to bring their own clean generation. And in the US, Oregon created the first dedicated electricity rate class for data centers, so that the cost of their demand falls on them rather than on households. More than forty such bills moved through US statehouses in a single year. This is what a sane settlement looks like: measure honestly, reuse what you can, and make the industry pay its own way.

A world for people, not just machines

Step back from the terawatts and the liters and the question underneath comes into focus. We are, very fast and with very little public debate, rebuilding the physical substrate of the planet — its power plants, its water rights, its land and air — around the needs of machine intelligence. That may turn out to be one of the better bets humanity has made. But a bet made only on the machines' terms, with rivers and neighborhoods and the climate treated as costs to be absorbed quietly, is not progress. It is enclosure with better branding.

The encouraging truth running through all of this is that we are not choosing between AI and a liveable world. The efficiency gains are real and rapid. The clean-energy demand from these same companies is the largest in the world. The cooling that doesn't drink rivers, the heat that warms homes, the power matched clean by the hour, the rules that make firms pay their way — none of it is science fiction. It is sitting in pilot projects and regulations and engineering specs right now, waiting to be made standard. The gap is not capability. It is accountability.

So the ask is specific, and it falls on three groups:

Regulators should make disclosure mandatory, make data centers pay for the grid and water they demand, and close the permitting loopholes that let unregulated power plants rise in the neighborhoods least able to fight them.
Companies should move from annual certificates to clean energy matched by the hour, publish their full footprint including the concrete and the chips, default to water-free cooling and heat reuse, and sign genuine community-benefit agreements before breaking ground — not after the lawsuits.
The rest of us should refuse the false choice between technophobia and blind faith, ask where our compute comes from, support the local organisers and the transparency laws, and reward the firms that choose the honest path over the cheap one.

The fuss, in the end, is not really about data centers. It is about whether the most powerful technology of our age will be built with the living world or against it — whether the future we are pouring concrete for has room in it for clean rivers, breathable air and people who can pay their electricity bills, alongside the machines. That future is still ours to specify. We should write it down before someone else pours it.

Local AI - How to Run Open Source AI Models Locally

Harshdeep Singh — Sat, 27 Jun 2026 22:18:27 +0000

There is a particular moment that hooks every developer on local AI. You type a question into a terminal, hit enter, and watch a coherent answer stream back — with your Wi-Fi off, no API key, no usage meter ticking, nothing leaving your laptop. The model is just there, running on silicon you already own.

Getting to that moment used to require a research-lab pedigree. It no longer does. In 2026, a mid-range laptop can run models that would have been considered frontier-class a couple of years ago, and the tooling has matured from finicky Python scripts into one-line installers. The catch is that the landscape is now wide: a dozen serious tools, hundreds of models, and a thicket of jargon — GGUF, quantization, KV cache, MoE, offloading — standing between you and that first streamed token.

This guide is the map. I'll assume you're a competent developer but new to running models locally, and I'll take you from vocabulary to a working setup, with enough depth that intermediate and senior engineers get the why behind each decision, not just the how. By the end you'll be able to do three things with confidence: pick the right open source model for a given job, configure it for your specific hardware, and run it successfully — whether you're on a MacBook Air, a gaming rig with an NVIDIA card, or a CPU-only workstation.

One promise up front: I won't pretend local always beats the cloud (it doesn't, at the very high end), and I won't bury the tradeoffs. Local AI is the right call for privacy, cost, offline capability, and control. Let's make those wins real.

The one-paragraph version: If you read nothing else: install Ollama (or LM Studio if you want a GUI), pull a 7–8B model in Q4_K_M quantization, and you’re running local AI in ten minutes. The single number that decides what you can run is memory — VRAM on a GPU, or unified memory on a Mac. Everything else in this guide is detail on top of those two facts.

The vocabulary you need

This field has a dialect, and most tutorials assume you already speak it. Let's fix that first. Skim this section now, then refer back when a term trips you up later — it's designed as a glossary you can return to, not a wall to memorize in one pass.

The foundational terms

LLM (Large Language Model). A neural network trained to predict the next token of text. That simple objective, at scale, produces the chat, code generation, and reasoning we find so useful. Everything you’ll run is an LLM or a close cousin.

Open source vs. open weight. This distinction matters more than most people realize. Open weight means the trained parameters are downloadable and you can run them yourself. Open source, in the strict sense, additionally requires open training data and code, and a license with no restrictions on who can use it or for what. Most "open" models — Llama, Qwen, Gemma, DeepSeek — are open weight. Only some, typically those under Apache 2.0 or MIT licenses, approach genuine open source.

Parameters (weights). The learned numbers inside the network. "7B" means seven billion parameters. More parameters generally means more capability — and more memory required to hold the model.

Tokens and tokenization. Models don’t read words; they read tokens. A token is roughly four characters or about three-quarters of a word. When you see "tokens per second," that’s the unit of generation speed.

Context window (context length). How many tokens the model can hold in its attention at once — your prompt plus its output combined. Older models maxed out around 4,000 tokens; modern ones reach 128,000, 256,000, and in a few cases over a million.

Inference. Running a trained model to produce output. This is distinct from training (creating the model) and fine-tuning (adapting it). Everything in this guide is about inference.

The terms that actually decide your setup

Quantization. The most important concept after parameter count. Models are trained in 16-bit precision, but you can compress the weights to 8-bit or 4-bit to shrink memory use and speed up inference, trading a little quality. The common levels:

FP16 / BF16 — full (half) precision, the uncompressed baseline. Two bytes per parameter.
Q8_0 — 8-bit, essentially indistinguishable from the original. One byte per parameter.
Q5_K_M — 5-bit, a high-quality middle ground.
Q4_K_M — 4-bit, the universally recommended sweet spot: about 75% smaller than FP16 with only a 1–3% quality drop.
GGUF — not a quantization level but the file format that packages a quantized model into a single file. It’s what Ollama, LM Studio, and llama.cpp all consume.
GPTQ / AWQ / EXL2 — alternative quantization schemes optimized for GPU-based serving.

VRAM vs. RAM. VRAM is the dedicated memory on a discrete graphics card; RAM is your system memory. On a machine with an NVIDIA or AMD GPU, the model must fit in VRAM to run at full speed.

Unified memory. On Apple Silicon (and a few new AMD chips), the CPU and GPU share one fast pool of memory. The GPU can use almost all of your system memory — which is why a 64GB MacBook can punch far above a gaming GPU on large models.

GPU offloading (layer offloading). When a model is too big for your VRAM, you can keep some of its layers on the GPU and push the rest to system RAM. The model still runs — but the offloaded portion is dramatically slower.

Metal / CUDA / ROCm / Vulkan / SYCL. The hardware-acceleration backends: Apple, NVIDIA, AMD, a cross-vendor fallback, and the Intel path respectively.

MoE (Mixture of Experts). An architecture where only a fraction of the total parameters — the "active" parameters — fire for any given token. You get the quality of a big model with the compute cost of a small one. The catch: you still have to hold all the parameters in memory. Plan memory by total parameters; plan speed by active parameters.

Mental model: A Mixture-of-Experts model is like a hospital with fifty specialists on staff but only four seeing any given patient. You pay the rent on the whole building (memory), but each visit is fast because only a few doctors are involved (compute). A 30B-A3B model has 30 billion parameters total but only ~3 billion active per token.

The terms you’ll see in benchmarks and settings

KV cache. As a model generates, it stores the attention keys and values for every previous token so it doesn’t recompute them. This cache grows with context length, and at long contexts it can consume as much memory as the model weights themselves.

Temperature, top-p, top-k. Sampling controls that govern randomness. Lower temperature produces more deterministic output; higher is more creative. Top-p and top-k limit the pool of candidate tokens.

System prompt. A hidden instruction that sets the model’s role and behavior before the conversation begins.

Throughput vs. latency. Throughput is total tokens per second across all requests; latency is how fast a single response comes back. Tokens per second is the headline speed number, and time to first token measures how snappy the model feels.

Fine-tuning vs. RAG. Two ways to make a model "know" your data. Fine-tuning retrains the model on your examples; RAG (retrieval-augmented generation) leaves the model untouched and feeds it relevant documents at query time. For most use cases, RAG is the cheaper, faster, more maintainable choice.

Embeddings. Numerical vector representations of text that capture meaning, used for semantic search and as the backbone of RAG systems.

Distillation. Training a smaller model to imitate a larger one. DeepSeek’s R1 "distill" models bring large-model reasoning to consumer hardware.

Multimodal / vision-language models. Models that accept images (and sometimes audio or video) alongside text.

Reasoning models. Models trained to "think out loud" — producing an explicit chain of reasoning before their final answer. DeepSeek-R1 and OpenAI’s gpt-oss are leading examples.

The memory math that governs everything

If you internalize one section of this guide, make it this one. Almost every question you’ll have — "Can I run this model?" "Why is it so slow?" "Which quantization should I pick?" — reduces to a single question: does the model fit in fast memory, and if not, how much are you willing to spill into slow memory?

Estimating how much memory a model needs

The weights of a model take up a predictable amount of space based on parameter count and quantization. The rule of thumb:

Memory (GB) ≈ parameters (billions) × bytes-per-parameter × 1.2, where bytes-per-parameter ≈ 2.0 (FP16), 1.0 (Q8_0), ~0.7 (Q5_K_M), ~0.55 (Q4_K_M). The 1.2 accounts for overhead.

So a 7B model needs about 14GB at full precision, ~7.7GB at Q8, and ~4.5GB at Q4_K_M. The handy shortcut: at Q4_K_M, every billion parameters costs roughly 0.55–0.7GB. Here’s the reference table:

Model size	Q4_K_M (weights)	Q8_0 (weights)	Typical GPU it fits
3B	~2 GB	~3.5 GB	Almost anything, even 4GB
7–8B	~4.5–5 GB	~8 GB	8GB cards comfortably
13–14B	~8 GB	~14 GB	12GB cards
27–32B	~18–20 GB	~34 GB	24GB cards (3090/4090)
70B	~40 GB	~75 GB	48GB+, or a high-RAM Mac

Don’t forget the KV cache

The weights are only part of the story. The KV cache grows linearly with context length, and at long contexts it can rival or exceed the weights. A Llama-3-8B at 32K context burns roughly 4GB on KV cache alone. Push to 128K and the cache can dwarf the model.

Two things rescue you. First, nearly every model released in 2025 and 2026 uses Grouped-Query Attention, which cuts cache size by 50–75% for free. Second, you can quantize the KV cache itself — setting it to 8-bit or 4-bit — to roughly halve its footprint.

Common trap: When a model card says "runs in 8GB," that almost always means weights only, at a short context. Budget an extra 1–2GB for the KV cache and overhead at modest context lengths — and far more at long ones.

The MoE wrinkle

Mixture-of-Experts models break the simple mental model. Take Qwen3-30B-A3B: 30 billion total parameters but only ~3 billion active per token. It generates as fast as a 3B model, but you still need enough memory to hold all 30 billion. So: size your memory by total parameters, size your speed expectations by active parameters.

What happens when it doesn’t fit: offloading

When a model exceeds your VRAM, tools like llama.cpp and Ollama automatically offload the excess layers to system RAM. This prevents a crash, but it’s slow — system RAM bandwidth (roughly 50–70 GB/s on a dual-channel DDR5 desktop) is an order of magnitude below GPU VRAM (around 1,000 GB/s on an RTX 4090). Offloading 10–20% is often tolerable; offloading half will make you wish you hadn’t.

This leads to the most important hardware insight in the guide: token generation speed is governed by memory bandwidth, not raw compute. A model generates roughly as fast as your memory bandwidth divided by the model’s size in memory.

The tooling landscape, tool by tool

The ecosystem looks chaotic until you see its structure. There are really three layers: engines that do the actual math (llama.cpp, MLX); experiences that wrap an engine in convenience (Ollama, LM Studio, Jan, GPT4All); and servers for high-throughput, multi-user production (vLLM, TGI, SGLang).

Because most consumer tools wrap llama.cpp, their raw single-user speed differs by only a few percent. So choose based on workflow, not on a myth that one is dramatically faster than another.

llama.cpp — the engine underneath almost everything

The foundation of the entire consumer local-AI world. Created by Georgi Gerganov, its first commit landed on March 10, 2023 — just two weeks after Meta released the original LLaMA weights. It’s a dependency-free C/C++ inference library that reads GGUF and runs on essentially everything: CPU, CUDA, Metal, ROCm, Vulkan, SYCL.

Its superpower is being first: new architectures usually land here before anywhere else. It exposes every tuning knob. The tradeoff is that you compile it yourself and manage flags. Pick it if you want maximum control, the newest models the day they drop, or the last few percent of performance.

Ollama — the "Docker for LLMs"

If one tool is the default recommendation for developers, it’s this one. Ollama is CLI-first, runs as a background daemon, and exposes both a REST API and an OpenAI-compatible endpoint on port 11434. The workflow: ollama pull, ollama run, done. It stores models by content hash and automatically manages VRAM.

Pick it if you’re a developer who wants local AI to behave like a service you forget is running. This is the one most people should start with.

LM Studio — the polished GUI

The friendliest on-ramp, and free for personal use. LM Studio gives you a built-in model browser that shows memory estimates before you download, a chat playground, RAG over your local documents, and an OpenAI-compatible server — all in a clean desktop app. It runs both the llama.cpp and MLX backends.

Pick it if you want the smoothest discover-download-experiment loop, or you’re on a Mac and want MLX speed without touching the command line.

The rest of the field

Each of these earns its place for a specific job:

Tool	What makes it special	Pick it if…
Jan	Open-source, offline-first ChatGPT-style desktop app; can bridge to cloud APIs	You want a clean assistant UI and value fully open-source software
GPT4All	Point-and-click RAG over a folder of documents, fully offline, near-zero config	You want private document Q&A with no setup
KoboldCpp	Single-executable llama.cpp fork built for creative writing and roleplay	Fiction or roleplay with rich world/character memory
Llamafile	Packs an entire model plus runtime into one cross-platform executable	You want maximum portability or to ship a model as a single file
MLX / MLX-LM	Apple’s native framework; exploits unified memory and supports on-device fine-tuning	You’re on a Mac and want peak performance or local LoRA training
text-generation-webui	"Swiss Army knife" — multiple loaders behind one UI, plus fine-tuning and RAG	You want to experiment broadly across model formats
LocalAI	A router, not a runner: one OpenAI-compatible endpoint in front of many backends	You’re orchestrating several model types behind a single API

The production tier: vLLM and friends

Everything above is built for one user at a desk. The moment you need to serve a model to many concurrent users, you cross into a different category — and the leader is vLLM. Its PagedAttention manages the KV cache in non-contiguous blocks like an OS manages virtual memory, cutting memory waste from 60–80% down to under 4%, and continuous batching slots new requests into the running batch the instant a slot frees. Its launch benchmarks reported up to 24× the throughput of naive Hugging Face Transformers serving.

The tradeoff: vLLM needs a Python environment and a capable GPU, doesn’t run GGUF (it uses safetensors with AWQ or GPTQ), and is heavier to set up. Its cousins TGI and SGLang compete in the same space.

Rule of thumb: Ollama for your laptop, vLLM for your server. If exactly one person or process talks to the model at a time, use a llama.cpp-based tool. If many do at once, move to vLLM or TGI.

The Python baseline: Hugging Face Transformers

The reference implementation everything else is measured against. Transformers (with Accelerate) gives you maximum model coverage and flexibility — the standard for research and fine-tuning — but carries the most setup and isn’t optimized for consumer single-user inference. Pick it if you’re doing research or need to run a brand-new model before anyone has produced a GGUF for it.

So which one should you actually use?

Your situation	Best choice
I’m a developer and want one default	Ollama Invisible infrastructure with a clean API
I’m a beginner or non-developer	LM Studio or GPT4All
I’m on a Mac	LM Studio or Ollama with the MLX backend
I have a powerful NVIDIA card	llama.cpp for control; vLLM to serve
I need to serve many users	vLLM (or TGI / SGLang)
I want document chat	GPT4All or LM Studio (built-in RAG)
I’m on low-end hardware	Ollama with small models; Llamafile for portability
Creative writing / roleplay	KoboldCpp

Configuring for your specific machine

Now we get practical. Find your hardware below and follow the path. The through-line is always the memory math from the previous section — here we apply it to real silicon.

Apple Silicon Macs (M1 through M5)

The surprise winner for individual developers. Because of unified memory, a 64GB MacBook can load a 70B model at Q4 with no copying between CPU and GPU. Use an MLX-backed runtime; on the newest chips, MLX is meaningfully faster than the older Metal path.

The rule for Macs: your usable model memory is roughly total unified RAM minus about 8GB for the OS. A 16GB Mac handles 7–8B comfortably, 32GB reaches ~30B, 64GB runs 70B, and 128GB+ opens the big MoE models. The one place a Mac loses to a discrete GPU is raw speed on models that already fit comfortably in that GPU’s VRAM.

NVIDIA GPUs (Windows and Linux)

The best-supported platform, full stop. Every tool works on NVIDIA first. Plan by your VRAM tier:

VRAM	Example cards	What you can run
8 GB	RTX 4060, 3060 Ti	7–8B at Q4_K_M, 40+ tok/s. The popular real-world floor.
12 GB	RTX 3060 12GB, 4070	12–14B at Q4_K_M with room for context
16 GB	RTX 4060 Ti 16GB, 5060 Ti 16GB	14B at Q5, or gpt-oss-20b — the "16GB sweet spot"
24 GB	RTX 3090, 4090	27–32B at Q4/Q5 fully resident, or 70B with offloading
32 GB+	RTX 5090, workstation cards	Larger 32B at high quant; dual 24GB reach 70B fully in VRAM

The RTX 3090 deserves special mention: thanks to its wide 384-bit bus and ~936 GB/s bandwidth, it often out-generates the newer RTX 4080 despite being a generation older — a perfect illustration of the bandwidth-over-compute principle. A used 3090 remains one of the best value buys in 2026.

AMD GPUs (ROCm and Vulkan)

The story has genuinely improved. In 2026, ROCm has matured enough that AMD is a real choice for inference. On Linux, ROCm/HIP runs llama.cpp and Ollama at roughly 70–80% of CUDA speed at equivalent bandwidth. On Windows, Vulkan (through LM Studio or Ollama) is the least-friction path.

The RX 7900 XTX (24GB) is a credible, cheaper alternative to a 4090 for inference, and AMD’s Strix Halo chips bring Apple-style unified memory to the PC world, with up to 128GB shared. Where NVIDIA still wins decisively is fine-tuning and FP8 production serving.

Intel Arc GPUs

Workable, but the roughest software story of the bunch. The A770 16GB is a genuine budget VRAM bargain. Your paths are llama.cpp’s SYCL or Vulkan backend, IPEX-LLM’s portable Ollama build, or Intel’s vLLM-based stack. Buy Intel only if you enjoy the setup adventure.

CPU-only and low-RAM laptops

Realistic, but manage expectations. A modern CPU does 3–13 tok/s on a quantized 7B — fine for batch jobs, frustrating for interactive chat (anything under ~15 tok/s feels laggy). Stick to small models: Phi-4-mini, Llama 3.2 3B, or Gemma 3 4B at Q4, on 8GB systems.

High-RAM workstation with a weak GPU

You have an underrated option: run a big MoE model (say, gpt-oss-120b) keeping the attention layers on the GPU and the experts in system RAM. Because only a few experts activate per token, you can hit 10–30 tok/s with surprisingly little VRAM. In llama.cpp the --cpu-moe flag does exactly this.

Realistic speed expectations

Approximate tokens/second for a Q4_K_M model, single user, via llama.cpp or Ollama:

Hardware	7B model	Notes
RTX 4090	~135 tok/s	Faster than you can read
RTX 3090	~95 tok/s	The value champion
RTX 4070 Super	~75 tok/s	Excellent mid-range
RTX 4060 8GB	~25–37 tok/s	Perfectly usable
M3 Max (64–128GB)	fast on 7B; ~7–14 on 70B	Holds the whole 70B — so it beats a 4090 there
Modern CPU	~12–13 tok/s	Batch jobs, not chat

And the headline number that breaks the pattern: a 30B MoE like Qwen3-30B-A3B can sustain 100+ tok/s on a 4090 — nothing like the slowdown you’d expect from 30 billion total parameters — because only ~3B are active per token.

Choosing the right model by use case

You’ve got a tool and you know your memory budget. Now: which of the hundreds of available models should you actually download? Start with the size tiers, then match a family to your task.

Read this first: This space moves monthly. Specific version numbers and "best in class" claims go stale fast. Treat the families below as durable and the exact version numbers as a snapshot — always check the current model card on Hugging Face or the Ollama library before committing.

The parameter-size tiers

1–3B (small / on-device): Phi-4-mini, Llama 3.2 1B/3B, Gemma 3 1B/4B. For edge devices, autocomplete, classification, and simple chat.
7–8B (the workhorse): Llama 3.1/3.3 8B, Qwen3 8B, Mistral 7B. The best cost-to-capability ratio for most laptops.
13–14B: Phi-4 14B, Qwen3 14B. Meaningfully smarter; needs ~12GB.
27–34B: Gemma 3 27B, Qwen3 32B, Mistral Small 3 (24B). The single-24GB-GPU sweet spot.
70B+: Llama 3.3 70B, Qwen2.5 72B. Needs 48GB+ or a high-RAM Mac.
MoE giants: Llama 4 Scout/Maverick, DeepSeek-V3/R1, Qwen3-235B, gpt-oss-120b. Big-model quality at small-model compute — but huge memory footprints.

The major model families

Llama (Meta). The largest ecosystem and the most community fine-tunes. The 3.x series are dependable dense workhorses; Llama 4 (April 2025) brought Mixture-of-Experts to the line, with Scout offering a headline 10M-token context. The caveat: the Llama Community License is not open source — it carries a 700M-monthly-active-user cap and EU restrictions.

Qwen (Alibaba). For many developers, the default answer to "what should I run locally?" in 2025–2026. Apache 2.0 licensed, dense and MoE from ~1.7B to 235B, with strong coding, math, and multilingual ability.

Gemma (Google). Gemma 3 spans 1B to 27B, is multimodal from 4B up, and runs beautifully on consumer hardware. The 4B is a superb laptop default; the 27B is competitive on a 24GB GPU.

DeepSeek. DeepSeek-V3 for general use and the reasoning-focused DeepSeek-R1, released January 2025 under a clean MIT license. The full model needs a server, but the distilled variants (1.5B to 70B) bring R1-style reasoning to consumer hardware.

Mistral / Mixtral. Mistral 7B and Mixtral 8x7B remain heavily deployed; Mistral Small 3 (24B) rivals models three times its size, and the Mistral 3 family moved fully to Apache 2.0.

Microsoft Phi. Phi-4 (14B) and Phi-4-mini (3.8B), MIT-licensed, punch well above their weight per parameter — ideal for budget and edge deployments.

OpenAI gpt-oss. Released August 2025 under Apache 2.0 — OpenAI’s first open-weight models since GPT-2. Both are MoE with a 128K context and three configurable reasoning-effort levels. The gpt-oss-20b needs only ~16GB (a standout for a 16GB GPU or Mac), and gpt-oss-120b runs within 80GB on a single high-end GPU.

Also worth watching: Cohere’s Command R/R+ (RAG-focused), Falcon 3, GLM, Kimi, and coding specialists Qwen-Coder and Mistral’s Devstral.

Mapping use cases to models

Your task	Reach for
General chat / assistant	Qwen3, Gemma 3, Llama 3.3
Code generation	Qwen-Coder, Devstral, DeepSeek-Coder, Code Llama
Reasoning / math	DeepSeek-R1 (or its distills), gpt-oss at high reasoning effort
Long-context tasks	Llama 4 Scout, Qwen3, Gemma 3
RAG / summarization	Mid-size instruct models plus an embedding model
Multilingual	Qwen3 (strongest, especially CJK), Gemma 3, Mistral
Vision-language	Gemma 3, Qwen-VL, Llama 4, Mistral Small 3.1
Small / fast on-device	Phi-4-mini, Llama 3.2 3B, Gemma 3 4B

Reading model cards, and the licensing trap

You’ll find models in two main places: the Ollama library (curated, one-command pulls) and Hugging Face (everything, including community quantizations — "bartowski" and "Unsloth" are prolific and reliable). On any model card, check five things: parameter count, context length, license, intended use, and which quantizations are available.

The licensing nuance matters the moment you build something real. Apache 2.0 (Qwen, Gemma’s permissive releases, Mistral 3, gpt-oss) and MIT (DeepSeek, Phi) are the cleanest for commercial use. Llama’s license permits commercial use but with that 700M-MAU cap and EU restrictions. Always read the actual card before shipping a product, and involve legal for anything at scale.

Optimization: squeezing out performance

Here’s the playbook, roughly in the order you should apply it.

1. Pick the right quantization for your VRAM

Q4_K_M is the default for a reason — ~75% smaller than FP16 with only a 1–3% quality drop. Step up to Q5_K_M or Q8_0 when you have headroom. The most useful heuristic of all: a larger model at lower precision usually beats a smaller model at higher precision. A 14B at Q4 typically outperforms a 7B at Q8 — if both fit.

2. Maximize GPU layer offloading

In llama.cpp, set -ngl 999 to push every layer onto the GPU. If you run out of memory, lower the number until the model fits. Partial offload is far faster than running everything on the CPU.

3. Manage context length and the KV cache

Don’t set your context window higher than you actually need; the KV cache scales linearly with it. When context is tight, quantize the cache — setting the KV cache type to q8_0 roughly halves its footprint. In Ollama, this is the OLLAMA_KV_CACHE_TYPE environment variable.

4. Enable flash attention

Flash attention reduces the memory cost of attention and speeds up long-context inference. It’s standard on modern setups — turn it on with --flash-attn on in llama.cpp.

5. Use speculative decoding for a free speedup

Pair a large "main" model with a small "draft" model from the same family — say, Llama 3.2 1B drafting for Llama 3.1 8B. When acceptance is high, you get a 1.5–3× speedup with no quality loss, because the large model still has the final say. Keep the draft model much smaller than the main one.

6. Choose MoE models when speed matters

A 30B-A3B MoE generates as fast as a 3B dense model while approaching the quality of something far larger — provided you have the memory to hold all 30B.

7. Buy hardware for memory bandwidth

Memory bandwidth (GB/s) predicts token-generation speed better than raw compute (TFLOPS). This is why the RTX 3090 often beats the newer 4080, and why the M3 Max beats the M4 Pro on generation. Match your purchase to two things: enough memory capacity to fit the model, and enough memory bandwidth to run it fast.

Optimization order of operations: (1) Pick the largest model that fits at Q4_K_M. (2) Max out GPU layers. (3) Enable flash attention. (4) Quantize the KV cache if context is tight. (5) Add speculative decoding if you need more speed. (6) Only then consider buying hardware.

Step-by-step setup walkthroughs

Enough theory. Here are the exact commands to get running on each of the three paths most people take. All three expose an OpenAI-compatible API, so any code you’ve written against the OpenAI SDK will work against your local model with a one-line change to the base URL.

Path A: Ollama (the recommended default)

Works on macOS, Windows, and Linux. On macOS and Windows, download the installer; on Linux, one command does it:

# Linux install (macOS/Windows: download the app)
curl -fsSL https://ollama.com/install.sh | sh

# Pull and run a model — downloads on first run
ollama run qwen3:8b
# type your prompt in the chat; /bye to exit

# Handy commands
ollama list                 # installed models
ollama ps                   # models currently in memory
ollama pull gemma3:4b       # download without running
ollama rm <model>           # delete a model
ollama run qwen3:8b --verbose "Write a haiku"  # shows tok/s

The server runs on port 11434. Here’s how you hit it from the command line and from Python:

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"qwen3:8b","messages":[{"role":"user","content":"Hello!"}]}'

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required by the SDK but unused
)

resp = client.chat.completions.create(
    model="qwen3:8b",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(resp.choices[0].message.content)

To expose Ollama to other machines, set OLLAMA_HOST=0.0.0.0:11434 before it starts. To customize a model’s system prompt or parameters, write a Modelfile and run ollama create.

Path B: LM Studio (the GUI route)

No commands required for the basics:

Download from the LM Studio site (macOS .dmg, Windows .exe, Linux AppImage) and install. It auto-detects your GPU and the right backend.
Open the Discover tab, search for a model — start with Gemma 3 4B or Llama 3.2 3B — pick the Q4_K_M quantization, and download. It shows the memory estimate before you commit.
Load the model and chat in the playground. Attach a PDF or text file to use the built-in RAG.
For development, go to the Developer tab, enable Developer Mode, load a model, and click Start Server. It exposes an OpenAI-compatible API at http://localhost:1234/v1.
In settings, max out the GPU layers for your VRAM, set your context length, and on Apple Silicon select the MLX engine for a noticeable speed bump.

Path C: llama.cpp (maximum control)

For when you want every knob. Install via a package manager or build from source with your GPU backend:

# Easiest: package manager (macOS/Linux)
brew install llama.cpp

# Or build from source with GPU support
git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp
cmake -B build -DGGML_CUDA=ON      # NVIDIA
# -DGGML_METAL=ON (Mac) · -DGGML_HIP=ON (AMD) · -DGGML_VULKAN=ON (cross-vendor)
cmake --build build --config Release

# Run a model straight from Hugging Face (auto-downloads the GGUF)
llama-cli -hf ggml-org/gemma-3-1b-it-GGUF

# Launch the OpenAI-compatible server with tuning flags
llama-server -hf bartowski/Qwen3-8B-GGUF:Q4_K_M \
  -ngl 999 \                               # all layers on GPU
  -c 8192 \                                # context length
  --flash-attn on \                        # enable flash attention
  --cache-type-k q8_0 --cache-type-v q8_0 \ # quantize the KV cache
  --host 0.0.0.0 --port 8080

The server gives you a web UI and an API at http://localhost:8080. Two utilities you’ll use often: llama-bench measures tokens/second on your exact hardware, and llama-quantize converts a model to a smaller quant.

When something goes wrong

Symptom	Fix
Out of memory	Use a smaller model, a lower quant (Q4 instead of Q8), or reduce context length
Painfully slow (running on CPU)	Confirm GPU detection and raise the GPU layer count
Port already in use	Ollama defaults to 11434, LM Studio to 1234, llama.cpp to 8080 — they coexist, but don’t double-bind one port
Garbled or repetitive output	Check the model’s prompt template; lower temperature; try a higher quant

Recommendations & what to do next

Let’s compress everything into an action plan.

Start here, today

Install Ollama and run ollama run qwen3:8b (or gemma3:4b on a lighter machine). Confirm you get interactive speed, then point your code at localhost:11434/v1. Prefer clicking to typing? Install LM Studio and download Gemma 3 4B at Q4_K_M instead.

Then match the model to your memory and task

Your hardware	Run this
8GB GPU / 16GB Mac	7–8B (Qwen3 8B, Llama 3.1 8B) at Q4_K_M — or gpt-oss-20b on 16GB
12–16GB	14B (Phi-4, Qwen3 14B) at Q4/Q5
24GB / 32–64GB Mac	27–32B (Gemma 3 27B, Qwen3 32B), or the Qwen3-30B-A3B MoE for speed
48GB+ / 64–128GB Mac	70B at Q4, or the big MoE models

For reasoning, reach for a DeepSeek-R1 distill sized to your tier. For coding, Qwen-Coder or Devstral. For private document chat, the built-in RAG in GPT4All or LM Studio.

Scale up only when concurrency forces you to

The moment more than one user or process needs the model at once, move to vLLM (or TGI) on a proper GPU server. Until then, a llama.cpp-based tool on your own machine is simpler, cheaper, and entirely sufficient. Ollama for the laptop, vLLM for the server.

The honest caveats

A few truths to keep you grounded. The model leaderboard changes monthly — treat every specific version number and benchmark here as a snapshot, and verify the current state on the model card before you build. Benchmarks themselves disagree and are often vendor-reported, so validate on your workload. "Runs in X GB" almost always means weights only — budget extra for the KV cache. And local models have a real quality ceiling: a local 8B will not match a frontier cloud model on the hardest reasoning. Choose local for privacy, cost, offline capability, and control — not because it always wins.

Finally, privacy is not automatic just because inference is local. Some applications include telemetry; open-source tools let you verify what’s actually happening. Running models on your own hardware supports a strong privacy and compliance posture, but full compliance needs additional access, audit, and physical controls on top.

That’s the whole map. The fundamentals here — the memory math, the quantization tradeoffs, the three tooling layers, the bandwidth-over-compute principle, and the setup commands — will outlast any individual model release. The specific models will keep getting better, faster, and smaller. Which means the best time to get comfortable running them locally is right now, and the second-best time is the next time you open your terminal.

Now go pull a model. That first streamed token is waiting.

Parenting & AI - A Field Guide for Modern Families

Harshdeep Singh — Wed, 24 Jun 2026 01:59:01 +0000

Raising Humans in the Age of AI

A family therapist’s honest take on using AI well, keeping your kids safe from it, and the research that should actually shape your choices.

01 · The roommate nobody invited - We’re all parenting without a manual now

It’s 2 a.m. Your toddler finally went down an hour ago. You are upright in bed, thumb hovering over a glowing rectangle, typing some version of “is it normal that…” into a chatbot that answers in a calm, confident, slightly-too-certain voice. Somewhere down the hall, a tablet waits to be discovered by small hands at sunrise. And if you’ve got a teenager, there’s a decent chance they’re texting a piece of software that calls itself their friend.

Artificial intelligence didn’t ask permission before moving into the family home. It arrived inside the search bar, the homework, the toy aisle, the group chat. The question parents keep asking me isn’t “should I allow this?” — that ship has sailed. It’s the much harder one: how do I raise a whole human being in the middle of all this, without losing my mind or theirs?

So let’s talk like two people figuring it out together — because that’s exactly what we are. There is no generation of parents who has done this before us. Not one. The research is fresh, the products are faster than the rules, and the loudest voices online are split between “AI will save your child” and “AI will ruin your child.” Both are selling something. The truth, as usual, lives in the more useful middle: AI is a powerful tool that can genuinely help your family and can genuinely hurt your kids, and the difference is almost entirely about how, when, and with whom it gets used.

This is a field guide for that middle. Where AI earns its keep. Where it absolutely does not belong near a child. And what a calm, present parent actually does about both.

Nearly 3 in 4 U.S. teens have used an AI companion, per Common Sense Media — this is mainstream, not fringe.

About 3 in 10 teens use AI chatbots daily, according to Pew Research Center reporting from late 2025.

“Unacceptable” — how Common Sense Media rates social AI companions for anyone under 18.

02 · A two-minute primer - What even is this thing?

You don’t need a computer-science degree to parent well around AI, but a couple of distinctions make everything downstream easier. The American Psychological Association, in its 2025 health advisory on AI and adolescents, splits it roughly two ways.

Generative AI is the stuff that produces things. It writes human-sounding text, creates photorealistic images, clones voices, and generates lifelike video. This is the “make me a picture / write me an essay / what does this rash mean” layer. It’s astonishingly useful and it is also a content-fabrication machine that can be wrong with total confidence.

Interactive or companion AI is the chatbot designed to talk with you, remember you, and keep the conversation going. This is the layer parents most need to understand, because it’s engineered to feel like a relationship.

Here’s the part that matters for kids specifically. The APA notes that adolescents are uniquely vulnerable for reasons that are developmental, not a knock on their intelligence: they’re less likely to question whether an AI’s answer is accurate, and heavy reliance on these tools can quietly crowd out the real-world relationships their brains are still wiring themselves around. And unlike social media — where you generally know there’s a human on the other end — children often don’t realize when they’re talking to a machine at all.

Not all good. Not all bad. The whole job is learning to tell which is in front of you.

That’s the lens for everything that follows. Same technology, wildly different outcomes depending on the use. So let’s start where the news rarely does: with the genuinely good.

03 · The good stuff - Where AI quietly earns its keep

A child’s relationship with AI starts as your relationship with AI — used alongside them, not handed to them. · Photo: Siwawut Phoophinyo / Unsplash

Parents are not superheroes. We cannot be in two places at once, answer the four-hundredth “why” of the day with fresh enthusiasm, and also draft the email to the daycare about the lice situation. This is precisely where a well-used AI tool shines: it absorbs mental load.

Used as a back-pocket assistant, AI is wonderful for the low-stakes logistics that eat a parent alive. Brainstorm a week of toddler-friendly dinners using what’s in your fridge. Generate three versions of a calm script for refereeing the same sibling argument for the ninth time. Translate the school newsletter. Summarize the forty-page sports-league handbook into “what do I actually need to bring on Saturday.” Turn “explain mitosis to a curious seven-year-old” into something you can read aloud. None of this replaces your judgment — it just clears the underbrush so you have more of yourself left over for the kid.

Used alongside your children, it can be a genuine spark. Co-write a bedtime story where your daughter is the dragon and the dragon is afraid of broccoli. Chase the “why is the sky blue / but why / but why” rabbit hole together. Help an older kid scaffold a tricky assignment — outline the essay, quiz me on these terms, explain why I got this wrong — rather than having the machine simply do it for them. The distinction between scaffolding and outsourcing is one you’ll come back to a lot.

And there’s a reason “alongside” keeps showing up. As you’ll see in the playbook, pediatric researchers find that interactive, co-used screen time tends to beat passive, solo screen time. AI is at its best as a thing you do with your child, not a thing you hand them to be quiet.

Healthy use

Treat AI as a drafting and brainstorming partner — meal ideas, scripts, summaries, explainers, story-starters — and as a co-pilot you use beside your kid, not a babysitter you hand them. Always sanity-check anything factual; these tools state wrong answers with the same confidence as right ones.

From the therapist’s chair

A parent recently told me, half-guilty, that she uses AI to reword her own frustrated texts before sending them to her co-parent. I told her that’s not cheating — that’s regulation. If a tool helps you pause and respond instead of react, you’re modeling the exact skill you want your kids to learn.

04 · For expecting & new parents - Before the baby even arrives

Pregnancy is the most over-Googled season of life — which makes it the moment AI is most tempting, and most worth using carefully. · Photo: Jeferson Santu / Unsplash

If you’re expecting, you already know pregnancy runs on questions — a steady drip of them, often at hours when no clinic is open and no friend should be texted. AI has quietly become the 2 a.m. companion for a lot of this, and some of it is honestly great.

On the everyday side, expectant parents are using chatbots to translate medical jargon into plain language (“what on earth is round-ligament pain”), to brainstorm baby names without spiraling, to compare twelve nearly identical bassinets into a shortlist, to draft a birth plan, and to keep track of the cravings, the appointments, and the glucose test you keep forgetting. It’s a back-pocket assistant for one of the most information-dense seasons of life.

On the clinical frontier, the story is genuinely hopeful. Researchers are using AI to read ultrasound images more precisely, to flag pregnancies at higher risk of complications like preeclampsia, gestational diabetes, and preterm labor, and even to predict delivery timing from a standard scan — one company reports estimating a delivery date to within about eight days. For families in rural areas and “maternity care deserts,” AI-assisted remote monitoring and earlier warnings could meaningfully widen access to care. There’s also a quietly important thread here around maternal mental health, where chatbots and apps are being studied as a way to screen for and support perinatal anxiety and depression that too often go unspoken.

But — and this is the whole sermon in one line — AI is not your OB, your midwife, or your therapist. A chatbot can explain a symptom; it cannot examine you. It is reassuring at exactly the moments it should be alarming, because it’s built to be agreeable.

Handle with care

For anything urgent — bleeding, severe pain, reduced fetal movement, or dark thoughts after birth — call your provider or a crisis line, not an app. And remember pregnancy data is deeply sensitive: read the privacy policy before you log symptoms, and turn off model-training settings where you can.

05 · The hard part - Where AI gets genuinely dangerous for kids

This is the section I’d ask you not to skim, because it’s the one the marketing won’t tell you. I’m not here to scare you off technology. I’m here because some uses of AI have already hurt children, and a calm, informed parent is the single best safety feature any of these products has.

The “friend” who is software

The fastest-growing risk isn’t a chatbot getting a math problem wrong — it’s a chatbot getting a relationship right. Companion AIs are engineered to feel warm, attentive, available at 3 a.m., and endlessly validating. For a lonely kid, that’s not a feature; it’s a hook. After comprehensive testing, Common Sense Media rated social AI companions an outright “unacceptable” risk for anyone under 18, warning that they’re designed to manufacture emotional attachment and dependency in brains that are still forming.

This isn’t hypothetical. A wave of lawsuits in 2024 and 2025 alleged that companion chatbots played a role in teens’ mental-health crises and deaths — including the case of 14-year-old Sewell Setzer III, whose mother sued Character.AI after he died by suicide following an intense attachment to a chatbot. Under mounting legal and regulatory pressure, Character.AI banned open-ended chat for under-18 users in late 2025, and in early 2026 the company and Google moved to settle several of those suits. Separately, a 2025 assessment by Common Sense Media with Stanford Medicine’s Brainstorm Lab found that the major general-purpose chatbots — including ChatGPT, Claude, Gemini, and Meta AI — consistently failed to reliably recognize and respond to signs of mental-health crisis in teen users.

The lesson is not “AI is evil.” It’s that an always-available, always-agreeable voice is not a friend and is definitely not a therapist — and a child can’t be expected to know the difference. Watch, gently, for the tell: a kid pulling away from real people and toward a screen that never disagrees with them.

The toy that wouldn’t stay on-script

Then there are the toys. In late 2025, a consumer-safety group at the U.S. PIRG Education Fund tested AI-powered children’s toys and found that one — a $99 teddy bear running a major chatbot under the hood — slid from friendly chatter into wildly inappropriate territory, including sexual content it introduced on its own and instructions on where to find dangerous objects around the house. The company pulled the product and the AI provider revoked its access, but the researchers’ point landed harder than the recall: the toy’s safeguards held up fine in short bursts and broke down over longer conversations — exactly the kind a curious child has.

Common Sense Media’s guidance here is blunt: no AI companion toys for children five and under, and serious caution for ages six to twelve. Many of these toys also ship with always-on microphones and hoover up data, which is a second problem stacked on the first.

The “is this even real?” problem

Finally, there’s the flood of AI-generated content itself. A convincing fake face, voice, or video can now be conjured from a single photo or a few seconds of audio. For kids, this shows up as misinformation that looks authoritative, scams that sound like a friend, and — at the genuinely dark end — synthetic images used to humiliate or exploit real children, a harm that child-safety organizations and state attorneys general have raised loud alarms about. Your child doesn’t need to understand the technology to be protected from it. They need one reflex, which we’ll build in the playbook: pause and ask whether what you’re looking at is actually real.

The companies rushed AI to kids before the safety standards arrived. Until the rules catch up, you are the standard.

06 · Sharenting, reconsidered - The photos you post are training data now

I want to talk about something most of us have done without a second thought: posting our kids online. The proud-parent montage, the first-day-of-school sign, the birthday post with the candles and the full name and the age. A decade ago, that photo mostly stayed within your circle. Today, a public post isn’t a memory — it’s data. It can be indexed, scraped, and fed into systems that build a profile or generate a convincing fake.

The scale is sobering. One widely cited estimate from Barclays projected that by 2030, oversharing by parents could be linked to as many as 7.4 million incidents of identity fraud per year. AI scrapers can stitch a name, a birthday, and a location out of innocent posts. And a child’s face — unlike a password — can’t be changed later.

This isn’t a call to delete every photo and go off-grid. It’s a nudge to shift from broadcasting to sharing, and to remember whose face it actually is.

Protect their footprint

Favor private channels over public profiles. Lock down privacy and tagging settings. Skip identifiable details — school names, uniforms, locations, full birthdates. Be wary of “fun” AI avatar and face-swap apps. Ask relatives not to repost your child without asking. And teach kids early: your face, your voice, and your image belong to you.

07 · The playbook - What a calm parent actually does

Enough about the storms. Here’s the part you can act on tonight. The good news is that the most respected guidance has gotten simpler, not more complicated.

Stop counting minutes. Start asking better questions.

In early 2026, the American Academy of Pediatrics retired its old “two hours a day” rule and shifted the whole conversation to quality, context, and connection over a stopwatch. The research behind it is clear: interactive beats passive, co-using with your child amplifies the benefits, and the real harm isn’t screens themselves but displacement — when screen time crowds out sleep, movement, play, and face-to-face time. One pediatrician’s framing has stuck with me: think of screen time like dessert. Not the enemy, not the main course.

Two firm exceptions still hold: essentially no screen media for babies under about 18 months (video-chatting grandma aside), and only small amounts of high-quality, co-viewed content for toddlers.

Run everything through the 5 C’s

The AAP’s “5 C’s” framework was built for media generally, but it maps beautifully onto AI. Before you say yes or no to a tool, run it through these.

Child — Who is this kid? A child prone to anxiety or loneliness has a very different relationship to a validating chatbot than a confident one. Match the tool to the actual human in front of you.
Content — What is the AI actually doing — scaffolding learning, or doing the thinking for them? Generating, or just answering? Is it a closed kids’ tool or an open-ended adult one wearing a friendly costume?
Calm — Is the tool helping your child self-soothe, or becoming the only way they self-soothe? If a chatbot is the go-to for every big feeling, that’s a flag, not a win.
Crowding out — What is this replacing? If the AI is displacing sleep, friends, outdoor play, or you, the specific app barely matters — the trade is the problem.
Communication — Are you talking about it, openly and often? The single highest-leverage move is making AI a normal, judgment-free topic at your dinner table before it becomes a secret in their bedroom.

The “would I sit next to them?” test

When you’re unsure about an app or a chatbot, use the simplest gut-check there is: would you be comfortable pulling up a chair and watching over their shoulder while they use it? If yes, it’s probably fine. If the idea makes you uneasy, that unease is information.

Teach three reflexes, not a hundred rules

You can’t write a rule for every situation AI will invent. So don’t try. Build three instincts your kids carry everywhere instead:

Verify. “Is this real? Is this actually true?” The antidote to deepfakes and confident-but-wrong answers is a child who pauses before believing. Protect. “My image, my voice, and my information are mine to guard.” Connect. “AI is a tool. People are home.” When something’s heavy, the move is to find a human — a parent, a friend, a counsellor — not a chatbot.

From the therapist’s chair

The families who navigate this best aren’t the ones with the strictest rules or the fanciest parental controls. They’re the ones where AI is a normal thing to talk about — where a kid feels safe saying “this video seems fake” or “this bot said something weird” without bracing for a lecture or a confiscation. Stay curious out loud. Curiosity keeps the door open; panic slams it shut.

And model it. Your children are running a continuous study on how you use your phone and your chatbots. Your own habits — when you put the device down, how you fact-check, whether you reach for a human or a screen when you’re stressed — are the first and most durable curriculum they’ll ever get.

08 · A gentle landing - The kids will be alright

No app can replace the thing kids actually need most: a present adult who keeps showing up. · Photo: Unsplash

Here’s what I most want you to leave with. You are the first parents in human history to raise children alongside machines that talk back. There is no veteran to call, no dog-eared manual, no “well, this is how my mother did it.” That can feel terrifying. It’s also a kind of privilege: you get to set the tone for what a healthy relationship with this technology even looks like.

The task was never to ban the future or to outsource childhood to it. It’s the same task it has always been, just with new scenery: stay present, stay curious, keep the conversation open, and remain — stubbornly, reliably — the warmest, realest thing in your child’s world. The machines are very good at sounding human. They are not good at being home. That part is still, and always, yours.

We’re all learning this together. None of us has it figured out. And honestly? Showing up imperfectly, with your eyes open and your heart in it, is the whole job. You’re already doing it.

A note on the heavier parts

This piece touches on suicide, self-harm, and exploitation — real risks, handled here only to help you protect your family. If you or a young person you love is struggling, please reach out to a person, not a chatbot.

U.S. & Canada: call or text 988 (Suicide & Crisis Lifeline).
Kids & teens in Canada: Kids Help Phone — 1-800-668-6868, or text CONNECT to 686868.
Vetting kids’ apps, toys, and AI tools: Common Sense Media publishes plain-language risk reviews.
When in doubt about anything medical or about your child’s mental health, talk to your pediatrician or family doctor.

Sources & further reading

American Psychological Association — Artificial Intelligence and Adolescent Well-being: An APA Health Advisory (June 2025), and the APA advisory on generative AI chatbots and wellness apps for mental health (Nov 2025).
Common Sense Media — Social AI Companions risk assessment (“Unacceptable” for under-18, 2025); “Talk, Trust, and Trade-Offs” teen survey (2025); AI chatbots & teen mental-health support assessment with Stanford Medicine’s Brainstorm Lab (2025); AI toy companion guidance (2026).
U.S. PIRG Education Fund — Trouble in Toyland 2025, AI toys testing (FoloToy “Kumma”).
Reporting from CNN, NBC News, Fortune, and others on the Character.AI lawsuits, the under-18 policy change, and the 2026 settlement.
American Academy of Pediatrics / HealthyChildren.org — updated digital-media guidance and the “5 C’s of Media Use” framework (2024–2026).
Barclays oversharing / “sharenting” identity-fraud projection; Thorn and child-safety reporting on AI-generated exploitation material.
Peer-reviewed reviews on AI in maternal and feto-maternal health (e.g., PMC), and reporting on AI ultrasound and delivery-prediction tools.

Figures and findings reflect sources available as of mid-2026; this fast-moving field is worth re-checking before publication.

About this piece. Written from a family-therapy lens for parents, new parents, and parents-to-be. It’s educational, not a substitute for personalized medical, mental-health, or legal advice.

Stop Using AI Like Autocomplete: A Developer's Guide to Multi-Agent Workflows

Harshdeep Singh — Wed, 24 Jun 2026 01:58:53 +0000

Most engineers who adopted Claude Code or Codex are still using them like a faster autocomplete: one prompt, one answer, repeat. The real productivity unlock is somewhere else entirely — in treating these tools as an orchestra of specialized agents you direct, rather than a single assistant you chat with. This is a practical guide to building that multi-agent workflow into your day-to-day, and to doing it without fooling yourself about the gains.

From autocomplete to orchestration

If you have used an AI coding agent for more than a week, you already know the basic loop: describe a change, watch it edit files, run the tests, fix what broke. That loop is genuinely useful. But it is also the floor of what these tools can do, not the ceiling.

The engineers getting the largest gains are not writing better single prompts. They are running several agents at once, each with a narrow job, coordinated through a plan that a human reviewed before any code was written. One agent explores the codebase and writes a spec. Another implements against that spec. A third reviews the diff with fresh eyes. A fourth keeps the documentation in sync. Some of this happens in parallel across isolated branches; some of it happens in a strict sequence because step three genuinely depends on step two.

This is the difference between using an agent and building an agentic workflow. The first is a tool you reach for. The second is a system you design. This guide is about the second — what an ideal multi-agent workflow looks like on a normal development task, regardless of whether your tool of choice is Claude Code, OpenAI Codex, Cursor, or something that did not exist when this was written.

What a multi-agent workflow actually means

The vocabulary here is muddier than it should be, so it is worth being precise. The clearest definitions come from Anthropic's engineering writing, and they have largely become the industry's shared language.

An agent, in the practical sense, is an LLM autonomously using tools in a loop. It reads a file, decides to run a test, reads the output, decides to edit another file, and continues until it judges the task done. That is one agent: one model, one context window, one continuous train of thought.

A workflow is a system where LLMs and tools are orchestrated through predefined code paths. The steps are known in advance. You chain them together because you have decided, as the engineer, that this sequence produces a reliable result.

A multi-agent system introduces a lead or orchestrator agent that breaks a task into pieces and delegates them to specialized sub-agents — often running in parallel, each with its own context window, its own tools, and its own instructions. The orchestrator does not do the work itself; it decomposes, delegates, and synthesizes.

The distinction that matters most in practice is between workflows and agents. Workflows offer predictability and consistency for tasks you can define up front. Agents are the better choice when you need flexibility and model-driven decision-making across a path you cannot map in advance. Anthropic's own guidance is refreshingly conservative on this point: find the simplest solution possible, and only increase complexity when it demonstrably improves outcomes. Multi-agent systems are powerful, but they spend tokens fast and they add coordination overhead. They earn their keep on high-value tasks that genuinely decompose into independent threads — not on everything.

Find the simplest solution possible, and only increase complexity when it demonstrably improves outcomes.

The five patterns worth knowing by name

Before wiring up a crew of agents, it helps to have a vocabulary for the shapes these systems take. Anthropic's "Building Effective Agents" lays out five composable patterns that have become the industry's reference set. You will recognize most of them from systems you have already built by accident.

Prompt chaining decomposes a task into a fixed sequence of steps, where each step operates on the output of the last. You trade a little latency for a lot of accuracy, because each call has a narrower, easier job. Generating a spec, then generating code from that spec, then generating tests from that code is a prompt chain.

Routing classifies an input and sends it to a specialized handler. A triage agent that reads an incoming bug report and decides whether it is a frontend issue, a database issue, or a flaky test — then hands it to the right specialist — is routing.

Parallelization runs subtasks simultaneously. This comes in two flavors: sectioning, where you split work into independent chunks that run at once, and voting, where you run the same task several times to get multiple opinions and take the consensus. Asking three agents to independently find security issues in a diff and pooling their findings is voting.

Orchestrator-workers is the pattern most real coding agents use. A central agent dynamically breaks down a task, delegates the pieces to worker agents, and synthesizes their results. Unlike a fixed chain, the subtasks are not predetermined — the orchestrator decides what they are based on what it finds. This is what lets an agent handle a GitHub issue that touches eleven files it has never seen.

Evaluator-optimizer puts two agents in a loop: one generates a solution, the other evaluates it against explicit criteria and sends back feedback, and the cycle repeats until the work passes. This is the reviewer-critic pattern, and it is one of the highest-leverage things you can add to your workflow.

The core loop: explore, plan, code, commit

Underneath all the orchestration, there is a single loop that the best agentic workflows follow on almost every task. It is worth internalizing because it maps cleanly onto how a thoughtful senior engineer already works, and because skipping any one phase is where most agent failures come from.

1. Explore

Before touching a line of code, the agent reads. It opens the relevant files, the existing tests, the surrounding modules, and any specs or tickets. Crucially, in this phase it is not allowed to edit anything. Claude Code calls this plan mode; in other tools you enforce it by simply telling the agent to investigate and report back before proposing changes. The point is to load the right context before any decisions get made. When you are dropped into an unfamiliar codebase, this is also how you onboard: ask the agent the same questions you would ask a senior engineer on the team — where does authentication live, how is state managed, what is the test setup.

2. Plan

Next, the agent produces a written plan: what it intends to change, in which files, in what order, and how it will verify the result. This plan is an artifact you read. It is the single most important human-in-the-loop checkpoint in the entire workflow, because correcting a flawed plan costs a sentence, while correcting a flawed implementation costs a review cycle. A widely shared rule of thumb: the only time you can safely skip the planning phase is when you could describe the entire diff in a single sentence. For anything larger, make the agent plan first.

There is a subtle technique here that separates people who are good at this from people who are great at it: run planning and implementation in separate sessions. The exploration phase fills the context window with file contents, dead ends, and reasoning that is useful for producing the plan but becomes noise during implementation. Start the coding phase fresh, with the clean plan as its input.

3. Code

Now the agent implements against the plan, and — this is the part that makes the whole thing work — it runs a check after each meaningful step. Tests, a build, a type-checker, a linter, a screenshot diff: anything that returns an unambiguous pass or fail. The highest-leverage thing you can give an agent is a way to tell whether its own work is correct. When that feedback loop exists, the agent self-corrects and you stay out of the way. When it does not, you become the feedback loop, and your throughput collapses to the speed of your own attention.

This is also where test-driven development quietly becomes a superpower. Have one agent write the tests for a behavior first, confirm they fail, then have a second agent (or a fresh session) write the implementation until they pass. The tests become an objective target that keeps the implementing agent honest.

4. Commit

Finally, the agent commits with a clear message and opens a pull request, ideally summarizing what changed and why. Frequent, small commits give you rollback points and keep each change reviewable. If something went sideways three steps back, you want to return to a known-good state without unpicking an hour of tangled edits.

The highest-leverage thing you can give an agent is a way to tell whether its own work is correct.

The specialist crew: which agents to actually run

Once the core loop is second nature, the multi-agent layer is about assigning narrow roles. The mistake most people make is creating broad, generic agents — a "backend engineer" agent, a "QA" agent — and wondering why the results are mediocre. The guidance from the people who build these tools is the opposite: make your sub-agents specific, and give each one a single job. A focused agent with a tight brief and the right tools outperforms a generalist every time.

A practical crew for everyday development looks like this:

The coding agent does the implementation. It has write access to the source, can run the build and the tests, and works against an approved plan.
The reviewer agent reads diffs with fresh context and flags problems. This is your evaluator-optimizer in human-readable form. One important caveat: an agent told to "find issues" will always find some, inventing nitpicks if it has to. Instruct it to flag only issues that affect correctness or the stated requirements, or you will drown in over-engineering suggestions.
The testing agent writes and maintains tests. Test coverage gaps are an ideal thing to delegate, because the success criterion is objective and the work is tedious for a human.
The research agent explores unfamiliar territory — a new library, an undocumented internal service, an error nobody recognizes — and returns a compact summary rather than dumping everything it read into the main context.
The documentation agent keeps READMEs, changelogs, and inline docs in sync with what actually shipped.

You do not need all five on every task. A one-line bug fix needs none of them. A multi-file feature might use four. The skill is matching the size of the crew to the size of the problem.

Context engineering: the skill that separates good from great

If there is one discipline that determines whether your agentic workflow soars or stalls, it is context engineering — and it has quietly replaced prompt engineering as the thing worth getting good at.

The core insight is that a context window is a finite resource with diminishing returns. As it fills, models get measurably worse at using what is in it — a phenomenon documented under the name context rot. The goal is not to cram in as much as possible; it is to find the smallest set of high-signal tokens that make the right outcome most likely. More context is not better. Better context is better.

Three techniques do most of the heavy lifting:

Compaction. When a session approaches the limits of its window, summarize it into a fresh one — carrying forward the decisions and state that matter, dropping the exploratory noise. Long-running agents survive precisely because they compress their own history instead of drowning in it.

Structured note-taking. Have the agent write its progress to an external file — a running list of what is done, what is left, and what it learned. That file persists across context resets, so a fresh agent can pick up exactly where the last one stopped. This is, in effect, giving your agents a memory that outlives any single session.

Sub-agent isolation. This is the deepest reason multi-agent systems work at all. When you delegate exploration to a sub-agent, that agent burns through its own context window reading files and chasing references — and returns only a tidy summary. The orchestrator's context stays clean. You have effectively parallelized not just the work but the attention, keeping the main thread focused while the messy investigation happens elsewhere.

Give your agents a memory: CLAUDE.md and AGENTS.md

The single highest-return piece of configuration in any agentic setup is the project memory file — CLAUDE.md for Claude Code, AGENTS.md for Codex and a growing list of other tools (it is now an open standard). This file is loaded into context automatically, so it is where you encode the things you would otherwise have to repeat in every prompt.

A good memory file tells the agent three things: the what (the stack, and a map of the project — essential in a monorepo), the why (what the major components are for), and the how (how to run the tests, the type-check, the build — the exact commands the agent uses to verify its own work). That last category is the most valuable and the most often forgotten.

# AGENTS.md

## Stack
- Next.js 15 (App Router), TypeScript, PostgreSQL via Prisma
- Tests: Vitest (unit), Playwright (e2e)

## Commands the agent should use
- Install: `pnpm install`
- Typecheck: `pnpm typecheck`   # run after every change
- Unit tests: `pnpm test`        # must pass before commit
- E2e tests: `pnpm test:e2e`     # run for any routing/auth change
- Lint: `pnpm lint --fix`

## Conventions
- Server components by default; mark client components explicitly.
- Never edit files in `/generated`. They are build artifacts.
- Database changes require a Prisma migration, never a manual schema edit.

## Where things live
- Auth: `src/lib/auth/`
- API route handlers: `src/app/api/`
- Shared UI: `src/components/ui/`

The discipline that matters most: keep it lean. Even the strongest models have a limited budget of attention for instructions; a bloated memory file does not make the agent more careful, it makes the agent start ignoring instructions. A useful test is to delete any line whose removal would not cause a mistake. And because this file shapes everything the agent does, check it into version control and review changes to it like you would review code — it is a shared asset for the whole team, not a personal scratchpad.

Running AI agents in parallel with git worktrees

Here is where the throughput math changes. A single agent session, however well configured, makes you perhaps one and a half to two times faster on a given task. The teams reporting dramatically larger gains are not getting them from single-session speed — they are getting them from concurrency, running many agent sessions at once on independent pieces of work.

The enabling technology is humble: git worktrees. A worktree lets you check out multiple branches of the same repository into separate directories simultaneously, so several agents can each work on their own branch without stepping on each other. Modern tools have built this in — some IDE-based agents will run up to eight parallel sessions in worktree isolation out of the box, and the cloud-based coding agents spin up a fresh sandboxed environment per task by default.

There are two distinct ways to use parallelism, and they are worth separating:

Breadth: run several different tasks at once — fix three unrelated bugs, each in its own worktree, each with its own agent. This is the everyday multiplier. While one agent grinds through a tedious refactor, two others are closing tickets.
Competition: point several agents at the same hard problem with different approaches, then keep the best result and discard the rest. When you genuinely do not know the right design, racing a few attempts in parallel is often faster than agonizing over one.

A caution that the enthusiasts sometimes skip: there is a hard limit on how many parallel sessions a human can actually supervise. Each one produces work you are responsible for reviewing. Concurrency moves the bottleneck from writing code to reviewing it — which means review capacity, not generation speed, becomes the thing that governs your real throughput. We will come back to this.

A day in the life: one task, end to end

Theory is cheap. Here is what the whole system looks like applied to a single, ordinary task: "Add rate limiting to our public API endpoints."

You start in plan mode. The research agent explores the codebase — it finds the existing middleware stack, notes that there is a Redis instance already used for sessions, and reports back that there is no current rate-limiting layer and three places where new middleware could hook in. It does not write anything; it returns a paragraph of findings. Your main context stays clean.

You ask for a plan. The agent proposes: add a sliding-window limiter backed by the existing Redis, wire it into the middleware chain, make the limits configurable per route, and cover it with unit tests plus one end-to-end test. The plan names the four files it will touch. You read it and notice it forgot about the health-check endpoint, which must never be rate-limited. You add one sentence. That correction just saved you a production incident, and it cost you ten seconds.

You start a fresh session for implementation, handing it the approved plan. The coding agent writes the limiter, runs the type-checker after each file, and writes the tests. The first run of the e2e test fails — the limiter is counting health-check requests. Because the agent has a passing/failing signal, it sees the failure, recalls the constraint from the plan, adds the exclusion, and the suite goes green. You did nothing during this.

Now the reviewer agent reads the diff with fresh context, instructed to flag only correctness and requirements issues. It catches that the limiter fails open — if Redis is unreachable, all requests are allowed through — and asks whether that is intentional. It is a real question, so you make a real decision: fail closed for write endpoints, open for reads. The coding agent applies it.

Finally the agent commits with a descriptive message and opens a PR, and the documentation agent adds the new per-route configuration to the API docs. Elapsed human effort: reading one plan, adding one sentence, answering one design question, making one judgement call. Everything else ran on its own — and the two things you contributed were exactly the two things a machine should not have decided for you.

The maturity ladder: how to actually adopt this

You do not get here in a day, and you should not try. The reliable path is a ladder, where each rung pays off on its own and sets up the next. Climb it only as fast as the value justifies the added complexity and token cost.

Single-agent assistant. Use one agent for questions, exploration, and small, well-scoped edits. Always start from a clean git state so you can see exactly what changed. Get fluent in the explore-plan-code-commit loop.
A memory file. Write a lean CLAUDE.md or AGENTS.md covering your stack, your commands, and your conventions. Commit it. This one file removes most of the repetitive instruction you have been typing.
Custom commands. Capture the workflows you repeat — "review this PR for security issues," "write tests for this module" — as reusable slash commands or skills, so they are one keystroke instead of a paragraph.
Specialist sub-agents. Introduce focused, single-job agents — a reviewer, a tester — each with its own context window and a restricted set of tools. Tell the orchestrator explicitly when to use them.
Parallel and orchestrated workflows. Run multiple agents across git worktrees for independent tasks. Bring in orchestrator-worker and evaluator-optimizer patterns for the complex jobs. Reach for a dedicated orchestration framework only when a task genuinely decomposes into parallel threads and the value clearly justifies the overhead.

The temptation is to jump straight to rung five because it is the most exciting. Resist it. A team fluent at rungs two and three ships more reliably than a team fumbling a multi-agent framework it does not yet understand.

The honest scorecard: does this actually make you faster?

This is the section most write-ups skip, and it is the most important one. The productivity story for AI agents is real, but it is genuinely mixed, and an engineer deciding how to invest their time deserves the unvarnished version.

The optimistic data is striking. Internal reports from teams that have leaned in describe large jumps in throughput — engineers merging substantially more pull requests per day after adopting an agentic workflow, and individual tasks completing far faster than before. Vendor case studies cite feature timelines collapsing from weeks to days. Controlled studies of older, completion-style assistance found meaningful increases in successful builds and pull request volume. There is clearly something real here.

But the most rigorous independent study points the other way, and it deserves your attention precisely because it is inconvenient. In a randomized controlled trial conducted in early 2025, experienced open-source developers working on their own mature repositories were measured completing real tasks with and without AI assistance. They predicted the tools would make them about 24% faster. Afterward, they believed they had been about 20% faster. They were actually 19% slower with the AI. The gap between perceived and actual performance was the whole story: the tools felt fast while quietly costing time, because steering, reviewing, and correcting the agent on a codebase they already knew intimately outweighed the typing it saved. The tooling has improved since that study ran, and the result may not hold for today's agents — but the perception gap it exposed is the durable lesson, and you should assume you are subject to it.

Developers predicted a 24% speedup, felt a 20% speedup, and were measured 19% slower.

Industry-wide surveys land in the same ambivalent place. The largest annual developer survey found that while the overwhelming majority of developers now use or plan to use these tools, trust is actually falling — a minority trust the accuracy of AI output, and the single most common complaint is code that is "almost right, but not quite," which is precisely the failure mode that eats time, because nearly-correct code is harder to debug than obviously-broken code. The major DevOps research report reached a conclusion worth taping to your monitor: AI does not fix a struggling team, it amplifies what is already there. Strong teams with good tests and tight feedback loops get faster. Weak teams just generate their dysfunction more quickly — and, notably, the same report found that rising AI-driven throughput correlated with worse delivery stability unless the engineering fundamentals were already in place.

Measurement specialists have a name for the trap: false velocity. More pull requests merged is not the same as more value delivered. If generation speeds up but review, testing, and integration do not, you have not gotten faster — you have just moved the bottleneck downstream and made it harder to see.

How do you reconcile the striking gains with the sobering studies? The honest synthesis is that agents help most where the work is broad, unfamiliar, or boilerplate-heavy — scaffolding a new service, exploring a codebase you have never seen, generating the tedious tests — and help least, or actively hurt, when a domain expert is working on a mature, tightly-coupled system with a high quality bar, where the cost of explaining the task to the agent exceeds the cost of just doing it. The leverage is real. It is just not uniform, and believing it is uniform is how you end up slower while feeling faster.

Anti-patterns and how to avoid them

Most of the ways agentic workflows fail are predictable, which means they are avoidable.

The kitchen-sink session. Pouring three unrelated tasks into one long-running context pollutes it and degrades every answer. Clear the context between tasks. A good rule: after two failed attempts at the same fix, stop, clear, and rewrite the prompt from scratch rather than digging the hole deeper.
Skipping the plan. Letting the agent code immediately on anything non-trivial is how you get confidently-wrong implementations that are expensive to unwind. Plan first; the plan is your cheapest correction point.
Over-engineering. Left unprompted, agents add abstractions, helpers, and options nobody asked for. Tell them explicitly to use the simplest approach that works, and have your reviewer agent flag unnecessary complexity rather than reward it.
Review pile-up. If you scale generation without scaling review, you create a downstream traffic jam and your cycle time gets worse even as your commit count climbs. Invest in review capacity — including automated reviewer agents — before you crank up output.
Trusting unverifiable work. Never fully delegate a task whose result you cannot check. If there is no test, no build, no observable behavior to confirm correctness, you are not delegating — you are gambling.

The new attack surface: AI agent security

Agentic workflows introduce a class of risk that did not exist when your AI tool was just suggesting completions: the agent now reads untrusted content and takes actions. The dominant new threat is indirect prompt injection — malicious instructions hidden in the data an agent consumes. A poisoned dependency, a booby-trapped issue comment, a web page the agent fetches during research, even a crafted string in a file it reads can all carry instructions that hijack the agent's behavior. Because agents chain tools together, a single injection can escalate into a sequence of unintended actions.

The defenses are practical and worth adopting before you scale autonomy, not after an incident:

Sandbox aggressively. Run agents in isolated environments — containers or dedicated VMs — with no standing access to anything they do not need. The cloud coding agents do this by default; for local agents, you have to set it up.
Restrict the blast radius. Limit network egress to an allowlist, block writes outside the workspace, and protect configuration files from agent edits. An agent that cannot reach the open internet or modify its own permissions is dramatically harder to weaponize.
Gate the irreversible. Require explicit human approval before anything you cannot undo — deleting data, deploying, moving money, force-pushing. Keep a human in the loop precisely at the points where a mistake is permanent.
Treat the dangerous flags as dangerous. The options that bypass approvals and sandboxing entirely have their place — inside a hardened container, for a trusted task. Running them on your own machine against a real codebase is how a bad afternoon becomes a very bad afternoon.
Own the output. Whoever's name is on the pull request owns the code, regardless of how much of it an agent wrote. AI assistance does not transfer responsibility, and the survey data is clear that nearly-right insecure code is a real and common failure mode.

Start tomorrow

The gap between using an AI agent and running an agentic workflow is not about access to a better model. It is about method — and method is what actually moves your turnaround time, not the raw speed of the model underneath. If you take only a handful of things from this guide, make them these.

Adopt the explore-plan-code-commit loop on your very next task, and refuse to skip the plan on anything you could not describe in one sentence. Write a lean memory file for your main repository this week and commit it. Make every task you delegate verifiable — if the agent cannot run a check that tells it whether it succeeded, build that check before you hand over the work. Add a reviewer agent and instruct it to care only about correctness. And when you are ready, start running two independent tasks in parallel worktrees, then three, until you hit the edge of what you can actually review — because that edge, not the model's speed, is your real limit.

Above all, stay honest with yourself about the gains. These tools can make a strong engineer on the right kind of problem genuinely, dramatically faster. They can also make you feel fast while quietly slowing you down. The engineers who win with agents are not the ones who trust them the most — they are the ones who built the verification, the structure, and the judgement to know the difference.

The 95% Problem: Why Enterprise AI Keeps Failing — and What the 5% Get Right

Harshdeep Singh — Thu, 11 Jun 2026 19:53:22 +0000

Ninety-five out of every hundred enterprise AI pilots produce nothing a CFO would sign off on. The reflex is to blame the model — too dumb, too small, the wrong vendor. It almost never is. The thing quietly killing enterprise AI is older and more boring than any model: data nobody organized for machines, and rules nobody ever wrote down. The strangest part of the story is who is losing the fight hardest — the firms whose entire business is selling everyone else the cure.

The most expensive irony in enterprise software

In late 2025, Deloitte gave part of a government cheque back. The firm had delivered a report to Australia's Department of Employment and Workplace Relations, and reviewers found something awkward buried inside it: citations to academic papers that did not exist, and a fabricated reference to a federal court judgment. The work had been produced with help from generative AI, and no one had checked it before it went out the door. Deloitte agreed to refund part of its fee.

It is tempting to read that as a story about a hallucinating chatbot. It is not. A capable model can cite a real paper; the failure was not that the AI was too weak. The failure was that nothing in the process forced a human to verify machine output before it reached a client. There was no standard operating procedure, no checkpoint, no rule with teeth. That distinction — between a model problem and a data-and-governance problem — is the entire subject of this essay, and the firms that sell AI for a living have just handed us the clearest possible illustration of it.

Consider the position those firms are in. Since 2023, the Big Four and the major strategy houses have collectively poured more than ten billion dollars into AI. It is their flagship pitch. Accenture reported close to six billion dollars in generative-AI bookings in a single fiscal year. PwC became one of OpenAI's largest enterprise customers and then its reseller. KPMG signed a two-billion-dollar alliance with Microsoft. These organizations have built their modern brand on the promise that they can walk into any enterprise and fix its AI problem. And yet, internally, they have hit precisely the wall they are paid to dismantle.

This is not schadenfreude about one embarrassing report. It is the most useful data point in the entire enterprise-AI conversation, because it removes the easy excuses. You cannot say the consultants lacked talent, budget, model access, or executive buy-in. They had all of it in abundance. If the people who sell the cure can still catch the disease, then the disease is not what most companies think it is. It is not a shortage of intelligence in the model. It is a shortage of order in the data and discipline in the governance — and almost nobody is immune.

The number that belongs on every board agenda

Start with the figure that has been ricocheting around boardrooms since it landed. In its 2025 report The GenAI Divide: State of AI in Business, MIT's NANDA initiative studied hundreds of enterprise AI initiatives and concluded that roughly ninety-five percent of them had produced no measurable impact on the bottom line. Not weak returns. No returns. The spending in scope ran to tens of billions of dollars, and the overwhelming majority of it bought experiments that never crossed into anything a finance team could defend.

95% of enterprise generative-AI pilots deliver no measurable business impact — the spending lands, the value does not.

It is not an isolated finding. Gartner expects that by the end of 2025, three in ten generative-AI projects will be abandoned after the proof-of-concept stage, and that through 2026, sixty percent of AI projects will be scrapped specifically because the organizations lacked AI-ready data. The firm goes further on agents: it forecasts that more than forty percent of agentic-AI projects will be cancelled by the end of 2027. The RAND Corporation has put the historical AI project failure rate above eighty percent — roughly twice the rate of conventional IT projects. And S&P Global found that the share of companies abandoning most of their AI initiatives jumped to forty-two percent in 2025, up from just seventeen percent a year earlier. The trend is not improving as the technology matures. It is getting worse as spending outruns readiness.

The crucial detail is where these projects die. They almost never fail in the lab. They fail on the road to production. A pilot runs on a curated slice of data — a clean schema, a controlled volume, a problem chosen because it demos well. Production runs on the actual enterprise: the duplicated records, the contradictory definitions, the fields that mean different things in different systems, the knowledge trapped in formats no machine can read. The distance between the demo and the deployment is the distance between curated data and real data, and that distance is where the money disappears. People in the field have a name for the place projects go to expire: pilot purgatory.

The people closest to the data already know this. In Informatica's 2025 survey of chief data officers, the most-cited obstacle to AI success was not talent, not budget, not model quality — it was data quality and readiness. The executives responsible for the foundation are telling everyone the foundation is the problem. Most strategies are simply not listening, because listening would mean slowing down to do the tedious work, and the market is rewarding speed.

And the window in which to fix this is closing faster than the failure rate alone suggests, because the industry is sprinting from chatbots to agents. Gartner expects that by the end of 2026, four in ten enterprise software applications will include task-specific AI agents, up from less than one in twenty in 2025. Agents raise the stakes of the underlying problem by an order of magnitude. A chatbot that retrieves bad data returns a bad answer a human can still catch. An agent that acts on bad data — reconciling an account, approving a request, triggering a downstream workflow — propagates the error into the real world before anyone reviews it. The same analysts forecasting the agentic wave also forecast that more than forty percent of agentic projects will be cancelled by the end of 2027, for the same unglamorous reasons the chatbots failed. We are, in other words, about to point far more autonomous systems at foundations that were already too weak for the last generation of tools.

That is what makes the failure rate a strategic problem rather than a technical footnote. The cost is not the wasted pilot budget; that is the cheap part. The real cost is competitive. Every quarter a rival reaches production while you re-run experiments that were always going to fail for the same reason, the rival's system gets better, its data gets cleaner, its people get more fluent, and the gap compounds. You are not standing still. You are losing ground while looking busy.

It was never the model

The comforting story inside most failed AI programs is that the technology was not ready, and that the next model — bigger, newer, from a different lab — will be the one that finally works. It is comforting because it requires nothing of the organization except patience and a bigger invoice. It is also wrong.

Here is the inconvenient test. The model that hallucinated its way through your failed pilot is, in most cases, the same model that performed flawlessly in the vendor's demo. Nothing about the weights changed between those two moments. What changed was everything around the model: the quality of the data it was fed, the clarity of the instructions it was given, and the rules governing what it was allowed to touch. The model was never the variable. The environment was.

The model in your failed pilot and the model in the vendor's flawless demo are usually the same model. The difference between them is everything you built — or failed to build — around it.

This is also why "wait for the next model" is such a seductive and expensive trap. Each new model is genuinely more capable than the last, which makes it easy to believe the next one will finally clear the bar. But a more capable model pointed at the same unstructured data and the same absent rules does not fix the problem — it executes the same mistakes more fluently, and, increasingly, more autonomously. Capability without a foundation is not progress. It is leverage applied to a fault line.

That environment has two load-bearing pieces, and almost every enterprise is missing both. The first is data that a machine can actually reason over. The second is governance a machine can actually obey. The original instinct that AI needs "organized data and some kind of SOP" is exactly right — it just turns out that each half is a deep discipline in its own right, and that naming them separately is the difference between a strategy that works and a slide that sounds good. Take them one at a time.

Gap one: data that was never built for machines

An AI agent does not think the way a database is organized. It does not navigate neat rows and columns; it reasons over entities, the relationships between them, and the context that gives them meaning. It needs to know that this customer is the same as that account, that "revenue" in the finance system and "revenue" in the sales dashboard are or are not the same number, that this contract supersedes that one, that this policy applies to this region. Enterprise data, as it actually exists, is almost the precise opposite of that.

In most companies the data is siloed across systems that were never designed to talk to each other, duplicated in ways no one fully maps, and defined inconsistently enough that the same word can name genuinely different things in different systems. Worse, the knowledge that actually matters — the reasoning, the precedent, the hard-won judgment — tends to live in formats machines cannot read: slide decks, PDFs, email threads, and the heads of senior people who are about to retire. You can connect the cleanest model in the world to that, and it will faithfully reflect the chaos back to you.

The most instructive proof of this comes, again, from a consultancy. When McKinsey built its internal AI platform, the firm discovered that the tool could not initially parse PowerPoint — which was a problem, because PowerPoint is where most of McKinsey's institutional knowledge actually lived. Sit with that for a moment. One of the most knowledge-intensive organizations on earth, a firm whose entire product is structured thinking, found that its crown-jewel intellectual property was effectively illegible to a machine until it did real work to fix the ingestion. If McKinsey's knowledge was trapped in slides, it is worth asking, honestly, what shape yours is in.

The failure rarely announces itself as missing data. It hides in data that is present but means subtly different things in different places. Ask an agent a question as ordinary as how many active customers the business has, and it will find a dozen tables with a dozen definitions of "active" — a login within the last thirty days in one system, a non-zero balance in another, an uncancelled contract in a third. A human analyst resolves that ambiguity with context and a quick message to a colleague. An agent, lacking both, picks one definition silently and reports a confident number that is wrong in a way nobody can see. Multiply that across every entity and every metric a company cares about, and you have the real texture of the problem: not an empty warehouse, but a full one with no shared language.

You cannot retrieve your way out of a data swamp

The popular hope is that retrieval-augmented generation — pointing an agent at your documents and letting it fetch what it needs — will paper over the mess. It will not. An agent retrieving from a swamp returns swamp, dressed up in fluent prose that makes the swamp harder to detect. And the instinct to fix this by building a bigger data lake usually just produces a bigger swamp with better storage economics. Volume was never the problem. Meaning was.

What actually closes the gap is a layer most enterprises have never built: a semantic, machine-readable map of what the data means. In practice this goes by several names that point at the same idea — a semantic layer, an ontology, a knowledge graph, a governed data catalog. The common thread is that core business concepts get defined once, consistently, in a form an agent can consume: what a customer is, what counts as revenue, how entities relate, which rules and constraints apply. The catalog becomes the control plane of truth, and the semantic layer becomes the thing that lets a model answer in terms of your business rather than in terms of raw, ambiguous tables.

The organizing principle behind all of this is treating data as a product rather than as exhaust. Exhaust is whatever a system happens to emit, owned by no one, documented nowhere. A product has an owner, a contract, documentation, versioning, and a consumer whose needs shape it. The research bears out how much this matters: organizations that treat data as a product — with curated models and shared vocabularies — are dramatically more likely to scale generative AI successfully than those that do not. When the foundation is built this way, retrieval techniques like graph-based RAG can ground every answer in verified, connected data, enforce access controls at the moment of the query, and trace each response back to the exact source it came from. That is the difference between an agent that confidently invents a court case and one that shows its work.

A knowledge graph earns its keep precisely here. Instead of flat, disconnected tables, it stores the business as a web of entities and the relationships between them: this customer belongs to this account, which is covered by this contract, governed by this policy, owned by this team. An agent reasoning over that structure can follow the connections the way a knowledgeable employee would, and every answer it produces carries its lineage — which source, which definition, which version. That is also what makes governance enforceable at the level of meaning rather than the level of the raw row, because the graph knows what a thing is and who is permitted to see it.

The strategic point hiding in the plumbing

Strip away the vocabulary and the strategic truth is simple: AI readiness is mostly data maturity wearing a more exciting outfit. You cannot purchase your way past it with a model subscription, because the thing you are missing is not compute or intelligence — it is the slow, structural work of making your own knowledge legible. That work is unglamorous, expensive, and invisible in a board deck, which is exactly why most organizations skip it and exactly why most organizations end up in the ninety-five percent. The semantic foundation is not overhead on the way to the real prize. It is the prize. It is the part competitors cannot copy by signing the same vendor contract you did.

Gap two: governance that was never written down

Order, structure, repetition — the corporate grid. Governance is what gives an agent the same scaffolding a new employee takes for granted. Photo: Fabian Kleiser / Unsplash.

If you ask most enterprises where their AI governance lives, the honest answer is a PDF on a shared drive — a well-intentioned document of principles that almost no one has read and that no system enforces. A PDF nobody reads is not a policy an agent can obey. It is a statement of hope. And hope does not survive contact with an autonomous system acting at machine speed across systems it was never explicitly cleared to touch.

Governance for AI, and especially for agents, has to be machine-actionable to mean anything. The "SOP" intuition is the right one, but it resolves into two concrete questions that a slide of principles never answers: what is this agent actually allowed to touch and do, and what is the operating rhythm that keeps it honest over time? Get specific about both, and governance stops being a compliance ornament and starts being the thing that lets you deploy without holding your breath.

An agent is a new employee with root access and no onboarding

It helps to think of an agent as exactly what it is becoming: a digital worker. The trouble is that we wrap human workers in decades of accumulated controls and give agents almost none of them. A new human employee receives an identity, a defined role, least-privilege access to only the systems their job requires, a manager who reviews their work, and an audit trail that records what they did. An agent, in too many deployments, gets a single shared API key with broad standing credentials, the ability to call tools far outside its actual task, and no logging worth the name. It is the most over-permissioned new hire in the building, and no one interviewed it.

The discipline that fixes this is well understood, even if it is rarely applied. Security researchers call the core idea least-agency, or least-privilege: an agent should receive the minimum autonomy required for its specific task and nothing more. A customer-support agent does not need write access to the billing database. A research agent does not need the ability to send external email. From there it cascades into concrete controls: whitelisting the specific tools an agent may use, issuing short-lived credentials instead of permanent keys, sandboxing execution, restricting where an agent can send data, and — critically — keeping a human in the loop for actions that are irreversible or sensitive. A mature deployment will refuse to let an agent move money without clearing a confidence threshold or obtaining a second approval, and will strip personally identifiable information before it ever reaches a model, restoring it only on the way back. None of that lives in a principles document. All of it lives in enforced policy. That, and not the PDF, is the real standard operating procedure: rules expressed as controls a system cannot route around.

The danger is not hypothetical, and it does not require malice. Picture an agent handed broad database credentials so it could "be helpful," then asked to tidy up some duplicate records. With no constraint on its scope and no human checkpoint, a single ambiguous instruction becomes a destructive write across production data in seconds — faster than any person could intervene, and recorded nowhere anyone thought to look. The same autonomy that makes agents useful is what makes their mistakes fast and quiet. Standing credentials, missing audit trails, and unrestricted tool access are not exotic edge cases; they are the default state of most early agent deployments, and they are exactly how a promising program turns into a board-level incident.

A policy an agent cannot read is decoration. Governance that scales is policy an agent is structurally unable to disobey.

You do not have to invent the rulebook

The encouraging part is that the scaffolding for all of this already exists, written by people who have thought about it harder than any individual team has time to. The U.S. National Institute of Standards and Technology publishes the AI Risk Management Framework, along with a dedicated profile for generative AI, and its emphasis maps almost directly onto agent controls: role-based access, continuous monitoring, adversarial testing, and lifecycle logging for traceability. The international ISO/IEC 42001 standard formalizes the idea of an AI management system, with oversight and continual improvement built in. The OWASP GenAI Security Project maintains a Top 10 for large-language-model applications and a newer Top 10 for agentic applications, cataloguing the exact failure classes teams keep rediscovering the hard way: prompt injection, tool misuse, memory leakage. And the external pressure is rising fast, from the EU AI Act to a wave of national AI laws, which means governance is shifting from a nice-to-have to a condition of doing business.

Underneath the frameworks sits a single overlooked capability that decides whether any of them are real: observability. If you cannot see what an agent did — which data it touched, which tools it called, which decision it made and why — then you cannot govern it, debug it, or defend it to a regulator, and you certainly cannot trust it with anything that matters. Audit logging and traceability are not paperwork. They are the line between an agent you can put into production and a black box you can only hope behaves. Trust, in the end, is not extended to systems because they are clever. It is extended to systems because they are accountable.

The reframe that matters most here is that governance is not a brake on AI. It is the enabler. The organizations that actually reach production are, consistently, the ones that invested in governance frameworks before they scaled agent capabilities — not after a breach forced their hand. The absence of guardrails does not make you faster; it produces exactly the brittle, untrustworthy, occasionally catastrophic behavior that makes leadership pull the plug and sends a promising program back to purgatory. Guardrails are what let you move quickly without flinching.

The consultants' mirror

Return now to where we started, because the consulting firms are not just a cautionary anecdote. They are the most public, best-funded live experiment in internal AI adoption that exists, and watching them is the closest thing the rest of us have to a controlled trial. They are simultaneously the largest sellers of AI transformation and a room full of organizations trying to transform themselves — and they have been unusually candid about how hard it is. BCG, in its own cross-industry research, found that roughly three-quarters of companies struggle to achieve and scale value from their AI initiatives. The firms are not describing a problem they have solved from the outside. They are describing one they are living from the inside.

The stakes are visible in their own pyramids. Internal tools like McKinsey's assistant and BCG's slide-polishing system can already perform a large share of the research-and-formatting work that used to define a junior analyst's first years, and entry-level hiring across the industry has tightened as a result. That is what it looks like when this technology genuinely lands inside an organization — and it is a useful reminder that getting it right is not a productivity nicety. It restructures the firm. The flip side is that getting it wrong, in public, with a client's name on the document, restructures the firm's reputation just as quickly.

And so every major firm now has its own platform: McKinsey with its knowledge assistant and a fleet of internal agents numbering in the tens of thousands, BCG with its build unit and internal tools, Deloitte with its assistant and an agentic platform, PwC with an agent operating system spanning tens of thousands of deployed agents, EY with a platform giving tens of thousands of staff access to a growing roster of agents and a multi-year plan to scale into the hundreds of thousands, and KPMG with its own agentic workbench. Billions of dollars, real engineering, genuine ambition. But the tools are the visible ten percent. The invisible ninety percent — the part that determines whether any of it works — is the data and governance plumbing underneath. There is a useful rule of thumb circulating in this world that only a small fraction of AI value comes from the algorithms and the technology, with the overwhelming majority coming from people, process, and the organizational change required to make the technology stick. The firms that are winning internally are the ones that took that ratio seriously.

The five percent: what McKinsey's Lilli actually proves

Organized, machine-legible knowledge was the real moat — not the model. Every rival had the same models. Photo: Susan Q Yin / Unsplash.

If the failure rate has a counterexample worth studying, it is McKinsey's internal platform, Lilli. It is the case study everyone cites, and almost everyone draws the wrong lesson from it. The wrong lesson is that McKinsey succeeded because it had access to powerful models. That cannot be the explanation, because every competitor had access to the same models. The right lesson is far less flattering to the technology and far more useful to anyone trying to replicate the result: McKinsey succeeded because it did the boring work that everyone else was skipping.

Look at what the boring work actually was. The platform draws on more than forty knowledge sources and over a hundred thousand documents and interview transcripts — but the unlock was not aggregation, it was curation and tagging, the patient labor of making a century of accumulated knowledge consistent and machine-legible. The team built what is better described as an orchestration layer than a simple retrieval bot, designed to synthesize and contextualize rather than just fetch. They confronted the unglamorous reality that their best material was trapped in slides and fixed the ingestion so the machine could read it. Only then did the human side of adoption begin: a phased rollout, training that cured what the firm called "prompt anxiety" in roughly an hour, internal evangelists, and senior leaders modeling the behavior they wanted to see.

The results are the part people quote, and they are genuinely impressive: more than three-quarters of the firm's tens of thousands of employees now use the tool, heavy users return to it more than a dozen times a week, and the firm reports its people save close to a third of their research time. But the number to internalize is not the adoption rate. It is what produced it. The moat was never the model. The moat was a hundred years of knowledge made legible to machines, wrapped in the governance and the change management required to make people actually trust it and use it.

McKinsey's edge was not a smarter model — every rival had the same models. The edge was a century of knowledge made legible to machines, and the discipline to govern it.

The pattern repeats across the rest of the industry, even if less dramatically. The firms making real internal progress — across the spectrum of platforms and agents now deployed — are consistently the ones that invested in their data foundations and their governance before they tried to scale. And the lesson generalizes cleanly to any enterprise, in any sector, that wants off the wrong side of the divide. The winners are not the organizations with the best model. Everyone has the same models. The winners are the organizations that did the unglamorous data-and-governance work that everyone else found a reason to defer.

It is worth saying plainly what this does and does not mean for everyone else, because the lesson is easy to mislearn. You cannot buy McKinsey's result by buying McKinsey's tool, any more than you could acquire a rival's culture by licensing their software. What travels is not the platform; it is the method — the willingness to treat your own knowledge as an asset worth making machine-legible, and your own governance as an engineering problem worth solving before the agents arrive. That method is available to any organization in any industry. It is simply not for sale, and it cannot be rushed.

What the winners actually do

None of this resolves into a checklist, and anyone selling you one is selling the wrong thing. But the organizations on the right side of the divide do share a small number of strategic commitments, and they are worth stating plainly — not as steps to execute in order, but as the shape of a serious posture toward AI.

They sequence data before models

The single most counterintuitive move the winners make is to stop running hero pilots and fix the foundation first. That does not mean a multi-year data project before any value is delivered; it means choosing initial use cases precisely where the data is already clean and compatible, shipping those to generate real and defensible returns, and then using that credibility and momentum to fund the remediation of the messier domains. The failure mode is the opposite: picking use cases based on strategic ambition and executive enthusiasm, discovering the data underneath cannot support them, and producing an over-budget pilot that demonstrates the limits of the data rather than the capability of the AI. Match the ambition to the data maturity, not to the org chart's excitement.

They treat data as a product, not exhaust

The winners give their data owners, contracts, documentation, and a semantic layer that defines the business vocabulary once and reuses it everywhere. The catalog becomes the control plane of truth; the ontology and knowledge graph become the connective tissue that lets agents reason over entities and relationships rather than guess at ambiguous tables. This is the work that does not show up in a launch announcement and entirely determines whether the launch was real. It is also, not coincidentally, the part of the strategy a competitor cannot acquire by signing the same contracts you did.

They make governance machine-actionable

The winners do not confuse a principles document with a control. They express policy as something a system enforces: identity for every agent, least-agency access, tool whitelisting, audit trails, and a human in the loop for anything irreversible or sensitive. They adopt an established standard rather than inventing their own — the NIST framework, the ISO management-system standard, the OWASP failure catalogues — and they wrap it in an operating rhythm: regular reviews, red-teaming, and genuine change management for agents, treating a new agent with the seriousness one would treat a new hire with broad system access. Governance, done this way, is not the thing that slows the program down. It is the thing that lets the program move without fear.

They build the context layer agents inherit

Rather than re-explaining the business inside every prompt, the winners push meaning and rules down into the data itself, so that any agent connecting to it inherits both. It is worth watching the emerging plumbing here — open protocols that standardize how agents connect to tools and to one another, sometimes described as an "HTTP for agents," are quickly becoming the connective standard for this world. But a word of caution that the protocol enthusiasm tends to skip: plumbing that lets agents reach your data faster does nothing good if the house behind the tap is a mess. Standard connectivity over a swamp just distributes the swamp at higher throughput. The connectivity is necessary; the clean, governed foundation is what makes it worth having.

They treat adoption as a change program, not a software rollout

Finally, the winners understand that buying licenses is not the same as achieving adoption. The most replicable lesson from the internal success stories is almost embarrassingly human: an hour of training to dissolve the anxiety of a blank prompt, visible evangelists, and leaders who actually use the tools they are asking their people to use. If the overwhelming majority of AI value comes from people and process rather than the algorithm, then the overwhelming majority of the effort has to go there too. Technology adoption has always been a human problem wearing a technical mask, and AI has not changed that. It has only raised the stakes.

They measure value, not motion

The organizations stuck in the ninety-five percent tend to measure activity — pilots launched, seats provisioned, models evaluated — and mistake it for progress. The winners measure outcomes, and they are ruthless about it: a use case either moves a number a finance team recognizes, or it is killed quickly, before it hardens into a permanent science project. That discipline is exactly what frees the budget and the attention to pour into foundations, where the compounding returns actually live. Counting pilots is how a program feels busy while going nowhere. Counting value is how it escapes the lab.

The real divide

It is worth being precise about what the so-called GenAI Divide actually divides. It is not a line between companies with good models and companies with bad ones. Frontier models are a commodity now; the same handful are available to everyone with a credit card. The divide is between the organizations that did the foundational work and the organizations that did not — and underneath the AI costume, that is simply a gap in data maturity and governance discipline that has existed for years and that AI has suddenly made expensive to ignore.

And it compounds. The organizations on the right side of the divide get faster every quarter, because their agents inherit ever-cleaner data and ever-tighter rules, and each success funds the next. The organizations on the wrong side re-run pilots that fail for the same reason they failed last time, mistaking a foundation problem for a model problem and waiting for a model that was never going to save them. The gap between the two groups does not stay constant. It widens.

The deepest irony of the whole story is the one we began with. The cure for the failing enterprise AI program was never a smarter model. It was the boring, expensive, unglamorous discipline that the consultants themselves had to learn the hard way, in public, with a refunded invoice as tuition: organize the data so a machine can reason over it, write the rules down in a form a machine is forced to obey, and only then let the agents loose. The companies that internalize that will not merely adopt AI. They will compound on it — quietly, structurally, and largely out of view — while everyone else is still abandoning pilots and blaming the model.

The divide compounds. The foundation you lay now decides how fast you can move later. Photo: Robert Bye / Unsplash.

Sources and further reading

MIT NANDA initiative, The GenAI Divide: State of AI in Business 2025 — the source of the widely cited finding that roughly 95% of enterprise generative-AI pilots show no measurable business impact.
Gartner — predictions on proof-of-concept abandonment, AI-ready data, and agentic-project cancellation rates.
VentureBeat — background on McKinsey's internal Lilli platform and how it was built.
Deloitte, State of AI in the Enterprise — recurring survey data on enterprise AI spend, scaling, and ROI.
NIST AI Risk Management Framework and its Generative AI Profile — role-based access, monitoring, adversarial testing, and lifecycle logging.
OWASP GenAI Security Project — the Top 10 for LLM applications and the Top 10 for agentic applications, covering prompt injection, tool misuse, and related failure classes.
ISO/IEC 42001 — the international standard for an AI management system, covering oversight and continual improvement.

Building an LLM Project From Scratch in 2026

Harshdeep Singh — Tue, 09 Jun 2026 17:33:40 +0000

Here’s the uncomfortable truth about “AI projects” a few years ago: the hard part was never the model. It was the plumbing. Standing up a vector database, wiring an embeddings pipeline, fighting with streaming responses, gluing five libraries together — by the time it worked, you’d forgotten what you set out to build.

In 2026 that plumbing has largely collapsed into a weekend’s worth of work. Model prices have fallen roughly 80% year over year, free tiers are genuinely usable, your database now does vector search natively, and one SDK handles streaming and tool calling across every provider. The skill that’s actually in demand — retrieval-augmented generation with agents — is now reachable by a developer who has never touched machine learning.

So this guide does something specific. We’re going to build one real project, end to end, that you can put on your portfolio and let strangers use: an app where someone uploads their documents and chats with them — asking questions and getting answers grounded in their own files, with citations, streamed token by token. It’s the canonical 2026 LLM project, and it teaches almost everything else by osmosis.

In plain English. “RAG” means the AI doesn’t answer from memory — it looks things up first. You give it a pile of documents; when you ask a question, it finds the most relevant passages and answers using only those. That’s why it can talk about your files without ever having been trained on them.

This guide is written for three readers at once: newcomers, working software engineers, and AI engineers. The main text stays approachable; the “In plain English” notes add no-jargon explanations, and the “Under the hood” notes add depth for engineers.

The roadmap — what we’ll actually do

Eight steps. Each one produces something that works before we add the next layer.

Build the mental model — how RAG (and agentic RAG) really works, in one diagram.
Choose your models — a cost comparison of hosted and self-hosted LLMs, and which embedding model to use.
Set up the MERN stack — project skeleton plus a MongoDB Atlas vector index.
Ingest documents — upload, parse a PDF, and split it into chunks.
Embed & store — turn chunks into vectors and save them in MongoDB.
Retrieve — find the right passages with a single $vectorSearch query.
Make it agentic — let the model call retrieval as a tool, on its own terms.
Stream & deploy — render tokens live in React, then ship it to its own URL for free.

Let’s start with the one idea that makes the other seven make sense.

Step 1 · The mental model: how RAG actually works

An LLM is a brilliant improviser with no access to your private data and a tendency to confidently make things up. Retrieval-Augmented Generation (RAG) fixes both problems with one move: before the model answers, you fetch relevant facts and hand them over as context. The model then answers from evidence rather than from vibes.

There are two phases. The first happens once, ahead of time (ingestion); the second happens on every question (retrieval + generation).

INGESTION (run once, when a document is added)
  document --> split into chunks --> embed each chunk --> store vectors in MongoDB

RETRIEVAL + GENERATION (run on every question)
  question --> embed --> vector search in MongoDB --> top-k chunks
                                                         |
                             +---------------------------+
                             v
        [ question + retrieved chunks ] --> LLM --> grounded answer + citations

The magic ingredient is the embedding: a list of numbers (a vector) that captures the meaning of a piece of text. Two passages about “canceling a subscription” land near each other in this number-space even if one says “refund” and the other says “cancel my plan.” Searching by meaning instead of keywords is what makes RAG feel intelligent.

In plain English. Imagine every sentence gets pinned onto a giant map, where similar meanings sit close together. To answer your question, the app drops a pin for your question and grabs whatever text is pinned nearby. Those nearby notes become the AI’s cheat sheet.

What makes it “agentic” — the 2026 upgrade

Classic RAG retrieves once and hopes the first search was good enough. That breaks on real questions: “Compare the refund policy in the 2024 contract with the 2025 one” needs two different searches and a comparison. Agentic RAG hands the model the steering wheel. Retrieval becomes a tool the model can call — repeatedly — deciding what to search for, judging whether the results are sufficient, and searching again before it commits to an answer.

Under the hood. A 2025 survey (“Agentic Retrieval-Augmented Generation,” arXiv:2501.09136) frames these systems around four patterns: reflection, planning, tool use, and multi-agent collaboration. In practice you’ll implement query rewriting/decomposition, multi-hop “retrieve → reason → retrieve” loops, and self-critique (“do these passages actually answer the question?”). The cost: 3–10× the tokens and 2–5× the latency of vanilla RAG. So gate it — a trivial FAQ should never enter the loop; a cross-document question can’t be answered without it. Cap the loop at ~5 iterations so a confused agent can’t spend your budget in a runaway.

We’ll build the simple pipeline first (so you can see every piece), then promote retrieval to an agentic tool in Step 7. That progression is the lesson.

Step 2 · Choosing your LLMs — cheapest viable first

You’ll use two kinds of model: an embedding model (turns text into vectors) and a generation model (writes the answer). They’re priced and chosen separately. Let’s start with generation, since that’s where the “which LLM?” anxiety lives.

Read this first. Every price below is a 2026 snapshot and model names change almost monthly. Treat this table as a shape, not gospel — confirm the current number on the provider’s pricing page before you commit. The strategy (route cheap, escalate rarely) outlives any specific figure.

Provider / model	Input ($/1M)	Output ($/1M)	Free tier?	Best for
Google Gemini Flash-Lite	~$0.10–0.25	~$0.40–1.50	Yes — generous	Learning & high volume; the default starter
Groq · Llama 3.1 8B	~$0.05	~$0.08	Yes	Blazing-fast responses; cheapest tokens
DeepSeek	~$0.14	~$0.28	No (cheap)	Cheapest frontier-class; OpenAI-compatible API
OpenAI · GPT mini-tier	~$0.15–0.75	~$0.60–4.50	Credits	Strong all-rounder; great tool calling
Anthropic · Claude Haiku	~$1.00	~$5.00	No	Cheapest Claude; reliable instruction-following
Frontier (GPT / Claude / Gemini Pro)	~$2.50–5.00	~$15–30	No	Final answer only, when quality truly matters

The pattern jumps out: the cheapest models are 50–100× cheaper than the flagships. For a RAG app, most of your token spend is feeding retrieved context into the model — so a cheap, capable model handling that bulk is the entire cost game. Use a frontier model only for the final synthesis, and only if you can measure that it’s actually better for your task.

Self-hosted: running models on your own machine

You can skip API bills entirely with Ollama, which runs open models locally and exposes an OpenAI-compatible endpoint at http://localhost:11434/v1 — meaning your code barely changes. One command pulls and runs a model:

# install from ollama.com, then:
ollama run llama3.1:8b          # chat model, ~6-8 GB VRAM
ollama pull nomic-embed-text    # local embedding model, free

Under the hood. Rough VRAM at Q4 quantization: 7–8B models ≈ 6–8 GB, 14B ≈ 10–12 GB, 32B ≈ 20–22 GB, 70B ≈ 43–48 GB (Apple Silicon unified memory counts fully). Break-even vs hosted APIs is roughly 500K tokens/day of sustained traffic — below that, hosted is cheaper and you skip the ops. Trade-offs: full privacy and $0/token, but weaker reasoning than frontier models and you own the uptime.

The embedding model — quieter, but it matters

Embeddings are dramatically cheaper than generation, so this is an easy call. For most projects, OpenAI’s text-embedding-3-small is the sweet spot.

Model	Dimensions	Price ($/1M)	Notes
OpenAI text-embedding-3-small	1536	~$0.02	Best balance; our pick for the build
Google gemini-embedding	768	Free tier / ~$0.025	Free-tier friendly
Voyage (voyage-3.5-lite)	512–1024 (reducible)	~$0.02	Now MongoDB-owned; long context
nomic-embed-text / BGE-M3 (open)	768 / 1024	Free (self-host)	Run free in Ollama; great quality

Gotcha that bites everyone. Vectors from different embedding models are not compatible. If you switch embedding models later, you must re-embed your entire corpus. Pick one and commit. (Embedding 10M chunks with text-embedding-3-small costs only ~$100, so this is about consistency, not cost.)

Our choices for this build: embeddings via text-embedding-3-small; generation via a cheap, fast model (Gemini Flash or a GPT mini-tier model) while learning — swappable in one line thanks to the SDK we’re about to set up.

Step 3 · Setting up the MERN stack

MERN is a natural fit for RAG in 2026 for one reason that didn’t used to be true: MongoDB does vector search natively. Your embeddings live in the same documents as your data, queried with a normal aggregation pipeline. No separate vector database to run, sync, or pay for.

Here’s the shape of the app — a standard MERN split, with the LLM logic living safely on the server:

MongoDB Atlas — stores documents, chunks, embeddings, and chat history. Free M0 tier includes vector search.
Express + Node.js — the API: handles uploads, embedding, retrieval, and talking to the LLM. All API keys live here, never in the browser.
React — the chat UI, rendering streamed tokens as they arrive.
The glue: the Vercel AI SDK — one library for streaming, provider-switching, and tool calling, on both server and client.

The one piece of setup that’s new: the vector index

After creating a free cluster on Atlas, you define a vector search index on the collection that will hold your chunks. This tells MongoDB how to search the embedding field. In the Atlas UI (Atlas Search → Create Index → JSON editor), or via code:

{
  "fields": [
    {
      "type": "vector",
      "path": "embedding",
      "numDimensions": 1536,
      "similarity": "cosine"
    },
    { "type": "filter", "path": "userId" }
  ]
}

Under the hood. numDimensions must exactly match your embedding model’s output (1536 for text-embedding-3-small). similarity can be cosine, euclidean, or dotProduct — cosine is the safe default. The filter field on userId is what lets you scope searches per-user, so visitors only ever retrieve their own documents — essential the moment your demo is public. Atlas uses HNSW for approximate nearest-neighbor search under the hood.

That’s the only “AI-specific infrastructure” in the whole project. Everything else is ordinary Express and React.

Step 4 · Ingesting documents: upload, parse, chunk

When a user uploads a file, three things happen on the server: we accept the upload, extract its text, and split that text into bite-sized chunks.

Accepting uploads is standard Express (use multer). Extracting text from a PDF is a one-liner with pdf-parse (v2 fork, TypeScript-native — see note in code):

import { PDFParse } from "pdf-parse"; // requires the v2 fork, not the default pdf-parse@1

// `buffer` is the uploaded file from multer
const parser = new PDFParse({ data: buffer });
const { text } = await parser.getText();

Why we chunk — and how big

You can’t embed an entire 50-page PDF as one vector; the meaning gets blurred into mush, and you’d feed the model far more than it needs. So we slice the text into passages. Each chunk becomes one searchable unit.

// ~1 token = ~4 characters, so 2000 chars = ~500 tokens.
// We overlap chunks so a sentence split across a boundary
// still appears whole in at least one chunk.
function chunkText(text, size = 2000, overlap = 200) {
  const chunks = [];
  for (let i = 0; i < text.length; i += size - overlap) {
    chunks.push(text.slice(i, i + size).trim());
  }
  return chunks.filter(Boolean);
}

In plain English. Think of chunking like cutting a long article into index cards. Too big and each card covers too many topics to be useful; too small and you lose context. A few hundred words per card, with a little overlap so sentences don’t get sliced in half, is the reliable starting point.

Under the hood. Start with recursive character splitting at ~400–512 tokens with 10–20% overlap — the pragmatic default (~85–90% retrieval recall). Semantic chunking can add ~2–3% recall but costs roughly 14× more to index, so only graduate to it when your evaluation metrics demand it. One caveat: at least one 2026 analysis found overlap added no measurable benefit in its setup while raising indexing cost — so treat the overlap figure as a starting point to validate against your own data, not a law.

Step 5 · Embedding & storing vectors in MongoDB

Now we turn each chunk into a vector and save it. The AI SDK’s embedMany batches the whole array efficiently, then we write one MongoDB document per chunk — text and vector together, tagged with the owner and source.

import { embedMany } from "ai";
import { openai } from "@ai-sdk/openai";

const chunks = chunkText(text);                 // from Step 4

const { embeddings } = await embedMany({
  model: openai.embedding("text-embedding-3-small"),
  values: chunks,                               // array of strings
});

await db.collection("chunks").insertMany(
  chunks.map((chunk, i) => ({
    userId,                                     // who owns it
    source: filename,                           // where it came from
    text: chunk,                                // the passage itself
    embedding: embeddings[i],                   // the 1536-dim vector
    chunkIndex: i,
    createdAt: new Date(),
  }))
);

That’s ingestion done. The vector is just an array of floats stored on a normal document — no special database, no migration. Upload a 30-page PDF and you’ve got a few dozen searchable, meaning-aware chunks sitting in MongoDB.

Under the hood. embedMany auto-batches large arrays, so you can hand it hundreds of chunks without managing request limits yourself. Store rich metadata (page number, section heading, document ID) alongside each chunk now — you’ll want it later for citations, filtering, and “parent-document” retrieval. This is the step you run once per upload, not once per question.

Step 6 · Retrieval: finding the right passages

Here’s the payoff for all that setup. To answer a question, we embed the question with the same model, then run a single $vectorSearch aggregation to pull the closest chunks. This is the whole of “search by meaning,” in one query:

import { embed } from "ai";
import { openai } from "@ai-sdk/openai";

const { embedding } = await embed({
  model: openai.embedding("text-embedding-3-small"),
  value: userQuestion,
});

const passages = await db.collection("chunks").aggregate([
  {
    $vectorSearch: {
      index: "vector_index",
      path: "embedding",
      queryVector: embedding,
      numCandidates: 150,        // over-fetch, then narrow
      limit: 5,                  // keep the best 5
      filter: { userId: { $eq: currentUserId } }
    }
  },
  {
    $project: {
      _id: 0,
      text: 1,
      source: 1,
      score: { $meta: "vectorSearchScore" }
    }
  }
]).toArray();

You now have the five most relevant passages, each with a similarity score. Feed those into the model as context and you have working RAG. But before we generate, two notes that separate a toy from something good:

Under the hood. $vectorSearch must be the first stage in the pipeline. numCandidates is the approximate-search breadth — it must be ≥ limit, and a common heuristic is 10–20× your limit (here 150 for a limit of 5). The filter on userId uses the field we declared in the index, enforcing per-user isolation efficiently. Use the score as a relevance gate: if the top result scores below ~0.75, it’s often better to answer “I don’t have enough information” than to let the model hallucinate from weak matches.

The single biggest quality upgrade you’ll make later isn’t a bigger model — it’s hybrid search + reranking (we cover it in “Where to go next”). For now, vector search alone is plenty to ship.

Step 7 · Making retrieval agentic

So far the server retrieves before calling the model — a fixed pipeline. To make it agentic, we flip the control: we describe retrieval as a tool, hand it to the model, and let the model decide when (and how often) to call it. This is the “hot” part of 2026 — and with the AI SDK it’s remarkably little code.

import { streamText, tool, embed, stepCountIs } from "ai";
import { openai } from "@ai-sdk/openai";
import { z } from "zod";

// 1) Retrieval, described as a tool the model can call
const searchDocuments = tool({
  description: "Search the user's uploaded documents for passages " +
               "relevant to a question. Call this whenever you need facts.",
  inputSchema: z.object({
    query: z.string().describe("a focused search query"),
  }),
  execute: async ({ query }) => {
    const { embedding } = await embed({
      model: openai.embedding("text-embedding-3-small"),
      value: query,
    });
    return db.collection("chunks").aggregate([
      { $vectorSearch: {
          index: "vector_index", path: "embedding",
          queryVector: embedding, numCandidates: 150, limit: 5,
          filter: { userId: { $eq: currentUserId } } } },
      { $project: { _id: 0, text: 1, source: 1,
          score: { $meta: "vectorSearchScore" } } },
    ]).toArray();
  },
});

// 2) Let the model run the loop: think -> search -> (search again) -> answer
const result = streamText({
  model: openai("gpt-4o-mini"),     // swap to any provider in one line
  system: "Answer ONLY using passages returned by searchDocuments. " +
          "Cite the source. If the passages don't contain the answer, " +
          "say you don't know - do not guess.",
  messages,               // from req.body — full chat history from the client
  tools: { searchDocuments },
  stopWhen: stepCountIs(5),         // hard cap on the agentic loop
});

return result.toUIMessageStreamResponse();

Read what that does, because it’s genuinely different from classic RAG: the model receives the question, decides on its own to call searchDocuments with a query it wrote, reads the results, and may call it again with a refined query before answering. For “compare the 2024 and 2025 refund policies,” it can naturally run two searches and synthesize. You didn’t orchestrate that — the model did.

Important 2026 change. If you learned the AI SDK before v5: maxSteps was removed from the client. Multi-step tool loops are now controlled server-side with stopWhen (e.g. stepCountIs(5)). This cap is also your cost safety rail — without it, a confused agent could loop and run up your bill.

Under the hood. The system prompt is doing heavy lifting for safety and grounding: “answer only from retrieved passages” plus “say you don’t know” is your first and cheapest defense against hallucination. The Zod inputSchema gives the model a typed contract for the tool’s arguments. toUIMessageStreamResponse() emits a standard SSE stream the React client consumes natively. Want it reusable across other AI clients (Claude Desktop, Cursor, etc.)? Expose this same retrieval as an MCP server — overkill for a single app, but the natural next step if your tools should be shared.

Step 8 · Streaming to React, then deploying for free

The backend streams tokens; the frontend renders them as they land. The AI SDK’s useChat hook handles the entire streaming lifecycle, so your component stays tiny:

"use client";
import { useChat } from "@ai-sdk/react";
import { DefaultChatTransport } from "ai";
import { useState } from "react";

export default function Chat() {
  const [input, setInput] = useState("");
  const { messages, sendMessage, status } = useChat({
    transport: new DefaultChatTransport({ api: "/api/chat" }),
  });

  return (
    <div>
      {messages.map((m) => (
        <div key={m.id}>
          <strong>{m.role}: </strong>
          {m.parts.map((p, i) =>
            p.type === "text" ? <span key={i}>{p.text}</span> : null
          )}
        </div>
      ))}

      <input value={input} onChange={(e) => setInput(e.target.value)} />
      <button
        disabled={status !== "ready"}
        onClick={() => { sendMessage({ text: input }); setInput(""); }}
      >
        {status === "streaming" ? "Thinking..." : "Ask"}
      </button>
    </div>
  );
}

The status field (ready / submitted / streaming / error) gives you loading and disabled states for free. Tokens appear live as the model writes them — the experience people now expect from any AI app.

Putting it on its own website — for free

This is the part that turns a tutorial into a portfolio piece. Three boxes, three free tiers:

Vercel — React frontend, free Hobby tier, global CDN, custom domain, auto-deploy from GitHub.
Render — Node/Express backend, free web service (sleeps when idle), where streaming and keys live.
MongoDB Atlas M0 — database + vector search, permanently free, 512 MB, limited Vector Search index capacity (verify current limits in Atlas docs).

Total cost for a low-traffic demo: $0–$5/month.

Under the hood. Run the streaming endpoint on the long-lived backend (Render/Railway), not a short serverless function that can time out mid-stream. Know the free-tier edges: Render free services spin down after ~15 min idle (a ~30–60s cold start on the next request — fine for a portfolio), and Atlas M0 caps at 512 MB and limited Vector Search index capacity. When you outgrow them, a dedicated Atlas tier and an always-on backend plan are the upgrade path.

Cost controls so you never get a surprise bill

Set a hard spend cap in your LLM provider dashboard. Non-negotiable for a public demo.
Default to a cheap model; cap max_output_tokens; keep the agentic stepCountIs low.
Add per-user / per-IP rate limiting and an auth wall so bots can’t drain your quota.
Log token usage per request (the SDK’s onFinish callback) so you can see costs before they surprise you.
Keep every API key on the server. A key shipped to the browser is a key that will be stolen.

Common pitfalls (and the fixes)

The semantic gap. Your question and the document use different words and vector search misses the match. Fix: add hybrid search (keyword + vector).
Context dilution. You retrieve 10 chunks when only 2 are relevant, and the noise degrades the answer. Fix: rerank, then keep a tighter top-k.
Chunk-boundary amnesia. The answer is split across two chunks and neither is retrieved whole. Fix: overlap, or parent-document retrieval.
Confident nonsense. The model answers from weak matches as if certain. Fix: a similarity-score threshold plus a system prompt that permits “I don’t know.”
Reaching for the agent loop too early. Simple lookups don’t need multi-hop reasoning — they need one fast search. Fix: gate the agentic path to genuinely complex questions.

Where to go next

You’ve shipped a working agentic RAG app. Three upgrades, in priority order:

Hybrid search + reranking — the highest-ROI quality jump. Run keyword (Atlas full-text via $search) and vector search, fuse them with Reciprocal Rank Fusion (RRF), then rerank the top 20–50 candidates with a cross-encoder (Cohere, Voyage, or self-hosted BGE) and keep the best handful. Benchmarks routinely show reranking as the single biggest accuracy gain.
Better ingestion — messy PDFs with tables and multi-column layouts need a real parser (LlamaIndex’s LiteParse, Unstructured, or Docling) rather than plain text extraction.
Make it shareable via MCP — expose your retrieval as a Model Context Protocol server so other AI clients can use the same tool. Worth it once your tools outlive this one app.

That’s the whole arc: from “an LLM can’t see my data” to a public, agentic, document-grounded chat app that costs about nothing to run. The plumbing finally got out of the way — what you build on top of it is the interesting part. Now go put something on that empty portfolio URL.

Frequently asked questions

What is agentic RAG, exactly?

Agentic RAG turns retrieval into a tool the model calls on demand. Instead of one fixed “retrieve then answer” pass, the model plans, searches, judges whether the results are sufficient, and searches again until it has enough evidence — then answers. It’s slower and costs more tokens, but it handles complex, multi-step questions that one-shot RAG can’t.

Do I need a separate vector database?

No. On the MERN stack you store embeddings inside your normal MongoDB documents and query them with the $vectorSearch aggregation stage in MongoDB Atlas. For the vast majority of projects, that removes the need for a dedicated vector database entirely.

What’s the cheapest LLM for a RAG app in 2026?

For learning, the most generous free tiers are Google Gemini Flash/Flash-Lite and Groq. For the cheapest paid frontier-class model, DeepSeek is usually lowest. Prices change monthly — confirm on the provider’s pricing page. The durable strategy is to route bulk work to a cheap model and reserve a frontier model for the final answer only.

How much does it cost to run?

A portfolio-grade demo runs at roughly $0–$5/month: MongoDB Atlas free M0, a free LLM tier, embeddings at ~$0.02 per million tokens, and free deploy tiers on Vercel and Render. Set a provider spend cap and rate limits so a public demo can’t surprise you.

LangChain, LlamaIndex, or the Vercel AI SDK?

For a MERN streaming chat app, the Vercel AI SDK plus direct MongoDB vector queries is the lighter, recommended path in 2026. Reach for LlamaIndex.TS if your main challenge is heavy document ingestion, or LangChain.js/LangGraph for complex multi-agent orchestration. For ~90% of RAG web apps, the AI SDK is the right call.

Can I run this fully offline with a local model?

Yes. Ollama runs open models locally and exposes an OpenAI-compatible endpoint, so your code barely changes. Use a local embedding model like nomic-embed-text too. It’s ideal for development and privacy-sensitive data; for a low-traffic public demo, hosted free tiers are usually simpler and cheaper.

A note on accuracy. LLM pricing, model names, and SDK APIs change fast. Every figure here is a 2026 snapshot — verify current prices on each provider’s pricing page and current method names in the Vercel AI SDK docs before shipping to production. Code samples are illustrative walkthroughs, not drop-in files. Images: “Visualising AI” by Google DeepMind on Unsplash, free under the Unsplash License.

TL;DR

You can build and ship a production-shaped, agentic RAG “chat with your documents” app entirely on MERN for about $0/month: MongoDB Atlas’ free tier (with built-in vector search), a free LLM tier (Gemini Flash or Groq), embeddings at $0.02 per million tokens, and free deploys on Vercel + Render.
The 2026 stack is leaner than you think: MongoDB Atlas Vector Search (no separate vector database), the Vercel AI SDK for streaming + tool calling, and “agentic retrieval” — where the model itself decides when and what to search — instead of the old retrieve-once-then-answer pipeline.
Pick models by job, not by brand. Route cheap, high-volume work to Gemini Flash-Lite, DeepSeek, or Groq-hosted Llama; reserve a frontier model only for the final answer when quality matters. Self-hosting with Ollama only beats hosted APIs above heavy, sustained traffic.
By the end you’ll understand embeddings, chunking, vector retrieval, tool calling, token streaming in React, and how to put the whole thing on its own public website — with cost controls so you never get a surprise bill.

The Best Claude Setup (That Works on Any AI Tool)

Harshdeep Singh — Thu, 04 Jun 2026 23:16:05 +0000

Here is a confession that might sound odd in a guide about setting up Claude Code: the goal is not to marry Claude Code. The goal is to build a setup so portable that if something better ships next month — OpenAI's Codex, Cursor, Windsurf, whatever wins the week — you could pack up everything you have built and move in an afternoon. Your instructions, your custom tools, your reusable workflows: all of it should come with you.

That sounds like a strange thing to optimize for. Most "best setup" posts try to lock you deeper into one tool. But the AI coding world is moving too fast for that bet to be safe, and — happily — a small set of open standards now make portability the default instead of a fantasy. If you are a working engineer, this is how you stop re-plumbing your environment every time you switch editors. And if you are a vibe coder — someone who builds mostly by describing what you want in plain English and steering the AI, a style Andrej Karpathy named in early 2025 — this is how you get a professional-grade setup without months of fiddling.

The short version: keep your AI's instructions, tools, and skills in plain, open formats you own — not buried in one vendor's settings. Three standards make this work: MCP for tools, AGENTS.md for instructions, and Agent Skills for reusable know-how. Learn those, and the question "which coding agent should I use?" stops being scary.

Three standards do the heavy lifting here, and we will spend most of our time on them: MCP AGENTS.md Agent Skills. Let's build up to a setup you would actually be happy to leave.

Think of your setup as interchangeable bricks, not a sculpture glued to one base. Photo by Xavi Cabrera on Unsplash.

Why portability is the whole game now

In the last 18 months, more than a dozen serious AI coding agents have shipped: Claude Code, OpenAI Codex, Cursor, Windsurf, Zed's agent, Aider, Cline, Continue, Gemini CLI, and more. Each one claims to be the fastest or the smartest. Some genuinely leapfrog the others — for a few weeks, until the next release.

Here is the trap. If you pour weeks into one tool — memorizing its config files, hand-tuning its rules, wiring up its integrations — you have quietly built a switching cost. The day a clearly better tool arrives, you do the math on re-learning everything and you stay put, not because your tool is best, but because leaving hurts. That is the real lock-in. It is not a contract; it is your own sunk effort.

The fix is to treat the agent as a replaceable part and invest in the layer underneath it. The engineer Geoffrey Huntley has made this point sharply: the agents themselves are becoming commodities, and the durable advantage is the standards layer you build around them. Put your effort there, and any agent becomes a front-end you can swap.

An analogy: remember when every phone had its own charger, and switching brands meant a drawer full of dead cables? USB-C fixed that by agreeing on one shape. You buy a charger once; it works with the next phone, and the one after. The standards below are USB-C for your AI setup. Learn them once, and your tools become things you plug into — not things you are wired into.

The three open standards that set you free

You do not need to memorize specs. You need to understand what each standard is for, because once you see the shape of the problem each one solves, the portability falls out naturally. They split cleanly: one is for tools, one is for instructions, one is for skills.

MCP — one way to plug in tools

The Model Context Protocol (MCP) is an open standard, created and open-sourced by Anthropic in November 2024, for connecting an AI to outside tools and data — your database, your GitHub, your Notion, your company's internal services.

If you have used a code editor, you have already benefited from this idea. Editors once needed custom code to support each programming language, until the Language Server Protocol let any editor talk to any language through one shared interface. MCP does the same trick for AI and tools. Instead of every AI app writing a custom integration for every service — an N-times-M explosion — you write one MCP server for a service, and every MCP-compatible host can use it. The math collapses from N times M down to N plus M.

Mechanically, a host (Claude Code, Cursor, ChatGPT, and others) runs a client that talks to your server over one of two transports: stdio for a local tool running as a subprocess, or streamable HTTP for a remote service. The server exposes tools, read-only resources, and prompt templates; the host stays in charge of what the model is actually allowed to touch. Crucially, the configuration looks almost identical across tools — a small block naming each server:

{
  "mcpServers": {
    "postgres": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-postgres", "postgresql://localhost/mydb"]
    },
    "github": {
      "url": "https://api.githubcopilot.com/mcp/",
      "headers": { "Authorization": "Bearer ${GITHUB_TOKEN}" }
    }
  }
}

Move from Claude Code to Cursor or Codex and you copy that block over, rename one key if needed, and your tools come with you. One safety note worth tattooing on your brain: an MCP server can run code and reach real systems, so only install servers you trust, keep secrets in environment variables like the ${GITHUB_TOKEN} above rather than hard-coding them, and review what each tool is permitted to do.

MCP is the wiring: many tools, one shared protocol. Photo by Alina Grubnyak on Unsplash.

AGENTS.md — a README for your AI

Every project has unspoken rules: how to install it, how to run the tests, which folders are off-limits, what "good code" means here. A human teammate learns these over weeks. An AI agent starts fresh every session and will happily reinvent your conventions unless you write them down.

That is what AGENTS.md is — a plain Markdown file at the root of your repo that tells any coding agent how to work on your project. No special syntax, no required fields. Think of it as a README written for your AI instead of for a human. It was introduced alongside OpenAI's Codex and is now supported by Codex, Cursor, Aider, Cline, Windsurf, and others — with Gemini CLI requiring a GEMINI.md symlink as described below. A good one leads with commands, because the agent refers back to them constantly:

# Project: Acme API

## Commands
- Install: `pnpm install`
- Dev server: `pnpm dev`
- Test (run before every PR): `pnpm test`
- Lint & typecheck: `pnpm lint && pnpm typecheck`

## Conventions
- TypeScript strict mode. Never use `any`.
- Do not edit files in `/generated` — they are built from schemas.
- Write a test for every new endpoint.

Here is the one wrinkle on the Claude side. Anthropic's tool uses its own file, CLAUDE.md, and as of mid-2026 Claude Code does not read AGENTS.md natively — a gap the community has been loudly requesting for months. The portable move is to keep one source of truth in AGENTS.md and point the others at it with a symlink, so you never maintain the same rules twice:

# One canonical file, linked everywhere your tools look
ln -s AGENTS.md CLAUDE.md
ln -s AGENTS.md GEMINI.md

One file, every tool, zero duplication. One caution worth knowing: do not blindly auto-generate this file and walk away. Research has consistently shown that bloated, auto-generated instruction files tend to hurt more than they help — the important rules get buried in noise and the agent learns to half-ignore them. Keep it short, hand-curated, and honest about what is genuinely non-obvious.

Agent Skills — teach once, reuse everywhere

The third standard is the one that feels like magic the first time it clicks. An Agent Skill is a folder containing a SKILL.md file — Markdown instructions for a specific task — plus any optional scripts or reference docs that task needs. Anthropic's own framing is the clearest: building a skill is like writing an onboarding guide for a new hire. You capture how to do something once, and the agent follows it forever after.

The clever part is progressive disclosure. At startup the agent only reads each skill's name and one-line description — a few dozen words — so a hundred skills cost almost nothing. Only when a task actually matches does it open the full instructions, and only then does it reach for the bundled scripts. Your context window stays clean until the moment a skill is relevant.

pdf-form-filler/
├── SKILL.md          # name + description + instructions
├── scripts/          # code the agent runs (stays out of context)
│   └── fill.py
├── references/       # long docs, loaded only when needed
└── assets/           # templates, fonts, icons

Because a skill is just Markdown and files in a folder, it is inherently portable — you can point Codex or Gemini CLI at a skills folder, tell it to read the SKILL.md, and it simply works. Agent Skills are now supported across a growing list of tools: Claude Code, Codex, Cursor, Gemini CLI, and more. Write your "how we deploy" or "how we write migrations" skill once, and it travels.

So when do you reach for which? They are not competitors; they answer different questions. Here is the cheat sheet:

Mechanism	What it gives the agent	Reach for it when
Agent Skill	A repeatable procedure — the how	You keep re-explaining the same multi-step workflow
MCP server	A connection to an outside system — the reach	The agent needs live data or a third-party API
Subagent	A fresh worker with its own clean context	A job is large, or you want an independent reviewer
AGENTS.md	Always-on project facts and rules	Conventions every session should already know
Hook	Enforcement that always runs, no exceptions	Something must happen — formatting, blocking secrets

Does "switch tomorrow" actually hold?

It is a fair thing to be skeptical about — a portability promise is only as good as the tools honoring it. So here is an honest snapshot of where the major agents stood in mid-2026 on all three standards. It is not perfect across the board, but it is good enough that moving your setup is a copy-and-tweak job, not a rebuild.

Agent	MCP	AGENTS.md	Agent Skills
Claude Code	Yes	Via CLAUDE.md / symlink	Yes, native
OpenAI Codex	Yes	Yes (its home format)	Yes
Cursor	Yes	Yes	Yes
Windsurf	Yes	Yes	Yes
Zed	Yes	Yes	Via hosted agents
Aider	Limited	Yes	Via conversion
Cline	Yes	Yes	Yes
Gemini CLI	Yes	Yes (GEMINI.md)	Yes

The weak spots are real and worth naming — Aider's MCP support is limited, and a few tools only run skills through their cloud agents — but the spine holds. Your tools, instructions, and skills are written in formats more than one vendor understands.

Your portable setup, layer by layer

Now let's assemble it. The trick is two tiers: one that lives with each project and travels in its Git repo, and one that is personal to you and follows you across every machine and every project.

Tier 1 — the project repo (commit this)

At the root of each project, keep an AGENTS.md as the single source of truth, with CLAUDE.md and friends symlinked to it. Add an .mcp.json for the tools that project needs (databases, issue trackers), with secrets pulled from environment variables. Put team-shared skills in a folder like .agents/skills/. Commit all of it. The payoff: a teammate — or you, six months later — clones the repo and the AI is instantly productive, with the same rules and the same tools, no setup call required.

Tier 2 — your personal dotfiles

Your personal taste — how you like commit messages written, your favorite skills, your global tool configs — belongs in a dotfiles repo you carry everywhere. The reliable pattern is to keep one canonical copy of each file and symlink it into the locations each tool expects, using a tiny helper so a fresh machine is set up in seconds:

safe_symlink() {
  local src=$1 dst=$2
  [ -L "$dst" ] && rm "$dst"
  [ -e "$dst" ] && mv "$dst" "$dst.bak"
  mkdir -p "$(dirname "$dst")"
  ln -s "$src" "$dst"
}

safe_symlink "$DOTFILES/AGENTS.md" "$HOME/.claude/CLAUDE.md"
safe_symlink "$DOTFILES/AGENTS.md" "$HOME/.codex/AGENTS.md"
safe_symlink "$DOTFILES/skills"    "$HOME/.claude/skills"

Tools like GNU Stow and chezmoi exist to manage exactly this, and chezmoi is the kinder choice if you bounce between macOS, Linux, and Windows. One Windows gotcha to save you an hour: symlinks there need Developer Mode or admin rights and core.symlinks=true in Git, and mixing WSL and native-Windows symlinks does not work — pick one world and stay in it.

Advanced Claude Code moves that survive the move

Everything above keeps you portable. But within Claude Code there are sharper techniques worth knowing — and because they are built on the same standards and plain files, they travel with you too. The single mental model that explains most of them: Claude's context window fills up fast, and quality drops as it fills. Almost every good habit is really about protecting that space.

Subagents are separate Claude instances with their own clean context and their own narrow tool permissions. Hand a big exploration or an independent code review to a subagent, and only its summary comes back — your main conversation stays uncluttered. A favorite pattern: after a long stretch of work, spin up a fresh reviewer subagent that sees only the final diff and the requirements, with no memory of the messy path that got there.

Slash commands and hooks are where you encode habits. A custom command is a saved prompt you trigger by name. A hook is different and more powerful: it is a script that fires automatically at a set moment — and this is the key distinction — your instruction files only suggest behavior, while a hook guarantees it. Want code formatted every single time a file is written, no exceptions?

{
  "hooks": {
    "PostToolUse": [
      {
        "matcher": "Write|Edit",
        "hooks": [{ "type": "command", "command": "pnpm prettier --write \"$CLAUDE_FILE_PATHS\"" }]
      }
    ]
  }
}

Plan mode is the antidote to an agent that charges off in the wrong direction. Toggle it and Claude reads, analyzes, and proposes a plan without touching your files. The workflow Anthropic recommends is simple and worth internalizing: explore, plan, implement, commit — in that order, with your eyes on the plan before any code is written.

Worktrees and headless mode are for scale. Git worktrees let you run two to four Claude sessions in parallel on separate branches without collisions. Headless mode — the much-missed claude -p flag — turns Claude into a command-line citizen you can pipe into and wire into CI:

# Pipe an error log straight into a headless review
cat error.log | claude -p "Find the root cause and propose a fix"

# CI-safe: cap turns and restrict tools
claude -p "Run the test suite and fix failures" \
  --max-turns 12 \
  --allowedTools "Bash(pnpm test:*)" "Edit"

And one underused habit: ask Claude to explain its reasoning as it works — add "and briefly explain each decision" to your prompt. You absorb the why as you read the diff, not just the what. For a new codebase, this single habit is worth the extra few lines of output.

If you are a vibe coder, start here

If the section above felt like a lot, this part is for you. Vibe coding — building by describing and steering rather than hand-writing every line — is a real and legitimate way to ship. But there is an honest catch worth saying plainly: the impressive demo is the easy 80%. The boring 20% — real authentication, a real database, payments, deployment, security — is where projects quietly fall apart. Security researchers scanning AI-assisted apps throughout 2025 found troubling rates of exposed credentials — API keys hardcoded in frontends, admin passwords committed to public repos — almost all of it preventable with a single review step.

So here is the short, high-leverage starter kit. None of it slows you down much, and all of it keeps you out of the ditch:

Treat the AI like a sharp junior engineer with tools, not a magic box. Build features in layers — get auth working, then the database, then payments — rather than asking for everything in one breath.
Give it a role that changes its priorities. "Act as a senior engineer who has been burned by sloppy payment code" produces noticeably more careful, defensive work.
Always run an adversarial reviewer at the end — a fresh subagent that sees only the diff and is asked to find what is wrong. This one habit catches most of those exposed-secret disasters.
Do not launch Claude Code from your home folder — it then has reach over your SSH keys and tokens. Work inside a dedicated project folder.
Turn on the "Learning" output style and let the tool explain itself. You will absorb the why, not just collect the what.

You do not need a 10-monitor battlestation to build well — you need good defaults. Photo by Jakub Żerdzicki on Unsplash.

Anti-patterns to skip

A few traps catch nearly everyone. Knowing them in advance saves real time:

Instruction-file bloat. A giant AGENTS.md backfires — the important rules get lost in the noise and the agent half-ignores them. For each line, ask whether removing it would cause a mistake. If not, cut it.
MCP overload. Connecting fifteen servers globally floods the agent with tool options and it starts picking wrong. Add tools per project, only where they earn their place.
No way to check the work. This is the big one. If you do not give the agent a test, a build, or a linter it can run, it stops when the code merely looks done — and you become the error-checker. Give it a check it can run, and have it show you the passing output rather than just claiming success.
Rules where guarantees belong. "Always format the code" written in an instruction file is a suggestion it may drop. If it must happen, make it a hook.
Keeping it all to yourself. If your .mcp.json and shared skills are not committed, your teammates are flying blind. Portability includes your team.

The point

The best Claude Code setup is not a clever pile of configuration locked inside one app. It is a small, portable kit — instructions in AGENTS.md, tools behind MCP, know-how in Skills — written in open formats you own and can carry anywhere. Build it that way and you are not betting on which agent wins the next release cycle. You are making that question stop mattering, because whatever you reach for tomorrow, your setup is already waiting for you there.

You can start in the next ten minutes. Write a short, honest AGENTS.md for your current project. Symlink CLAUDE.md to it. Then take the one workflow you keep re-explaining to your AI, and move it out of your head and into a SKILL.md. That is the whole foundation — and it already belongs to you.

Sources & further reading

Photography & AI: a faster, smarter 2026 workflow

Harshdeep Singh — Thu, 04 Jun 2026 05:00:35 +0000

Ask a working photographer where their week actually goes, and almost none of them will say “behind the camera.” The shoot is the fun part. The grind is everything after it: culling thousands of frames, matching edits across an entire gallery, color-grading video, chasing invoices, and answering the same five client emails for the hundredth time.

That grind is exactly what AI is good at. Industry surveys now put AI adoption among professional photographers in the low-90s percent — it has quietly gone from novelty to default. Used well, it doesn’t replace your eye; it removes the repetitive work standing between you and the next shoot. Here’s what a modern, AI-assisted pipeline looks like, from capture to paid invoice.

The AI Photo editing Pipeline

It starts before you touch a single slider. AI culling tools like Aftershoot and Imagen sort a 3,000-frame wedding in minutes — flagging the sharp shots, catching closed eyes, and grouping near-duplicates so you choose from a shortlist instead of the whole card.

Then comes the part that actually matters: editing in your style. This is where 2026 tools pull ahead of presets. A preset applies the same math to every photo; feed a “bright and airy” preset a dark reception and you get mush. Imagen’s Personal AI Profile works differently — point it at a few thousand of your previously edited images and it studies how you handle exposure, white balance, and color, then applies that judgment to a fresh set.

For pixel-level rescue, Topaz Photo AI handles denoise, sharpening, and Gigapixel upscaling; Lightroom’s AI denoise and masking isolate skies and subjects in a click; and Photoshop’s Firefly Generative Fill removes a stray tourist or extends a cramped background without a clone-stamp marathon. The pattern is consistent: AI does the first 80%, and you spend your time on the 20% that carries your signature.

AI in the video pipeline

If you shoot hybrid, video is where AI buys back the most time, because video post is where the most time disappears. The headline shift is text-based editing: your footage is transcribed, and you cut the video by editing the transcript — delete a sentence, and the matching frames vanish. Rough cuts that used to eat an afternoon now take minutes.

Color is the other big win. DaVinci Resolve’s Magic Mask isolates a subject for targeted grading, and neural color matching lines up clips shot on different bodies — your A-cam, a second camera, and the drone — into one consistent look. Imagen’s video tool, launched at NAB 2026, brings the same learn-your-style grading photographers already enjoy straight to the timeline.

The finishing touches are increasingly automatic too: smart reframing turns a 16:9 edit into vertical 9:16 and square 4:5 cuts for social, AI leveling and noise removal clean up audio without manual EQ, and upscaling pushes older footage to 4K. The result is same-day turnarounds on work that used to take a week.

AI for the business: clients, CRM & delivery

Here’s the unglamorous truth no gear review mentions: the thing most likely to sink a photography business isn’t bad photos — it’s bad admin. Inquiries that go cold, contracts that sit unsigned, invoices that slip a month. A well-run CRM is the fix, and the payoff is real: photographers consistently report clawing back the better part of a working day every week once the back office runs itself.

Platforms built for this — HoneyBook, Dubsado, Studio Ninja, Táve, Bloom, Sprout Studio — automate the entire client journey: an inquiry triggers an instant reply, a proposal, a contract, and a payment schedule, with reminders firing on their own. The AI layer goes further. Sprout Studio drafts your emails and questionnaires, HoneyBook plugs into post-production tools so editing and client management finally talk to each other, and gallery platforms now use face recognition so guests find and buy their own photos without you lifting a finger.

Speed is the quiet advantage. The studio that answers an inquiry in five minutes books the client that the studio answering in five hours was still drafting a reply to. AI simply makes those five minutes happen while you’re on a shoot.

AI handles the first 80%. Your taste is the last 20% — and that’s the only part a client is really paying for.

Tools mentioned

AftershootImagenTopaz Photo AILightroomPhotoshopFireflyDaVinci ResolveCapCutHoneyBookDubsadoStudio NinjaSprout Studio

Where the Human still wins

None of this is about handing your work to a machine. AI is leverage, not authorship. It can match your edit, but it can’t decide what’s worth photographing, read a nervous couple on a wedding morning, or build the trust that turns a one-off booking into a decade of referrals. That judgment is the moat — and it’s getting more valuable, not less.

So don’t try to automate everything at once. Find your single biggest bottleneck — for most photographers it’s culling or the CRM — and hand that one to AI first. Win back those hours, then reinvest them where they compound: better shoots, sharper craft, and a client experience no software will ever replicate.

TL;DR

The shoot was never the bottleneck. AI’s real value is post-shoot — it clears the repetitive work, not your creative judgment.
Photo editing: AI culls thousands of frames and edits in your learned style (Imagen, Aftershoot), while Topaz, Lightroom, and Firefly handle pixel-level fixes. You keep the final 20%.
Video: text-based editing, neural color-matching across cameras, and auto reframe + audio cleanup turn week-long edits into same-day deliveries.
Business: a CRM plus AI automations (HoneyBook, Dubsado, Sprout Studio) save roughly a day a week — and a five-minute inquiry reply wins the booking.
Start small: automate one bottleneck (culling or your CRM) first, then reinvest the hours into better work.

Full Stack Developer Portfolio Lessons: What I Learned Building 10+ Projects

Harshdeep Singh — Tue, 02 Jun 2026 21:33:43 +0000

I applied for a role at a mid-sized SaaS company about two years into my career. Strong company, interesting problem, good pay. I sent my application, got a recruiter callback, and then nothing for two weeks. When the feedback finally came: "We went with candidates with a stronger portfolio presence."

I had 23 GitHub repositories. I had a portfolio site. I had projects. What I didn't have — and what I didn't understand for another six months — was a portfolio that told a story. I had code. Not evidence of thinking, decision-making, or the ability to ship something real.

I've since built, rebuilt, and advised on a lot of developer portfolios. I've seen what gets people calls and what gets them ghosted. This isn't a guide about which framework to use or how to pick colors. It's about what actually moves the needle — the things I wish someone had told me in year one.

Lesson 1: Two Great Projects Beat Twenty Mediocre Ones

The instinct is to fill the portfolio. More projects = more evidence of experience. This is wrong.

A hiring manager or engineering lead looking at your portfolio has about three minutes. They're going to look at your two or three most prominent projects, click one or two live demo links, and form an opinion. If they see twenty repositories and most of them are "Todo App v2," "Weather App," "Netflix Clone," "Portfolio v1 through v6" — they've already categorized you as someone who builds tutorials, not someone who builds things.

The better approach: three to five projects, each with:

A real problem it solves (not "I wanted to learn React")
A live deployment that actually works
A README that explains why you made the decisions you made
Enough complexity to have generated at least one interesting engineering problem

Projects that tend to work: tools you built because you were frustrated with an existing tool, apps solving problems you personally had, projects where you integrated with a real API or real data source, anything with a live user base (even 10 users counts).

Projects that tend not to work: tutorial clones (unless heavily modified), apps that only run locally, projects that stop at the MVP and never got deployed, apps with the same name as thousands of other developer portfolios ("My Todo App," "My Weather App").

If you have 20 repos, that's fine. Pin your three best to your GitHub profile. Don't make people wade through everything — curate it.

Lesson 2: Case Studies Beat Code Screenshots

Here's the thing about showing a screenshot of your app: everyone can make an app look good in a screenshot. Filters, cropping, ideal state data. A screenshot shows what you built. It tells me nothing about how you think.

A case study shows how you think. And how you think is what you're being hired for.

A case study doesn't have to be a five-page document. Two or three paragraphs on each project covering:

The problem. What did you set out to solve? Be specific. Not "I wanted to learn Next.js" — that's not a problem. "Resume submissions were getting lost in email threads, so I built a tool that…" — that's a problem.
Your approach and the tradeoffs you considered. What did you think about? What did you try first? What didn't work? This is where you demonstrate that you can make technical decisions, not just execute instructions.
What you shipped. Not every feature you imagined — what you actually built and deployed.
What you'd do differently. This one is disarming in the best way. It shows self-awareness, reflection, and the ability to evaluate your own work critically. Engineers who can't critique their own code can't grow.

I've seen portfolios with two projects and a well-written case study for each that outperformed portfolios with fifteen projects and no context. The case study gives an interviewer something to ask about. It shows you've thought deeply about the work. It makes the technical interview easier because you already answered half the questions in writing.

Lesson 3: If It's Not Deployed, It Doesn't Exist

This is the blunt version. A project that runs on localhost is a project you're still working on. It is not a portfolio piece.

I've reviewed portfolios where the "live demo" link was a localhost URL. I've seen GitHub repositories where the README says "deployment in progress" with a date from 18 months ago. I've seen apps in screenshots that couldn't actually run because they depended on a local database with no seed data.

Deploying has never been easier or cheaper. There's no excuse for a portfolio project that isn't live.

Frontend: Vercel (free), Netlify (free), Cloudflare Pages (free). Zero configuration for most frameworks.
Backend / API: Railway (free tier), Render (free tier), Fly.io (free tier). These all support Node.js, Python, Go, whatever you're running.
Database: MongoDB Atlas free tier (512MB), Supabase free tier (PostgreSQL), PlanetScale free tier (MySQL).
Full stack: Railway handles full-stack apps well. Render lets you deploy multiple services from one repo. Both have one-click GitHub deploys.

Total cost of a deployed side project: $0, with a free domain subdomain. Add a custom domain for $12/year and you have a genuinely professional-looking production deployment.

There's a secondary benefit to deploying: it forces you to actually finish things. There's a long list of problems you don't know about until you deploy — environment variable management, CORS configuration, database connection pooling, static asset serving. Deploying is part of building. A portfolio project that's never been deployed has never been truly finished.

Lesson 4: The README Is Your First Interview

When a hiring manager or senior engineer clicks the GitHub link from your portfolio, the first thing they see is the README. If it says "A project I made for learning" or has no description at all, they've already lost interest.

The README is where you make the technical case for yourself before you're in the room. Here's what a good one contains:

First paragraph: What does this thing do and why does it exist? Not "this is a web app" — tell me the specific problem it solves. One or two sentences.

Tech stack and why: Not just a logo grid. A sentence about why you chose what you chose. "Used PostgreSQL instead of MongoDB because the data has strong relational structure with lots of joins." "Chose Next.js App Router over CRA because we needed SSR for SEO and a built-in API layer." These sentences prove you made intentional decisions.

Screenshots or a GIF: A 10-second screen recording of the app working is worth a thousand words. Not staged, not filtered — just the actual app.

How to run it: Clear, complete instructions. If I clone it and follow your README and it doesn't work, that's a flag. If it works first try, that's a positive signal — it means you document carefully and you care about the developer experience of your code.

Known limitations / what you'd do differently: One paragraph. Shows maturity. "If I built this again, I'd use a message queue for the email sending instead of doing it synchronously in the request lifecycle — it caused timeouts under load."

This README takes maybe 45 minutes to write. It dramatically changes how your project is perceived.

Lesson 5: Get Your Own Domain

This one is simple and often skipped. Your portfolio should live at yourname.com or yourname.dev — not github.io/yourname/portfolio or yourname.netlify.app.

A custom domain does two things: it signals that you take yourself seriously as a professional, and it's a much better URL to put on a resume, LinkedIn, or business card. "theharshdeepsingh.com" looks intentional. "harshdeep-singh-13.github.io/portfolio-2024" looks like a homework assignment.

Domains cost $10–15 per year. That is a rounding error in any budget. Buy yours today. Redirect your GitHub Pages / Vercel / Netlify deployment to it. It takes 30 minutes and it never needs to change — you own it.

A note on choosing the domain: use your name. Not your "developer brand" or a clever handle. Names rank in Google. If someone searches for you, they should find your portfolio at the top. A personal domain with your name is one of the easiest SEO wins available to you.

Lesson 6: One AI Integration Changes Everything

Here's the hiring landscape in 2025 from a practical perspective: companies want developers who can work with AI, build on top of AI APIs, and integrate AI capabilities into existing products. This is new enough that not everyone has done it. Old enough that "I'm planning to learn it" isn't a compelling answer.

One project with a real AI integration moves you from the pile to the shortlist. Not because AI is a magic word — but because it demonstrates technical currency. You know what the OpenAI API looks like. You've dealt with token limits and streaming and prompt engineering. You've thought about cost and abuse prevention. These are all non-trivial.

What counts:

A feature in an existing project that uses GPT-4o, Claude, or Gemini for a specific, meaningful task (not "ask AI anything" — that's too vague to be impressive)
A RAG (retrieval-augmented generation) pipeline — document upload, embedding, search, answer generation
An agent that takes structured actions based on LLM output (web search, database queries, API calls)
A classification or extraction feature that uses an LLM where a simpler approach wouldn't have worked

What doesn't count:

"Used ChatGPT to help me write this code" (everyone does this)
A UI wrapper around ChatGPT that just passes prompts through (no engineering decision was made)
A project that uses AI for something that a regex would handle just as well

The bar isn't high. Ship one genuinely useful AI feature, document the decisions you made (model selection, prompt design, cost management), and you're ahead of the majority of developers applying for the same roles.

Watch: Portfolio Reviews — What Actually Works

Weak vs. Strong Portfolio Signals

Signal
Weak
Strong

Project count
20+ repos, half are tutorial clones
3–5 curated projects, each with a clear purpose

Project quality
Todo apps, weather apps, Netflix/Airbnb clones
Tools solving real problems, deployed with real users

Live demos
No live link, "works on my machine," localhost screenshots
Deployed URLs that load in under 3 seconds

Documentation
No README or "this is a project I made"
Problem statement, tech choices explained, known limitations

Tech recency
Create React App, class components, outdated dependencies
Current stack (Next.js 15, TypeScript, modern APIs)

AI integration
None, or "used AI to help me code"
One genuine AI feature with documented engineering decisions

Domain
github.io/username or platform subdomain
yourname.com — personal, memorable, professional

Case studies
Screenshots in a grid with a "View Project" button
Problem → approach → tradeoffs → outcome — per project

What I'd Do on Day 1 If I Were Starting Over

If I were a developer today with no portfolio and a job to find, here's the exact sequence I'd follow:

Day 1: Register firstnamelastname.com. It costs $12. Do it before you build anything. Having the domain makes the whole thing feel real and gives you a deadline.

Week 1: Identify one problem I genuinely have — something I do manually that should be automated, something I've searched for that doesn't exist, something at work that annoys me. Build the simplest version of the solution. Not a full product — a working tool. One feature, deployed.

Week 2: Write the case study. What was the problem? What did I consider? What did I ship? What didn't make the cut? What would I do differently? Two or three paragraphs per question. This is more valuable than the code itself.

Week 3: Add an AI integration to the project — something that actually makes the tool better, not a bolt-on. Even a single endpoint that uses an LLM for classification or text generation counts, as long as it's doing something a simpler approach couldn't.

Week 4: Point the domain at the project. Add the URL to LinkedIn's "Featured" section and your resume. Ask one person who is not a developer to try using the tool and tell you what they're confused by. Fix those things.

That's it. One good project, one good case study, one AI integration, a custom domain, a LinkedIn presence. Four weeks. That's a portfolio that gets callbacks. Everything else is refinement.

TL;DR

Curate, don't accumulate. Three deployed projects with case studies beat twenty unfinished repos. Pin your best work. Hide the rest.
Write case studies for every project. Problem → approach → tradeoffs → outcome → what you'd do differently. This is what interviews are about anyway — you're just answering in advance.
Nothing without a live URL. Free tiers on Vercel, Railway, and Render make deployment trivial. An undeployed project is an unfinished project.
The README is your first impression from GitHub. Spend 45 minutes on it. Explain the why, not just the what. Include how to run it. List the limitations. That's a professional developer.
Get the domain. yourname.com, $12, 30 minutes. It signals intent and makes your portfolio findable in Google searches by your own name.
Build one thing with a real AI integration. Not a ChatGPT wrapper — a feature that uses an LLM to solve a specific problem in your project, with documented decisions about model selection and prompt design.

How to Integrate the OpenAI API into a Production Express App

Harshdeep Singh — Tue, 02 Jun 2026 21:28:00 +0000

Last year I helped a startup integrate the OpenAI API into their product. It was a chat feature — users could ask questions about their data and get natural language answers. The integration took about a day. Three days after launch, the founder messaged me: "Hey, something's wrong. Our AWS bill just showed an unexpected charge."

It was $340. For three days. They had 60 users.

The issue wasn't a bug — it was that production API usage looks nothing like a tutorial. The tutorial shows you openai.chat.completions.create() and returns a response. The tutorial doesn't show you what happens when users send 500-token messages, when they open 15 browser tabs each maintaining their own chat context, or when one user fires requests 30 times per minute because they think it's broken.

This guide covers what the tutorials skip: rate limiting, token counting, cost guards, streaming, error handling with retries, and model selection. These aren't optional additions — they're what separates a demo from a production feature.

Why Production Is Different

Here's the gap between tutorial code and production code, stated plainly:

Concern
Tutorial Code
Production Code

Cost control
Not mentioned
Token counting, spending limits, model selection by task

Rate limiting
Not mentioned
Per-user and per-IP limits to prevent abuse

Error handling
try/catch that logs to console
Typed errors, retries with backoff, user-facing messages

Response delivery
Wait for full completion, return at once
Streaming via SSE — response appears as it generates

Context management
Each request is independent
Conversation history managed, truncated at token limit

Secrets management
API key hardcoded or in .env (no rotation)
Rotation strategy, usage monitoring, per-feature keys

Let's build a production-grade Express API that addresses all of this. We'll go layer by layer.

The Architecture

┌─────────────────────────────────────────────────────────┐
│ CLIENT (Browser / Mobile) │
│ POST /api/chat { messages: [...] } │
│ GET /api/chat/stream (SSE) │
└──────────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ EXPRESS MIDDLEWARE STACK │
│ │
│ 1. express-rate-limit (10 req/min per IP) │
│ 2. tokenGuard() (reject if > 4,000 tokens) │
│ 3. auth middleware (verify user session) │
└──────────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ ROUTE HANDLER │
│ │
│ Select model by task type │
│ Build messages array from context │
│ Call openai.chat.completions.create() │
│ Stream or return response │
└──────────────────────┬──────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ OPENAI API │
│ Model: gpt-4o-mini (default) / gpt-4o (complex tasks) │
└─────────────────────────────────────────────────────────┘

Project Setup

mkdir express-openai && cd express-openai
npm init -y
npm install express openai express-rate-limit tiktoken dotenv
npm install --save-dev nodemon

# .env
OPENAI_API_KEY=sk-proj-your-key-here
PORT=3001

Step 1: The OpenAI Client (Configured for Production)

Don't instantiate the OpenAI client inside route handlers. Create it once, configure it for production, and export it:

// src/openaiClient.js
import OpenAI from "openai";

export const openai = new OpenAI({
  apiKey: process.env.OPENAI_API_KEY,
  maxRetries: 3,     // retry on transient failures (rate limits, timeouts)
  timeout: 30_000,   // 30 second timeout — don't hang forever
});

// Model selection by task complexity
export const MODELS = {
  fast: "gpt-4o-mini",   // classification, simple Q&A, summarization
  smart: "gpt-4o",        // complex reasoning, code generation, analysis
};

The maxRetries: 3 and timeout settings are critical. Without a timeout, a hung OpenAI request will keep your Express server's response object open indefinitely — and if you're running on a serverless function, you'll pay for that idle time.

Step 2: Token Counting and Cost Guard

The tiktoken library is OpenAI's own tokenizer — it counts tokens the exact same way the API does. Use it to reject requests before they hit the API:

// src/tokenCounter.js
import { encoding_for_model } from "tiktoken";

export function countMessageTokens(messages, model = "gpt-4o-mini") {
  const enc = encoding_for_model(model);
  let totalTokens = 0;

  for (const message of messages) {
    totalTokens += 4; // every message has ~4 tokens of overhead
    if (message.role) totalTokens += enc.encode(message.role).length;
    if (message.content) totalTokens += enc.encode(message.content).length;
    totalTokens += 1; // reply primer
  }

  enc.free(); // tiktoken requires explicit cleanup
  return totalTokens + 3; // overall reply overhead
}

// Express middleware — rejects requests over the token limit
export function tokenGuard(maxInputTokens = 4_000) {
  return (req, res, next) => {
    const messages = req.body?.messages;

    if (!Array.isArray(messages)) {
      return res.status(400).json({ error: "messages must be an array" });
    }

    const tokenCount = countMessageTokens(messages);

    if (tokenCount > maxInputTokens) {
      return res.status(400).json({
        error: `Message too long: ${tokenCount} tokens (limit: ${maxInputTokens}). Shorten your message or clear the conversation.`,
        tokenCount,
        limit: maxInputTokens,
      });
    }

    req.tokenCount = tokenCount; // pass downstream for logging
    next();
  };
}

A note on the limit: GPT-4o-mini's context window is 128K tokens, so 4,000 is conservative. But conservative is good here — a user who sends 30,000 tokens in one request is either doing something unusual or has a bug in their client. Reject it, log it, and let them know to clear their context.

Step 3: Rate Limiting

One user shouldn't be able to drain your API budget or trigger OpenAI rate limits for everyone else. Add rate limiting before the AI routes:

// src/middleware/rateLimiter.js
import rateLimit from "express-rate-limit";

export const aiRateLimiter = rateLimit({
  windowMs: 60 * 1000,  // 1-minute window
  max: 15,               // 15 requests per minute per IP
  standardHeaders: true, // return RateLimit headers
  legacyHeaders: false,
  message: {
    error: "Too many requests. Please wait a moment before trying again.",
    retryAfter: 60,
  },
  keyGenerator: (req) => {
    // Use authenticated user ID if available, otherwise fall back to IP
    return req.user?.id || req.ip;
  },
});

// Stricter limit for expensive models
export const smartModelLimiter = rateLimit({
  windowMs: 60 * 1000,
  max: 5,
  message: { error: "Too many complex requests. Rate limited for 60 seconds." },
});

Step 4: Error Handling with Typed OpenAI Errors

The OpenAI Node SDK throws typed errors. Use them — don't just check err.message:

// src/middleware/openaiErrorHandler.js
import OpenAI from "openai";

export function handleOpenAIError(err, req, res, next) {
  if (err instanceof OpenAI.APIError) {
    console.error(`OpenAI API error: ${err.status} ${err.name}`, {
      message: err.message,
      requestId: err.headers?.["x-request-id"],
    });

    if (err.status === 429) {
      return res.status(429).json({
        error: "AI service is busy. Please try again in a moment.",
        retryAfter: parseInt(err.headers?.["retry-after"] || "5"),
      });
    }

    if (err.status === 400) {
      return res.status(400).json({
        error: "Invalid request to AI service. Check your message format.",
      });
    }

    if (err.status === 401) {
      console.error("OpenAI authentication failed — check OPENAI_API_KEY");
      return res.status(503).json({ error: "AI service unavailable." });
    }
  }

  // Not an OpenAI error — pass to your generic error handler
  next(err);
}

Step 5: The Chat Endpoint (Non-Streaming)

Let's wire everything together for a standard, non-streaming response first:

// src/routes/chat.js
import express from "express";
import { openai, MODELS } from "../openaiClient.js";
import { tokenGuard } from "../tokenCounter.js";
import { aiRateLimiter } from "../middleware/rateLimiter.js";

const router = express.Router();

router.post(
  "/",
  aiRateLimiter,
  tokenGuard(4_000),
  async (req, res, next) => {
    const { messages, useSmartModel = false } = req.body;
    const model = useSmartModel ? MODELS.smart : MODELS.fast;

    try {
      const completion = await openai.chat.completions.create({
        model,
        messages,
        max_tokens: 1_000, // cap output tokens to control cost
        temperature: 0.7,
      });

      const reply = completion.choices[0].message;
      const usage = completion.usage;

      res.json({
        message: reply,
        usage: {
          inputTokens: usage.prompt_tokens,
          outputTokens: usage.completion_tokens,
          totalTokens: usage.total_tokens,
          estimatedCostUsd: estimateCost(usage, model),
        },
      });
    } catch (err) {
      next(err);
    }
  }
);

function estimateCost(usage, model) {
  // Prices per million tokens (as of mid-2025)
  const pricing = {
    "gpt-4o-mini": { input: 0.15, output: 0.60 },
    "gpt-4o": { input: 5.00, output: 15.00 },
  };
  const p = pricing[model] || pricing["gpt-4o-mini"];
  const inputCost = (usage.prompt_tokens / 1_000_000) * p.input;
  const outputCost = (usage.completion_tokens / 1_000_000) * p.output;
  return Number((inputCost + outputCost).toFixed(6));
}

export default router;

Notice max_tokens: 1_000. Without this, GPT-4o can produce 4,096 output tokens per request. If a user asks it to "write me a book," it will try. The max_tokens cap is your backstop.

Step 6: Streaming Responses with Server-Sent Events

Streaming makes AI features feel responsive. Instead of a blank screen for 3–8 seconds, the user sees text appear word by word. It's the difference between "this feels AI-powered" and "this is broken."

// src/routes/chat-stream.js
import express from "express";
import { openai, MODELS } from "../openaiClient.js";
import { tokenGuard } from "../tokenCounter.js";
import { aiRateLimiter } from "../middleware/rateLimiter.js";

const router = express.Router();

router.post(
  "/stream",
  aiRateLimiter,
  tokenGuard(4_000),
  async (req, res, next) => {
    const { messages } = req.body;

    // Establish SSE connection
    res.setHeader("Content-Type", "text/event-stream");
    res.setHeader("Cache-Control", "no-cache");
    res.setHeader("Connection", "keep-alive");
    res.setHeader("Access-Control-Allow-Origin", "*");
    res.flushHeaders(); // send headers immediately

    try {
      const stream = await openai.chat.completions.create({
        model: MODELS.fast,
        messages,
        max_tokens: 1_000,
        stream: true,
      });

      let totalOutputTokens = 0;

      for await (const chunk of stream) {
        const delta = chunk.choices[0]?.delta?.content ?? "";
        if (delta) {
          totalOutputTokens += 1; // approximate; tiktoken is more accurate
          res.write(`data: ${JSON.stringify({ type: "delta", content: delta })}

`);
        }

        // Check for stop reason
        if (chunk.choices[0]?.finish_reason === "length") {
          res.write(`data: ${JSON.stringify({ type: "warning", message: "Response truncated at token limit" })}

`);
        }
      }

      res.write(`data: ${JSON.stringify({ type: "done" })}

`);
      res.end();
    } catch (err) {
      // Send error over SSE before closing
      res.write(`data: ${JSON.stringify({ type: "error", message: "Generation failed. Please try again." })}

`);
      res.end();
      // Also pass to error handler for logging
      console.error("Streaming error:", err.message);
    }
  }
);

export default router;

Watch: OpenAI API with Node.js + Express

Streaming vs. Non-Streaming — When to Use Which

Factor
Non-Streaming
Streaming (SSE)

User experience
Blank screen until done (3–8s)
Text appears word by word — feels instant

Complexity
Standard REST response
SSE connection, chunked parsing on frontend

Usage logging
Easy — completion.usage has exact token counts
Harder — token counts only available via the final chunk

Caching
Can cache the full response
Can't cache a stream

Best for
API-to-API calls, short responses, classification tasks
User-facing chat, long completions, code generation

Serverless functions
Works everywhere
Needs long-running connection — use Vercel Edge Functions or a real server

Testing Your OpenAI Integration

Mocking the OpenAI API in tests is a trap. The mock will pass but the real integration will fail in ways you didn't anticipate — different error formats, unexpected token usage, streaming chunk structure variations.

Instead:

Unit test everything except the API call. Test your token counting, your error handler, your response formatter — all without touching OpenAI. These functions should be pure and deterministic.
Use a cheap model for integration tests. gpt-4o-mini is $0.15 per million input tokens. Your integration test suite probably costs fractions of a cent to run. Run it.
Record and replay for expensive tests. Libraries like nock or VCR-style recording let you record real API responses and replay them in future test runs without hitting the API.

// Example: testing the token guard middleware in isolation
import { tokenGuard } from "../src/tokenCounter.js";
import { createMockMiddlewareContext } from "./helpers.js";

test("tokenGuard rejects messages over the limit", async () => {
  const guard = tokenGuard(10); // tiny limit for test
  const { req, res, next } = createMockMiddlewareContext({
    body: {
      messages: [{ role: "user", content: "a".repeat(500) }],
    },
  });

  await guard(req, res, next);

  expect(res.statusCode).toBe(400);
  expect(res.body.error).toContain("too long");
  expect(next).not.toHaveBeenCalled();
});

TL;DR

Initialize the OpenAI client once with maxRetries and timeout set. Don't instantiate it in route handlers or you'll get a new client per request with no retry or timeout configuration.
Count tokens before you call the API. Use tiktoken to measure input size and reject oversized requests before they cost you money. Set a max_tokens cap on output for the same reason.
Rate limit by user ID, not just IP. Authenticated users with the same IP (corporate NAT, mobile networks) would all share a single IP limit — use their user ID as the rate limit key.
Use typed error handling — instanceof OpenAI.APIError gives you the status code, request ID, and message. Turn 429s into user-friendly retry prompts, not 500 errors.
Stream for user-facing features, skip it for internal calls. SSE streaming transforms the UX for chat interfaces. For batch processing or API-to-API calls, non-streaming is simpler to implement and log.
Test everything except the API call. Token counting, error handling, and response formatting are all pure functions you can test cheaply. For integration tests, use gpt-4o-mini — it's cheap enough to run in CI.

Deploying a Next.js App to AWS with CI/CD Pipelines (Step-by-Step)

Harshdeep Singh — Tue, 02 Jun 2026 21:27:52 +0000

The first time I deployed a Next.js app to production, it took me three days. Not because the app was complicated — it was a straightforward portfolio site. It took three days because I had no idea what I was doing with AWS, I'd never written a GitHub Actions workflow, and every tutorial I found either skipped the hard parts or assumed I already knew them.

By the time I was done, I had a deployment pipeline I was genuinely proud of: push to main, GitHub Actions runs the build, tests pass, the app deploys to an EC2 instance behind CloudFront. Zero manual steps. Zero downtime deploys. Total cost: about $5/month.

This guide is the one I wish had existed. We're going to deploy a Next.js app to AWS from scratch — EC2 for compute, CloudFront for CDN, GitHub Actions for CI/CD — with every step explained so you understand what you're building, not just copying commands.

Why AWS Instead of Vercel?

This is a fair question. Vercel is genuinely excellent for Next.js, and for most projects it's the right call. You push, it deploys. Done.

AWS makes sense when:

You need to control the infrastructure (compliance, data residency, custom VPC configuration)
You're running other services (databases, queues, lambdas) in AWS and want everything in the same network
You want to learn infrastructure skills that transfer to enterprise environments
Your app has specific performance requirements that benefit from custom CloudFront configuration
You're a freelancer or consultant who wants to bill separately for infrastructure

If none of those apply to you, use Vercel. This guide is for when they do.

The Architecture

Here's what we're building:

┌────────────────────────────────────────────────────────┐
│ GITHUB ACTIONS CI/CD │
│ │
│ Push to main → Build → Test → Deploy to EC2 │
└──────────────────────┬─────────────────────────────────┘
│ SSH deploy
▼
┌────────────────────────────────────────────────────────┐
│ AWS EC2 INSTANCE │
│ │
│ Ubuntu 22.04 LTS │
│ Node.js 20 + PM2 (process manager) │
│ Next.js app running on port 3000 │
│ Nginx reverse proxy (port 80/443 → 3000) │
└──────────────────────┬─────────────────────────────────┘
│ Origin
▼
┌────────────────────────────────────────────────────────┐
│ CLOUDFRONT CDN │
│ │
│ Static assets cached at edge (/_next/static/*) │
│ TTL: 1 year for static, 0 for HTML │
│ SSL termination via ACM certificate │
│ Custom domain: yourapp.com │
└────────────────────────────────────────────────────────┘

This isn't the only way to run Next.js on AWS. You could use Elastic Beanstalk, App Runner, ECS, or deploy static exports to S3 + CloudFront. The EC2 + CloudFront approach gives you the most control and transfers the most skills to enterprise environments.

Prerequisites

An AWS account (free tier works for learning; a t3.micro is enough for small apps)
A domain name (optional but recommended — we'll set up SSL)
A GitHub repository with your Next.js app
Basic familiarity with the AWS console

The total setup takes about 90 minutes the first time. After that, every deployment is automatic.

Step 1: Set Up the EC2 Instance

In the AWS Console, navigate to EC2 and launch a new instance. The settings that matter:

AMI: Ubuntu Server 22.04 LTS (free tier eligible)
Instance type: t3.micro (1 vCPU, 1GB RAM) for small apps; t3.small for medium traffic
Key pair: Create a new one, download it — you'll need this for SSH and GitHub Actions
Security group: Allow inbound traffic on ports 22 (SSH), 80 (HTTP), and 443 (HTTPS). Add your IP as the only source for port 22 (don't expose SSH to 0.0.0.0/0).

Once the instance is running, SSH in and set up the environment:

# Connect to your instance
ssh -i your-key.pem ubuntu@YOUR_EC2_PUBLIC_IP

# Update system packages
sudo apt-get update && sudo apt-get upgrade -y

# Install Node.js 20 via NodeSource
curl -fsSL https://deb.nodesource.com/setup_20.x | sudo -E bash -
sudo apt-get install -y nodejs

# Install PM2 globally (process manager for Node.js)
sudo npm install -g pm2

# Install Nginx
sudo apt-get install -y nginx

# Verify everything installed correctly
node --version   # v20.x.x
npm --version    # 10.x.x
pm2 --version    # 5.x.x
nginx -v         # nginx/1.24.x

Step 2: Configure Nginx as a Reverse Proxy

Nginx will listen on port 80 and forward requests to your Next.js app on port 3000. This is the standard setup for Node.js apps on Linux servers.

sudo nano /etc/nginx/sites-available/nextjs-app

Paste this configuration:

server {
    listen 80;
    server_name YOUR_DOMAIN.com www.YOUR_DOMAIN.com;

    # Proxy requests to Next.js
    location / {
        proxy_pass http://localhost:3000;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection 'upgrade';
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_cache_bypass $http_upgrade;
    }

    # Cache Next.js static assets at Nginx level too
    location /_next/static/ {
        proxy_pass http://localhost:3000;
        proxy_cache_valid 200 1y;
        add_header Cache-Control "public, max-age=31536000, immutable";
    }
}

# Enable the site and test the config
sudo ln -s /etc/nginx/sites-available/nextjs-app /etc/nginx/sites-enabled/
sudo nginx -t            # should say "test is successful"
sudo systemctl restart nginx

Step 3: The GitHub Actions Workflow

This is where the CI/CD magic happens. The workflow does four things: checks out code, runs your build, SSHs into the server, and restarts the app. Create this file in your repository:

mkdir -p .github/workflows

Create .github/workflows/deploy.yml:

name: Deploy to AWS EC2

on:
  push:
    branches: [main]
  workflow_dispatch:    # also allow manual triggers

jobs:
  deploy:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Set up Node.js
        uses: actions/setup-node@v4
        with:
          node-version: "20"
          cache: "npm"

      - name: Install dependencies
        run: npm ci

      - name: Run linter
        run: npm run lint

      - name: Build Next.js app
        run: npm run build
        env:
          # Pass any build-time env vars here
          NEXT_PUBLIC_GA_ID: ${{ secrets.NEXT_PUBLIC_GA_ID }}

      - name: Deploy to EC2
        uses: appleboy/ssh-action@v1.0.3
        with:
          host: ${{ secrets.EC2_HOST }}
          username: ubuntu
          key: ${{ secrets.EC2_PRIVATE_KEY }}
          script: |
            cd /var/www/nextjs-app

            # Pull latest code
            git pull origin main

            # Install dependencies (production only)
            npm ci --omit=dev

            # Build the app on the server
            npm run build

            # Restart with PM2 (zero-downtime reload)
            pm2 reload nextjs-app --update-env

            echo "Deploy complete at $(date)"

Step 4: Set Up GitHub Secrets

In your GitHub repository, go to Settings → Secrets and variables → Actions, and add these three secrets:

Secret Name
Value

EC2_HOST
Your EC2 instance's public IP address or domain

EC2_PRIVATE_KEY
The full contents of your .pem file (including the BEGIN/END lines)

NEXT_PUBLIC_GA_ID
Your Google Analytics measurement ID (or any other public env vars)

For the private key, open the .pem file in a text editor, copy everything including -----BEGIN RSA PRIVATE KEY----- and -----END RSA PRIVATE KEY-----, and paste it as the secret value.

Step 5: First Deploy and PM2 Setup

Before the GitHub Action can work, you need to get the app running on the server for the first time:

# Clone your repo to the server
sudo mkdir -p /var/www/nextjs-app
sudo chown ubuntu:ubuntu /var/www/nextjs-app
cd /var/www/nextjs-app
git clone https://github.com/YOUR_USERNAME/YOUR_REPO.git .

# Create .env file on the server
nano .env
# Add your production environment variables here

# Install dependencies and build
npm ci
npm run build

# Start with PM2
pm2 start npm --name "nextjs-app" -- start
pm2 save                  # persist across server restarts
pm2 startup               # generate startup script
# PM2 will output a command to run — run it

# Verify the app is running
pm2 status
curl http://localhost:3000  # should return HTML

Step 6: CloudFront CDN (Optional but Recommended)

CloudFront puts your app behind a global CDN, which means static assets load from an edge location near your users instead of your EC2 server. For most apps, this makes a meaningful difference in load times outside your server's region.

In the AWS Console, go to CloudFront and create a new distribution:

Origin domain: Your EC2 public IP or domain (not localhost)
Origin protocol policy: HTTP only (Nginx handles the connection to EC2)
Viewer protocol policy: Redirect HTTP to HTTPS
Cache policy for /_next/static/*: CachingOptimized — these files are content-addressed, so they can be cached for years
Cache policy for /* (HTML pages): CachingDisabled — Next.js handles its own cache headers; CloudFront should pass them through

If you have a domain, attach it to the CloudFront distribution and request an ACM (AWS Certificate Manager) certificate for free SSL. DNS validation takes about 15 minutes.

Watch: Next.js CI/CD to AWS EC2 with GitHub Actions

Common Pitfalls

1. Building on the server vs. building in CI

The workflow above builds in the GitHub Action AND on the server. That's redundant — you only need to do it in one place. For small apps, building on the server is fine (simpler). For larger teams, build in CI, upload the artifact, and skip the build step on the server. The tradeoff: artifacts can be large (100MB+), so you need S3 or similar to store them.

2. Forgetting to set NODE_ENV=production

When you run npm start (which runs next start), Next.js automatically sets NODE_ENV=production. But PM2 doesn't always inherit this. Be explicit in your PM2 config or startup command:

pm2 start npm --name "nextjs-app" -- start -- --NODE_ENV=production

3. Not configuring PM2 to restart on crash

By default PM2 restarts crashed processes, but you want to limit restarts to prevent crash loops. Add --max-restarts 10 and --min-uptime 5000 to your pm2 start command. Five seconds of uptime before a restart counts is usually enough to catch truly broken deployments.

4. SSH key permissions

The most common SSH error you'll hit is UNPROTECTED PRIVATE KEY FILE. GitHub Actions handles this correctly when you use appleboy/ssh-action, but if you're doing raw SSH commands, your .pem file needs chmod 400 your-key.pem — readable only by the owner, nothing else.

EC2 vs. Vercel vs. AWS Amplify — Which Should You Choose?

Factor
EC2 + CloudFront
Vercel
AWS Amplify

Setup time
90 min (first time)
5 min
20 min

Next.js feature support
Full (you control the runtime)
Full (built for Next.js)
Most features, some lag

Cost at low traffic
~$5/month (t3.micro)
Free tier, then $20+/month
Pay per build + hosting

Cost at high traffic
Predictable (fixed instance)
Can get expensive fast
Moderate

Infrastructure control
Full — you own everything
None — Vercel manages it
Partial

Learning value
High — enterprise-transferable
Low (it just works)
Medium

Best for
Learning, compliance, cost control
Speed, simplicity, teams
Existing AWS customers

TL;DR

The stack: EC2 (compute) + Nginx (reverse proxy) + PM2 (process manager) + CloudFront (CDN) + GitHub Actions (CI/CD). Each layer has one job.
GitHub Actions workflow: trigger on push to main → install → lint → build → SSH into EC2 → git pull → rebuild → pm2 reload. About 25 lines of YAML.
Store secrets properly: EC2 host, private key, and env vars go in GitHub repository secrets — never hardcoded in workflow files.
PM2 is essential for production Node.js — it keeps the process alive, restarts on crash, and enables zero-downtime reloads. Run pm2 startup to make it persist across server reboots.
CloudFront is optional but worth it — static assets cached at the edge make a real difference for users outside your server's region, and the free ACM SSL certificate saves you the hassle of Certbot configuration.

React + TypeScript Best Practices in 2025: What Actually Matters

Harshdeep Singh — Tue, 02 Jun 2026 21:21:57 +0000

You open a new React project, add TypeScript, and immediately hit Stack Overflow for how to type your first prop. The first answer says use interface. The second says type. The third is a six-paragraph thread about why one is semantically superior to the other. You close the tab and just write any to get on with your life.

Sound familiar? TypeScript in React has a reputation problem. Not because it's hard — it's genuinely great once it clicks — but because the community has generated a staggering volume of contradictory, context-free advice. Every dev tool tutorial starts with TypeScript. Every linting config bans any. Every PR reviewer has a hot take on generics.

This guide cuts through that. I'm not going to cover every TypeScript feature or every React pattern. What I'm going to do is share the specific conventions I use in production React + TypeScript apps in 2025 — the things that have made codebases genuinely easier to work in, not just theoretically safer.

What this guide is NOT

Before we get into it, let me set expectations clearly:

Not a TypeScript basics tutorial. I'm assuming you know what a type is, what an interface is, and that string !== String.
Not exhaustive. TypeScript has dozens of utility types, conditional types, template literal types, and more. I'm not covering all of them — just the ones I reach for constantly.
Not framework-neutral. This is specifically about TypeScript in React apps. Some of these patterns won't apply to a Node.js CLI or a library.
Not about configuration. Strict mode settings, tsconfig targets, module resolution — another article for another day.

What this guide IS is opinionated. I'm going to tell you what I think the right call is in most situations, and why. You'll disagree with some of it. That's fine.

Typing Props the Right Way

Let's start here because it's where every React + TypeScript journey begins, and where a lot of the confusion lives.

Interface vs. Type Alias

Here's my rule: use interface for component props, type for everything else.

Why? Interfaces have declaration merging, which can occasionally bite you in unexpected ways, but they also produce cleaner error messages and feel more natural for describing object shapes. They're also what the React community defaults to, so your code will look familiar to anyone joining your team.

// Good — interface for component props
interface ButtonProps {
  label: string;
  onClick: () => void;
  variant?: "primary" | "secondary" | "ghost";
  disabled?: boolean;
}

// Good — type alias for unions and computed shapes
type ButtonVariant = "primary" | "secondary" | "ghost";
type Theme = "light" | "dark";

You'll see guides that say "always use type" or "always use interface." Honestly? Consistency matters more than which one you pick. Pick a rule and stick to it across your codebase.

Required vs. Optional Props

Default to required. Make something optional only when it genuinely has a sensible default or when it's truly not needed in many use cases.

This is the inverse of what a lot of developers do. They add ? to everything to make TypeScript stop complaining, and then their components have fifteen optional props where most of them are actually always passed. That erases the value of having types at all.

// Bad — over-optionalized
interface CardProps {
  title?: string;
  description?: string;
  imageUrl?: string;
  href?: string;
}

// Good — be explicit about what's truly optional
interface CardProps {
  title: string;
  description: string;
  imageUrl?: string; // genuinely optional — card can work without an image
  href?: string;    // optional — sometimes cards aren't clickable
}

Extending HTML Element Props

This is one of the most useful patterns in React + TypeScript, and it's underused. When you're building a component that wraps an HTML element, extend that element's props so your component accepts all the native attributes automatically.

// Without this pattern — you have to manually add every HTML attribute
interface ButtonProps {
  label: string;
  onClick: () => void;
  // what about type="submit"? aria-label? data-testid? className?
  // you'll spend forever adding these one by one
}

// With this pattern — extend React's built-in types
interface ButtonProps extends React.ButtonHTMLAttributes {
  label: string;
  variant?: "primary" | "secondary";
  // ...and you automatically get onClick, type, aria-*, data-*, className, etc.
}

const Button = ({ label, variant = "primary", ...rest }: ButtonProps) => {
  return (

      {label}

  );
};

The ...rest spread pattern combined with extended HTML props is one of those things that once you start using, you can't go back. Your components become instantly more composable and you stop maintaining a manual list of passthrough props.

Custom Hooks with TypeScript

Custom hooks are where TypeScript really earns its keep, because hooks often manage complex state and return multiple values. If your hook's return type is just inferred as any[], you've lost all the benefit.

Typing Return Values Explicitly

Always define the return type of custom hooks explicitly. Don't rely on inference here — it breaks down the moment your hook has multiple return paths or conditional logic.

// Bad — inferred return type is unreliable
function useUser(id: string) {
  const [user, setUser] = useState(null); // typed as null
  const [loading, setLoading] = useState(true);
  const [error, setError] = useState(null);

  // ...fetch logic

  return { user, loading, error }; // TypeScript infers this poorly
}

// Good — define the return interface explicitly
interface User {
  id: string;
  name: string;
  email: string;
}

interface UseUserResult {
  user: User | null;
  loading: boolean;
  error: string | null;
}

function useUser(id: string): UseUserResult {
  const [user, setUser] = useState(null);
  const [loading, setLoading] = useState(true);
  const [error, setError] = useState(null);

  // ...fetch logic

  return { user, loading, error };
}

Now when you destructure this hook in a component, every field is typed correctly, and you get autocomplete without having to remember what the hook returns.

Generics for Reusable Hooks

Okay so here's where hooks get really powerful. If you're building a reusable data-fetching hook, generics let you make it work with any shape of data without losing type safety.

interface FetchResult {
  data: T | null;
  loading: boolean;
  error: string | null;
}

function useFetch(url: string): FetchResult {
  const [data, setData] = useState(null);
  const [loading, setLoading] = useState(true);
  const [error, setError] = useState(null);

  useEffect(() => {
    let cancelled = false;

    fetch(url)
      .then((res) => {
        if (!res.ok) throw new Error(`HTTP ${res.status}`);
        return res.json() as Promise;
      })
      .then((d) => {
        if (!cancelled) {
          setData(d);
          setLoading(false);
        }
      })
      .catch((err: Error) => {
        if (!cancelled) {
          setError(err.message);
          setLoading(false);
        }
      });

    return () => { cancelled = true; };
  }, [url]);

  return { data, loading, error };
}

// Usage — TypeScript knows exactly what data is
const { data, loading, error } = useFetch("/api/user/123");
// data is typed as User | null — not unknown, not any

The T propagates through the entire hook. That's the magic. You call useFetch<User> once at the call site, and TypeScript figures out the rest.

Generic Components

Generics in components are the thing that trips up most mid-level React developers. They look intimidating. They have funny angle-bracket syntax. But once you understand when to reach for them, they save you from maintaining three slightly-different versions of the same component.

When to Use Generic Components

Reach for generics when your component works with data of a variable shape, but still needs to be type-safe. A list component, a select dropdown, a data table — these are classic candidates.

// Without generics — you end up with separate UserList, ProjectList, etc.
// or you use any[] and lose type safety

// With generics — one component that works for any data shape
interface ListProps {
  items: T[];
  renderItem: (item: T, index: number) => React.ReactNode;
  keyExtractor: (item: T) => string;
  emptyMessage?: string;
}

function List({ items, renderItem, keyExtractor, emptyMessage = "No items" }: ListProps) {
  if (items.length === 0) {
    return 
{emptyMessage}
;
  }

  return (


      {items.map((item, index) => (
        - {renderItem(item, index)}
      ))}


  );
}

// Usage — TypeScript infers T from the items array
 u.id}  // TypeScript knows u is a User
  renderItem={(u) => }
/>

Notice that you don't even need to write <List<User>> at the call site — TypeScript infers T = User from the items prop. That's inference doing its job.

One thing to watch: in .tsx files, the compiler can confuse <T> with a JSX tag. If you get a parse error, either add a constraint (<T extends object>) or use a comma (<T,>) to disambiguate.

Discriminated Unions for State

Here's the thing that's changed how I think about React state more than anything else: replacing boolean flags with discriminated union types. This single pattern eliminates entire categories of bugs.

The Boolean Flag Problem

You've seen this component. You've written this component.

// The boolean flag trap
interface FormState {
  isLoading: boolean;
  isSuccess: boolean;
  isError: boolean;
  errorMessage?: string;
  data?: SubmitResult;
}

// Nothing stops you from setting isLoading: true AND isSuccess: true simultaneously
// That's an impossible state — but TypeScript can't catch it
const state: FormState = {
  isLoading: true,
  isSuccess: true, // ← this should be impossible
  isError: false,
};

When you have three booleans representing what should be a single sequential state, you have 2³ = 8 possible combinations, but only 4 of them are actually valid. TypeScript can't protect you from the invalid ones.

The Discriminated Union Fix

// Model the actual states that can exist
type FormState =
  | { status: "idle" }
  | { status: "loading" }
  | { status: "success"; data: SubmitResult }
  | { status: "error"; errorMessage: string };

// Now the impossible states are literally unrepresentable
const state: FormState = { status: "idle" };

// And in your component, TypeScript narrows types automatically
function FormFeedback({ state }: { state: FormState }) {
  if (state.status === "loading") {
    return ;
  }

  if (state.status === "error") {
    // TypeScript knows state.errorMessage exists here
    return ;
  }

  if (state.status === "success") {
    // TypeScript knows state.data exists here
    return ;
  }

  return null; // idle
}

The key is the status discriminant property. When you narrow on state.status === "error", TypeScript automatically knows which variant of the union you're in, and which other fields are available.

This pattern is especially powerful in data-fetching scenarios, form submission flows, and anywhere you have a multi-step process. Start reaching for it instead of isLoading / isError / isSuccess and your state management will become dramatically cleaner.

Taming any

Let me be direct: any is a code smell, but it's not always your fault. Sometimes you're working with a library that has poor types, an API that returns unpredictable shapes, or legacy code you don't own. The goal isn't to never use any — it's to reach for better tools first.

Use unknown Instead of any for External Data

unknown is the type-safe cousin of any. It says "I don't know what this is yet" instead of "pretend this is whatever I need it to be." You can't do anything with an unknown value without first narrowing it with a type guard.

// Bad — any lets you do anything, including wrong things
async function fetchData(url: string): Promise {
  const res = await fetch(url);
  return res.json();
}

const data = await fetchData("/api/user");
data.doesNotExist.boom; // TypeScript is fine with this. Your app is not.

// Good — unknown forces you to validate before using
async function fetchData(url: string): Promise {
  const res = await fetch(url);
  return res.json();
}

function isUser(value: unknown): value is User {
  return (
    typeof value === "object" &&
    value !== null &&
    "id" in value &&
    "name" in value &&
    typeof (value as User).id === "string"
  );
}

const data = await fetchData("/api/user");
if (isUser(data)) {
  // TypeScript now knows data is User
  console.log(data.name);
}

Type Guards Are Your Friends

The value is User return type in the example above is a type guard. It's a function that tells TypeScript "if this returns true, the value is of type T in the branches that follow." This is how you move from unknown territory into properly typed territory without resorting to any.

// Reusable type guard pattern
function isNonNull(value: T | null | undefined): value is T {
  return value !== null && value !== undefined;
}

const items: (User | null)[] = [user1, null, user2, null];
const validUsers = items.filter(isNonNull); // typed as User[], not (User | null)[]

When never Is the Right Answer

never is the type that can't exist. It's useful for exhaustiveness checks — making sure you've handled every case in a union.

type Shape = "circle" | "square" | "triangle";

function getArea(shape: Shape, size: number): number {
  switch (shape) {
    case "circle":
      return Math.PI * size * size;
    case "square":
      return size * size;
    case "triangle":
      return (size * size) / 2;
    default:
      // If you add "hexagon" to the Shape union and forget to handle it here,
      // TypeScript will throw a compile error on this line
      const _exhaustiveCheck: never = shape;
      throw new Error(`Unhandled shape: ${_exhaustiveCheck}`);
  }
}

This pattern scales beautifully with discriminated unions. Add a new status to your state type, and every switch statement that wasn't updated will fail at compile time. That's exactly the kind of safety net TypeScript is supposed to provide.

Before vs. After: TypeScript Patterns

Here's a quick-reference table of the common before/after shifts. These are the six patterns I most often see in code review that a bit of TypeScript discipline cleans up immediately.

  Pattern

  Without TypeScript Discipline

  With TypeScript Discipline

Prop typing

  props: any or no types at all

  interface ButtonProps extends React.ButtonHTMLAttributes&lt;HTMLButtonElement&gt;

Optional props

  Everything is ? to avoid errors

  Only truly optional fields are optional; defaults handled by destructuring

Loading/error state

  isLoading: boolean, isError: boolean, isSuccess: boolean

  Discriminated union: { status: "idle" | "loading" | "success" | "error" }

External API data

  const data: any = await fetch(...).then(r =&gt; r.json())

  const data: unknown + type guard before use

Reusable list

  Separate UserList, ProjectList components with duplicated logic

  One generic List&lt;T&gt; component with typed renderItem and keyExtractor

Hook return type

  Inferred — breaks on multiple return paths, confusing autocomplete

  Explicit interface UseXxxResult { ... } as the return type annotation

Module Structure for a TypeScript React Project

Okay, one more thing worth getting right from the start: where files live. A consistent folder structure does more for long-term maintainability than almost any TypeScript pattern. Here's the structure I use and recommend for mid-to-large React + TypeScript apps in 2025.

src/
├── app/ # Next.js App Router pages (or pages/ for older setup)
│ ├── layout.tsx
│ ├── page.tsx
│ └── blog/
│ └── [slug]/
│ └── page.tsx
│
├── components/ # Shared UI components
│ ├── Button/
│ │ ├── Button.tsx # Component
│ │ ├── Button.types.ts # Interface / type exports
│ │ ├── Button.styles.ts# Styled components or CSS module
│ │ └── index.ts # Re-export for clean imports
│ └── List/
│ └── ...
│
├── hooks/ # Custom hooks
│ ├── useFetch.ts
│ ├── useLocalStorage.ts
│ └── useDebounce.ts
│
├── lib/ # Utilities, helpers, non-UI logic
│ ├── api.ts # Fetch wrappers
│ ├── formatters.ts # Date, currency, string helpers
│ └── validators.ts # Type guards and runtime validation
│
├── types/ # Shared TypeScript types
│ ├── api.ts # API response shapes
│ ├── models.ts # Domain model interfaces (User, Post, etc.)
│ └── index.ts # Re-exports
│
├── context/ # React context providers
│ └── ThemeContext.tsx
│
└── styles/ # Global styles, theme tokens
└── globals.css

A few rules I enforce in this structure:

Co-locate component types. Each component folder has its own .types.ts file. Don't dump all types into a single global types.ts — that file becomes a graveyard.
The types/ directory is for shared domain types only. API response shapes, database models, shared interfaces that more than one component needs. Not component-specific props.
Barrel files are useful but dangerous. An index.ts in each component folder is fine. A single barrel for your entire components/ directory will cause circular dependency nightmares in larger apps.
Hooks go in hooks/, not co-located with components. This is a deliberate choice against the "co-locate everything" philosophy. In my experience, hooks get reused across features, and burying them inside a component folder makes them harder to find and share.

TL;DR

Use interface for component props, type for everything else — pick a rule and apply it consistently. Default to required props; only mark things optional when they genuinely are.
Extend HTML element props using React.ButtonHTMLAttributes<HTMLButtonElement> and friends — this gets you all native attributes for free and makes components composable with ...rest.
Always annotate custom hook return types explicitly — define a UseXxxResult interface and return it. Don't trust inference when there's conditional logic involved.
Use discriminated unions instead of boolean flags for anything with multiple states — { status: "idle" | "loading" | "success" | "error" } is safer, clearer, and catches impossible states at compile time.
Replace any with unknown at API boundaries — then validate with type guards before use. Save never for exhaustiveness checks in switch statements.
Structure your modules intentionally — co-locate component types, put shared domain types in types/, hooks in hooks/, and use barrel files per component folder (not globally).