DEV Community: Ben Halpern

New Gemini models dropped https://blog.google/innovation-and-ai/models-and-research/gemini-models/gemini-3-6-flash-3-5-flash-lite-3-5-flash-cyber/

Ben Halpern — Tue, 21 Jul 2026 16:31:50 +0000

3.6 Flash, 3.5 Flash-Lite, and 3.5 Flash Cyber

We’re introducing new Gemini models, including Gemini 3.6 Flash, 3.5 Flash-Lite and 3.5 Flash Cyber.

blog.google

Context Is King: Rethinking Domain Ownership, Product, and the "Spec Phase"

Ben Halpern — Mon, 20 Jul 2026 14:05:39 +0000

If you’ve spent any time recently writing detailed product requirement documents or meticulously crafting ticket specifications, you’ve probably noticed a frustrating paradox:

Thoroughly specifying an issue is often 90% of the work required to solve it.

By the time you’ve gathered the context, mapped out the edge cases, accounted for user flows, and detailed the expected behavior precisely enough for someone else to execute without friction, you’ve already done the heavy mental lifting. Translating that comprehensive spec into code is increasingly becoming the easy part.

This dynamic is quietly reshaping how software gets built, how teams collaborate, and what it actually means to be a competitive developer today.

The Death of the Developer-as-Cog

Back in the day, software development often operated on an assembly-line model. A Product Manager wrote a detailed spec, handed it off to an engineering lead, who broke it down into sub-tickets, which were then handed off to developers.

The developer’s job was essentially to act as a cog in the machine: take the ticket, write the code according to spec, and move it to QA.

To be clear, this model was never ideal. It created silos, bred disinterest, and led to bloated communication loops. But it technically worked.

Today, operating that way isn't just inefficient—it’s non-competitive. When the friction between ideation and execution drops, the overhead of the traditional handoff becomes the main bottleneck. The games of telephone between product definition and code delivery slow teams down to a crawl while faster, context-rich teams lap them.

The Friction of the "Spec Phase"

If writing a complete spec takes 90% of the cognitive effort, having a constant handoff between "the person who thinks of the task" and "the person who implements the task" creates massive drag.

This is why domain ownership has become table stakes.

When developers truly own a domain—meaning they deeply understand the problem space, the user, the underlying system architecture, and the business goals—they bypass the awkward, high-overhead "spec phase." They don't need a 10-page PRD to build the next feature because they already hold the context in their head. They can move fluidly from one problem to the next, making real-time trade-offs and inline product decisions without waiting for a spec to be baked.

Redefining the Lines Between Dev and PM

This shift doesn't mean Product Managers disappear, but it does mean the boundaries of the roles are becoming far more fluid:

The Developer as PM: In many areas, the developer is the product manager. Because they hold domain context, they identify what needs to be built next, spec it in their mind as they go, and ship it.
The PM as Builder: In other workflows, a PM might leverage low-code tools, scripts, or AI models to take a feature 90% of the way to completion—building a functional prototype, shaping the data flow, or laying out the initial structure. The developer then steps in to handle the remaining 10%.

Why is that last 10% so critical? Because even simple features require deep technical introspection: architecture alignment, performance at scale, edge-case hardening, security, and long-term maintainability.

The Moving Goalpost of "Competitive Software"

It’s tempting to look at current trends and assume that as models and tools improve, that final 10% will simply vanish—that software will eventually build and maintain itself at the press of a button.

That view misses how software markets operate.

As raw code generation becomes cheaper and faster, the baseline for what constitutes "competitive software" continuously moves up. Features that used to take three months now take three days, which means customer expectations rise accordingly. The line of what makes a product stand out shifts to higher levels of polish, deeper integrations, tighter performance, and better user experience.

The last 10% doesn't disappear; it just becomes more nuanced.

The Bottom Line: Problem Context and Quality Standards

We are moving away from an era where value is measured by syntax output or the ability to execute against rigid, pre-packaged specifications.

The real leverage in modern software engineering comes down to two things:

Problem Context Management: Your ability to internalize the user's needs, the business goals, and the system architecture so you can make high-judgment decisions on the fly without waiting for a spec.
Quality Thresholds: Your ability to look at a 90%-complete solution—whether produced by an AI, a teammate, or a PM—and know exactly what it needs to cross the finish line to be production-ready, scalable, and resilient.

Tools will continue to evolve, and the mechanics of writing code will keep changing. But the engineers who cultivate deep domain context and maintain uncompromising standards for quality won't just stay competitive—they’ll be the ones setting the pace.

Meme Monday

Ben Halpern — Mon, 20 Jul 2026 13:04:45 +0000

Meme Monday!

Today's cover image comes from the last thread.

DEV is an inclusive space! Humor in poor taste will be downvoted by mods.

This is remarkable https://bobdahacker.com/blog/fifa-hack

Ben Halpern — Tue, 14 Jul 2026 15:11:42 +0000

I Could've Rickrolled the Entire FIFA World Cup. All I Needed Was My ID. | bobdahacker

How I found that anyone could register on FIFA's public Agent Platform, gain access to the Football Data Platform's Streaming Management panel, and get RTMP ingest URLs and stream keys for every live FIFA World Cup 2026 camera feed. I then spent hours calling FIFA, MediaKind, HBS, CISA, and the FBI trying to get someone to pick up the phone.

bobdahacker.com

HTTP gets a QUERY method so complex searches can stop pretending to be POST https://www.theregister.com/devops/2026/07/13/http-gets-a-query-method-so-complex-searches-can-stop-pretending-to-be-post/5270192

Ben Halpern — Tue, 14 Jul 2026 11:59:15 +0000

HTTP gets a QUERY method so complex searches can stop pretending to be POST

New verb carries request content while remaining safe, idempotent, and cacheable

theregister.com

The Myth of the Post-Documentation Era

Ben Halpern — Mon, 13 Jul 2026 15:59:11 +0000

There is a growing sentiment in engineering circles right now that documentation is a relic of the past. The argument usually goes something like this: We’re living in the era of agent-driven development. If an AI agent can read the raw source code or parse an OpenAPI specification instantly, why waste human engineering hours writing prose? Code churns too fast anyway, and human-written docs are outdated the second they’re committed.

It’s an attractive, black-and-white view of the world. It’s also completely wrong.

Chasing strict determinism in your source of truth is a pipe dream. Code and specs tell a system how something works, but they are fundamentally incapable of explaining why it was built that way in the first place.

The Intent Gap: Why Code Isn't Enough

Even if you’re building entirely for a downstream consumer of AI agents, there is a massive, structural gap between a raw API specification and an operational reality.

Agents are phenomenal at pattern matching and syntax execution, but they struggle with architectural philosophy and human intent. We still need words to contextualize the boundaries. A spec can define an endpoint, its parameters, and its payload. What it can't capture is the nuance of why a specific architectural trade-off was made, or the implicit historical context of a legacy edge case.

Prose provides the guardrails for non-deterministic systems. Even if that prose is ultimately consumed by a machine rather than a human, the written word remains the highest-leverage way to transmit intent.

The Danger of Slop Describing Slop

This doesn't mean we need to return to the days of manually maintaining massive, static wiki pages. Automation has a massive role to play here. Cascading automation—where documentation is dynamically generated alongside code changes—is incredibly powerful.

But there’s a trap here: slop describing slop is entirely useless.

If we completely hand off documentation generation to unchecked LLMs, we end up with a feedback loop of hallucinated context describing rapidly shifting code. It creates noise, not clarity.

The Key is Oversight. Even if the documentation is entirely bot-driven, human engineering oversight is non-negotiable. We need to gut-check and validate the generated prose to ensure it represents an accurate, high-level explanation of the broader context. Think of generated docs as a non-deterministic cousin of the API itself—highly valuable, but only if kept on a tight leash.

The Trust Crisis and the Search for Reputation

Right now, the single biggest blocker to this new paradigm is trust. The current lack of a "gut-check" trustworthiness metric for documentation is a massive bottleneck for both human developers and autonomous agents.

In the open-source eras of the past, we relied on crude but effective reputation proxies. If a repository had 10,000 GitHub stars, a vibrant issue tracker, and recent commits, you could reasonably assume the project (and its documentation) was stable.

We don't have a reliable reputation system for the AI era yet. The absolute novelty of the moment, combined with how incredibly easy it is to game automated metrics, means everything feels a bit unanchored.

The next major shift in developer tooling won't just be about making agents faster or code generation cleaner. It will be about solving the reputation problem—building systems that can automatically verify, score, and guarantee the trustworthiness of the knowledge bases our software relies on.

Until then, don't delete your markdown files. The machines still need to read between the lines.

A lot of good points here https://antirez.com/news/169

Ben Halpern — Mon, 13 Jul 2026 15:17:02 +0000

antirez.com

Meme Monday

Ben Halpern — Mon, 13 Jul 2026 12:27:50 +0000

Meme Monday!

Today's cover image comes from the last thread.

DEV is an inclusive space! Humor in poor taste will be downvoted by mods.

Meme Monday

Ben Halpern — Mon, 06 Jul 2026 12:25:22 +0000

Meme Monday!

Today's cover image comes from the last thread.

DEV is an inclusive space! Humor in poor taste will be downvoted by mods.

Letting the DEV Community Weigh in on the Topics of AIE

Ben Halpern — Thu, 02 Jul 2026 15:24:12 +0000

I’m at the AI Engineer World’s Fair in San Francisco, where the vibes are enthusiastic. However, enthusiasm does not mean hype. The content has largely been grounded in pragmatic problem-solving. My sense is that the industry is finally homing in on the "jobs to be done" conversation over model hype — though I could still do without the “maxxing” suffix applied to everything.

To mirror the tone of the conference itself — where raw hype isn't quite as cool as it used to be — the global DEV Community has been providing excellent commentary on the reporting we’ve been publishing. The Daily Context newspaper has been distributed every day at the conference to help attendees stay caught up on broader themes, but it’s also gone out on DEV for thousands of remote developers to read and weigh in on.

To close the feedback loop and elevate the conversation, here are a few standout quotes and themes from the community that caught my eye.

Infinite Code and Shifting Constraints

We talk a lot about AI enabling us to ship infinite code, but our community quickly pointed out that raw volume is a vanity metric. Raju Dandigam cut straight to the core of the issue, noting that:

"Choke points govern value, not code volume. The teams who win won't be the ones generating the most, they'll be the ones who made the choke points cheap to clear."

When code generation becomes free, our bottlenecks move downstream to architectural cohesion, verification, and code review. As Nazar Boyko added, a development command center only helps if it surfaces the current constraint; otherwise, you've just built a faster way to watch the wrong thing.

Blame Shifting and the Frontier Default

Another fascinating debate unfolded around why developers stubbornly default to expensive frontier models for trivial tasks. While it's easy to preach about "tokenomics," kingai offered a brutally honest psychological perspective. The frontier default isn’t always a capability hedge — it’s a blame-shifting hedge. If a fast model fails, it's your fault; if a massive frontier model fails, you get to blame the model.

To break this habit, Pon argued against making users choose between models upfront via complex dropdowns. Instead, software should default to fast, cheap models out of the gate, gating escalation on a deterministic check of the output structure rather than the model's own self-reported confidence.

Agent Architecture: Claims vs. Evidence

The structural shift toward treating an AI agent as an append-only event log generated some of our sharpest technical pushback. While the log-as-state model ensures exceptional reliability, Alice dropped a brilliant warning: The log faithfully resumes claims, not objective truth. If an agent records a confident status event saying a file is empty without an underlying tool confirmation, the log simply hardcodes a durable hallucination.

Mateo Ruiz proposed an elegant architectural split modeled after double-entry bookkeeping: Maintain a claim ledger for state resumption, but use an independent evidence ledger (file diffs, exit codes) to handle real-world verification.

The Hidden Tax of Autonomous Decisions

Finally, we have to look closely at dependency selection. When you ask an agent to build a feature, it implicitly chooses your library stack for you. FrancisTRᴅᴇᴠ highlighted the profound security edge here, warning that a model's authoritative delivery easily disarms human checkers, leaving the door wide open for typosquatted packages or supply-chain attacks.

Practicality Wins the Cycle

The DEV community isn't getting swept up in the sci-fi dream of fully unsupervised autopilot. The developers winning this cycle are applying basic, defensive engineering principles — making inputs predictable, creating strict code harnesses, and testing outputs rigorously.

Frankly, I think that mirrors the tone of the conference, and this is the feedback loop our industry is in right now. Everyone sees a form of progress, but nobody wants their AI-pilled manager to come back from the market having been sold magic beans.

The Fragile Balance of AI Development: Individual Flow vs. Collective Context

Ben Halpern — Thu, 02 Jul 2026 15:00:21 +0000

As much AI-driven development has normalized, we are still in the Wild West. While we are closer to homing in on what “best practices” actually mean, defining them remains a moving target. Right now, a fascinating tension is emerging between the workflows we build for ourselves and the systems we build for our teams.

At the individual level, best practices are a bit of a “choose your own adventure” setup, and that’s perfectly fine — with one major caveat. It’s incredibly easy to drift into isolated silos when you’re running your own little fleet of developer agents.

True individual mastery isn’t just about prompt engineering; it’s about context management and disorganization control. It’s setting up the right Model Context Protocol (MCP) servers to bridge your tools and services, and mastering the feedback loops necessary to manage parallel work. When you’re orchestrating multiple agent workflows, the core skill is balancing your own cognitive capacity — observing and inferring state across different tasks to ensure the train doesn’t run off the tracks.

But things get exponentially harder at the team level. Collective best practices require finding common ground, which inherently sits just behind the bleeding edge. If a team constantly swaps core architecture for the newest shiny object, velocity stalls. Instead, we need a predictable, accelerated pace for tool adoption that fits into the team’s collective brain without causing whiplash.

This requires a specific archetype of technical leader: someone deeply anchored in “traditional” production engineering, security, and DevOps, but possessing the pragmatism to integrate AI acceleration safely. While individual devs need a security-conscious mindset to protect their environments, the team level is where that mindset becomes mission-critical. It’s the ultimate gatekeeper for what code actually reaches production.

To prevent total divergence, teams must actively invest in intentional knowledge sharing and inspiration sessions. If we don’t intentionally bridge the gap between individual flow and collective guardrails, we risk fracturing our engineering culture.

From Harness Engineering to Evals: What’s Trending at AI Engineer

Ben Halpern — Wed, 01 Jul 2026 14:34:35 +0000

I’m at the AI Engineer conference in San Francisco this week. The event has every major brand-name sponsor you’d expect, a lineup of internet-famous project maintainers on stage, and a massive schedule covering which more or less has something for everyone. It’s easy to get lost in the noise. I spent my time trying to figure out what themes are actually real.

With dozens of tracks and thousands of builders, the ecosystem looks incredibly fractured. But if you look at what engineers are actually putting into production, the chaos collapses into a clear pattern. The industry is moving past simple chat interfaces and treating large language models like central processing units inside a larger, highly structured software architecture—essentially an LLM Operating System.

I cataloged everything I was seeing, dug into the technical tracks, and came away with these six themes. This is not my endorsement, and I have not separated the hype from the real. Take these brief summaries as jumping-off points to help you go deeper if any of these ideas trigger your curiosity.

1. The Shift to Repository-Scale “Software Factories”

For the last few years, AI in development was basically tab-complete. You wrote a line of code, an assistant suggested the next few tokens, and you moved on.

That single-file approach is quickly becoming obsolete. The focus has shifted to repository-scale, multi-agent systems—what people are calling Software Factories.

Instead of writing lines of code alongside an AI assistant, developers are managing fleets of agents that operate across entire codebases. For example, Uber shared details on uReview, their internal code review engine. It uses agents to autonomously review pull requests, spin up localized test suites, catch edge cases, and commit fixes back to the branch before a human even looks at it.

To make this reliable, engineers are plugging compilers and linters directly into the agent’s feedback loop. If the generated code fails to compile, the raw error output is fed right back into the system prompt. The model reads its own error, fixes the bug, and re-runs the check autonomously.

2. Hardening Systems with “Harness Engineering”

There’s a common realization on the conference floor right now: “Everyone is building an agent harness, but nobody calls it that.”

LLMs are inherently probabilistic and non-deterministic. Software infrastructure, however, requires predictable inputs and outputs. To fix this, teams are formalizing a core systems discipline: Harness Engineering.

The “harness” is the strict software wrapper built around a model to enforce constraints, manage state, and prevent infinite execution loops.

+--------------------------------------------------------+
| THE AGENT HARNESS |
+--------------------------------------------------------+
| 1. Durable Execution (State preservation & retries) |
+--------------------------------------------------------+
| 2. Structured Outputs (Schema enforcement / Pydantic) |
+--------------------------------------------------------+
| 3. Dynamic Guardrails (Input/Output sanitization) |
+--------------------------------------------------------+

Instead of letting an agent run unmonitored, developers are using toolchains like Temporal or Inngest to implement durable execution. If an agent is running a complex, multi-hour workflow and hits a network timeout, the harness preserves its memory and state. The process can resume exactly where it failed without repeating expensive API calls. Paired with libraries like Pydantic or Instructor to force strict JSON schema compliance, the harness makes unpredictable models behave like stable infrastructure.

3. Computer Use vs. Custom APIs

For decades, integration meant writing custom API connectors or scraping endpoints. A major theme this year is Computer Use—building agents that navigate software exactly like a human operator does: by looking at a screen, moving a mouse, and typing commands.

Enabled by better vision-language models (VLMs), these systems don’t need structured backend APIs. They take continuous screenshots of a graphical user interface (GUI), parse the visual layout to locate fields and buttons, and execute precise pixel coordinates.

This has forced a shift in local developer setups. Engineers are building isolated, sandboxed terminals and open-source desktop companions (like OpenClaw) that give background agents their own virtual environments. This lets agents spin up local servers and debug files in isolation without taking over the engineer’s active screen and keyboard.

4. Context Engineering & “Tokenmaxxing”

Context windows have scaled to millions of tokens, but dumping an entire codebase into a prompt is an expensive, high-latency anti-pattern.

Time-to-first-token and API costs are the real bottlenecks today. Because of this, developers are focusing heavily on Context Engineering—treating the context window as a highly optimized, dynamic memory cache rather than a static text dump.

The optimization strategy generally follows a three-layer approach:

Prefix Caching: Inference engines like vLLM cache the Key-Value (KV) states of static system instructions or documentation headers. Subsequent requests reuse this cache, significantly cutting down latency and cost.
Context Compression: Middleware layers are introduced to run semantic compression algorithms, pruning irrelevant tokens and summarizing messy chat logs before sending data to the provider.
Graph RAG & Hybrid Retrieval: Instead of pulling raw text blocks indiscriminately, systems use structured knowledge graphs to pass only high-signal data into the active context window.
Finish reading at link.dev.to/aie39.

5. Moving Past “Vibe-Based” Evaluations

If there is one clear operational shift, it’s that vibe-based engineering is dead. Reviewing a few outputs, deciding they look reasonable, and shipping them to production is no longer an acceptable practice.

The core focus of the Evals community is on automated, multi-step simulation benchmarks. Evaluating an agent now requires spinning up an isolated virtual environment—a temporary sandbox with mock databases and network access—and letting the agent attempt a complex task. The evaluation framework doesn't grade the style of the response; it checks if the task was completed successfully, notes how many steps it took, and verifies that no security protocols were broken.

Engineers are also moving away from the “Persona Trap”—giving a model a prompt like “You are a senior staff engineer.” Studies shared at the event show this approach evaluates a stylistic vibe rather than a rigorous technical capability, often introducing silent biases that degrade performance. The standard now is rigid, task-oriented testing.

6. Secure Micro-Sandboxes for Runtime Safety

Giving an agent the authority to write code, modify files, and run terminal commands introduces severe security risks.

Platform engineers are tackling this by focusing on the underlying execution layer. The industry standard has normalized around Micro-Sandboxes. Agent-generated code is executed inside lightweight, ephemeral micro-VMs (like those from E2B or Docker) that spin up in milliseconds, handle the specific computation, and are immediately destroyed to prevent container escape or persistent file system tampering.

There is also a major push toward credential masking. When agents need access to enterprise databases or third-party tools, engineers are using new delegation layers like the AAuth protocol. This grants the agent mission-bounded authority to call a tool, but prevents the agent from ever seeing or interacting with the raw API keys, neutralizing prompt injection leaks.

The Bottom Line

It’s easy to skim these topics, feel a wave of FOMO, and think you’re already lagging behind if you aren’t running a fleet of micro-sandboxes or an autonomous software factory.

Don’t buy into the hype. You don’t need to overhaul your entire stack by next Monday.

The real takeaway from all the noise at Moscone is actually pretty reassuring: AI is just becoming regular software infrastructure. The developers who build useful things over the next few years won't be the ones chasing every flashing new model drop or complex multi-agent framework. They’ll be the ones applying basic, boring engineering principles—making their inputs predictable, testing their code rigorously, and keeping their environments secure.

If you're looking for a place to start, don’t overcomplicate it. Pick a single, repetitive workflow in your day-to-day. Wrap a clean, defensive code harness around it, build a straightforward evaluation script to check its work, and see what happens. Inspiration is great, but pragmatism is what actually ships.