DEV Community: S M Tahosin

The Most Underrated Announcement from Google I/O 2026 Was Buried in a 90-Second Demo

S M Tahosin — Thu, 21 May 2026 20:22:22 +0000

This is a submission for the Google I/O Writing Challenge.

I watched the Google I/O 2026 keynote twice.

First time, I got swept up in the shiny stuff. Gemini 3.5 Flash benchmarks. Veo 3 generating videos that look disturbingly real. Gemini Omni doing that multimodal physics thing. Cool. Expected. The usual I/O sugar rush that gets 50,000 retweets and fades by Thursday.

Second time through, I caught something different.

About 40 minutes into the developer keynote, sandwiched between the Jules GA announcement and a Stitch demo, there was maybe 90 seconds on something called the Managed Agents API. The presenter dropped one line that made me hit pause and rewind.

"Deploy an autonomous agent that reasons, writes code, browses the web, and executes in a secure sandbox. One API call."

I closed every other tab. Pulled up the docs. Started writing code.

The 19-Day Problem

Here's some context. If you've tried building anything with AI agents in the past year, you know the drill. And by "drill" I mean "weeks of suffering."

Say you want an agent that takes a GitHub issue, reads the codebase, writes a fix, runs tests, and opens a PR. Sounds straightforward, right? In reality, you're wiring up five services, spinning up sandboxed containers, managing auth, building tool-call routing, writing health checks, and setting up network policies so your agent doesn't accidentally nuke production at 3am on a Saturday.

Last month I built an internal bot that triages support tickets. Took three weeks. The actual AI logic? One day. The other 19 days were pure infrastructure. Docker config. Sandbox isolation with gVisor. Network policies. Timeout handling. Health checks. Retry logic.

Nineteen days of plumbing. One day of thinking.

That ratio is broken. And this API just fixed it.

Three Weeks to Eleven Lines

I took that same support ticket bot and rewired it on the Managed Agents API. Not a demo version. The same bot. Same capabilities.

from google import genai

client = genai.Client()

interaction = client.interactions.create(
    agent="antigravity-preview-05-2026",
    environment="remote",
    input=(
        "You are a support ticket triage agent. "
        "Read the following ticket, classify its severity, "
        "identify the affected component from the codebase, "
        "and draft a response with a proposed fix.\n\n"
        f"Ticket: {ticket_text}"
    )
)

print(interaction.output_text)

Eleven lines. No Docker. No Kubernetes. No sandbox config.

The API spins up a fresh, isolated Linux environment, loads the agent runtime, runs your task, hands back the result, and destroys the sandbox. Done.

Here's what that looked like in practice:

	Old Setup	Managed Agents API
Time to build	3 weeks	1 afternoon
Lines of infra code	~2,400	0
Lines of agent logic	~180	11
Dependencies	Docker, gVisor, Redis, nginx	`google-genai` pip package
Maintenance burden	Container updates, health checks, scaling	None (Google's problem)

I stared at my screen for a solid minute when it worked. Not because the output was flawless (it wasn't). Because I'd just thrown away three weeks of infrastructure code.

What Google Actually Built Under the Hood

When you hit interactions.create, four things happen:

Sandbox provisioning. Google fires up an isolated Linux VM. Fresh filesystem every time. No leftover state from previous runs. Network access is off by default, opt-in only. This alone used to cost me a week of Docker and gVisor wrestling.

Agent harness boots up. This is the exact same runtime that powers Jules and the Antigravity desktop app. Not a watered-down version. Same thing. Every improvement Google makes to Jules? Your managed agents get it too.

Reasoning loop. The agent reads your input, builds a plan, starts executing. Writing files. Running code. Hitting the web if you've turned that on. There's a "critic" layer baked in that catches logic errors before returning output. Think of it like a built-in code reviewer that runs before every response.

Cleanup. Interaction finishes, sandbox gets nuked, you get the result plus any files the agent created. Thirty seconds to a few minutes total.

Where the Sandbox Breaks: The Preview Limitations

I'm not going to pretend this is ready for production. Two days of testing surfaced real problems.

Timeout wall. I pointed it at a 15,000-line codebase and asked it to refactor one module. Hit the 5-minute ceiling and died. Large, complex tasks choke.

Zero memory between calls. Each interaction gets a clean sandbox. Great for security. Terrible if you need your agent to remember context. You have to manage state yourself, passing the previous_interaction_id and relevant context back in on every subsequent call. Not hard, but not free either.

The "preview" tax. Pre-GA. Google says don't feed it sensitive data. Side projects and internal tools? Go for it. Customer data in production? Wait.

Pricing is a black box. Free during preview. Nobody knows what this costs at scale. That's a real problem for anyone planning production workloads.

Network access is half-baked. Your agent can browse the public web. But reaching internal APIs? You need an MCP server as a bridge, which brings back some of that infrastructure overhead. A bit ironic.

How It Stacks Up Against the Competition

Here's what made me pay attention. Right now, if you want an autonomous agent that executes in a sandbox, your options are:

OpenAI Assistants API gives you code execution in a sandbox, but it's tied to OpenAI models, the sandbox is limited (no arbitrary binary execution, no web browsing), and you're paying per-token plus tool-call fees. It's also not truly "deploy an agent" so much as "run a conversation with tools."

Anthropic's tool-use is powerful for single-turn tool calling, but there's no managed sandbox. You bring your own execution environment. So you're back to the Docker-and-gVisor dance.

LangGraph Cloud gets you agent orchestration, but again, you manage the infrastructure. The execution environment is your problem.

Google's approach is different. They're saying: give us the instructions, we'll handle the sandbox, the execution, the security, the cleanup. You don't think about infrastructure at all. That's a genuinely new position in this space.

This is the first time a major cloud provider is treating autonomous agents as serverless compute, not just chat-with-tools.

The tradeoff? You're locked into Google's ecosystem. The agent runs on Gemini models. If you need Claude or GPT-4 for a specific task, this isn't your tool. But for teams already in the Google stack, the friction drop is real.

The Feature That Actually Got Me: Saved Agents

One-shot interactions are cool. But agents.create is where things get interesting.

You define an agent with custom instructions, specific tools, MCP connections, and environment settings. Save that whole configuration. Then trigger it by ID from anywhere. Cron job. Webhook. GitHub Action. Another agent.

agent = client.agents.create(
    display_name="ticket-triage-v1",
    system_instruction=(
        "You are a senior support engineer. "
        "Classify tickets by severity. "
        "Always check error logs before suggesting a fix. "
        "Never suggest restarting the service as a first option."
    ),
    tools=["code_execution", "web_browse"],
    environment_config={
        "sandbox": "remote",
        "timeout_seconds": 300
    }
)

# Trigger from anywhere
result = client.interactions.create(
    agent=agent.id,
    input=f"New ticket: {ticket_text}"
)

I wired one to our Slack. Someone files a bug, the agent auto-triages, pulls relevant logs, posts analysis in the thread. Forty lines of Python and a webhook.

The Lambda Moment

Remember 2014? Before Lambda, running code in the cloud meant EC2 instances. Load balancers. Auto-scaling groups. The works.

Lambda said: give us the function, we handle the rest. People called it a toy. Then it ate the backend world.

I keep seeing the same pattern. Before this API, building an agent meant managing infrastructure. Now you hand over instructions and Google runs the thing in a sandboxed environment.

Maybe I'm wrong. Maybe this stays niche. But the parallel keeps nagging at me, and I haven't been able to talk myself out of it.

What I Want to Build Next

A docs drift detector that points at a repo, reads the README, runs the code, and flags where documentation and behavior have diverged. Every project has this problem. Nobody fixes it manually.

A dependency changelog reader that actually reads changelogs for your deps, understands breaking changes, and tells you which updates are safe to auto-merge and which ones need human review.

A pre-review PR agent that reads changes before a human reviewer opens the PR, checks test coverage on modified files, identifies risky diffs, and writes review notes. Like a thorough junior dev who never sleeps.

All of these would've been multi-week projects before. Now they're afternoon builds. That's the shift. Not what agents can do. But how fast you can ship them.

So What Now

Google I/O 2026 had no shortage of headlines. Gemini 3.5 Flash is fast. Veo 3 is wild. Gemini Omni understanding physics makes you wonder what 2027 looks like.

But this quiet little API is the one that actually changed my Tuesday. It didn't make me go "wow." It made me delete code. And that's usually how the important stuff starts.

Open the docs. Write eleven lines of Python. See what happens.

Found this useful? A reaction helps others find it too. Questions about the API or building with it? I'm in the comments.

Hermes Just Killed OpenClaw (Here's Why)

S M Tahosin — Tue, 19 May 2026 13:12:33 +0000

This is a submission for the Hermes Agent Challenge.

I do not think OpenClaw is dead.

That title is deliberately dramatic because the shift is dramatic. OpenClaw did something important: it made a lot of developers believe that a personal AI assistant could be more than a chat box. It could sit on your machine, connect to your messages, call tools, browse, run commands, and actually move work forward.

But Hermes Agent changes the question.

OpenClaw asks:

What if I could run a personal AI assistant on my own devices?

Hermes asks:

What if my agent could live on my infrastructure, remember how I work, improve its own procedures, use tools across channels, and become more useful every week?

That second question is why Hermes feels like the next step.

Not because OpenClaw is bad. OpenClaw is popular for a reason. The official repo describes it as a personal AI assistant that runs on your own devices, answers through the channels you already use, and uses a Gateway as the control plane. That is a strong idea.

The problem is that the AI agent market is moving from "assistant I operate" to "worker I supervise." Once that happens, the winning system is not the one with the loudest demo. It is the one with the better memory model, execution boundary, skill lifecycle, tool surface, and deployment story.

That is where Hermes starts to pull ahead.

The short version

If I had to explain the difference in one line:

OpenClaw feels like a local-first assistant. Hermes feels like agent infrastructure that happens to chat.

That distinction matters.

A real agent has to do more than respond. It needs to run somewhere reliable. It needs to work while I am away. It needs to remember the parts of my environment that matter. It needs to learn repeatable procedures. It needs to make tool use safer, especially when those tools touch files, browsers, credentials, APIs, and servers.

OpenClaw helped prove the demand.

Hermes is making the operating model more serious.

The five claims that matter

The loudest Hermes pitch right now is simple: install it, connect it, give it skills, run it on a server, and let it become your agent.

That pitch is exciting, but I would not judge Hermes by hype. I would judge it by which claims survive contact with architecture.

Claim	Why it matters	My read
"One-command install"	Agents die when setup is fragile. If the first hour is dependency pain, most people quit.	Useful, but not the real moat. Setup gets you to day one. Memory and skills decide day thirty.
"Run it on a VPS or sandbox"	A serious agent should not need your personal laptop open all day.	This is one of Hermes' strongest arguments. Persistent agents belong on persistent infrastructure.
"Built-in skills"	Skills turn vague AI behavior into repeatable procedures.	Strong, especially because Hermes treats skills as something the agent can improve, not just something a user installs.
"Messaging integrations"	Telegram, Discord, Slack, WhatsApp, and similar channels make the agent reachable from normal life.	Important, but only if paired with background sessions. Otherwise it is just another bot in another inbox.
"Safer execution"	Agents touch terminals, files, browsers, APIs, and credentials. That is dangerous by default.	This is where Hermes feels more mature: command approval, allowlists, Docker, SSH, sandbox backends, and scoped toolsets all matter.

That is the lens for the rest of this post.

I do not care whether Hermes can produce a flashy demo once. Most agent frameworks can do that now.

I care whether Hermes has the bones for repeated work: memory, procedural learning, sandboxed execution, remote availability, and enough tool scoping to avoid turning convenience into a security incident.

Why OpenClaw won attention first

OpenClaw's strength is obvious from its own README. It is broad, local, channel-heavy, and familiar to developers who want an assistant they can own.

The official repo highlights:

WhatsApp, Telegram, Slack, Discord, Signal, iMessage, Microsoft Teams, Matrix, LINE, WeChat, and many more channels
A local-first Gateway that owns messaging surfaces and routes requests
First-class tools for browser, files, exec, canvas, cron, sessions, image generation, video generation, TTS, and sub-agents
Skills based on SKILL.md
Native onboarding with openclaw onboard
Companion apps and nodes for macOS, iOS, Android, and headless devices

That is not small. That is why OpenClaw became a reference point for personal agents.

It also has a massive community. At the time I checked the GitHub API, OpenClaw had far more stars than Hermes. Popularity alone does not decide technical direction, but it does tell you something: OpenClaw made the category legible.

For context, I checked the public repos directly: openclaw/openclaw and NousResearch/hermes-agent. OpenClaw has the bigger gravity right now. Hermes has the more interesting agent-runtime thesis.

The issue is that popularity also brings a harsh spotlight. Once strangers, groups, plugins, browsers, shells, and personal accounts all meet inside one assistant, the security model becomes the product.

OpenClaw's own security docs are honest about this. The guidance assumes a personal assistant trust boundary: one trusted operator boundary per gateway. It says OpenClaw is not a hostile multi-tenant security boundary for adversarial users sharing one gateway. It also says the product default for trusted single-operator setups allows host execution in the gateway or node context unless you tighten it.

That is not a cheap criticism. It is the tradeoff OpenClaw chose: powerful local assistant first, hardening second.

Hermes starts from a different center.

Hermes is built around compounding

The most important Hermes idea is not Telegram integration. It is not browser automation. It is not even the tool count.

The key idea is compounding.

Hermes describes itself as a self-improving agent with a built-in learning loop. Its docs talk about agent-curated memory, autonomous skill creation, skill improvement during use, session search, external memory providers, and user modeling.

That sounds abstract until you translate it into developer terms:

If the agent solves a hard workflow today, it should not rediscover that workflow next week.

That is the difference between a chatbot with tools and an agent that grows.

Hermes has two memory layers that are easy to reason about:

MEMORY.md for environment facts, project conventions, lessons learned, and workflow notes
USER.md for preferences, communication style, expectations, and profile details

Those are bounded on purpose. Hermes keeps them focused instead of stuffing an infinite pile of text into every prompt. For older conversations, it uses SQLite session storage with FTS5 search and summarization.

That design feels practical. The always-loaded memory stays small. The deeper history is searchable when needed.

This is exactly how I want a serious agent to behave. I do not want it to remember everything equally. I want it to remember what changes future behavior.

The skill system is the real "DNA"

Skills are where Hermes becomes interesting.

OpenClaw has skills too. Its docs explain that skills are AgentSkills-compatible SKILL.md folders that teach the agent how to use tools. OpenClaw loads bundled skills, managed/local skills, personal skills, project skills, and workspace skills.

Hermes takes the same basic idea and pushes it closer to procedural memory.

The Hermes docs say the agent can create, update, and delete its own skills through skill_manage. It creates skills after complex successful tasks, when it finds the path through errors, when a user corrects its approach, or when it discovers a non-trivial workflow.

That is the part that matters.

Not "skills as a plugin folder."

Skills as the agent writing down how to be better next time.

This is the difference between installing extensions and building organizational memory. A good senior developer does not just solve an incident. They improve the runbook. Hermes is trying to make the agent do the same thing.

And it is not only local skills. Hermes supports:

Official optional skills
skills.sh
Well-known skill endpoints
Direct URL skills
GitHub skill installs
Community registries
External read-only skill directories
Security scanning and audit commands for installed hub skills

That gives Hermes a useful middle ground. It can learn locally, but it can also participate in a broader open skill ecosystem.

The execution story is stronger

This is where the comparison gets practical.

An agent that can run commands should make you slightly nervous. That is healthy.

Hermes treats terminal execution as a configurable backend. Commands can run locally, in Docker, over SSH, in Singularity, in Modal, in Daytona, or in Vercel Sandbox. The docs are clear about the tradeoff:

local is easy, but has no isolation
Docker gives container isolation
SSH moves execution to another server
Modal and Daytona give cloud sandbox options
Vercel Sandbox gives microVM-style cloud execution with snapshot persistence

The security page goes further. With Docker, Hermes applies hardened container flags: drop capabilities, no new privileges, PID limits, tmpfs mounts, and explicit resource limits. It also avoids forwarding host environment variables by default.

That matters for one simple reason:

The agent should not automatically inherit your entire laptop just because you wanted it to scrape a page or refactor a file.

OpenClaw can sandbox too. Its README points to Docker, SSH, and OpenShell options, and it recommends sandboxing for non-main sessions. Its security docs are detailed and serious.

But the default mental model is different.

OpenClaw is a personal assistant with optional hardening.

Hermes is an agent runtime where isolated execution is part of the normal deployment conversation.

That is why I would rather run Hermes on a VPS or cloud sandbox for always-on work.

Messaging is not the win. Remote agency is.

Both tools can talk through messaging platforms.

OpenClaw has a huge channel list. Hermes also supports a wide set: Telegram, Discord, Slack, WhatsApp, Signal, SMS, Email, Matrix, Mattermost, Home Assistant, DingTalk, Feishu/Lark, WeCom, Microsoft Teams, and more.

The interesting Hermes feature is not that you can message it.

The interesting feature is that messaging becomes a control surface for background work.

Hermes supports background sessions from messaging platforms. You can start a separate task, keep chatting in the main thread, and receive the result back in the same channel. That is a small feature on paper, but it changes the feel of the system.

It stops being:

I am chatting with a bot.

It becomes:

I am dispatching work to an agent that lives somewhere else.

That is the future I care about.

I do not want my personal agent trapped inside the laptop I am currently using. I want it on a server, reachable from my phone, able to run a long task, report back, and remember the result.

Hermes is built for that shape.

Tool breadth is now table stakes

There was a time when "this agent can browse the web and run commands" sounded wild.

That time is over.

Both OpenClaw and Hermes have serious tool surfaces.

OpenClaw ships built-in tools for shell execution, code execution, browser control, web search, file I/O, patching, messaging, canvas, nodes, cron, images, music, video, TTS, sessions, and sub-agents.

Hermes ships a broad registry too: web search, extraction, terminal, file editing, browser automation, vision, image generation, TTS, memory, session search, cron, messaging, delegation, code execution, Home Assistant, MCP tools, RL tools, and more.

So the question is not:

Which one has tools?

The better question is:

Which one makes tools safer, more composable, and easier to scope per situation?

Hermes has a clear toolset model. Toolsets can be enabled per session, per platform, or per task. There are platform presets like hermes-cli, hermes-telegram, and dynamic MCP toolsets. That gives you a cleaner way to say:

"This Telegram agent can do X, but not Y."

For me, that is more important than raw tool count.

Hermes vs OpenClaw

Here is my practical comparison.

Area	OpenClaw	Hermes Agent
Core identity	Personal AI assistant	Self-improving agent runtime
Mental model	Local-first Gateway assistant	Persistent worker on your infrastructure
Setup	CLI onboarding and Gateway daemon	CLI, Gateway, and multiple runtime backends
Messaging	Very broad channel coverage	Channels plus background sessions
Skills	Skills loaded from many locations	Skills as procedural memory
Memory	Workspace and session context	Curated memory plus session search
Tooling	Broad built-in tools	Toolsets, MCP, delegation, media, web
Security	Personal trust boundary, hardening available	Approval, isolation, env filtering, scoped tools
Deployment	Device or Gateway host	Local, VPS, Docker, SSH, Modal, Daytona, Vercel Sandbox
Ideal user	Power user with a device assistant	Developer building a supervised digital worker
Biggest risk	Too much power in one assistant boundary	Newer ecosystem still proving itself

This table is why I do not read Hermes as "another OpenClaw clone."

Hermes is competing on a different axis.

OpenClaw made the assistant powerful.

Hermes is trying to make the assistant compound.

The practical playbook

If you are reading this and wondering "okay, but what do I actually try first?", this is the path I would take.

First, run Hermes somewhere disposable. A local machine is fine for learning, but the interesting path is Docker, SSH, Modal, Daytona, or another sandbox backend. The whole point is to avoid giving an experimental agent unlimited access to your daily machine on day one.

Then connect one messaging surface, not five. Telegram or Discord is enough. Make sure allowlists or DM pairing are enabled before you give the agent terminal access.

Then give Hermes one recurring workflow:

/background Research the latest Hermes Agent docs changes, summarize the developer impact, and send me 5 possible DEV post angles.

After that, watch for the compounding moment. If the workflow takes several tool calls, has a repeatable structure, or needs a correction from you, that is exactly the kind of thing that should become a skill.

A good first Hermes skill would not be "write blog posts." Too vague.

A better one would be:

research-release-notes

When given a GitHub repo or docs page:
1. Find the latest release or docs update.
2. Prefer primary sources.
3. Extract concrete changes.
4. Separate confirmed facts from opinion.
5. Produce a DEV-ready outline with links.

That is where Hermes becomes more than a chat assistant. You are not just asking it to do a task. You are teaching it a durable way to do that class of task.

Where OpenClaw still wins

A good comparison should admit the other side.

OpenClaw still has big advantages:

It has enormous attention and community gravity.
Its channel ecosystem is very broad.
Its native app and node story is compelling.
Its local-first assistant feel is easier to explain to non-agent people.
It has already shaped how people talk about personal AI assistants.

If your goal is "I want a personal AI assistant connected to my messaging apps and devices," OpenClaw is still a serious answer.

But if your goal is "I want an agent that can become operational infrastructure," Hermes is the more interesting answer.

Where Hermes wins

Hermes wins because it is opinionated about the hard parts.

1. It treats memory as a product surface

Memory is not just chat history. It is a curated behavioral layer. The split between MEMORY.md, USER.md, and searchable session history is simple enough to trust and flexible enough to grow.

2. It treats skills as learning

The agent can create and update skills after hard tasks. That is the closest thing to compounding engineering knowledge in this category.

3. It treats execution location as a first-class choice

Local, Docker, SSH, Modal, Daytona, Vercel Sandbox, Singularity. That is not a footnote. That is the difference between a toy assistant and something you can deploy with intent.

4. It treats messaging as dispatch

I can talk to the agent through Telegram or Discord, but the real value is sending background work and getting results back. That makes the chat app a command center, not the product itself.

5. It treats safety as architecture, not a disclaimer

Allowlists, DM pairing, command approval, container isolation, MCP credential filtering, context scanning, env var filtering, and scoped toolsets are not glamorous features. They are the features you need after the first impressive demo.

The bigger point

The agent space is splitting into two philosophies.

One philosophy says:

Give the user a powerful assistant and let them connect everything.

The other says:

Give the user an agent runtime that can be supervised, isolated, taught, remembered, and deployed.

OpenClaw represents the first philosophy extremely well.

Hermes represents the second.

That is why I think Hermes is the more important project to study right now.

OpenClaw proved people want agents with hands.

Hermes is asking what happens when those hands also get memory, runbooks, safer execution, background work, and a home outside your current laptop.

That is the jump.

What I would build with Hermes

If I were turning this into a real project, I would build a developer publishing agent.

Not a blog spammer. A proper assistant for technical writing:

Watch official docs, GitHub releases, and challenge pages.
Summarize what changed with links to primary sources.
Keep a memory of my writing preferences and recurring projects.
Create reusable skills for research, outline creation, source checking, and DEV formatting.
Draft posts in my style, but keep claims grounded in citations.
Send drafts to Telegram for review.
Track comments and suggest follow-up posts based on real discussion.

That would use the Hermes shape well:

long-running background research
web extraction
session search
persistent memory
skills that improve over time
messaging delivery
scoped tool access
scheduled tasks

That is the kind of workflow where Hermes makes more sense than a one-shot chat assistant.

The point is not that Hermes can write.

The point is that Hermes can build a writing operation around memory, tools, and feedback.

Final take

Did Hermes literally kill OpenClaw?

No.

OpenClaw is too useful, too popular, and too culturally important to dismiss.

But Hermes may have killed the idea that a personal agent is only a local assistant with a chat interface.

That is the real shift.

The next generation of agents will not be judged only by how many apps they connect to. They will be judged by whether they can:

remember the right things
forget the wrong things
learn procedures
run in isolated environments
work asynchronously
integrate with open tools
stay useful after the first demo

By that standard, Hermes is not just another agent.

It is a strong argument for where agent software is going next.

That is my real test for any agent framework now:

Does it get more useful because I used it yesterday?

If the answer is no, it is still mostly a tool wrapper.

If the answer is yes, we are finally talking about agent software.

And yes, that is why the title says it:

Hermes just killed OpenClaw.

Not by replacing it overnight.

By making the category grow up.

The first thing I would personally validate is not whether Hermes can write a pretty paragraph. It is whether a Docker or SSH-backed Hermes research agent can run for a week, keep useful memory, and avoid turning one bad tool call into a machine-level mess. If you have tried either backend already, I would genuinely like to hear which one felt smoother and where it broke.

Sources

What do you think?

Is Hermes actually the next step after OpenClaw, or is OpenClaw still the better model for personal agents?

And of the five claims above, which one matters most to you: memory, skills, sandboxing, messaging, or running the agent on real infrastructure?

My GitHub Graveyard has 27 dead projects. Here is the brutal truth about why.

S M Tahosin — Wed, 13 May 2026 18:32:33 +0000

I recently opened my GitHub account and filtered by private repositories. I actually counted them: exactly 27 abandoned side projects created over the last 3 years.

There was a machine-learning habit tracker. There was a Twitter clone for dogs. There was a complex SaaS boilerplate that I spent four weeks configuring before completely giving up on it. Some of them I spent weeks on. One I even bought a domain for.

Hundreds of hours wasted. Why did they all die before seeing the light of day? It was not a lack of time. It was not a lack of motivation.

Here is the controversial truth:

Most developers do not fail because of a lack of skill. They fail because they secretly enjoy the dopamine rush of starting a new project more than the grind of finishing it.

Here is the exact pattern that killed my 27 projects, and the rule that finally helped me break the cycle.

1. The "Perfect Stack" Trap

As developers, we love shiny new tools. When starting a project, the first instinct is to try that new database everyone is talking about on Twitter, or the latest beta version of a framework.

I once spent an entire weekend configuring a Next.js app with tRPC, Prisma, and a custom Tailwind design system. By Sunday night, my infrastructure was absolute perfection. But I had zero business logic written. The next day, I lost interest completely.

If you want to actually finish a project, you have to use boring technology. Pick the stack you know best, even if it feels outdated.

2. Optimizing for Phantom Users

For the dog Twitter clone, I spent three days setting up a complex Redis caching layer. I was terrified the server would crash if a million dogs signed up on day one.

We love to over-engineer. We worry about how our database will handle massive traffic, so we design complex microservices. But here is the brutal reality:

Your biggest threat is not the server crashing. Your biggest threat is that nobody will ever visit your site.

Stop building for problems you do not have yet. A simple database query is fine. You can always optimize later when the app actually gets traction.

3. Feature Creep is a Disease

It starts innocently. You are building a simple to-do list, and you think, "It would be cool if users could upload a custom profile picture." Suddenly, you are reading AWS S3 documentation for five hours instead of finishing the core task logic.

Features are fun to dream about, but they are heavy to build. Every extra button you add delays the launch. The best way to finish a project is to ruthlessly cut features until you have the absolute minimum viable product. If it does not solve the core problem, it gets deleted.

4. The Fear of Shipping

Writing code is safe. Your VS Code editor does not judge you. But launching a project means real people might see it, find bugs, or worse—ignore it completely.

A lot of side projects are abandoned right at the 90 percent mark because the developer is secretly afraid of hitting the deploy button. We hide behind the excuse of "it just needs a little more polish."

A buggy, ugly app that is live on the internet is infinitely more valuable than a perfect app sitting on localhost.

The 48-Hour Rule

To break this curse, I made a strict new rule for myself: I have to launch a working, ugly prototype within 48 hours.

If it takes longer than a single weekend to get the core feature live, the scope is too big. This simple mindset shift is one of the biggest reasons I finally started shipping real apps instead of building graveyards.

Over to you

I know I am not the only one with a GitHub graveyard of dead ideas.

Be honest: What is the weirdest abandoned side project you have ever started, and what was the real reason you stopped working on it?

Let me know in the comments. What is in your graveyard?

I Replaced My $500 GPU with a $75 Raspberry Pi: How Gemma 4 Makes Computer Vision 10x Cheaper

S M Tahosin — Thu, 07 May 2026 18:35:58 +0000

This is a submission for the Gemma 4 Challenge: Write About Gemma 4

What I Built

GemmaVision — A complete computer vision pipeline that replaces $500+ GPU setups with a $75 Raspberry Pi 5, powered entirely by Gemma 4's native multimodal capabilities.

Native object detection without YOLO, OpenCV, CUDA, or cloud APIs. Just Gemma 4 multimodal AI running 100% offline on a single-board computer.

Metric	Traditional CV	Gemma 4 Vision
Total Cost	$500–2000 (GPU + cloud)	$75 (Raspberry Pi 5)
Monthly Bill	$20–100 cloud fees	$0 (runs offline)
Setup Time	2–4 hours of dependency hell	20 minutes
Code Complexity	500–1000 lines	50 lines
Dependencies	10+ (OpenCV, CUDA, etc.)	3 (torch, transformers, Pillow)
Power Draw	150–300W	7.5W
Accuracy (COCO)	~90%	~85%
Zero-Shot Detection	❌ Requires training	✅ Works out of box

The trade-off: 5% accuracy drop for 90% cost reduction and 10× simpler setup. For home automation, accessibility tools, and hobby robotics, this trade is obvious.

Quick links:

🚀 GitHub Repository — Full source code
🛒 Shopping List — Exact parts to buy

The Problem: Why Computer Vision is Broken for Indie Developers

For two years, I maintained a production computer vision pipeline that looked like every tutorial on the internet:

YOLOv8 → OpenCV preprocessing → CUDA drivers → Cloud API fallback → Custom NMS → Deployment hell

The reality of traditional CV:

Pain Point	Cost	Frequency
Cloud GPU rental	$47/month	Every month
CUDA driver updates	3-4 hours debugging	Quarterly
Dependency conflicts	2-6 hours resolution	Monthly
Model retraining	$50-200 compute	Per use case
API rate limits	Throttled at scale	Daily

The monthly bill: $47 for cloud GPU + API calls

The codebase: 800 lines of preprocessing, coordinate transforms, and version pinning

The maintenance: Broken every time NVIDIA drivers updated

The latency: 2–5 seconds end-to-end (when it worked)

It worked. But it felt… heavy. Like I was managing infrastructure instead of building products. The cognitive overhead of keeping CUDA, cuDNN, PyTorch, and OpenCV versions in sync was exhausting. Every apt update on the server felt like a gamble.

The frustration peaked in March 2026. I was debugging a CUDA version mismatch at 2 AM for a side project that was supposed to be "simple object detection." I asked myself: Why does computer vision require so much ceremony? Why does a "hello world" object detector need 10 dependencies and a $500 GPU?

That night, I started researching alternatives. What I found changed everything.

The Discovery: Gemma 4's Secret Weapon

Reading the Gemma 4 technical documentation, I found something buried in the multimodal section that made me stop breathing for a second:

"The model can return structured JSON output including box_2d coordinates for detected objects."

I read it twice. Then I tested it immediately.

The Experiment

The prompt I sent:

Detect all objects in this image. Return bounding boxes in JSON format 
with 'box_2d' [y1, x1, y2, x2] and 'label' fields.

The response I got:

[
  {"box_2d": [171, 75, 245, 308], "label": "coffee mug"},
  {"box_2d": [89, 420, 334, 612], "label": "laptop"},
  {"box_2d": [245, 512, 412, 780], "label": "desk chair"}
]

Minimal post-processing. Coordinates are normalized to a 1000×1000 grid, so you descale them to your image dimensions — but no NMS, no coordinate transforms, no class-ID mapping. No Non-Maximum Suppression algorithms. No OpenCV cv2.rectangle() calls. Just… coordinates. Ready to use. Native from the model.

The realization hit like a truck: A large vision-language model can replace my entire computer vision pipeline.

Why This Changes Everything

Traditional computer vision pipelines are composed systems:

Detection model (YOLO) outputs raw tensors
NMS algorithm filters overlapping boxes
Coordinate transforms scale to image dimensions
Label mapping converts class IDs to text
Visualization layer draws boxes with OpenCV

Gemma 4 is a unified system:

One model takes image + text prompt
One output contains structured bounding boxes with labels

This architectural simplification isn't just cleaner code — it's a fundamentally different approach to computer vision that eliminates entire categories of bugs and maintenance overhead.

The $75 Solution: Building GemmaVision

If Gemma 4 could output bounding boxes natively, I didn't need a GPU server. I needed just enough compute to run an E4B (Effective 4B) parameter model. That compute fits in a $75 single-board computer.

Enter the Raspberry Pi 5.

Hardware Shopping List

Component	Cost	Purpose	Where to Buy
Raspberry Pi 5 (8GB)	$60	Inference engine	rpilocator.com
Camera Module 3	$15	Image capture	Adafruit
Active Cooler	$5	Thermal management	Official Raspberry Pi store
64GB microSD (U3)	$10	Model storage	Any retailer (U3 speed required)
USB-C Power Supply	$8	5V 5A PSU	Included or separate
Total	$90	Complete system	—

Note: Skip the camera, use existing images — total drops to *$75*.

Software Architecture

┌─────────────────────────────────────────────────────────────┐
│                    GemmaVision Pipeline                     │
├─────────────────────────────────────────────────────────────┤
│  [Camera/PIL Image]                                         │
│         ↓                                                   │
│  [Transformers 4.48+ — AutoProcessor]                       │
│         ↓                                                   │
│  [Gemma 4 E4B-it, 4-bit quantized, 2.1GB]                   │
│         ↓                                                   │
│  [Native JSON: box_2d + label]                              │
│         ↓                                                   │
│  [PIL ImageDraw — Bounding boxes overlay]                   │
└─────────────────────────────────────────────────────────────┘

Dependencies: 3.

torch — PyTorch (CPU-optimized)
transformers — Hugging Face model loading
Pillow — Image I/O and drawing

Lines of code: ~50. Compare that to a YOLOv8 pipeline with preprocessing, NMS, coordinate transforms, and visualization.

Performance & Evaluation

What Works / What Breaks: Honest Assessment

I promised honesty. Here's the real-world performance:

✅ Works Well

Use Case	Example	Accuracy
Common objects	Coffee mugs, laptops, chairs, phones	87%
UI elements	Buttons, text inputs, dropdowns, links	91%
Indoor scenes	Living rooms, kitchens, offices	84%
Screenshots	Web interfaces, mobile apps	89%
Documented objects	Items with clear visual features	85%

⚠️ Edge Cases

Scenario	Issue	Mitigation
Small text at distance	Poor detection	Crop or zoom image
Occluded objects	Partial detection	Multiple angles
Very dark images	Missed objects	Brighten/preprocess
Noisy images	False positives	Confidence threshold
Abstract art	Nonsensical labels	Not recommended

❌ Don't Use For

Application	Why	Alternative
Real-time video	Too slow (8-12s/frame)	YOLOv8 on GPU
Sub-100ms latency	Impossible on Pi	Edge TPU / NVIDIA Jetson
Industrial precision	85% isn't enough	Custom trained YOLO
Safety-critical systems	No hard real-time guarantees	Certified CV systems
Tiny objects (< 20px)	Detection fails	Higher resolution camera

Bottom line: Gemma 4 vision excels at general-purpose object detection where latency tolerance is 10+ seconds. For real-time applications, traditional CV still wins.

Real-World Use Cases

Home Automation

# Detect if garage door is open/closed
detections = detect_objects("garage.jpg", "garage door")
for det in detections:
    if "open" in det["label"].lower():
        send_notification("Garage door is open!")

Accessibility Tool

# Describe scene for visually impaired users
detections = detect_objects("room.jpg", "all furniture and obstacles")
description = generate_spatial_description(detections)
speak(description)  # "Coffee table 2 meters ahead, chair to the right"

Inventory Management

# Count items on shelf
detections = detect_objects("shelf.jpg", "all products")
inventory = count_by_label(detections)
print(f"Stock: {inventory}")

UI Testing

# Verify all buttons are present in screenshot
detections = detect_objects("ui-screenshot.png", "buttons and input fields")
expected = ["Submit", "Cancel", "Username", "Password"]
missing = find_missing(expected, detections)
assert len(missing) == 0, f"Missing UI elements: {missing}"

Head to Head: Gemma 4 vs Traditional CV

Metric	YOLOv8 + OpenCV	Gemma 4 on Pi 5	Winner
Setup time	2–4 hours	20 minutes	🏆 Gemma 4
Lines of code	500–1000	50	🏆 Gemma 4
Dependencies	10+	3	🏆 Gemma 4
Hardware cost	$500–2000	$75–90	🏆 Gemma 4
Monthly cost	$20–100	$0	🏆 Gemma 4
Power draw	150–300W	7.5W	🏆 Gemma 4
Offline capable	❌ No	✅ Yes	🏆 Gemma 4
Zero-shot capable	❌ Requires training	✅ Yes	🏆 Gemma 4
Inference speed	50-200ms	8-12s	🏆 YOLOv8
Accuracy (COCO)	~90%	~85%	🏆 YOLOv8
Real-time video	✅ Yes	❌ No	🏆 YOLOv8
Custom training	✅ Well documented	⚠️ Limited	🏆 YOLOv8

When to choose Gemma 4: Offline deployment, zero-shot detection, simple setup, low cost, privacy-first.

When to choose YOLOv8: Real-time video, highest accuracy, custom training, GPU available.

Code

🚀 GitHub Repository: tahosinx/gemmavision — Full source code, MIT Licensed.

Quick start:

git clone https://github.com/tahosinx/gemmavision.git
cd gemmavision/src
python3 pi-client.py --image test.jpg --query "all objects"

Hardware Setup: 10-Minute Raspberry Pi Guide

Prerequisites

Raspberry Pi 5 (8GB RAM strongly recommended)
64GB microSD card (U3 speed class)
Camera Module 3 or USB webcam
Active cooler (thermal throttling occurs without it)
Stable internet connection (for initial model download)

Step-by-Step Installation

Step 1: System Dependencies

# Update system packages
sudo apt update && sudo apt full-upgrade -y

# Install Python and camera support
sudo apt install -y \
    python3-pip \
    python3-venv \
    python3-picamera2 \
    git \
    htop \
    libcamera-dev

# Increase swap (essential for 4GB Pi models)
sudo dphys-swapfile swapoff
sudo sed -i 's/CONF_SWAPSIZE=.*/CONF_SWAPSIZE=4096/' /etc/dphys-swapfile
sudo dphys-swapfile setup
sudo dphys-swapfile swapon

Step 2: Python Environment

# Create virtual environment
python3 -m venv ~/gemmavision-env
source ~/gemmavision-env/bin/activate

# Install CPU-optimized PyTorch (NO CUDA)
pip install torch \
    --index-url https://download.pytorch.org/whl/cpu

# Install transformers and utilities
pip install transformers Pillow bitsandbytes accelerate

Step 3: Download GemmaVision

git clone https://github.com/tahosinx/gemmavision.git
cd gemmavision/src

# Optional: Run tests
python3 test_local.py

Step 4: First Run (Model Download)

python3 pi-client.py --image test.jpg --query "all objects"

# First run downloads ~2.1GB quantized model
# Time: 5-10 minutes depending on internet
# Subsequent runs: ~30s (cached)

Camera Configuration

For Camera Module 3:

# Enable camera interface
sudo raspi-config
# Interface Options → Camera → Enable

# Test camera
libcamera-jpeg -o test.jpg -t 1000 --width 1920 --height 1080

For USB webcam:

# No additional config needed
# GemmaVision auto-detects /dev/video0

How I Used Gemma 4

I chose the Gemma 4 E4B-it model because it's the sweet spot for edge deployment — small enough to run on a Raspberry Pi 5's 8GB RAM with 4-bit quantization (2.1GB), yet powerful enough for accurate zero-shot object detection at ~85% accuracy.

The key insight: Gemma 4's multimodal capabilities include native bounding box output via the box_2d JSON format. This eliminates the need for traditional CV pipelines (YOLO, OpenCV, NMS algorithms) entirely. One model replaces an entire stack.

How It Works: The Technical Deep Dive

Model Selection: Why Gemma 4 E4B-it?

Gemma 4 comes in multiple sizes. For edge deployment on a Raspberry Pi 5 with 8GB RAM, the E4B-it (Effective 4B) variant hits the sweet spot:

Model	Parameters	Quantized Size	RAM Required	Pi 5 Compatible?
gemma-4-E4B-it	E4B (Effective 4B)	2.1GB	~6GB	✅ Yes
gemma-4-26b-a4b-it	26B MoE (4B active)	13GB	~20GB	❌ No (Pi 5 has 8GB max)
gemma-4-31b-it	31B Dense	16GB	~36GB	❌ No

The 4-bit quantization via bitsandbytes is essential (CPU support was added in recent versions; ensure you install the latest). It reduces memory usage by 4× with minimal accuracy loss (~1-2% in my testing).

The Complete Implementation

"""
GemmaVision — Complete computer vision in 50 lines
Native object detection with Gemma 4 on Raspberry Pi 5
"""

from transformers import AutoProcessor, AutoModelForCausalLM
from PIL import Image, ImageDraw
import json

# Configuration
MODEL_ID = "google/gemma-4-E4B-it"
DEVICE = "cpu"  # Raspberry Pi 5 has no CUDA

def load_model():
    """Load Gemma 4 with 4-bit quantization for Pi 5's 8GB RAM."""
    processor = AutoProcessor.from_pretrained(MODEL_ID)
    model = AutoModelForCausalLM.from_pretrained(
        MODEL_ID,
        load_in_4bit=True,      # Essential for 8GB RAM constraint
        device_map="cpu",       # CPU inference on Pi
        torch_dtype="auto",
    )
    return processor, model

def detect_objects(image_path: str, query: str = "all objects") -> list:
    """
    Detect objects in image using Gemma 4 native vision.

    Args:
        image_path: Path to image file
        query: What to detect (e.g., "cars", "furniture", "buttons and inputs")

    Returns:
        List of dicts with 'box_2d' [y1, x1, y2, x2] and 'label'
    """
    processor, model = load_model()

    # Load image
    image = Image.open(image_path)

    # Construct prompt for structured output
    messages = [{
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": f"Detect {query} in this image. Return JSON with 'box_2d' [y1, x1, y2, x2] and 'label' fields."},
        ],
    }]

    # Run inference (10-20s on Pi 5)
    inputs = processor.apply_chat_template(
        messages, 
        tokenize=True, 
        return_tensors="pt"
    )

    outputs = model.generate(
        **inputs, 
        max_new_tokens=256,
        do_sample=False,  # Deterministic for reproducibility
    )

    # Parse native JSON output
    result_text = processor.decode(
        outputs[0][inputs["input_ids"].shape[-1]:], 
        skip_special_tokens=True
    )

    # Gemma 4 returns valid JSON array
    detections = json.loads(result_text)
    return detections

def draw_boxes(image_path: str, detections: list, output_path: str = None):
    """Draw bounding boxes on image."""
    image = Image.open(image_path)
    draw = ImageDraw.Draw(image)

    w, h = image.size
    for det in detections:
        # Gemma 4 returns coords on a 1000x1000 grid — descale to image size
        y1, x1, y2, x2 = det["box_2d"]
        x1, x2 = int(x1 * w / 1000), int(x2 * w / 1000)
        y1, y2 = int(y1 * h / 1000), int(y2 * h / 1000)
        label = det["label"]

        # Draw box
        draw.rectangle([x1, y1, x2, y2], outline="#00ff00", width=3)

        # Draw label
        draw.text((x1, y1 - 10), label, fill="#00ff00")

    if output_path:
        image.save(output_path)

    return image

# One-liner usage
if __name__ == "__main__":
    detections = detect_objects("kitchen.jpg", "all objects")
    print(f"Found {len(detections)} objects:")
    for det in detections:
        print(f"  - {det['label']} at {det['box_2d']}")

    draw_boxes("kitchen.jpg", detections, "output.jpg")

That's the entire pipeline. No cv2. No torchvision. No ultralytics. No YAML configs. No custom NMS logic. No coordinate normalization headaches.

Performance Benchmarks

I ran 100 test images across 5 categories on the Pi 5:

Category	Images	Avg Time	Accuracy	Notes
Common objects	20	12.3s	87%	COCO-style items
Indoor scenes	20	14.1s	84%	Living room, kitchen
UI elements	20	11.8s	91%	Buttons, inputs, links
Screenshots	20	10.5s	89%	Web interfaces
Outdoor scenes	20	15.2s	78%	Street, cars, pedestrians
Overall	100	12.8s	85.8%	—

First inference takes ~15 seconds (model loads from SD card).

Subsequent inferences take 8–12 seconds (model cached in RAM).

Memory usage: ~6GB RAM during inference (fits comfortably in 8GB Pi).

Power draw: 7.5W continuous (standard Pi 5 PSU).

The SEO Angle: Why This Matters for Developers

Three fundamental shifts are happening simultaneously in edge AI:

1. Democratization of Computer Vision

Computer vision was historically $500+ GPU territory. Now it's a $75 single-board computer. This changes who can build CV systems:

Students can prototype without cloud credits
Hobbyists in developing regions can build locally
Indie developers can ship CV features without venture funding
Researchers can deploy experiments without institutional GPU clusters

The barrier to entry for computer vision just dropped by 10×.

2. Privacy-First by Default

Everything happens locally on the Pi. No images uploaded to cloud APIs. No data retention policies to worry about. No network required after initial model download.

Use cases where this matters:

Home security cameras (no footage leaves your network)
Medical image analysis (HIPAA compliance without vendor audits)
Industrial quality control (trade secrets stay on-premise)
Accessibility tools for sensitive environments

3. Architectural Simplicity

Traditional CV pipelines are composed systems with multiple failure points. Gemma 4 is a unified system.

Complexity comparison:

Aspect	Traditional CV	Gemma 4 Vision
Setup time	2–4 hours	20 minutes
Lines of code	500–1000	50
Dependencies	10+	3
Configuration files	3-5 (YAML/JSON)	0
Training required	Yes (custom datasets)	No (zero-shot)
Version conflicts	Frequent	Rare

This simplicity isn't just about developer experience — it's about reliability. Fewer components means fewer things that can break at 2 AM.

FAQ: Frequently Asked Questions

Q: Can I run this on Raspberry Pi 4?

A: Technically yes, practically no. The Pi 4 tops out at 8GB but has a much slower CPU. With 4-bit quantization and heavy swap usage, it might run, but inference will be 2-3× slower (30-40s per image). Pi 5's 8GB RAM and faster CPU make it viable.

Q: How accurate is Gemma 4 compared to YOLOv8?

A: In my testing on 100 images: YOLOv8 ~90%, Gemma 4 ~85%. The 5% gap is the trade-off for zero-shot capability and zero dependencies. For many applications, 85% is sufficient.

Q: Can it detect custom objects not in COCO?

A: Yes! This is the magic of zero-shot. Just describe what you want: "detect red toy cars", "find cracks in concrete", "locate loose bolts". No retraining required.

Q: Does it work without internet?

A: After initial model download (~2.1GB quantized), yes. The model runs 100% locally on the Pi. No API calls, no cloud dependencies.

Q: Can I use it for real-time video?

A: No. At 8-12 seconds per frame, it's far too slow for video. Use YOLOv8 or other traditional CV for real-time applications. Gemma 4 excels at batch processing of still images.

Q: What's the power consumption?

A: ~7.5W continuous under load. A standard 5V 5A Raspberry Pi PSU handles it easily. The active cooler adds ~1W.

Q: Can I run this on NVIDIA Jetson?

A: Absolutely, and it'll be much faster. Jetson Nano/Orin has CUDA support. This guide focuses on Pi 5 because it's cheaper and more accessible, but the code works anywhere PyTorch runs.

Q: Is the model free to use commercially?

A: Yes! Gemma 4 is released under the Apache 2.0 license — a major upgrade from previous Gemma models' custom terms. This is a standard, permissive open-source license allowing unrestricted commercial use. See Gemma 4 license details.

Q: How do I improve accuracy?

A: Three strategies:

Higher resolution input — Larger images give more detail
Better prompts — Be specific: "detect laptops and phones" vs "detect electronics"
Crop regions — Focus on relevant image areas instead of full scene

Q: Can I fine-tune Gemma 4 for my use case?

A: Yes, but it's complex. Gemma 4 supports fine-tuning via LoRA/QLoRA. I plan to publish a fine-tuning guide after the challenge. For now, zero-shot prompting covers 80% of use cases.

What's Next for GemmaVision

This is my official entry for the DEV Gemma 4 Challenge (May 6-24, 2026).

Post-challenge roadmap:

Feature	Status	ETA
Fine-tuning guide	Planned	June 2026
Pi 5 GPU acceleration	Waiting for open-source drivers	TBD
WebRTC streaming	Prototyping	May 2026
9B model experiments	Blocked (needs 12GB+ RAM)	If Pi 6 releases
Docker deployment	Planned	May 2026
Home Assistant integration	Community request	June 2026

Call to Action

If this project helped you:

🚀 Try the code: github.com/tahosinx/gemmavision

⭐ Star the repo if you found it useful

💬 Comment below: What would you build with local, offline computer vision?

❤️ Heart this post — it helps in the challenge rankings

🐦 Share on Twitter — Tag me @tahosinx

Hardware links:

Raspberry Pi 5 — Stock finder (currently available)
Camera Module 3 — Wide angle recommended
Active Cooler — Official Pi cooler

About the Author

Tahosin — Building AI systems that run where you need them: on your desk, not in the cloud.

🌐 Website: tahosin.bro.bd
💻 GitHub: @tahosinx
📝 DEV: @tahosin
🐦 Twitter: @tahosinx

Built with Gemma 4. Tested on a $75 computer. Shared because nobody else was writing this guide.

Keywords: Gemma 4, computer vision, Raspberry Pi, edge AI, object detection, zero-shot learning, multimodal AI, local inference, privacy-first AI, embedded vision, YOLO alternative, OpenCV replacement, budget AI hardware, DIY computer vision.

Related reading:

Last updated: May 12, 2026. GemmaVision v1.0. MIT Licensed.

AI Code Generation: Google's 75% Claim and What It Means

S M Tahosin — Sat, 25 Apr 2026 04:00:55 +0000

Sundar Pichai just dropped a bombshell: 75% of Google's code is now AI-generated. That's a huge number, and it's not some far-off future scenario. This isn't just about faster autocomplete; it's a stark look at where enterprise development is headed, fast.

Why this matters for Tech Leads

If you're a tech lead, or even a staff engineer, this number should make you sit up straight. Your team's productivity metrics could be about to get a serious shake-up. You're not just reviewing human-written code anymore; you're going to be reviewing AI-generated solutions that might look perfect on the surface but hide subtle issues. Think about the shift from writing boilerplate to verifying boilerplate. You'll need to figure out how to integrate these tools, manage their output, and still maintain code quality and architectural integrity. This isn't just about adopting a new IDE plugin; it's about fundamentally rethinking how code gets from idea to production. Google's internal tools, whatever they're called, are clearly pushing boundaries way past what we see in public tools like GitHub Copilot, saving them potentially millions of developer hours.

The technical reality

So, how does 75% AI-generated code even work? It's not sentient AI writing entire systems from scratch. More likely, it's highly sophisticated code completion, pattern recognition, and scaffold generation, deeply integrated into Google's vast internal monorepo and toolchain. Imagine an AI that understands your internal APIs, coding standards, and common patterns better than a new hire. It probably generates entire function bodies, test cases, and even data models based on high-level prompts or existing code context. We're talking about tools that can spit out a src/utils/data-formatter.js file with 50 lines of perfect code, including JSDoc comments, in seconds. But you still gotta check it. Here's a tiny example of what an AI might generate, and what you'd typically do with it:

// AI-generated utility function
const formatCurrency = (value, locale = 'en-US', currency = 'USD') => {
  if (typeof value !== 'number' || isNaN(value)) {
    console.warn('Invalid input for formatCurrency:', value);
    return null;
  }
  return new Intl.NumberFormat(locale, {
    style: 'currency',
    currency: currency,
    minimumFractionDigits: 2,
    maximumFractionDigits: 2,
  }).format(value);
};

// A human-written test for verification
console.assert(formatCurrency(123.45, 'en-US', 'USD') === '$123.45', 'USD formatting failed');
console.assert(formatCurrency(99.99, 'de-DE', 'EUR') === '99,99 €', 'EUR formatting failed');
console.assert(formatCurrency(0) === '$0.00', 'Zero value failed');
console.assert(formatCurrency(null) === null, 'Null input failed');

And it's not just JavaScript. It's likely generating configuration files, build scripts, and more. Think about a Dockerfile for a new service or a Kubernetes deployment manifest. An AI could draft that based on a few parameters, saving hours of looking up syntax in documentation. It's about reducing the cognitive load on engineers by automating the predictable, allowing them to focus on the truly novel problems. I've seen teams save 10% of their time just by using basic code completion; imagine what 75% generation means.

What I'd actually do today

Given this news, here's my practical take for any dev team right now:

Start small with a public tool: Integrate something like GitHub Copilot or Cursor into a non-critical side project or a small, isolated module. See how it performs with your team's common tasks.
Define clear AI usage policies: Decide what kinds of code can be AI-generated without heavy human review. Establish rules for sensitive data or critical path logic.
Invest in robust testing: If AI writes more code, humans need to write more tests, or at least verify AI-generated tests. Strong unit and integration tests are your safety net.
Practice prompt engineering: Teach your team how to write effective prompts. Getting good output from AI is a skill, and it's becoming crucial.
Monitor code quality metrics: Keep a close eye on your static analysis tools and code coverage. AI can introduce subtle bugs or performance issues that human eyes might miss.

Gotchas & unknowns

While 75% is impressive, it's not a silver bullet. The biggest gotcha is hallucinations. AI models can generate plausible-looking but completely incorrect code. This is especially true when dealing with edge cases, complex business logic, or obscure library usage. Another unknown is the maintenance burden. If an AI generates code, who's responsible for understanding and debugging it later? What happens when the underlying libraries change, and the AI-generated code becomes outdated? It's also unclear how Google manages intellectual property or security concerns with such widespread AI usage. They have internal models, sure, but the ethical lines blur when a machine generates 3 out of 4 lines of your codebase. And let's not forget the environmental impact of running these massive AI models constantly; that's a whole other can of worms.

How much of your codebase do you think an AI could realistically generate without causing more headaches than it solves?

Streaming Speech-to-Text with OpenAI in 2026: Moving Beyond Whisper

S M Tahosin — Fri, 24 Apr 2026 19:16:56 +0000

Quick recap of where we are if you haven't been following OpenAI's STT roadmap: the classic whisper-1 endpoint is batch-only — you upload a file, wait, get back a finished transcript. There's no stream=True because the underlying Whisper decoder wasn't designed for it, and the endpoint probably won't ever get streaming retrofitted onto it.

That was a genuine blocker for about two years. If you wanted live captions or partial transcripts, you had to either self-host Whisper with a streaming fork, or reach for a third-party like AssemblyAI / Deepgram.

Then, quietly, OpenAI shipped two replacements that between them cover every STT streaming use case I've needed:

gpt-4o-transcribe / gpt-4o-mini-transcribe — file upload with stream=True, delivers partial transcripts as the audio is processed.
Realtime API (gpt-4o-realtime-preview) — WebSocket, bidirectional, built for live mic-in / TTS-out with a live-transcription mode.

I helped someone get unblocked on this in openai/openai-python#2306 and realised I'd never written up the full picture. Here it is — with the trade-offs, working code for each, and a decision rule at the end.

Option 1: `gpt-4o-transcribe` with `stream=True`

Same API shape as the old audio.transcriptions.create call, just with a new model and stream=True. You get incremental transcript.text.delta events as chunks come back:

from openai import OpenAI
client = OpenAI()

with open("meeting.mp3", "rb") as f:
    stream = client.audio.transcriptions.create(
        model="gpt-4o-transcribe",       # or "gpt-4o-mini-transcribe" (cheaper)
        file=f,
        response_format="text",
        stream=True,
    )
    transcript = []
    for event in stream:
        if event.type == "transcript.text.delta":
            print(event.delta, end="", flush=True)
            transcript.append(event.delta)
        elif event.type == "transcript.text.done":
            print()    # final newline
    full_text = "".join(transcript)

Async works identically with AsyncOpenAI:

import asyncio
from openai import AsyncOpenAI

async def transcribe(path: str) -> str:
    client = AsyncOpenAI()
    parts = []
    with open(path, "rb") as f:
        stream = await client.audio.transcriptions.create(
            model="gpt-4o-transcribe",
            file=f,
            response_format="text",
            stream=True,
        )
        async for event in stream:
            if event.type == "transcript.text.delta":
                parts.append(event.delta)
    return "".join(parts)

text = asyncio.run(transcribe("meeting.mp3"))

Why I default to this for "finished file" use cases

Three practical reasons beyond the streaming:

Accuracy. On the English + mixed-language audio I've benchmarked, gpt-4o-transcribe is noticeably better than whisper-1 at speaker changes, acronyms, and technical vocabulary. gpt-4o-mini-transcribe is a smaller quality step down but still beats whisper-1 in my tests.
Latency perception. Even though the total time is similar, partial transcripts streaming into your UI feel much faster to users. A 3-minute audio file that takes 20 seconds to transcribe feels instant if the first words show up after ~500ms; it feels sluggish if you wait the full 20 seconds for the whole blob.
Same file-upload ergonomics as Whisper. Swapping model="whisper-1" for model="gpt-4o-transcribe" + adding stream=True is almost a drop-in change, so migrating an existing pipeline is a 5-minute job, not a rewrite.

What you give up

No word-level timestamps (yet, at time of writing). whisper-1 with response_format="verbose_json" and timestamp_granularities=["word"] still wins if you need precise word-level timing for subtitle alignment. If that's your use case, stay on whisper-1.
No speaker diarization in either. If you need "who said what", both of these need to be paired with a separate diarization step (pyannote is the usual pick).

Option 2: Realtime API for true live audio

If you're transcribing live audio — a microphone, a phone call, a meeting as it happens — you want the Realtime API, not a file upload. It's a WebSocket connection you push PCM16 chunks into, and you get back conversation.item.input_audio_transcription.delta events every ~200–500ms.

import asyncio, base64
import sounddevice as sd
from openai import AsyncOpenAI

SAMPLE_RATE = 24_000
CHUNK_MS = 50   # 50ms chunks

async def live_transcribe():
    client = AsyncOpenAI()
    async with client.beta.realtime.connect(
        model="gpt-4o-realtime-preview"
    ) as conn:
        # Configure for transcription-only (no model replies, no TTS)
        await conn.session.update(session={
            "modalities": ["text"],
            "input_audio_format": "pcm16",
            "input_audio_transcription": {"model": "gpt-4o-transcribe"},
            "turn_detection": {"type": "server_vad"},   # let OpenAI handle silence detection
        })

        # Start streaming audio in the background
        audio_queue: asyncio.Queue[bytes] = asyncio.Queue()

        def on_audio(indata, frames, time_info, status):
            audio_queue.put_nowait(bytes(indata))

        with sd.RawInputStream(
            samplerate=SAMPLE_RATE,
            blocksize=int(SAMPLE_RATE * CHUNK_MS / 1000),
            channels=1,
            dtype="int16",
            callback=on_audio,
        ):
            sender = asyncio.create_task(_send_audio(conn, audio_queue))
            try:
                async for event in conn:
                    if event.type == "conversation.item.input_audio_transcription.delta":
                        print(event.delta, end="", flush=True)
                    elif event.type == "conversation.item.input_audio_transcription.completed":
                        print(f"\n[final: {event.transcript}]")
            finally:
                sender.cancel()

async def _send_audio(conn, q: asyncio.Queue[bytes]):
    while True:
        chunk = await q.get()
        await conn.input_audio_buffer.append(
            audio=base64.b64encode(chunk).decode("ascii")
        )

asyncio.run(live_transcribe())

A few things worth knowing:

PCM16 at 24 kHz is the expected format. If you're capturing at a different sample rate, resample before sending — the server won't resample for you.
Let server-side VAD handle turn detection (turn_detection: {type: "server_vad"}) unless you have a specific reason to do it client-side. OpenAI's VAD is well-tuned and keeps your client code simple.
The conversation.item.input_audio_transcription.delta events are your partial captions; conversation.item.input_audio_transcription.completed fires when the user finishes a "turn" (i.e. stops talking for ~500ms). Use the deltas to drive your live caption UI, and the completed event to commit a finalised sentence to your transcript log.
You can also use the Realtime API for voice-to-voice (audio in, audio out) by adding "audio" to modalities and setting a TTS voice. The transcription deltas still fire, so you get the transcript "for free" even in a full voice-assistant setup.

Latency is the killer feature

In my testing, the partial-transcript latency on Realtime is 200–400ms from end-of-phoneme to delta, which is what you need for live captions to feel responsive. File-based gpt-4o-transcribe with streaming still has to wait for the chunk to arrive on the server before it can start, so the first delta on a file upload lands ~1–2s in — fine for "uploaded recording" UX, too slow for "live."

Option 3: Stay on `whisper-1`

If:

You genuinely don't care about streaming (batch transcription of a recorded file where the UX is "upload → come back in a minute for the result").
You need word-level timestamps for subtitle alignment.
You're cost-optimising hard and the 50% discount of whisper-1 over gpt-4o-mini-transcribe matters.

... then whisper-1 is still the right call, and probably will be for a while. It's not going anywhere, it's cheap, it's stable.

The decision rule

Written out as an actual rule I use:

"Is the user staring at the UI waiting for the transcript?"

No (batch job, background processing, subtitle generation) → whisper-1 if you need timestamps, gpt-4o-mini-transcribe otherwise.

Yes, and the audio is a finished file → gpt-4o-transcribe or gpt-4o-mini-transcribe with stream=True.

Yes, and the audio is live (mic, phone, meeting) → Realtime API with input_audio_transcription.

There's one additional axis worth flagging: language coverage. whisper-1 still has the broadest language support (it was trained on 98 languages). gpt-4o-transcribe is very good on the major languages but gets noticeably worse as you head into the long tail. If you're transcribing Swahili, Bengali, or any other non-top-20 language, benchmark both on a sample before picking — don't assume the newer model is always better.

Common pitfalls

A few things that cost me hours the first time:

The file parameter for audio.transcriptions.create wants a file-like object, not bytes. If you have raw bytes in memory (e.g. from an upload handler), wrap them in io.BytesIO and set a .name attribute ending in the right extension: the SDK uses the filename to infer Content-Type.

   from io import BytesIO
   buf = BytesIO(audio_bytes)
   buf.name = "recording.mp3"    # ← critical
   client.audio.transcriptions.create(model="gpt-4o-transcribe", file=buf, stream=True)

Realtime API auth uses the same key as the rest of OpenAI, but the connection is authenticated at the WebSocket handshake. Your API key briefly appears in the Authorization header of the initial HTTP upgrade request, which is fine server-side — but if you're building a browser client, you need to proxy the handshake through your backend so the key never touches the client. OpenAI has a "client secret" flow for this.
The event types are stable but there are a lot of them. The Realtime API emits ~20 distinct event types; if you find yourself writing a giant if/elif chain, factor it into a dispatch dict indexed by event.type early — much easier to extend later.
Silence doesn't count as transcription. If your audio has a lot of pauses, you'll see input_audio_buffer.speech_stopped events but no transcription deltas for the silent parts. That's expected; don't treat it as a bug.

References

Speech-to-Text guide — covers gpt-4o-transcribe and whisper-1 together
Realtime API guide — the overview
openai-python SDK Realtime docs — the Python specifics
openai/openai-python#2306 — the discussion that prompted this writeup

If you're building something non-trivial with streaming STT — especially multi-speaker scenarios, code-switching (mixing languages), or very noisy audio — leave a comment, I've been collecting notes on which approach wins in each setting.

Next.js 16: Revalidating Per-User Dynamic Fetches on Demand (3 Patterns That Actually Work)

S M Tahosin — Fri, 24 Apr 2026 19:16:18 +0000

If you've ever tried to revalidate a user-scoped fetch in Next.js App Router and watched revalidateTag('...') silently do nothing, you've run into one of the subtler gotchas of the 16.x data cache. The short version:

Once a fetch reads from cookies() or headers(), Next marks it as Dynamic and bypasses the data cache entirely — so next: { tags: [...] } is silently ignored, and your tag-based revalidation has nothing to invalidate.

This bites hardest on auth-gated dashboards: every fetch forwards the session cookie to your backend, so every fetch is Dynamic, so none of them are cached, so revalidateTag is a no-op. You end up writing action handlers that "revalidate everything" with an empty tag key — and that actually does work, but it's a sledgehammer that obliterates cross-user cache isolation you didn't know you wanted.

I ran into this last week while helping someone in vercel/next.js#92829, and realised I've been using three distinct patterns depending on the data shape. Writing them up here because the docs don't connect the dots between the Dynamic IO model and the "per-user revalidation" use case.

All examples target Next.js 16.1+. I'll note where 16.0 and earlier diverge.

The pattern you probably tried first (and why it fails)

// app/lib/api.ts
import { cookies } from 'next/headers';

export const fetchAPI = async () => {
  const cookieStore = await cookies();
  return fetch('https://api.example.com/dashboard', {
    method: 'POST',
    headers: { Cookie: cookieStore.toString() },
    next: { tags: ['dashboard-data'] },   // ← this is silently ignored
  });
};

Then in a server action:

'use server';
import { revalidateTag } from 'next/cache';

export async function refreshDashboard() {
  revalidateTag('dashboard-data');   // ← nothing to invalidate; cache was never populated
}

The fetch is considered Dynamic because it reads from cookies() inside the module scope that fetch resolves in. Dynamic fetches skip the data cache entirely — they're not cached per-user, they're not cached at all. next.tags is only consulted when something actually enters the cache, so the tag never gets associated with any cache entry.

Your three options are:

Opt back in to caching with an explicit key (unstable_cache or 'use cache')
Accept it's dynamic, use React.cache for same-request dedupe, and revalidatePath for rerenders
Route the data through a Route Handler that does cache, and call it from the Server Component

Let's walk through each.

Pattern 1: `unstable_cache` with the cookie as a key part

unstable_cache reads its cache key from the function's arguments, not from the enclosing module. So you read the cookie outside the cached function and pass it in:

// app/lib/api.ts
import { unstable_cache } from 'next/cache';
import { cookies } from 'next/headers';
import { createHash } from 'node:crypto';

const sessionHash = (cookie: string) =>
  createHash('sha256').update(cookie).digest('hex').slice(0, 16);

const fetchAPIForUser = (sessionCookie: string) =>
  unstable_cache(
    async () => {
      const res = await fetch('https://api.example.com/dashboard', {
        method: 'POST',
        headers: { Cookie: sessionCookie },
      });
      if (!res.ok) throw new Error(`API ${res.status}`);
      return res.json();
    },
    // Cache key parts — different sessions get different cache entries
    ['fetchAPI', sessionCookie],
    {
      tags: [
        'dashboard-data',
        `dashboard-data:${sessionHash(sessionCookie)}`,
      ],
      revalidate: 60,
    },
  )();

export async function fetchAPI() {
  const sessionCookie = (await cookies()).toString();
  return fetchAPIForUser(sessionCookie);
}

Two things are doing work here:

The cookie is a key part, so every user ends up with their own cache entry. User A's revalidateTag doesn't nuke User B's data.
The tags list has both a global dashboard-data and a per-user dashboard-data:<hash>. This gives you granular control: revalidate one user's data after they mutate something, or nuke everyone's when a global config changes.

Then your server action becomes:

'use server';
import { revalidateTag } from 'next/cache';
import { cookies } from 'next/headers';
import { createHash } from 'node:crypto';

const sessionHash = (c: string) =>
  createHash('sha256').update(c).digest('hex').slice(0, 16);

export async function refreshMyDashboard() {
  const cookie = (await cookies()).toString();
  revalidateTag(`dashboard-data:${sessionHash(cookie)}`);   // just me
}

export async function refreshEveryonesDashboard() {
  revalidateTag('dashboard-data');   // global flush
}

When to use this: user-scoped data that's expensive to fetch and read more than once per session — dashboards, settings pages, user-specific feeds. You get the latency win of caching and tag-based revalidation.

Gotcha: don't accidentally cache PII in a way that survives the user's logout. The per-user tag + a reasonable revalidate ceiling (60s–5min) keeps the blast radius sane.

Pattern 2: `'use cache'` directive (the modern shape)

If you're on 16.1+ with experimental.dynamicIO enabled, 'use cache' is the newer, less verbose form — same idea, less ceremony:

// app/lib/api.ts
import { cookies } from 'next/headers';
import { cacheTag, cacheLife } from 'next/cache';

async function fetchAPIForUser(sessionCookie: string) {
  'use cache';
  cacheLife('minutes');
  cacheTag('dashboard-data', `dashboard-data:${sessionHash(sessionCookie)}`);

  const res = await fetch('https://api.example.com/dashboard', {
    method: 'POST',
    headers: { Cookie: sessionCookie },
  });
  return res.json();
}

export async function fetchAPI() {
  const sessionCookie = (await cookies()).toString();
  return fetchAPIForUser(sessionCookie);   // same pattern — read cookie outside
}

cacheTag / cacheLife from next/cache are the equivalents of the unstable_cache options, and the function's arguments become the cache key automatically.

The key discipline — read cookies() outside the cached function and pass it as an argument — is identical to Pattern 1. The framework still can't introspect into cookies() from inside a cached region; it just sees a function that takes a string and caches by string.

Enable it in next.config.ts:

import type { NextConfig } from 'next';

const config: NextConfig = {
  experimental: {
    dynamicIO: true,
    useCache: true,
  },
};

export default config;

Check your 16.x changelog for exact flag names — they shifted between 16.0 and 16.1.

Pattern 3: Accept the dynamic, dedupe with `React.cache`, refresh with `revalidatePath`

Sometimes the data just isn't cacheable — it changes every request, or it's cheap enough that caching adds latency instead of removing it. In that case, don't fight the framework; work with it.

// app/lib/api.ts
import { cache } from 'react';
import { cookies } from 'next/headers';

export const fetchAPI = cache(async () => {
  const sessionCookie = (await cookies()).toString();
  const res = await fetch('https://api.example.com/dashboard', {
    method: 'POST',
    headers: { Cookie: sessionCookie },
  });
  return res.json();
});

React.cache dedupes the fetch across components within the same request, so if five Server Components call fetchAPI() during one render, you still only hit the backend once. Different requests get fresh data — exactly what you want for per-user live data.

Then your server action rerenders the page instead of revalidating a cache entry:

'use server';
import { revalidatePath } from 'next/cache';

export async function refreshDashboard() {
  revalidatePath('/dashboard');   // forces re-render, which re-runs fetchAPI
}

When to use this: user-scoped data that's small, cheap, or genuinely fresh-per-request. Most dashboards I've built fall here — the latency of a direct backend call is dominated by network anyway, and skipping the cache layer saves you from a whole class of staleness bugs.

Decision rule I actually use

After writing a few of these, this is the rule I apply:

"The data is user-scoped, expensive, and reads dominate writes" → Pattern 1 or 2 with per-user tags. The 5× latency win on cache hits usually justifies the complexity.
"The data is user-scoped, cheap, and reads roughly equal writes" → Pattern 3. Don't cache; dedupe per-request, rerender on mutation.
"The data is global but personalised at the margin (e.g. reading a session cookie only for feature flags)" → Pattern 1 with a single tag, no per-user keying. Feature flag data is worth caching even though it reads a cookie.
"I need real-time-ish data (< 30s)" → Pattern 3 + poll-on-client with React Query / SWR. Caching on the server layer just pushes the staleness problem around.

The sledgehammer (and why to avoid it)

You can make the original code "work" by calling revalidateTag('') on every mutation — it nukes every tagged entry in the cache, and your Dynamic fetch also re-runs because the page gets marked for revalidation. I've seen this in production a few times and every time it caused an incident later:

One user's mutation invalidates every other user's cache → thundering herd on the backend
Global feature flags that were cacheable get flushed on every user action → effective cache hit rate drops to ~0%
Debugging becomes impossible because "why did User A see stale data?" has no local explanation

Per-user tags (Pattern 1 / 2) or per-request React.cache (Pattern 3) are both strictly better. Pick one, be consistent within a feature area, and document which pattern a given fetch is using.

A word on the mental model

The thing that clicked for me about the 16.x Dynamic IO model: the data cache is fundamentally a global key-value store keyed by URL + options hash. When your fetch reads something request-scoped (cookies, headers, searchParams), the cache layer has no good default for "who does this entry belong to?" — so it bails out entirely rather than silently cache PII across users.

You opt back in by making the user-scoping explicit (passing the cookie as a key part), which moves the security decision into your code where you can reason about it. That's the same tradeoff React Server Components made around 'use server' — the framework refuses to guess, and gives you a small API to tell it exactly what you mean.

Once I started thinking of unstable_cache / 'use cache' as "declare your cache key explicitly, include whatever request-scoped stuff you want to partition on", the rest of the API fell into place.

References

Next.js 16 — Data Cache
unstable_cache / 'use cache'
revalidateTag / revalidatePath
The original GitHub discussion where this writeup started

If you're hitting a variation of this problem — say, SSE streams that need to drop their connection on revalidation, or RSC payloads that race with client-side tag invalidations — drop a comment, I've probably tripped on it too.

Portal 2 Modding Tools: Community Edition is Here

S M Tahosin — Thu, 23 Apr 2026 16:02:25 +0000

So, Portal 2: Community Edition just dropped into open beta on Steam. It's got enhanced graphics, bigger maps, and a whole new set of modding tools for us to play with.

My hot take? This isn't just a game update; it's an open-source platform waiting for some serious innovation from the community.

Why this matters for Game Developers

Look, if you're a game dev, especially one who's ever tinkered with the Source Engine, this is a big deal. You're getting a fully featured, beloved game, essentially handed over to the community with new hooks. It's a goldmine for learning, experimenting, and even showcasing your skills. Think about the countless games that started as mods, like Counter-Strike or Dota. This isn't just about making new levels; it's about extending gameplay mechanics, building custom assets, and maybe even rewriting parts of the game logic. Valve's done the heavy lifting on the core engine, and now we get to build on top of it. It's free for existing Portal 2 owners, which means a huge potential audience for anything you create. We're talking millions of players, not just a niche group.

The technical reality

Modding Portal 2, even with new tools, still means getting cozy with the Source Engine. That often involves C++ for deeper modifications, but the community edition likely streamlines asset creation and scripting. You'll be dealing with Valve's Hammer editor for map creation, but the new tools probably offer more flexibility. Building a simple mod might look something like compiling custom scripts or assets. Let's say you're adding a new puzzle element. You'd likely define its behavior in a script, and then compile it.

Here's a conceptual shell command you might use to compile a custom game DLL for Source Engine, assuming you've got the SDK set up:

#!/bin/bash

# Navigate to your mod's source directory
cd "$PORTAL2CE_SDK_PATH/src/my_custom_mod"

# Clean previous build artifacts
make clean

# Build the game library (e.g., game_shared.dll or game_server.dll)
# This assumes a Make-based build system common in older engine SDKs
# Modern tools might use CMake or Visual Studio projects.
make -j8

# Copy the compiled DLL to the game's bin directory
cp "./bin/Release/game_server.dll" "$PORTAL2CE_GAME_PATH/portal2ce/bin/"

echo "Custom mod DLL compiled and copied!"

Or, if you're just dealing with asset compilation, you'd use specific tools provided by the SDK. For instance, compiling a custom VMT (Valve Material Type) file for a new texture might involve a tool like vpk.exe or studiomdl.exe to process models. It's not always JavaScript, but understanding build pipelines is key.

What I'd actually do today

Download it: Get Portal 2: Community Edition from Steam. It's free if you own the original, so no excuses. Get it installed and run it once.
Explore the SDK: Find the new modding tools. There's usually a dedicated SDK folder. Poke around, see what files are there, and check for documentation.
Start Small: Don't try to build a new game from scratch. Try changing a texture, moving a prop, or altering a simple script value. The Portal 2 mapping community already has a ton of tutorials.
Join the Community: Find their Discord or forums. Other devs will be asking questions and sharing tips. This is where you'll get the real answers.
Look for C++ examples: If you're serious, find some existing open-source Source Engine mods. See how they structure their code and handle engine interactions. It's a C++ beast, but a manageable one for small changes.

Gotchas & unknowns

First off, it's a beta. Expect bugs. You might hit crashes, weird physics glitches, or tools that don't quite work as advertised. The documentation might be sparse initially, too. And while it's exciting, remember this is still built on an older engine, even with enhancements. You're not getting Unreal Engine 5 features here. Performance might be an issue with truly massive maps, even with the promised larger map support. Also, how long will Valve (or the community team) actively maintain these new tools? That's always a question with community-led projects. It's a passion project, not a guaranteed long-term support contract.

What kind of amazing puzzles do you think people will build with these new tools? And what's the first thing you'd try to mod? Let me know in the comments. This could be big, or it could just be a fun distraction, but I'm betting on the former. I'm excited to see what the community comes up with. Maybe a new version of Aperture Science's potato battery?

GitHub Copilot Pauses New Sign-ups: Agentic AI Strains Infrastructure & Scaling Challenges

S M Tahosin — Tue, 21 Apr 2026 18:05:46 +0000

I was scrolling through my tech news feed recently when a headline caught my eye: GitHub has temporarily halted new sign-ups for its Copilot service. As a developer who's been keenly observing the rise of AI in our craft, this news immediately struck me as a significant turning point. The reason for the pause? Infrastructure strain caused by the increasing use of 'agentic AI' features.This isn't just about more users; it's about a different kind of AI that's pushing the boundaries of what our current tech infrastructure can handle. It highlights the rapid adoption and immense potential of advanced AI coding tools, but also signals the significant scaling challenges we face.## What is Agentic AI?First, let's unpack what agentic AI means. Unlike simpler AI models that might complete a single task (like suggesting the next word or line of code), agentic AI refers to AI systems that can autonomously perform complex tasks, often breaking them down into multiple sub-tasks, executing them, and even self-correcting along the way.Think of it less as an autocomplete tool and more as a proactive assistant that can understand a higher-level goal and work towards achieving it, potentially interacting with various tools and APIs. This level of autonomy and problem-solving naturally requires significantly more computational resources, as the AI isn't just generating; it's reasoning, planning, and executing.Consider a simple analogy: a basic function suggestion might just pull from a library. An agentic AI might analyze your entire project, understand the context, figure out the best approach, generate a multi-step solution, and even write tests for it. This deep engagement and iterative processing are what demand so much from the underlying infrastructure.## The Resource Demands of Advanced AITo illustrate the difference in resource demands, let's look at a very simplified, conceptual JavaScript example. Imagine a non-agentic function that just gives you a recommendation based on a single input, versus an agentic-like process that needs to iterate, make decisions, and potentially retry.### Basic Suggestion (Low Resource Example)Here's a trivial example of a function that provides a direct suggestion based on a simple input. It's fast and requires minimal computation.

javascriptfunction getSimpleCodeSuggestion(problemType) { const suggestions = { 'performance': 'Consider optimizing loop iterations.', 'security': 'Sanitize user inputs carefully.', 'bugfix': 'Check variable scope and type consistency.' }; return suggestions[problemType] || 'No specific suggestion available.';}console.log(getSimpleCodeSuggestion('performance')); // Output: Consider optimizing loop iterations.

Agentic-like Process (Higher Resource Example)Now, let's imagine a conceptual

HOCKS AI: I Open-Sourced a Full AI Platform With Chat, Vision, Video Analysis & Website Generation — Runs at $0/Month

S M Tahosin — Tue, 21 Apr 2026 16:17:46 +0000

TL;DR: I built and open-sourced a production-ready AI platform that combines chat, image analysis, video analysis, and website generation. It uses free models where possible and costs ~$0/month to run. Live demo | GitHub

Why I Built This

Every AI tool I tried was either:

Too expensive — GPT-4 API bills adding up fast
Single-purpose — chat OR image analysis, never both
Closed source — no way to learn from the architecture

I wanted a single platform that handles multiple AI modalities, uses the best free models available, and is fully open-source so other developers can learn from it.

The result is HOCKS AI — a multi-modal AI assistant platform.

🔗 Live: hocks.app
📦 Source: github.com/x-tahosin/hocks-ai

What It Does

Feature	AI Model	Monthly Cost
💬 Streaming Chat	OpenRouter GPT-OSS-120B (free)	$0
🌐 Website Generator	OpenRouter Nemotron-3 120B (free)	$0
🖼️ Image Analysis	Google Gemini 2.0 Flash	~$0.002/call
🎬 Video Analysis	Google Gemini 2.0 Flash	~$0.003/call
🧠 Memory System	Firebase Firestore	$0 (free tier)
🔐 Auth + Admin	Firebase Auth	$0

Total monthly cost: ~$0–5 depending on vision API usage.

The Hybrid Model Strategy

This is the key architectural decision. Instead of paying for one expensive model for everything, I split by capability:

Free Models for Text Tasks

Chat + Code Generation → OpenRouter API
├── openai/gpt-oss-120b:free (120B params, conversational)
└── nvidia/nemotron-3-super-120b-a12b:free (code generation)

These free 120B parameter models are genuinely production-quality for text tasks. GPT-OSS-120B handles conversational AI beautifully — context tracking, nuanced responses, multi-turn dialogue. Nemotron-3 excels at code generation and can build full websites from prompts.

Paid Models for Vision Tasks

Image + Video Analysis → Google Gemini 2.0 Flash
├── analyzeImage (~$0.002/call)
└── analyzeVideo (~$0.003/call)

Free models simply can't match Gemini's multimodal capabilities yet. Image understanding, OCR, visual reasoning — Gemini 2.0 Flash delivers production-quality results at extremely low per-call costs.

Architecture Deep Dive

┌─────────────────────────────────────────────┐
│          Frontend (React 18 + Vite)         │
│         Firebase Hosting / hocks.app        │
└──────────────────┬──────────────────────────┘
                   │
                   ▼
┌─────────────────────────────────────────────┐
│     Firebase Cloud Functions (Node 20)      │
├─────────────────────────────────────────────┤
│  streamChat ────► OpenRouter (GPT-OSS-120B) │
│  generateCode ──► OpenRouter (Nemotron-3)   │
│  analyzeImage ──► Google Gemini 2.0 Flash   │
│  analyzeVideo ──► Google Gemini 2.0 Flash   │
└──────────────────┬──────────────────────────┘
                   │
                   ▼
┌─────────────────────────────────────────────┐
│           Firebase Services                 │
│  • Firestore (users, memories, analytics)   │
│  • Authentication (Google + Email/Pass)     │
│  • Secret Manager (all API keys)            │
│  • Storage (file uploads)                   │
└─────────────────────────────────────────────┘

Key Design Decisions

1. Zero API Keys in Frontend

Every AI call is proxied through Firebase Cloud Functions. API keys live exclusively in Firebase Secret Manager — not in environment variables, not in .env files, not anywhere in client code.

// Cloud Function reads secret at runtime
const geminiApiKey = defineSecret("GEMINI_API_KEY");

exports.analyzeImage = onCall(
  { secrets: [geminiApiKey] },
  async (request) => {
    // Key is only available server-side
    const model = genAI.getGenerativeModel({ model: "gemini-2.0-flash" });
    // ...
  }
);

2. SSE Streaming for Real-Time Chat

Instead of waiting for the full response, the chat streams tokens in real-time using Server-Sent Events:

// Server: Stream each chunk from OpenRouter
const reader = orResponse.body.getReader();
while (true) {
  const { done, value } = await reader.read();
  if (done) break;
  res.write(`data: ${JSON.stringify({ text, fullText })}\n\n`);
}

// Client: Render as tokens arrive
eventSource.onmessage = (event) => {
  const { text } = JSON.parse(event.data);
  updateChatUI(text); // Instant visual feedback
};

3. Per-User Memory System

The AI remembers context across sessions. Users can save memories that persist in Firestore and are injected into every AI conversation:

// Inject memories into system prompt
let systemContent = SYSTEM_PROMPT;
if (memories.length > 0) {
  systemContent += "\n\n=== USER'S SAVED MEMORIES ===\n";
  memories.forEach((mem, i) => {
    systemContent += `${i + 1}. ${mem.content}\n`;
  });
}

4. Admin Dashboard with Cost Tracking

Built-in analytics track every API call in real-time:

Usage counters per feature (chat, image, video, website)
Daily cost breakdown with budget alerts
Feature toggles — disable any AI feature instantly
Audit logging for all admin actions

Security Architecture

Layer	Implementation
API Keys	Firebase Secret Manager (never in code)
Data Isolation	Firestore rules enforce per-user access
Admin Access	Custom claims + email verification
Authentication	Firebase Auth (Google + email/password)
Audit Trail	Every admin action logged with timestamp

Tech Stack

Layer	Technology
Frontend	React 18, Vite, CSS3 (Glassmorphism dark UI)
Backend	Firebase Cloud Functions (Node.js 20)
AI Engine	Google Gemini 2.0 Flash + OpenRouter (free models)
Database	Cloud Firestore
Auth	Firebase Authentication
Hosting	Firebase Hosting (custom domain)
Secrets	Firebase Secret Manager

Get Started in 5 Minutes

# Clone
git clone https://github.com/x-tahosin/hocks-ai.git
cd hocks-ai

# Install
cd functions && npm install && cd ..

# Set your API keys securely
firebase functions:secrets:set GEMINI_API_KEY
firebase functions:secrets:set OPENROUTER_API_KEY

# Deploy everything
firebase deploy

You need:

Node.js 20+
Firebase CLI (npm i -g firebase-tools)
A Gemini API key from ai.google.dev (free)
An OpenRouter API key from openrouter.ai (free models available)

What I Learned

Free AI models are production-viable — 120B parameter models handle conversational AI surprisingly well
Hybrid strategies save money — use free for text, paid only for vision
Firebase Secret Manager > .env files — proper secret management matters in production
SSE streaming transforms UX — users seeing real-time responses feels dramatically better than waiting
Cost tracking from day one — know exactly where every dollar goes

Try It

🔗 Live demo: hocks.app
📦 Source code: github.com/x-tahosin/hocks-ai
⭐ Star the repo if you find it useful!

What free AI models are you using in production? I'd love to hear about your hybrid model strategies in the comments.

5 TypeScript Patterns Every Developer Should Know in 2026

S M Tahosin — Mon, 20 Apr 2026 18:13:15 +0000

TypeScript has evolved massively. Here are 5 patterns I use daily that make my code bulletproof.

1. Discriminated Unions for State Management

type State = 
  | { status: "idle" }
  | { status: "loading" }
  | { status: "success"; data: User[] }
  | { status: "error"; error: string };

function handleState(state: State) {
  switch (state.status) {
    case "success":
      return state.data; // TS knows data exists here
    case "error":
      return state.error; // TS knows error exists here
  }
}

The compiler narrows the type automatically. No more if (data !== undefined) everywhere.

2. `satisfies` for Type-Safe Configs

const config = {
  apiUrl: "https://api.example.com",
  timeout: 5000,
  retries: 3,
} satisfies Record<string, string | number>;

// config.apiUrl is still typed as string, not string | number
config.apiUrl.toUpperCase(); // ✅ Works!

satisfies validates the type without widening it.

3. Template Literal Types for API Routes

type ApiRoute = `/api/${string}`;
type UserRoute = `/api/users/${number}`;

function fetchApi(route: ApiRoute) { /* ... */ }

fetchApi("/api/users/123"); // ✅
fetchApi("/dashboard");      // ❌ Type error

4. Const Assertions for Readonly Everything

const ROLES = ["admin", "user", "viewer"] as const;
type Role = (typeof ROLES)[number]; // "admin" | "user" | "viewer"

// Instead of: type Role = string

5. Branded Types for Domain Safety

type UserId = string & { __brand: "UserId" };
type PostId = string & { __brand: "PostId" };

function getUser(id: UserId) { /* ... */ }
function getPost(id: PostId) { /* ... */ }

const userId = "abc" as UserId;
getUser(userId); // ✅
getPost(userId); // ❌ Type error — can't mix IDs

Which TypeScript patterns do you use the most? Drop your favorites below!

Follow me for more TypeScript and AI content: @tahosin

5 Free AI APIs You Can Use Today (No Credit Card Required)

S M Tahosin — Sun, 19 Apr 2026 16:49:34 +0000

You don't need to pay OpenAI $20/month to build AI apps. Here are 5 completely free AI APIs you can start using right now.

1. Google Gemini API

Best for: Text generation, analysis, code generation

Free tier: 15 requests/minute, 1M tokens/day
Models: Gemini 2.0 Flash (fast), Gemini Pro (powerful)
Signup: ai.google.dev

const res = await fetch(
  `https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent?key=${API_KEY}`,
  {
    method: "POST",
    headers: { "Content-Type": "application/json" },
    body: JSON.stringify({
      contents: [{ parts: [{ text: "Explain quantum computing simply" }] }],
    }),
  }
);
const data = await res.json();
console.log(data.candidates[0].content.parts[0].text);

I built: MaxAI Writer and EcoSense AI entirely on Gemini's free tier.

2. Hugging Face Inference API

Best for: Specialized models (sentiment, translation, image classification)

Free tier: Rate-limited, thousands of models
Signup: huggingface.co

3. Cloudflare Workers AI

Best for: Edge inference, low latency

Free tier: 10,000 neurons/day
Models: Llama, Whisper, Stable Diffusion

4. Groq

Best for: Fastest inference speeds

Free tier: 30 RPM on Llama models
Signup: console.groq.com

5. Cohere

Best for: Enterprise-grade text analysis, RAG

Free tier: 5 RPM, trial API key

Comparison

API	Best For	Rate Limit	Signup
Google Gemini	General AI	15 RPM	Free
Hugging Face	Specialized	Varies	Free
Cloudflare AI	Edge	10K/day	Free
Groq	Speed	30 RPM	Free
Cohere	Text analysis	5 RPM	Free

Which free API are you using? Drop a comment!