DEV Community: Sreejit Pradhan

My AI Kept Hallucinating Career Paths. I Abandoned the Project. GitHub Copilot Helped Me Fix What Was Actually Broken.

Sreejit Pradhan — Thu, 28 May 2026 14:34:18 +0000

This is a submission for the GitHub Finish-Up-A-Thon Challenge

Every developer has a graveyard repo. You know the one. It lives in a pinned tab you stopped opening. The commit history stops mid-sentence. The README has a section called "Roadmap" that you wrote with too much ambition and too little sleep.

Mine was PathForge AI.

The idea was real: a career intelligence engine for students in India and Southeast Asia who don't have a guidance counselor, can't afford consultants, and are one bad decision away from a degree that doesn't match their goals, grades, or budget. You enter your marks, your dream career, your financial reality — PathForge gives you three ranked career paths, real institutions, scholarship intelligence, and what it called a "brutal honesty" score.

Good idea. And then the engine started hallucinating.

The Problem Nobody Likes Talking About with AI Career Tools

The institution matching engine was confidently wrong. Not occasionally wrong — structurally wrong. It would return a medical college in a tier a student couldn't afford, recommend a stream with a 12% subject overlap to their actual grades, suggest scholarships that had been discontinued. The probability scores looked precise — "78.4% fit" — but the math underneath was guessing.

This is the specific kind of brokenness that makes you close the laptop.

It wasn't a bug I could debug with a stack trace. It was an architecture problem. The AI reasoning layer had no anchoring system — no structured parameters to constrain what "good match" actually meant. The model was doing freeform pattern matching on career data and calling it intelligence. It wasn't. It was vibes with decimal points.

I had three commits in four months. The last one was: "fix: remove hallucinated university from results (again)".

That's when I stopped.

What Was Actually in the Graveyard

Here's what PathForge looked like before I came back to it:

Stack: Next.js 16, TypeScript, NVIDIA NIM (Llama-3.1-70b-Instruct), Prisma + Supabase, Clerk auth, Zustand for state.

What worked:

The 6-step onboarding wizard
Basic auth flow via Clerk
UI and design system (ember/dark forge aesthetic — still proud of this)
NVIDIA NIM integration was live

What was broken:

The institution matching engine — hallucinating ~40% of recommendations
No real scoring logic, just prompts asking the model to "rank these paths"
No parameter constraints on what constituted a valid match
No penalty system for budget mismatches or stream misalignment
The "Reality Check Engine" was aspirational text in a README, not code

The gap between what the README promised and what the code delivered was significant. I knew it. That's partly why I stopped — finishing it felt dishonest without fixing the core thing first.

The Comeback: What I Actually Fixed

The Finish-Up-A-Thon was the forcing function I needed. Here's what changed:

1. The Multi-Factor Probability Calculator — Now With Actual Math

The old version asked the LLM to produce a probability score. That's the problem. LLMs don't do probability. They do plausible-sounding probability.

The new scoring engine is deterministic:

// Multi-Factor Probability Calculator
// marks fit (40%) + stream fit (30%) + budget fit (20%) + base score (10%)
// + trend bonus/penalty adjustments

function calculateCareerScore(
  studentProfile: StudentProfile,
  careerPath: CareerPath
): number {
  const marksFit = calculateMarksFit(studentProfile.grades, careerPath.requiredMarks) * 0.40;
  const streamFit = calculateStreamFit(studentProfile.stream, careerPath.preferredStreams) * 0.30;
  const budgetFit = calculateBudgetFit(studentProfile.budget, careerPath.estimatedCost) * 0.20;
  const baseScore = 0.10;

  const trendBonus = getTrendBonus(careerPath.marketDemand);
  const penalties = applyPenalties(studentProfile, careerPath);

  return Math.min(100, (marksFit + streamFit + budgetFit + baseScore + trendBonus - penalties) * 100);
}

The LLM now receives this score and uses it as a hard anchor. It can't recommend a path with a 34% fit as a primary option. The math comes first. The language comes second.

2. The Matchmaking System — Parameter Constraints That Actually Constrain

The hallucination problem wasn't the model. The model was doing what models do: generating plausible text. The problem was that I was asking it to do constraint satisfaction without giving it constraints.

The new matchmaking system defines explicit parameters before the AI ever sees a student's profile:

Budget ceiling enforcement: if an institution's fee exceeds the student's stated budget by more than 15%, it gets filtered before the prompt is built
Stream compatibility matrix: a lookup table mapping board streams to career family compatibility scores — not inferred, hardcoded
Scholarship pre-filtering: institutions are matched against the live scholarship database before being passed to the model, not after
Minimum threshold gates: a career path below 45% combined fit score never reaches the output, regardless of what the model wants to surface

// Parameter constraint layer — runs BEFORE the AI prompt
function buildConstrainedCandidateSet(
  profile: StudentProfile,
  allPaths: CareerPath[]
): CareerPath[] {
  return allPaths
    .filter(path => path.estimatedCost <= profile.budget * 1.15)
    .filter(path => streamCompatibilityMatrix[profile.stream][path.family] > 0.4)
    .filter(path => calculateCareerScore(profile, path) >= 45)
    .sort((a, b) => calculateCareerScore(profile, b) - calculateCareerScore(profile, a))
    .slice(0, 6); // Top 6 candidates passed to AI for narrative generation
}

The AI now narrates. It no longer decides. That's the fix.

3. The Reality Check Engine — Actually Built This Time

The README mentioned this feature. The code did not have it. Now it does.

The Reality Check Engine generates specific flags, not generic warnings:

Budget gap flag: "Your stated budget (₹4L/year) is ₹2.8L below the average cost of top Engineering colleges in your shortlist. Here are 3 institutions within range."
Salary arbitrage flag: "Your target career (Data Science) pays a median ₹8.2L in Year 3. Your backup path (Actuarial Science) pays ₹11.4L in Year 3 with a 23% lower admission bar."
Survival odds flag: "4,200 students applied to your top-choice stream last year. 340 were admitted. Your profile puts you in the 61st percentile of that applicant pool."

These are not motivational statements. They are information.

4. Persistent Career Memory

Previously: your data lived in localStorage only, reset if you cleared your browser.

Now: Clerk + Supabase gives you persistent career profiles across devices. Your history is yours, and it's actually stored.

How GitHub Copilot Fit Into This

I want to be honest about how I used it, because the honest version is more useful than "Copilot wrote my app."

Where Copilot actually changed the work:

Clarifying what I was actually building. When I came back to the codebase after months away, I used Copilot Chat to explain my own code back to me. I'd highlight a function and ask: "What does this actually do and what are its failure modes?" That sounds embarrassing. It's also just accurate. It's faster than rereading 400 lines of cold TypeScript with no context.

The stream compatibility matrix. I had the concept but not the structure. I asked Copilot: "I need a lookup table that maps Indian board exam streams to career family compatibility scores between 0 and 1. What schema would you use for this?" It gave me a direction. I rewrote it substantially — the values are mine, the institution data is mine — but the schema idea saved me an hour of second-guessing.

Bug detection on the constraint layer. The budget ceiling calculation had an off-by-one logic issue where students who exactly matched the budget threshold were being filtered out instead of included. I'd been looking at it for 20 minutes. I pasted the function into Copilot and asked it to review for edge cases. It caught it in about 30 seconds.

Code review on the scoring function. Before I was confident the math was right, I asked Copilot to check whether my weighted scoring formula would behave unexpectedly at edge values (marks = 0, budget = maximum, stream = no match). It flagged a division-by-zero risk I'd missed in the marks fit calculation.

Where Copilot couldn't help:

The actual domain logic — what score a "Science stream" student should get when applying to an Arts career, what the right budget threshold multiplier should be, how to weight market demand trends — none of that came from Copilot. It doesn't know that the JEE Advanced has 150,000 serious applicants for 16,000 seats, or that a budget of ₹3L/year in India eliminates roughly 70% of private engineering colleges. That knowledge is local and specific. I had to supply it.

Copilot is a very good tool for the craft of code. It's not a substitute for knowing what you're building.

Where I'd Push Back

The constraint-based approach I built has a ceiling. It's good at filtering out bad matches. It's less good at surfacing surprising good matches — the career path a student wouldn't have thought to consider but actually fits them well.

The old hallucinating engine was wrong 40% of the time, but the remaining 60% occasionally included genuinely creative suggestions the deterministic system wouldn't generate. There's a version of PathForge that uses the constraint layer as a floor and the AI as an exploration layer on top. I haven't built that yet.

Also: the institution database is India-first. Southeast Asia support is planned but thin. The scholarship data needs regular updates to stay accurate. These aren't excuses — they're the honest version of the product right now.

Demo

Repo: github.com/ogMaverick12/pathforge-ai

The 6-step wizard takes about 90 seconds. Enter your stream, marks, dream career, and budget. What comes out is three ranked career paths with real institutions, probability scores built on actual math, and a Reality Check section that doesn't soften the numbers.

The Part That Surprised Me

Coming back to this project was harder than starting it.

Starting a project is pure possibility. Coming back to one you abandoned means confronting the gap between what you said you'd build and what you actually did. There's a specific discomfort in reading your own old comments — "// TODO: fix hallucination issue" — and knowing you left that there for months.

The Finish-Up-A-Thon forced the question: is this worth finishing, or is this a project I'm attached to for the wrong reasons?

PathForge is worth finishing because the problem is real. Students in India making career decisions on incomplete information, with no structured support, making choices they can't easily undo — that's not an abstract use case. The hallucinating engine was embarrassing. The deterministic scoring system isn't embarrassing.

That's the difference between the version I abandoned and the version I'm shipping.

What's the project in your graveyard that's worth coming back to? Drop the repo below — especially curious about anyone else who's hit the "AI is confidently wrong" wall and had to build structure around it to fix it.

Built by Vi-Bit Technologies. ⚡ Solving problems smarter, faster, and better.

I Built a Local Interview Coach That Learns From Every Submission With Hermes Agent.

Sreejit Pradhan — Tue, 26 May 2026 17:02:23 +0000

This is a submission for the Hermes Agent Challenge

TL;DR: PrepPilot is a local Hermes Agent-powered interview coach that remembers how you solve, reviews code in a full browser workspace, updates your profile after every submission, and visibly evolves its own skill files. It runs locally: Next.js, FastAPI, SQLite, 208 original shared problems, private custom problems, and Hermes verified from WSL Ubuntu on my machine.

What I Built

Every developer I know has the same complaint about interview prep.

LeetCode does not know you. It does not know that you can solve array problems half asleep but freeze when a graph problem uses recursion. It does not know that a visual explanation helps you more than a formal proof. It does not remember that you keep missing the same edge case.

PrepPilot is my attempt to fix that.

It is a local-first interview coach where each learner creates a local email/password profile, solves problems in a full dashboard workspace, gets reviewed by Hermes, rates the feedback, and watches the coaching model adapt. One email means one profile. Each profile has separate sessions, custom problems, assessment progress, stats, and Hermes memory.

The important bit: Hermes is not a decorative wrapper around the app. Hermes is the core intelligence layer. The dashboard, problem bank, custom problems, and session history all flow into the same FastAPI submission pipeline, and that pipeline calls Hermes skills.

That was the line I did not want to fake. If the app claims to learn, the learning path should be visible in the product and stored in the local system, not hidden behind a demo prompt.

The Working App

Local Profiles

For the event build, I dropped the deployment/auth complexity. No Google OAuth. No GitHub OAuth. No external auth wall. You run it locally and create as many local profiles as you want, one per email.

Telegram remains available as an optional username identity shortcut, but the main path is local email/password profiles because that is what makes sense when the database is on your own machine.

Dashboard And Assessment

The dashboard is where Hermes becomes visible:

current assessment progress
score trend
topic map
active problem
skill evolution timeline
local profile data

New users go through a 3-problem calibration assessment. Hermes reviews those submissions and assigns a starting level like foundation, interview-ready, or advanced.

Expanded Problem Bank

The app now ships with 208 original shared problems, not scraped statements. The bank covers:

classic DSA patterns
GSoC and open-source contribution tasks
GSSoC-style community tasks
Unstop-style hiring challenges
backend, frontend, full stack, and web platform problems
data/Kaggle-style tasks
AI/ML and RAG tasks
database, security, and system design basics

Users can also create private custom problems. Those stay attached to that local profile only.

Problem Card To Solve Flow

Clicking a problem is now a real flow, not a dead card. The card opens with the problem statement, examples, constraints, hints, status, and a Solve button. That button creates or reuses a PrepPilot session and sends the learner into the full /solve workspace.

Full Solve Workspace

The small modal is gone. Solving now happens in /solve, a full-page workspace that fills the app like a serious practice tab:

Problem Selector tab
Solve tab
problem statement, examples, constraints, hints
Monaco editor
language selector
submit button
Hermes review output
score breakdown
1-5 helpfulness rating

Hermes Review

When a solution is submitted, Hermes stores the code, language, time taken, score, score breakdown, status, and feedback. Then it updates topic stats and the coaching profile.

After the learner rates the feedback, Hermes can update the skill file itself. That is the loop I care about most: feedback does not disappear into a black box. It changes the coach.

Custom Problems

If a learner wants to bring their own problem, they can. Custom problems use the exact same pipeline as seeded problems:

custom problem -> session -> submit code -> Hermes review -> stats/profile update -> feedback rating -> skill evolution check

That keeps the system honest. There is not one "real" path and one demo path.

Local Hermes Status

The Settings page shows whether the local foundation is actually connected:

FastAPI backend
SQLite memory
loaded Hermes skill versions
WSL Ubuntu Hermes CLI
heuristic or optional external inference mode
Telegram configured or off

On my machine, Hermes is installed in WSL Ubuntu:

/home/sreej/.local/bin/hermes
Hermes Agent v0.12.0

That status is surfaced in the app itself through GET /api/v1/hermes/status.

How Hermes Is Used

PrepPilot has three core Hermes skills.

`problem_selector.md`

This chooses what you should solve next. It looks at weak topics, recent scores, difficulty, assessment state, and problem freshness. It can pick from the 208 shared problems or from custom problems owned by the profile.

`solution_reviewer.md`

This reviews your code. The review is structured:

correctness: 40 points
complexity: 30 points
edge cases: 20 points
style: 10 points

The feedback is personalized by the profile Hermes has built about you.

`coaching_profiler.md`

This is the memory layer. It watches repeated submissions and updates the learner profile:

recurring mistakes
explanation style preference
topic strengths
weak areas
pacing
calibration level

That profile feeds back into both the problem selector and reviewer.

The Learning Loop

User solves a problem
  -> FastAPI session submission endpoint
  -> Hermes solution_reviewer
  -> score and feedback stored
  -> topic stats update
  -> coaching profile update
  -> user rates helpfulness
  -> Hermes checks whether a skill should improve
  -> new skill version appears in Skill Evolution

The whole point is compounding context. A generic prep platform starts fresh every time. PrepPilot should feel like it remembers what happened last week.

Architecture

Local Browser
  -> Next.js dashboard and /solve workspace
  -> FastAPI backend
  -> Hermes core layer
       - problem_selector
       - solution_reviewer
       - coaching_profiler
       - skill evolution
  -> SQLite memory
  -> WSL Ubuntu Hermes CLI status check

The Telegram bot code exists, but for this event build I am intentionally keeping the demo local. That removes deployment risk, auth-provider setup, webhooks, and cloud database problems. The local product is the thing I want judges to run and feel.

Quick Start

# Backend
py -3.10 -m venv backend/.venv
backend/.venv/Scripts/python.exe -m pip install -r backend/requirements.txt
backend/.venv/Scripts/python.exe -m backend.seed_problems
backend/.venv/Scripts/python.exe -m uvicorn backend.main:app --host 127.0.0.1 --port 8000

# Frontend
cd frontend
npm install
npm run dev

Then open:

http://127.0.0.1:3000

Create a local profile, solve three calibration problems, and then check Settings to see the local Hermes connection.

What I Tested Locally

For the final event build I tested:

local email/password profile creation
duplicate email protection
two profiles with separate progress
expanded 208-problem bank
problem card -> /solve
code submission -> Hermes review
resubmission after Hermes review
feedback rating -> profile/skill path
private custom problem creation and solving
Settings Hermes status with WSL Ubuntu Hermes CLI
no OpenRouter key required
Telegram off unless explicitly configured

Where I Would Push Back

The learning loop takes time to become impressive. On day one, the selector is mostly a thoughtful rule engine. Around session 8-10, the interesting behavior starts: difficulty changes, explanation style shifts, and problem choice begins to reflect the mistakes you keep making.

That is also why I wanted the skill files visible. If the app claims it is learning, the learner should be able to inspect what changed.

What I Would Build Next

Mock interview mode: a timed two-problem round where Hermes behaves like an interviewer and writes a post-interview assessment.

Company/program tracks: GSoC, GSSoC, Unstop, backend interviews, frontend interviews, data roles, and system design foundations.

Deployment later: I deliberately removed deployment from this build so the event demo is stable. Later, the same FastAPI backend can be packaged with a persistent database and webhook setup.

Final Take

There are a hundred interview prep tools. Most of them give you more problems.

PrepPilot tries to give you a coach that notices patterns.

It runs locally. It stores your history locally. It lets you solve real problems in a real editor. It reviews your code through Hermes. It updates its model of you. And then the next problem is not random anymore.

That is the part I built this for.

Links:

GitHub: https://github.com/ogMaverick12/preppilot
Hermes Agent: https://hermes-agent.nousresearch.com

Built on Hermes Agent via Hermes Agent itself by Nous Research. Local runtime: FastAPI, SQLite, Next.js 14, WSL Ubuntu Hermes CLI. Event build: local-only, no cloud auth required, no external inference key required.

Everyone's Talking About Gemini 3.5 Flash. The Real Story at Google I/O 2026 Was a Skill File.

Sreejit Pradhan — Sun, 24 May 2026 08:13:34 +0000

This is a submission for the Google I/O Writing Challenge

Everyone walked away from Google I/O 2026 talking about Gemini 3.5 Flash benchmarks. Veo 3. Gemini Omni doing multimodal physics. The usual keynote sugar rush. Good stuff. Expected.

But if you want to understand why this I/O actually changes how developers build — not in theory, in production, this week — you need to look at something that got maybe four sentences in the developer keynote.

A markdown file called SKILL.md.

I didn't read about this. I ran it. Here's what actually happened.

What Antigravity CLI Actually Creates (Not What the Slides Said)

Every I/O recap I've read describes AGENTS.md as the agent configuration primitive. Clean. Simple. One file.

That's not quite right. Here's what /agents shows in a fresh Antigravity CLI 1.0.2 session on a real project:

Create New Agents
  Workspace: C:/Users/sreej/Downloads/Projects/SoilSense AI/.agents/agents/{agent_name}/agent.json
  Global:    C:\Users\sreej\.gemini\antigravity-cli\agents\{agent_name}\agent.json

▼ Available Agents
  • /default   Default agent

Agent definitions are JSON, not markdown. The markdown lives one level down — in skills:

Skills  
129 skills

Create new skills
  Workspace: ~/Downloads/Projects/SoilSense AI/.agents/skills/{skill_name}/SKILL.md
  Global:    ~/.gemini/antigravity-cli/skills/{skill_name}/SKILL.md
  Shared:    ~/.gemini/skills/{skill_name}/SKILL.md

So the actual structure is:

your-project/
└── .agents/
    ├── agents/
    │   └── {agent_name}/
    │       └── agent.json      ← agent behavior (JSON)
    └── skills/
        └── {skill_name}/
            └── SKILL.md        ← reusable capabilities (markdown)

And Antigravity ships with 129 built-in skills already — everything from agency-agentic-search-optimizer to agency-code-reviewer. You're not starting from zero. You're extending a library.

That's not a minor correction. That's a different mental model from what the keynote implied.

I Tested It on a Real Project

I ran this on SoilSense AI — a Capacitor/Android app with an existing codebase, git history, and a src/ directory full of React components. Not a demo project. A real one.

One prompt:

create a skill for SoilSense AI that reviews any new component 
for accessibility issues before committing

What followed was not autocomplete. The agent:

Read package.json to understand the stack
Scanned src/, src/lib/, docs/PROJECT_STRUCTURE.md
Checked ListPermissions — confirmed read/write access
Ran git status to understand current state
Proposed a hybrid approach and asked for approval before proceeding

The plan it proposed:

A global AI agent skill (soilsense-accessibility-reviewer) — a SKILL.md that instructs the agent to audit git-staged components using LLM-level reasoning
A standalone Python checker (check-a11y.py) for static WCAG rule enforcement
A pre-commit git hook that blocks commits containing critical violations

I typed proceed. Here's what it built:

Create(~/.gemini/config/skills/soilsense-accessibility-reviewer/SKILL.md)
Create(~/.gemini/config/skills/soilsense-accessibility-reviewer/scripts/check_a11y.py)
Create(SoilSense AI/scripts/check-a11y.py)
Create(SoilSense AI/.git/hooks/pre-commit)

Then — without me asking — it created a mock broken component with intentional violations, staged it, and ran the hook against itself to verify.

Result:

5 Critical issues detected:
  - Missing alt tags
  - Custom clickable divs lacking tabIndex/onKeyDown handlers  
  - Empty button
  - Unlabeled form inputs

3 Warnings:
  - Redundant alt terms
  - Positive tabIndex anti-patterns
  - Unlabelled decorative SVG/Lucide icons

→ Commit blocked. Fix critical issues or use --no-verify to bypass.

It caught real violations, blocked the commit, displayed results in a console table, then cleaned up the mock component and reset git state. The pre-commit hook is now active in the SoilSense AI repo.

One prompt. No orchestration code. No config files written by hand.

That's the thing nobody is explaining in I/O coverage: the skill file didn't just change what the agent knows — it changed what the agent does to your repository.

The Gemini CLI Retirement Nobody Is Explaining Clearly

Here's the detail buried in the Antigravity 2.0 announcement: Gemini CLI shuts down for consumer tiers on June 18, 2026. That's not optional. Free tier, AI Pro, AI Ultra — same message for all.

What you're migrating to:

Gemini CLI	Antigravity CLI
Node.js runtime	Go binary — zero runtime dependencies
`GEMINI.md`	`AGENTS.md` / `agent.json`
`.gemini/skills/`	`.agents/skills/{name}/SKILL.md`
Gemini models only	Gemini 3.5 Flash + Claude + GPT-OSS
Chat-first	Agent orchestration-first
Open source	Closed software

The multi-model routing is worth pausing on. Antigravity CLI supports Claude and GPT-OSS models through the same interface — you're not locked to Gemini at the CLI layer. The Managed Agents API is Gemini 3.5 Flash specifically, but locally you have model choice.

The last row is the one I keep thinking about. Gemini CLI was open source. Tens of thousands of contributors, forks, extensions built on it. Antigravity is closed. Google is moving developer tooling into its monetization stack and calling it an upgrade. That's accurate. It's also incomplete.

What the 129 Built-In Skills Actually Signal

When /skills showed 129 built-in skills, I scrolled through them. A few that caught my eye:

agency-agentic-search-optimizer — audits whether AI agents can actually accomplish tasks on your site (WebMCP readiness)
agency-ai-data-remediation-engineer — self-healing data pipelines using air-gapped local SLMs
agency-autonomous-optimization-architect — shadow-tests APIs for performance while enforcing financial constraints
agency-codebase-onboarding-engineer — helps new engineers understand unfamiliar codebases

These aren't autocomplete improvements. They're behaviors — things the agent will do autonomously when invoked. The skill file is the instruction set. The agent is the executor.

The accessibility reviewer I built for SoilSense AI is now skill number 130. It lives at ~/.gemini/config/skills/soilsense-accessibility-reviewer/SKILL.md. Every future Antigravity session in that project can invoke it.

That's the primitive. Not a feature. A composable unit of agent behavior that lives in version control.

Where I'd Push Back

A few things I'm not ready to be hyped about.

The closed-source problem is real. Gemini CLI being open source meant the community could audit the tool that had file system access to their codebases. Antigravity is closed. The pre-commit hook it created runs code from ~/.gemini/config/skills/ — a path Google controls the contents of at install time. For personal projects, fine. For anything enterprise, you need answers about what the agent runtime can and can't do with your code before you're committed.

proceed is doing a lot of work. The agent asked for approval before executing. I typed proceed without reading the full implementation plan. It created files in four locations, modified git hooks, and ran git commit against a real repository. The workflow assumes you'll review the plan carefully. In practice, under deadline pressure, most developers won't. That's a governance problem, not a technical one — but it's the kind of thing that causes incidents.

Skill scope creep is easy. The accessibility reviewer skill is global — it lives in ~/.gemini/config/skills/, not in the SoilSense AI project directory. That means it's available in every Antigravity session across every project on this machine. That's convenient. It's also how you end up with 60 global skills that conflict with each other in ways that are hard to debug. Antigravity's skill priority system (Workspace → Global → Shared) handles this, but you have to know it exists.

Getting Started (Windows, since that's what I actually used)

# Download from https://antigravity.google.com/download
# Or via winget (if available in your region)
winget install Google.AntigravityCLI

# Navigate to your project
cd "C:\Users\you\Projects\your-project"

# Launch
agy

# Inside the shell — explore what's available
/skills    # See 129 built-in skills + any you've created
/agents    # See available agents (just /default to start)

# Create your first skill with a plain English prompt
# Example: "create a skill that enforces our API response schema before any PR"

Start with /skills before writing anything. There's a good chance what you want already exists in the 129 built-ins. The skill creator workflow (plain English → agent builds SKILL.md + supporting scripts + tests) is the fastest path to something that actually runs.

The Real Take

Google didn't ship a better autocomplete at I/O 2026. They shipped a runtime for agent behavior — and gave you a text file as the configuration interface.

One prompt to Antigravity CLI created a WCAG accessibility reviewer, a Python static analysis engine, a git pre-commit hook, and a self-verification test — for a real Android/Capacitor project I'm actually building. The commit hook is active right now. It will block the next accessibility violation before it hits the repo.

The Gemini 3.5 Flash benchmarks will be obsolete in six months. A skill file that enforces your team's standards on every commit — that compounds.

The platform is impressive. The 130th skill is what makes it real.

What would you build as your first custom skill — a linter rule, a PR description generator, or something specific to your stack? Especially curious if anyone has gotten workspace-scoped skills working alongside the global ones without conflicts.

Kimi WebBridge just gave AI agents hands inside your browser — and kept your data local

Sreejit Pradhan — Tue, 19 May 2026 16:49:57 +0000

Most AI browser automation tools pipe your sessions through their cloud. Kimi WebBridge doesn't. That's the entire point.

What dropped

On May 15, 2026, Moonshot AI shipped Kimi WebBridge — a Chrome/Edge extension paired with a local background service that lets AI agents operate your browser the way you would. Click, scroll, type, fill forms, extract data, take screenshots. All of it.

The key architectural decision: everything runs through Chrome DevTools Protocol on your machine. Your cookies, your logged-in sessions, your bank dashboard, your internal tools — the agent can touch all of it and Moonshot never sees a byte of it. It's not a sandboxed headless browser. It's your actual Chrome window, with all your existing logins intact.

The model behind it

WebBridge runs on the Kimi K2 family. If you haven't been tracking Moonshot AI, here's a quick picture:


Model size	1 trillion parameters (MoE)
SWE-Bench Pro (K2.6)	58.6% — #1 overall
Parallel sub-agents	Up to 300, across 4,000 steps

K2.6 (April 2026) sits at 58.6% on SWE-Bench Pro — ahead of GPT-5.4 at 57.7% and Claude Opus 4.6 at 53.4%. Open-source. Mixture-of-experts.

And if the name Kimi only just landed on your radar, the Cursor controversy is why — Cursor's Composer 2 launched in March marketing "frontier-level proprietary intelligence," and a dev later identified it as K2.5 under the hood. Elon Musk confirmed it in a post. Awkward.

Agent-agnostic by design

This is the part that actually matters for builders: WebBridge isn't locked to Kimi's own model. It officially supports Claude Code, Cursor, Codex, Hermes, Kimi Code CLI, and OpenClaw as driving agents. You install it, paste a connection command into your agent, and it links to the local WebBridge service automatically. The extension becomes a universal browser-control layer — model-agnostic infrastructure for agentic workflows.

Get started

Step 1 — core install

1. Download the Kimi Desktop App from kimi.com/features/webbridge.

2. Install the extension from the Chrome Web Store or Edge Add-ons. Pin it to your toolbar so you can see the connection status at a glance.
Windows users, run this in PowerShell:

irm https://kimi-web-img.moonshot.cn/webbridge/install.ps1 | iex

3. Open the Desktop App → find Kimi Claw in the left sidebar → add a new Claw → select "On my computer." Local service is now running.

or

Directly install the extension from the Chrome Web Store or Edge Add-ons.
Windows users, run this in PowerShell:

irm https://kimi-web-img.moonshot.cn/webbridge/install.ps1 | iex

MacOS users, run this in Terminal:

curl -fsSL https://kimi-web-img.moonshot.cn/webbridge/install.sh | bash

Step 2 — connect your agent

During install, Kimi automatically drops skill files into Claude Code, Codex, Hermes, and other supported agents. After that it's a one-liner per agent.

Claude Code / Codex / Hermes
Skill file is pre-installed. Just invoke the slash command and WebBridge connects automatically:

/kimi-webbridge

Cursor / other agents
Copy the connection command from the official setup page and paste it into your agent:

curl -fsSL https://kimi-web-img.moonshot.cn/webbridge/connect.sh | sh

Verify the connection

kimi-webbridge status

Should say Connected. If it says Disconnected, make sure the Desktop App is running first, then re-run the connection command.

Once connected, prompt your agent naturally — WebBridge handles the browser side:

/kimi-webbridge Go to LinkedIn, find 2 senior AI engineers at top AI companies,
return a CSV with their name, profile URL, and current role.

The space is getting crowded — Claude's computer use, OpenAI Operator, Perplexity Comet. The differentiator Kimi is betting on is simple: your data doesn't leave your machine. For enterprise use cases, internal dashboards, anything auth-gated — that's not a small thing.

Worth watching closely. The open-source MoE model that secretly powered Cursor's flagship feature is now building infrastructure. That's a statement of intent.

Links

“Most AI agents forget everything between sessions. I ran Hermes continuously for 7 days to see what would happen.”

Sreejit Pradhan — Mon, 18 May 2026 08:47:02 +0000

Hermes Agent Challenge Submission

Sreejit Pradhan

May 16

I Ran Hermes Agent on the Same Task for 7 Days. The Skill File on Day 7 Looked Nothing Like Day 1.

#hermesagentchallenge #devchallenge #agents #ai

Comments 19

12 min read

“What good is AI if it stops working the moment the internet dies? Built an offline Gemma 4 farm doctor for real-world rural use.”

Sreejit Pradhan — Mon, 18 May 2026 08:45:56 +0000

Gemma 4 Challenge: Build With Gemma 4 Submission

Sreejit Pradhan

May 18

Everyone's Talking About Bigger AI Models. I Built a Gemma 4 Farm Doctor That Works When the Internet Doesn't.

#devchallenge #gemmachallenge #gemma #git

Comments 2

6 min read

Everyone's Talking About Bigger AI Models. I Built a Gemma 4 Farm Doctor That Works When the Internet Doesn't.

Sreejit Pradhan — Mon, 18 May 2026 08:40:55 +0000

This is a submission for the Gemma 4 Challenge: Build with Gemma 4

Google's AI story is usually told from the top of the stack.

Bigger models. Better reasoning. More multimodal demos. More cloud endpoints.

That is useful. But there is a different question that kept nagging at me:

What happens when the person who needs the AI is not sitting in a perfect cloud environment? What happens when they are on a farm, the internet is weak, the data is local, and the decision is not abstract at all?

That is why I built SoilSense AI.

What I Built

SoilSense AI is an offline-first farm intelligence app powered by Gemma 4. It runs on a PC, phone, tablet, or Raspberry Pi hub. It connects farm profiles, live soil sensor readings, plant analysis, chat history, and local memory into one app — and it works without a cloud account.

The core problem it solves: most AI agriculture demos skip the boring part. They show a nice chat box, the farmer asks a question, the model gives a confident answer. But real farms are not a single prompt.

A farmer may have one field for vegetables, another for flowers, another for fruit. Each has different crops, soil conditions, irrigation patterns, and disease risks. If those readings get mixed together, the AI advice becomes worse than useless. It becomes confidently wrong.

There is also the connectivity problem. A system that only works when the cloud is reachable is not good enough for many rural environments.

So SoilSense is built on a different assumption: the farm is the source of truth. Gemma 4 is the reasoning layer.

What the app includes:

Multi-farm profiles — fruit farms, flower farms, vegetable farms, greenhouses, and field zones
Farm-scoped sensor feeds, so each farm only sees its own pH, moisture, temperature, and NPK readings
A local bridge that receives sensor packets over HTTP, stores readings locally, and streams live updates over WebSocket
Gemma 4 analysis that receives the active farm profile, current sensor snapshot, sensor freshness, node identity, prior verdicts, chat memory, and an optional plant image
Persistent local memory — analyses and chats are stored per farm, not lost after one session
Three deployment paths: API mode (internet), phone-local Gemma (supported Android devices), or a PC/Pi hub over LAN
Judge Mode for reproducible demos when physical hardware is not available

The important part is not just that Gemma answers farm questions. It is that Gemma answers with the farm's context already in front of it.

Demo

Demo video: https://youtu.be/EJ6397EinNw?si=a9Ei3Fxmsh_UHHjV

GitHub: https://github.com/ogMaverick12/soilsense-ai/tree/beta

The demo flow:

First launch and farm profile setup
Gemma setup — API, phone-local, or PC/Pi hub
Sensor pairing via QR payload
Judge Mode replay — 9 sensor packets across 3 farms
Switching between fruit, flower, and vegetable farms while watching sensor readings change
Running Gemma 4 analysis using live farm context
Viewing saved history and local memory
About, Terms & Conditions, and Gemma 4 Hackathon links in the app footer

One honest note: I do not currently have physical sensor hardware connected for the submission demo. The demo uses Judge Mode — it replays sensor packets through the same local bridge endpoints that real ESP32 and Raspberry Pi nodes use. It proves the ingestion and routing pipeline, but it is replayed data, not live hardware.

Code

Repository:

Tech stack:

React + Vite for the UI
Express local bridge (bridge/server.mjs)
WebSocket for live sensor streaming
Farm-scoped HTTP sensor API with farmId and sensorNodeId on every reading
QR pairing payloads for hardware sensor nodes
Electron for Windows desktop packaging
Capacitor for Android wrapping
Ollama/OpenAI-compatible local bridge for Gemma-family models

The project structure keeps concerns clean: src/ has the React app and app logic, bridge/ has the local hub, desktop/ has the Electron shell, android/ has the Capacitor project, and scripts/ has Pi and Windows launchers. Docs cover setup, mobile deployment, Raspberry Pi use, sensor integration, and packaging.

How I Used Gemma 4

SoilSense is intentionally Gemma-only. The UI and local bridge both reject non-Gemma model tags. That matters because I did not want the submission to quietly become a generic model-switcher wrapper. The point is to show Gemma 4 as the reasoning engine for a local-first farm workflow, not as one option among many.

Model strategy is hardware-aware:

gemma4:e2b — preferred for Raspberry Pi and weak edge devices
gemma4:e4b — the balanced local desktop path (detected and tested on my machine through Ollama)
google/gemma-4-31b-it — the API path for stronger disease reasoning when internet is available
Phone-local Android mode — available on devices that support the Google on-device GenAI runtime; unsupported phones are guided toward PC/Pi hub mode

For analysis, Gemma receives structured context:

Active farm profile and crop selection
Current sensor snapshot (pH, moisture, temperature, N/P/K)
Sensor freshness and latency
Sensor node identity
Previous analysis history
Recent chat memory
Optional plant image

The output is also structured. SoilSense prompts Gemma for a JSON response covering health status, primary issue, urgency, confidence, disease evidence, soil interpretation, sensor correlation, memory used, spoken alert, and next actions. That lets the UI show a practical verdict instead of dumping raw prose into a chat window.

Gemma also powers a free-form farm chat mode. Even there, the selected farm's local memory is passed as context — so chat answers are grounded in that farm's history, not a blank slate.

Local evidence from my test machine (Windows x64, 8 CPU threads, 15.8 GB RAM, gemma4:e4b via Ollama):

Capability	Result	Honesty note
TypeScript check	Passed (`npm run typecheck`)	Static check only, not a full end-to-end test
Production build	Passed (`npm run build`)	`dist/` is generated output, not committed
Windows installer	Built (`SoilSense AI Setup 0.1.0.exe`)	Attached to GitHub Releases, not the repo
Android debug APK	Built via Capacitor + Gradle	Debug build unless signed release is created
Setup health endpoint	Passed: app, LAN, sensor, node, and model checks	Depends on local machine and network
Judge Mode replay	Accepted 9 packets across 3 farms	Replayed packets, not physical hardware
Farm-scoped sensor route	Returned orchard readings for `judge-fruit-orchard`	Proves routing by `farmId` through the bridge
Local Gemma detection	Detected `gemma4:e4b` through Ollama	Recommendation changes by hardware and installed models
Local Gemma smoke test	Returned a response in ~70 seconds	Latency depends heavily on hardware and model warm state

That last number is not a universal benchmark. Local inference speed depends on hardware, model size, and warm state. But it proves the local path is wired end to end on my machine.

Where I Would Push Back On My Own Project

SoilSense does not prove field-grade sensor calibration. Low-cost pH and NPK sensors can be noisy. A real deployment needs calibration records, placement guidance, and local agronomy validation.

It does not prove every Android phone can run local Gemma. Phone-local support depends on the Android on-device GenAI runtime and device capability — which is exactly why the PC/Pi hub mode exists as the fallback.

It is also not a replacement for local experts, lab tests, pesticide labels, or official agricultural guidance. SoilSense is decision support. It is not a magic agronomist in a box.

I think that honesty makes the project stronger. The architecture is built for the real constraint: use local data, keep the farmer in control, and let Gemma reason over context instead of isolated prompts.

Final Take

The most exciting AI apps are not always the ones with the largest model call.

Sometimes the real product is the boring glue around the model — local data, memory, sensor routing, setup flows, privacy boundaries, and a UI that someone can actually use on a farm with weak internet.

That is what SoilSense AI is trying to be.

Not just "Gemma, answer a farm question."

Gemma 4 with the farm profile, the sensor feed, the history, the crop, and the constraints in front of it.

That is the difference between a chatbot and a tool. And for farmers working with local hardware and unreliable connectivity, that difference matters.

Repo: https://github.com/ogMaverick12/soilsense-ai

I Tested Gemma 4 E4B vs 31B on 50 Real Student Career Queries — The Results Surprised Me

Sreejit Pradhan — Sun, 17 May 2026 05:33:24 +0000

I'm building PathForge AI — a career guidance platform for Indian students. The pitch is simple: AI-powered counselling for students who can't afford a human counsellor. The engineering problem underneath is not simple at all.

When Gemma 4 dropped in April 2026, I had a decision to make. The family ships four models. I had two obvious candidates:

E4B (~4.5B effective params): runs locally on a mid-range phone, free, completely private
31B Dense: server-side via API, costs real money per query, much slower

The conventional wisdom was clear: small model for quick tasks, big model for complex reasoning, route intelligently. Done.

Except I didn't trust the conventional wisdom. So I ran 50 real queries through both models — actual queries from PathForge AI's private beta — and measured everything: output quality, schema compliance, latency, and cost per query.

The results were not what I expected.

The Setup

What I was testing: Career guidance queries from real Indian students (anonymised). Not clean test prompts. Messy, code-switched, emotionally loaded, often under-specified — exactly the way real users type.

Query categories (50 total):

Category	Count	Example
Simple eligibility check	15	"Can I apply for NSP if family income is 2.8L?"
Single-path career question	15	"PCB student, 78%, interested in AI field, what options?"
Multi-constraint planning	12	"JEE rank 52000, budget 4L/year, prefer Karnataka, open to abroad if full scholarship"
Ambiguous / emotional	8	"parents want CA but I want game dev, marks average, what should I do honestly"

Scoring rubric (blind, three evaluators, averaged):

Dimension	Max
Constraint compliance — did it actually honour all stated constraints?	3
Schema fidelity — valid parseable JSON matching our output spec?	2
Practical accuracy — is the career/institution advice actually correct?	3
Tone — does it read like a counsellor, not a Wikipedia article?	2
Total	10

Infrastructure:

E4B: Q4_K_M quantised GGUF, llama.cpp, laptop (16GB RAM, no dedicated GPU). Simulating a real developer machine serving requests.
31B Dense: Gemma 4 31B endpoint via Gemini API. Server-side, billed per token.

The Numbers

Headline results first. Details follow.

Overall Score (out of 10)

Category	E4B	31B Dense	Winner
Simple eligibility	8.7	8.4	E4B
Single-path career	7.2	8.9	31B
Multi-constraint planning	5.1	9.1	31B
Ambiguous / emotional	8.1	7.6	E4B
Overall average	7.3	8.5	31B

Latency

Model	Median	P95
E4B (local, no GPU)	3.1s	6.8s
31B Dense (API)	9.4s	17.2s

Cost per Query

Model	Cost
E4B local	₹0
31B via Gemini API	~₹0.13 (~$0.0015 USD)

At projected 50,000 queries/month, running 31B for everything = ₹6,500/month. E4B for everything = essentially ₹0.

Sounds like an obvious choice. Here's why it isn't.

Category 1: Simple Eligibility — E4B Wins

Expected result. Give both models a bounded factual question and the smaller one handles it fine. What I didn't expect was how E4B won.

Query:

"my family income is 2.8 lakhs, i'm in 11th, can i get NSP scholarship? SC category"

E4B output (9/10):

{
  "eligible": true,
  "scheme": "NSP Post-Matric Scholarship",
  "income_cutoff_met": true,
  "category_benefit": "SC category qualifies for higher scholarship amount",
  "next_step": "Register on scholarships.gov.in after Class 12 results",
  "caution": "Student is in Class 11 — Post-Matric applies from Class 12 onwards. Apply in first month of Class 12 admission.",
  "confidence": "high"
}

That caution flag — catching that the student is currently in Class 11 so the application timing is wrong — wasn't prompted for. E4B inferred it from the grade level stated in the query. Proactive, correct, and actually useful.

31B output (8/10): Correct, but added an unrequested "general_advice" field outside our schema, requiring stripping in post-processing. Small thing. At 50,000 queries/month it's not small.

Takeaway: E4B is faster, free, and more schema-compliant on simple queries. Route these locally.

Category 2: Single-Path Career Questions — 31B Wins Clearly

Query:

"PCB student, 78% in boards, interested in AI and biotech both, family income 6L, can spend upto 8L for full course, which way should I go?"

E4B output (6/10):
Identified Bioinformatics correctly as the overlap zone. Listed three institution options. One problem: it recommended a private university in Pune at ₹3.2L/year — which is ₹12.8L for four years. Over the stated ₹8L total budget. E4B either didn't do the multiplication or lost the budget constraint between the reasoning step and the institution selection step.

31B output (9/10):

{
  "recommended_path": "B.Sc. Bioinformatics → M.Tech. Computational Biology",
  "rationale": "Sits at the intersection of PCB background and AI interest. Growing field with government push under BioE3 Policy 2023.",
  "institutions": [
    {
      "name": "University of Hyderabad",
      "type": "Central University",
      "annual_fee": 18000,
      "total_cost_estimate": 72000,
      "budget_check": "PASS — well within ₹8L total",
      "entrance": "CUET-UG"
    },
    {
      "name": "Amrita Vishwa Vidyapeetham",
      "type": "Private (NAAC A++)",
      "annual_fee": 185000,
      "total_cost_estimate": 740000,
      "budget_check": "PASS — within ₹8L total",
      "entrance": "AEEE / Direct admission"
    }
  ],
  "scholarship_flags": [
    "CSIR-UGC JRF eligible post-graduation",
    "DST INSPIRE — check if boards percentage qualifies for top 1% state cutoff"
  ],
  "confidence": "high"
}

The 31B computed total cost (annual fee × 4), checked it against the budget, and labelled each result PASS/FAIL without being asked. It also flagged DST INSPIRE proactively — exactly the counselling behaviour that makes the difference between a generic AI answer and a useful one.

The pattern: E4B loses the thread of a constraint when it has to maintain it across multiple reasoning steps inside a single output. 31B doesn't.

Category 3: Multi-Constraint Planning — Biggest Gap (E4B: 5.1 vs 31B: 9.1)

Query:

"JEE mains rank around 52000, family income 3.8 lakhs, want to stay in south india preferably karnataka, open to abroad only if full scholarship no loan, interested in CS or ECE, parents want govt job security, I want startup eventually"

Six simultaneous constraints, including one internal tension (parents vs student). This is representative of a real complex session.

E4B output (4/10):

Listed NIT Surathkal as a viable option. NIT Surathkal CS closing rank in 2025 was approximately 8,500. The student's rank is 52,000. That's not close. The model generated a confident recommendation that was factually wrong about a piece of information that's directly determinative of whether the advice works.

It also listed two German universities as abroad options (correct — near-zero tuition) but added: "loan may be required for living costs" — directly contradicting the "no loan" constraint stated in the query.

A student acting on E4B's output for this query would apply to a college they have near-zero chance of getting into, and might take a loan they said they couldn't take.

31B output (9/10):

{
  "jee_rank_analysis": {
    "rank": 52000,
    "nit_flagship_verdict": "UNLIKELY — NIT Surathkal/Trichy/Warangal CS/ECE cutoffs are 5,000–15,000. Do not apply without checking current year cutoffs.",
    "viable_nits": [
      "NIT Goa CS (~45,000–58,000)",
      "NIT Puducherry ECE (~50,000–62,000)"
    ],
    "state_alternative": "Karnataka KCET is entirely separate from JEE — strong option with equivalent state rank. Check KCET rank via conversion tool."
  },
  "abroad_assessment": {
    "constraint": "Full scholarship, zero loan — hard limit",
    "germany": "DAAD scholarships exist and cover living costs — viable but competitive. Include only if student is top academic performer overall.",
    "excluded_with_reason": {
      "USA": "Full cost-of-attendance scholarship near-impossible at this academic profile",
      "UK": "Tuition fees too high to cover without loan given income level"
    }
  },
  "parent_student_tension": {
    "observation": "Government security preference and startup aspiration are reconcilable — not opposing paths.",
    "framing": "DRDO, C-DAC, and ISRO R&D tracks increasingly resemble startup environments. Bengaluru government R&D → startup transition is a well-documented career arc. CS + ECE builds infrastructure for both.",
    "recommended_narrative": "Frame to parents as government R&D with startup optionality, not startup instead of stability."
  }
}

The tension-handling section got a unanimous "this is exactly right" from all three evaluators. It didn't pick the student's side or the parents' side. It reframed the conflict as a phased path. That's what a good counsellor does. E4B didn't attempt to address the tension at all.

Category 4: Ambiguous / Emotional Queries — E4B Wins Unexpectedly

This was the genuine surprise.

Query:

"parents want CA but I want game dev, marks are average, what should I do honestly"

I expected 31B to win here because nuance requires capacity. It didn't.

E4B output (8/10):
Short. Direct. No hedging. Acknowledged the conflict in one sentence, gave a concrete middle-ground (BBA + game dev certification track), named two Indian studios that hire from non-CS backgrounds (Nodding Heads, Rockstar India Pune), and closed with:

"Both paths are real. The question is which regret you can live with more."

31B output (7/10):
Structurally excellent. Balanced. Full of caveats. Longer. One evaluator wrote: "technically correct, emotionally inert."

This is a real pattern: large dense models over-optimise for completeness and under-optimise for voice on short emotional queries. E4B, with its smaller output budget, was forced to be direct. The directness worked. For a stressed 17-year-old reading this at midnight, "technically correct, emotionally inert" is a failure mode that matters.

The Routing Logic I'm Now Running

QUERY ROUTING — PathForge AI v2

if query_type == "eligibility_check":
    → E4B local          # Fast, free, more schema-compliant

elif query_type in ["emotional", "ambiguous"]:
    → E4B local          # Brevity is a feature, not a limitation

elif query_type == "single_path" and constraint_count <= 3:
    → E4B local          # Handles 80% correctly; retry on parse error → 31B

elif query_type == "multi_constraint" or constraint_count > 3:
    → 31B Dense via API  # ₹0.13/query, worth it

elif query_type == "final_plan_generation":
    → 31B Dense + 128K context
    # Full profile + institution corpus + scholarship ruleset loaded in one pass
    # No RAG, no retrieval miss, full coherence

At our projected query mix — roughly 60% eligibility/emotional, 40% complex planning — this routing brings API cost from ₹6,500/month to ~₹1,800/month. A 72% reduction. With no quality drop on complex queries and a genuine quality improvement on emotional ones.

The Three Things I Didn't Expect

E4B's schema compliance was better than 31B's on simple queries. The 31B over-explains easy questions — like a person who writes three paragraphs when one sentence was asked for. At 50,000 queries, extra fields in the output are a post-processing tax.

E4B handles Hinglish far better than benchmarks suggest. Queries like "maths mein weak hoon but PCB strong, AI side jaana hai" were processed correctly without preprocessing. Standard English benchmarks tell you nothing about this. Test with your actual users' actual language.

The quality gap between E4B and 31B disappears — and reverses — on emotional queries. This is the finding I'd most want another developer building for real users to know. Don't assume bigger = better for the queries where tone matters most.

What I'd Tell Another Developer

Don't benchmark with clean prompts.

In my case, real queries were Hinglish, emotionally loaded, under-specified, and carried six simultaneous constraints in a single run-on sentence. Clean-prompt benchmarks would have told me to use 31B for everything. Real queries told me E4B is better for 60% of my volume.

The Gemma 4 family isn't a ladder where you climb as high as hardware allows. It's a toolkit. The routing decision is the engineering. And if you're building for a market where ₹0.13 per query actually matters — where the difference between ₹1,800/month and ₹6,500/month determines whether a student platform is financially viable at all — that routing decision is the whole business.

Reproducibility

The anonymised 50-query test set (categories labelled, personal details stripped) and the scoring rubric are available on request. Drop a comment if you're building career or education AI for Indian or emerging-market users — happy to share. Real benchmark data from production-adjacent queries is rare enough in this space that it's worth pooling.

What was the biggest gap between benchmark performance and real user query performance in your Gemma 4 work? Comments below — the interesting stuff lives in that gap.

I Ran Hermes Agent on the Same Task for 7 Days. The Skill File on Day 7 Looked Nothing Like Day 1.

Sreejit Pradhan — Sat, 16 May 2026 04:05:21 +0000

This is a submission for the Hermes Agent Challenge

TL;DR: Hermes Agent is the only open-source agent that gets better at your specific work without you touching anything. I ran it on the same task every day for 7 days and watched the skill file evolve from a 12-line rough draft to a 60-line intelligent procedure. Here's every step, every output, and why this changes what I think an AI agent should be.

Every AI agent framework you've used starts from zero.

LangChain, AutoGen, CrewAI — they all do real work. Multi-step planning, tool use, parallelism. But you close the terminal, restart the session, and the agent that spent twenty minutes figuring out exactly how to handle your data structure has forgotten all of it. You're back to square one.

We've been so focused on what agents can do that nobody's asking what they keep.

That's the question Hermes Agent is actually answering. And after running it daily for a week, I can tell you: the difference between Day 1 and Day 7 isn't marginal. It's a different agent.

The Setup

I run a web app that deals with a lot of research — new models, framework updates, open-source releases. Every morning I was manually scanning HackerNews, arXiv, and GitHub to find the 3-4 things that actually mattered. 30-40 minutes. Boring, repetitive, and I kept missing things because I can only read so fast.

That's the perfect task for this experiment: give Hermes the same job every day, watch what it learns, and see whether Day 7 is actually better than Day 1.

My hardware: Windows 11, GTX 1650 (4GB VRAM), 16GB RAM — same machine from my Gemma 4 tests.

My setup:

# Install (Linux/macOS/WSL2 — I used WSL2)
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash

# Launch
hermes

That's it. No YAML. No environment variables. No dependency hell. The installer asks you for a model provider — I pointed it at OpenRouter with a Nous Hermes model. First prompt came back in under 10 seconds.

The task I gave it:

Every morning at 8AM, find the 3 most relevant AI and developer 
news items from the past 24 hours. I care about open-source models, 
agent frameworks, and local inference. Skip anything that's just hype 
with no technical substance. Post the results to my Telegram.

One instruction. Then I walked away.

Day 1: Raw and Messy

The first run came back with 6 items. Two were from TechCrunch articles with zero technical depth — the kind of "AI is changing everything" pieces that don't tell you anything. One was a GitHub release that was three weeks old. One was actually good: a new quantization method for running LLMs on consumer hardware.

The Telegram message was long, unformatted, no clear hierarchy. The summaries were one-sentence restatements of the headline, not actual analysis.

Here's what the skill file looked like after Day 1:

# skill: daily_ai_digest
version: 1.0
created: 2026-05-09

## task
Search for AI and developer news. Summarize and post to Telegram.

## steps
1. Search "AI news today"
2. Search "developer tools news"
3. Collect top results
4. Write summary
5. Post to Telegram

## tools_used
- web_search
- telegram_send

## notes
First run. Results were broad. User wants 3 items.

Twelve lines. Basically a placeholder. But it exists — and that matters, because this is what Hermes builds on.

Day 2: First Sign of Learning

I didn't touch anything.

Day 2 came back with 5 items. The TechCrunch pieces were gone. Hermes had started pulling from Hacker News and GitHub Releases — better signal sources. One item was still irrelevant (a VentureBeat funding round that mentioned AI in the headline), but the other four were legitimately useful.

The summaries were longer. They had context, not just restatements. One of them noted that a specific library update was a breaking change — information that wasn't in the headline but was in the release notes. Hermes had gone deeper.

The Telegram format was cleaner. Numbered list. Each item had a title, a one-sentence summary, and a link.

Skill file, end of Day 2:

# skill: daily_ai_digest
version: 1.2
created: 2026-05-09
last_improved: 2026-05-10

## task
Find and deliver 3 relevant AI/dev news items. 
User wants technical depth, not hype.

## search_strategy
queries:
  - "AI developer tools release site:github.com"
  - "open source LLM 2026"
  - "AI news site:news.ycombinator.com"
source_deprioritize: [techcrunch.com, venturebeat.com]

## steps
1. Run search queries
2. Score results by technical depth
3. Select top 3
4. Format as numbered list with title + summary + link
5. Post to Telegram

## tools_used
- web_search
- telegram_send

## notes
v1.2: Added source filtering after first run returned low-quality sources.
Switched to HN and GitHub as primary. Results improved.

It added source filtering on its own. I did not tell it TechCrunch was bad. It inferred it from the task description — "no hype, technical substance" — and encoded that into the skill.

Day 4: It Built a Scoring Rubric

This is the day I started paying attention.

The Day 4 Telegram message had something new: a score on each item. [7/10] [9/10] [6/10]. I hadn't asked for scores. Hermes decided scores were useful for the task — probably because "top 3 most relevant" implies there's a ranking, and making that ranking explicit makes the output more useful.

The 9/10 item was genuinely the best thing from that day — a benchmark paper comparing local inference speeds across different quantization methods. Exactly what I care about. The 6/10 item was a borderline include — a framework update that was interesting but not breaking news.

Skill file, end of Day 4:

# skill: daily_ai_digest
version: 1.4
created: 2026-05-09
last_improved: 2026-05-12

## task
Find, score, and deliver 3 AI/dev news items.
Filter: open-source models, agent frameworks, local inference.
Exclude hype with no technical depth.

## search_strategy
queries:
  - "open source LLM release site:github.com OR huggingface.co"
  - "agentic AI framework update -ChatGPT -Gemini"
  - "local inference benchmark 2026"
  - "AI developer tools release this week"
source_priority: [arxiv.org, github.com, huggingface.co, news.ycombinator.com]
source_deprioritize: [techcrunch.com, venturebeat.com, medium.com]

## scoring_rubric
score each item 0-10:
  technical_depth: 0-4  (has code/benchmarks/architecture details)
  novelty: 0-3          (not covered in previous runs)
  relevance: 0-3        (matches user focus: OSS/local inference)
threshold: include if score >= 6

## output_format
**[Score: X/10]** Title
> One sentence: what it is and why it matters.
Link

## tools_used
- web_search
- telegram_send

## notes
v1.2: Added source filtering.
v1.4: Added scoring rubric. User task implies ranking — made it explicit.
      Added novelty check to avoid repeating items from prior runs.

Three things happened autonomously between Day 2 and Day 4:

It built a formal scoring rubric with sub-dimensions
It added negative query filters (-ChatGPT -Gemini) to reduce noise
It started checking previous runs for novelty — so it wouldn't resurface the same items

I didn't write a single line of prompt engineering.

Day 7: The Skill That Won

By Day 7, the digest was good enough that I was reading it before my coffee instead of after my manual scan. That's the bar — useful enough to change behavior.

Here's the full Day 7 skill file:

# skill: daily_ai_digest
version: 1.7
created: 2026-05-09
last_improved: 2026-05-15

## task
Find, score, and deliver the 3 most relevant AI/developer news items 
for the day. Focus: open-source models, agent frameworks, local inference.
Exclude hype with no technical depth. Deliver to Telegram at 08:00 IST.

## search_strategy
queries:
  - "open source LLM release site:github.com OR huggingface.co"
  - "agentic AI framework update -ChatGPT -Gemini -GPT"
  - "local inference benchmark OR quantization 2026"
  - "AI developer tools release this week site:news.ycombinator.com"
  - "arxiv LLM agent reasoning 2026"
source_priority: [arxiv.org, github.com, huggingface.co, news.ycombinator.com]
source_deprioritize: [techcrunch.com, venturebeat.com, medium.com, forbes.com]
dedup_window: 7d  # skip items covered in the last 7 days

## scoring_rubric
score each item 0-10:
  technical_depth: 0-4
    4 = has code, benchmarks, or architecture details
    2 = has methodology but no reproducible artifacts  
    0 = opinion/news with no technical content
  novelty: 0-3
    3 = not covered in past 7 days
    1 = follow-up to prior story, adds new info
    0 = repeat
  relevance: 0-3
    3 = directly about OSS models, agents, or local inference
    2 = adjacent (cloud AI but with OSS implications)
    0 = enterprise SaaS, no OSS angle
threshold: score >= 6 to include
fallback: if < 3 items qualify, lower threshold to 5

## output_format
**[Score: X/10]** Title
> Summary: what it is. Why it matters for open-source/local AI specifically.
🔗 [Link](url)

## delivery
platform: telegram
timing: 08:00 IST
max_items: 3
failure_alert: if run fails, send "digest failed: {error}" to Telegram

## improvement_log
v1.0: Broad search. Too many results. No scoring.
v1.2: Added source filtering. Removed TechCrunch/VentureBeat. -60% noise.
v1.4: Added scoring rubric. Added novelty check vs previous runs.
v1.6: Added IST timezone scheduling. Added Forbes to deprioritize list.
v1.7: Added fallback threshold. Improved arxiv query. Added failure alert.
      Scoring rubric now has sub-criterion descriptions for consistency.

Day 1 skill file: 12 lines.

Day 7 skill file: 62 lines.

The Day 7 version has a search strategy I wouldn't have written myself — the -GPT -Gemini exclusion that cuts proprietary model noise, the 7-day deduplication window, the fallback threshold so the agent always delivers something even on slow news days, the failure alert so I know if it breaks.

I didn't write any of that. I didn't review the skill file during the week. Hermes built it, improved it, and documented its own reasoning in the improvement log.

How the Learning Loop Actually Works

The reason this is possible — and the reason most other frameworks can't do it — is an architecture Nous Research calls the closed learning loop. Four components:

1. Skills

After each successful run, Hermes compiles the trajectory into a skill — a structured, versioned procedure stored as a file on your machine. The skill is readable (it's markdown), editable, shareable (compatible with agentskills.io), and most importantly, evolvable. Hermes loads the existing skill at the start of each run, executes it, observes the result, and updates the skill if it found a better way.

A LangChain agent runs the same code every time. A Hermes skill runs better code every time.

2. Persistent Memory

FTS5 full-text search across all past sessions, with LLM summarization for cross-session recall. The deduplication in my digest skill — "skip items from the past 7 days" — comes from this. Hermes searched memory, found a pattern (user doesn't want repeated items), and encoded the fix into the skill.

3. User Modeling

Hermes integrates Honcho for dialectic user modeling — a continuously updated inference about your preferences. This is how it learned "open-source focus" and "no hype" from one sentence of initial instruction, and kept refining that over the week.

4. Autonomous Nudges

The agent periodically decides what's worth remembering without being told. The dedup_window: 7d parameter in the Day 7 skill? That came from a nudge — Hermes noticed it was retrieving items it had already surfaced, flagged the pattern, and embedded a fix.

The Framework Comparison Nobody Is Having

Most agent framework comparisons are feature lists. Tool support? ✅ Multi-step planning? ✅ Parallel agents? ✅

That comparison misses the dimension that actually matters over weeks of real use: what does the agent keep, and who owns it?

Here's the honest breakdown:

Framework	Memory Model	Skill/Learning System	Who Owns Accumulated Intelligence
LangChain / LangGraph	You build it	None built-in	You (in your code/prompts)
AutoGen	Conversation context	None built-in	You (in your config)
CrewAI	Session-scoped	None built-in	You (in your role definitions)
Hermes Agent	Persistent cross-session	Built-in, self-improving	You (on your machine, MIT)
OpenAI Assistants	Platform-managed	None built-in	OpenAI (on their servers)

LangChain is the most widely deployed and has the largest ecosystem — if you need a specific integration, it's there. But everything accumulates in your code. The agent itself is always a blank slate. You are the memory layer.

AutoGen's multi-agent conversation model is genuinely interesting for debate-style reasoning — Planner talks to Executor talks to Critic, and the conversation is the state. It works well for tasks where explicit agent dialogue is valuable. Same ceiling: no cross-session learning.

CrewAI's role-based abstraction maps well onto business workflows with stable, defined outputs. Best when you know exactly what roles you need. Same ceiling.

The ceiling is identical across all three: session ten with LangChain/AutoGen/CrewAI is identical to session one. The agent hasn't learned your preferences, hasn't refined its procedures, hasn't built a working theory of your use case. The maturity lives in your wrapper code. The agent itself stays naive.

Hermes bets on a different model. The agent accumulates across sessions. The skill file on Day 7 reflects 7 days of observed outcomes. You own all of it — MIT licensed, stored on your machine, readable text files. If Nous Research disappeared tomorrow, your skills still run.

Where I'd Push Back

Hermes is genuinely impressive after a week. It's also genuinely early in some ways.

The learning loop requires a capable model. Skills are only as good as the reasoning that generates them. I used a Nous Hermes model via OpenRouter and results were excellent. If you're using a weaker endpoint, the skills it writes will reflect that.

LangChain and LangGraph have a vastly larger ecosystem. If you need a specific vector store adapter, a custom evaluation framework, or fine-grained observability into every reasoning step — LangGraph is better suited. Hermes makes tradeoffs to deliver the learning loop. Those tradeoffs mean some things are less configurable.

The memory system has edge cases. Stale preferences can accumulate. If you told Hermes "I prefer X" three months ago and your preference changed, you need to correct it explicitly. The memory doesn't auto-expire. There's active work on making memory management more transparent, but it's not fully there yet.

It's a research project at production scale. The GitHub repo is active, the Discord community is engaged, and the documentation is solid. But you will hit edge cases. You will occasionally see a skill degrade instead of improve. The right mental model is "powerful and evolving," not "stable and mature."

None of these killed the experiment. But you should know what you're signing up for.

Who Should Actually Use This

Choose Hermes Agent when:

You have recurring tasks where session-to-session improvement creates compounding value
You're a solo developer or small team that can't maintain a custom memory architecture
You want the agent to improve without you manually encoding every lesson learned
You want to own what the agent accumulates — readable, portable, MIT licensed files on your machine

Choose LangChain / LangGraph when:

You need maximum ecosystem breadth and integration options
You have engineering resources to build and maintain custom memory and state layers
You need fine-grained observability and control over every agent decision

Choose AutoGen when:

Multi-agent deliberation adds value — tasks where watching agents debate improves quality
The workflow benefits from visible, auditable agent-to-agent reasoning

Choose CrewAI when:

Your workflow maps onto stable, defined roles
The output structure is predictable and you want a business-legible abstraction

Getting Started

Install in 60 seconds:

# Linux / macOS / WSL2
curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash

# Windows (PowerShell)
irm https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.ps1 | iex

# Android (Termux) — same curl command, auto-detects

Run it:

hermes

Set up Telegram delivery (optional but worth it):

# Tell Hermes in plain English:
"Connect to Telegram and send me a message when tasks complete"
# It walks you through the bot token setup conversationally

Configure a recurring task:

"Every morning at 8AM, [your task]. Post results to Telegram."
# Hermes parses this into a cron job and registers it.
# No cron syntax. No webhook configuration. Just English.

Then walk away. Come back on Day 7 and read your skill file.

Useful links:

Hermes Agent Docs
GitHub Repo (MIT License)
Quickstart Video
Skills Hub — community-shared skills
Discord

Final Take

The AI agent space has a specific failure mode: things that look impressive in a 15-minute demo and feel identical after three weeks of real use. Every agent can complete a task in a single session. That's not the bar anymore.

The bar is: does the agent get better at your work without you doing the maintenance work of manually encoding every improvement?

Day 1 Hermes gave me 6 unfiltered results, no scoring, no format.

Day 7 Hermes gave me 3 scored, deduplicated, source-filtered, IST-timed, failure-alerted items — with a reasoning trail showing exactly how it got there.

I wrote one sentence of instruction on Day 1 and nothing after that.

That's not a feature. That's a different kind of tool. And it's available right now, free, MIT licensed, on whatever hardware is sitting on your desk.

Pull it. Give it something you do every day. Then read the skill file on Day 7.

Research was done using hermes agent itself and was asked to write a draft.
The final post was written and created by me using 40% of its research and keywords. Tested on Windows 11 / WSL2 with a GTX 1650 (4GB VRAM) and 16GB RAM. Model: Nous Hermes via OpenRouter. All skill files shown are from actual Hermes runs. Hermes Agent is built by Nous Research — MIT licensed.

What's the first recurring task you'd hand off? Drop it in the comments — I'm curious what skill files look like across different use cases after a week.

I Let AI Write My Entire App — Here's What Actually Happened

Sreejit Pradhan — Tue, 12 May 2026 11:02:06 +0000

This post is my submission for DEV Education Track: Build Apps with Google AI Studio.

A few days ago I stumbled onto the DEV x Google AI Studio education track. The premise sounded almost too good: type a prompt, get a deployed web app. I was skeptical. Here's an honest account of how it went.

The Prompt

I wanted to build something creative, not just another todo app. After some thought I landed on Mythic Nations — a fantasy country generator where you describe an imaginary land and the AI generates a unique flag, national motto, origin story, key exports, and fun facts about it.

My prompt to Google AI Studio was:

"Please create an app called 'Mythic Nations' that lets users describe an imaginary country — its culture, terrain, values, and vibe — and then generates a unique flag image for it using Gemini 2.0 Flash image generation, along with a national motto, a short origin story, key exports, and three fun facts about the country using Gemini. The UI should feel like an encyclopedia or atlas entry, with the flag displayed prominently alongside the generated lore."

One prompt. That's it.

What Happened Next

Within seconds, AI Studio started scaffolding a full React + TypeScript application — components, API service layers, type definitions, everything. I watched it think out loud, catch its own errors, and self-correct. It felt less like a code generator and more like a senior developer who types very fast.

The one hiccup I ran into: Imagen is paywalled. The app threw a PERMISSION_DENIED error when trying to generate flag images. I asked the assistant to swap Imagen out for Gemini 2.0 Flash image generation — which is free — and it handled the migration without breaking anything else.

The Result

🔗 Live App: https://ai.studio/apps/b0225949-fd9c-46d5-9d8f-e834d6a69eea

The first country I generated was Orena — a nation where clocks don't exist, wealth is stored in glass vials of lucid dreams, and legal disputes are settled by harmonic duels. The flag it generated was genuinely beautiful. I did not write a single line of code.

Key Takeaways

The prompt is the most important thing. The more specific, vivid, and opinionated your prompt, the better the output. Vague prompts produce generic apps. Describe the feel of what you want, not just the features.

Don't be afraid to iterate. The build feature isn't a one-shot deal. You can keep talking to the assistant, ask it to change the UI, fix bugs, or swap out APIs. Treat it like a conversation.

Free tier has limits. Imagen requires billing. Gemini 2.0 Flash image generation does not. Know the difference before you start so you don't hit a wall mid-build.

The generated code is real, readable code. You can open it, understand it, modify it, and learn from it. This isn't a black box — it's a great way to study how production-quality React apps are structured.

Would I Recommend This Track?

Absolutely. Even if you're an experienced developer, there's something genuinely exciting about watching an idea materialize in real time. And if you're just getting started, this is one of the most confidence-building things you can do — you'll have a live, deployed app with your name on it in under an hour.

Give it a try. Come up with something weird. The weirder the better.

From One Prompt to a Full Fantasy Nation Generator — No Code, No Cost

Sreejit Pradhan — Tue, 12 May 2026 10:57:42 +0000

This post is my submission for DEV Education Track: Build Apps with Google AI Studio.

I Described Imaginary Countries and AI Built Me an Atlas

What I Built

I built Mythic Nations — an AI-powered imaginary country generator that lets you describe an undiscovered land and instantly brings it to life with a unique flag image and a full encyclopedia-style lore entry. You describe the culture, terrain, and vibe; Gemini handles the rest.

The app was generated using Google AI Studio's "Build apps with Gemini" feature with this prompt:

"Please create an app called 'Mythic Nations' that lets users describe an imaginary country — its culture, terrain, values, and vibe — and then generates a unique flag image for it using Gemini 2.0 Flash image generation, along with a national motto, a short origin story, key exports, and three fun facts about the country using Gemini. The UI should feel like an encyclopedia or atlas entry, with the flag displayed prominently alongside the generated lore."

Since Imagen requires a paid billing account, I asked the AI assistant to swap it out for Gemini 2.0 Flash image generation, which works on the free tier — and the results speak for themselves.

Demo

🔗 Live App: https://ai.studio/apps/b0225949-fd9c-46d5-9d8f-e834d6a69eea

Here's Orena — a nation hidden in the mist-veiled peaks of the ethereal Aethelian Basin, where wealth is measured in crystallized melodies and legal disputes are settled by harmonic duels:

My Experience

What surprised me most was how little effort it took to go from a vague idea to something that genuinely looks polished. I typed one prompt, watched Gemini think through the architecture out loud, and within minutes had a fully structured React + TypeScript app with components, services, and API integrations all wired up.

The self-healing error correction was the real eye-opener — at one point the assistant flagged and resolved multiple type conflicts on its own, without me touching a single line of code.

One thing to keep in mind: Imagen is not available on the free tier. If you hit a PERMISSION_DENIED error like I did, just ask the AI assistant to switch to Gemini 2.0 Flash for image generation instead — it works seamlessly and produces beautiful results.

The whole experience felt less like using a tool and more like pair-programming with someone who never gets tired. Highly recommend trying it with an idea that's a little weird and creative — the more imaginative your prompt, the more magical the output.without me touching a single line of code.
One thing to keep in mind: Imagen is not available on the free tier. If you hit a PERMISSION_DENIED error like I did, just ask the AI assistant to switch to Gemini 2.0 Flash for image generation instead — it works seamlessly and produces beautiful results.
The whole experience felt less like using a tool and more like pair-programming with someone who never gets tired. Highly recommend trying it with an idea that's a little weird and creative — the more imaginative your prompt, the more magical the output.

From One Prompt to a Full Fantasy Nation Generator — No Code, No Cost

Sreejit Pradhan — Tue, 12 May 2026 10:49:40 +0000

This post is my submission for DEV Education Track: Build Apps with Google AI Studio.

I Described Imaginary Countries and AI Built Me an Atlas

What I Built

The app was generated using Google AI Studio's "Build apps with Gemini" feature with this prompt:

"Please create an app called 'Mythic Nations' that lets users describe an imaginary country — its culture, terrain, values, and vibe — and then generates a unique flag image for it using Gemini 2.0 Flash image generation, along with a national motto, a short origin story, key exports, and three fun facts about the country using Gemini. The UI should feel like an encyclopedia or atlas entry, with the flag displayed prominently alongside the generated lore."

Since Imagen requires a paid billing account, I asked the AI assistant to swap it out for Gemini 2.0 Flash image generation, which works on the free tier — and the results speak for themselves.

Demo

🔗 Live App: https://ai.studio/apps/b0225949-fd9c-46d5-9d8f-e834d6a69eea

Here's Orena — a nation hidden in the mist-veiled peaks of the ethereal Aethelian Basin, where wealth is measured in crystallized melodies and legal disputes are settled by harmonic duels:

My Experience

The self-healing error correction was the real eye-opener — at one point the assistant flagged and resolved multiple type conflicts on its own, without me touching a single line of code.