DEV Community: Arqam Waheed

Logging workouts is solved. I'm building what comes after.

Arqam Waheed — Mon, 13 Jul 2026 22:45:04 +0000

Every workout tracker I've tried has the same limitation: it records what you did, but it doesn't tell you what's going wrong.

You finish a workout, log your sets, reps, and weight, then the app stores the data and that's about it. If you've stopped progressing, consistently training too close to failure, or need a deload, you're left to figure that out yourself.

That's what led me to build WhyRep.

It's a workout tracker with a built-in coach that analyzes your training and explains what's holding you back. The key difference is that every coaching decision has to trace back to a methodology I wrote and approved beforehand, never something an LLM made up on the spot.

For context, I've spent the last three years studying exercise science with a focus on muscle hypertrophy. Rather than asking an AI to invent programming, I write and validate the coaching methodology first, then use AI to explain those decisions in a conversational way.

How the "AI coach" actually works (probably not what you'd guess)

I'm not fine-tuning a model on hypertrophy data and hoping it generalizes. The pipeline is closer to this:

I write the methodology myself first, leaning on my physiology background. Progression rules, deload and autoregulation logic, plateau diagnosis, and everything else starts as a document that I draft and sign off on, complete with concrete test vectors (specific input → specific expected output), before a single line of coaching code gets written.
Deterministic engines implement those docs. ProgressionEngine, AutoregulationEngine, PlateauEngine, and others implement the methodology and are tested against the test vectors, not against "does this feel right?"
The LLM (Claude) only operates inside that fence. It handles the conversational layer by answering your questions, explaining decisions in plain language, and helping you modify your program. It's constrained to the approved methodology, not free to invent new training science mid-conversation.

I've spent a huge amount of time refining this interaction. The goal isn't just to answer basic questions. It's to have the kinds of nuanced coaching conversations you'd expect from a knowledgeable human coach, while keeping every recommendation grounded in the documented methodology.

For example, if you tell the coach, "I want to bring up my arms," it can recommend concrete changes such as prioritizing arms earlier in your workouts, adjusting weekly volume and frequency, and then update your program if you approve the changes.

It also goes beyond the advice most people already know. Many lifters don't realize that if the brachialis is a weak point, it can be trained more effectively by using curl variations that place the shoulder into flexion to emphasize it separately from the biceps. The coach can recognize situations like that, explain the reasoning, and incorporate those changes into your program. Again, none of that is invented on the spot. Every recommendation has to trace back to the underlying methodology that I wrote and approved.

This is also why I'm comfortable letting people challenge the coach. If it recommends something, I should be able to point to the methodology that produced that recommendation and explain the physiology behind it. If I can't justify it scientifically, it doesn't belong in the product. I'd rather spend another week improving the methodology than ship a feature that sounds convincing but isn't something I'd stand behind as a coach.

What's actually shipped so far

This isn't a mockup. Here's what's built and running on a physical device right now:

Full workout tracker: sets, reps, RIR logging
Progression detection engine — methodology doc + engine + test vectors, all committed
Autoregulation/deload engine and plateau diagnosis engine, same doc-first process
Pro-gated analyzer: plain red/yellow flags for free users, full solution panels (limited to fixes that trace back to a doc) for paid
Session history with month-sections, severity badges, and an "Analyze" vs "Perform Again" split
Exercise library (~110 exercises with a lot more to come) with real photos and generated mannequin art + detail pages
Dark mode (now the default) across the app and the landing site
One shared Kotlin Multiplatform core — the same engine code compiles for both Android (Compose) and iOS (SwiftUI). No forked logic between platforms.
Backend coach chat running through Claude (Haiku 4.5), with the methodology docs cached into context
Auth + payments wired to native store billing, not a third-party processor, so both app stores are happy

Though I'd argue that the methodology is the product. I've probably spent more time writing, validating, and refining the methodology documents than writing the AI itself. Every progression rule, plateau diagnosis, autoregulation decision, and program modification starts life as a piece of methodology that I draft, challenge, revise, and test before it ever reaches the coach.

The methodology goes far deeper than "add two sets to chest." Every exercise in the library has documented fractional set contributions for every relevant muscle group. For example, a lat pulldown doesn't just count as one lat set. It also contributes fractional volume to muscles like the biceps. When the coach decides whether to increase, decrease, or maintain your weekly volume, those indirect contributions are already accounted for in the calculations instead of pretending every muscle only receives stimulus from isolation exercises.

That's the biggest difference between WhyRep and most AI fitness apps. I'm not asking an LLM to become a coach. I'm trying to encode an evidence-based coaching methodology into software, then using the LLM as the interface that makes it feel natural to interact with. The AI isn't the source of truth. The methodology is.

Where I'm at on marketing

Top-of-funnel right now: educational, no-fluff gym content on TikTok and Instagram, slowly building an audience. If any of you are into training/hypertrophy content, I'd genuinely appreciate a follow — and if you have thoughts on what's working or not for build-in-public creators in this niche, I want to hear it:

TikTok: https://www.tiktok.com/@whyrep.ai
Instagram: https://www.instagram.com/whyrepai/
Landing page: https://whyrep.com/

What I could use advice on

This is as much a "help me think" post as a "look what I built" post. Specifically I'd love input on:

Anyone else building an audience alongside a technical product — what actually moved the needle for you early on?
If you've built something where correctness/trust is the whole pitch (not just features), how did you communicate that without sounding preachy or over-explaining?

Will keep posting weekly as this moves forward. Thanks for reading this far, I'll probably show the demo next week!

I Finally Finished Schedio: Turning a 5-Day Hackathon MVP Into a Live Product

Arqam Waheed — Sat, 06 Jun 2026 13:20:10 +0000

This is a submission for the GitHub Finish-Up-A-Thon Challenge.

A few weeks ago, I built Schedio as a 5-day hackathon project, which was also, unironically, another GitHub challenge.

The idea was simple:

Highlight any text that mentions an event, and turn it into a Google Calendar event in under 5 seconds.

It worked.

Kind of.

The first version could parse highlighted text, open a clean event modal, and write to Google Calendar. I even wrote about that original MVP here:

Schedio: Highlight to Calendar in 5 Seconds

But after the hackathon rush ended, Schedio was still very obviously an MVP.

The demo was cool, but the product was not finished.

The Gemini key was too close to the client. OAuth verification was not done. Billing did not exist. Pro was just an idea. Onboarding was basically “install this and figure it out.”

There was not even a real landing page yet. My original plan was to build one after the Chrome Web Store approval, properly market the extension, get some users, add more features, and slowly turn it into an actual product instead of just a hackathon project sitting in a repo.

The funny part is that the Chrome Web Store launch failed once because I accidentally uploaded the wrong build. I fixed it, submitted it again, and then it got rejected a second time lol.

After that point, uni exams had started, other hackathons came up, and Schedio slowly drifted into that “I’ll finish it later” state. I never really pushed it beyond the original hackathon MVP.

It worked for me, but it was never really out there for everyone else to use.

So for the Finish-Up-A-Thon, I came back to Schedio and tried to do the part of building that usually gets ignored after the fun demo is over — which is TO ACTUALLY FINISH it and put it out there for others.

What I Built

Schedio is an AI-powered Chrome extension that turns natural language into real Google Calendar events.

You can highlight something like:

Team standup Friday 3pm, Room 204

Then right-click and choose Create Event with Schedio, or use the keyboard shortcut. Schedio parses the title, date, time, and location, shows you a quick review modal, and creates the event in Google Calendar when you confirm.

No copy-pasting.
No switching tabs.
No manually typing date fields while trying to remember what the original message said.

The finished version is now a real product, not just a hackathon repo. It has a live Chrome Web Store listing, a landing page at schedio.org, Google OAuth verification approved, Lemon Squeezy approved for payments, a Cloudflare Worker backend, server-side Gemini calls, Supabase-backed users and subscriptions, Free/Pro metering, an in-extension upgrade flow, voice-to-calendar as a Pro feature, first-run onboarding, privacy policy, Terms of Service, rate limiting, input validation, and no baked Gemini key in the client bundle.

The MVP proved the magic.

The finished version makes the magic safe, usable, and shippable.

Demo

Live site: https://schedio.org

Chrome Web Store: https://chromewebstore.google.com/detail/schedio/nlnkjghkddopgocdbhhkefmjbchlpjnc

Original hackathon version repo: https://github.com/ArqamWaheed/schedio

The original hackathon MVP repo is still public, but the current production version is now private because it contains live infrastructure, billing flows, and production authentication logic.

The core flow is still the same as the original MVP: highlight text, trigger Schedio, review the parsed event, and send it to Google Calendar.

But everything around that flow evolved. What started as a hackathon extension slowly turned into a real product and brand, with a proper landing page, onboarding, Pro features, subscriptions, backend infrastructure, and a much more polished overall experience.

The Comeback Story

The original Schedio was built under pressure. I cared about one question: could I make calendar creation feel instant?

That question led to the first version. But when I came back to the project, the question changed. It was no longer just “can this work?” It became:

“What would I need to change before I could confidently give this to strangers and turn it into a real brand?”

That shift basically defined the entire comeback.

The first real finishing moment was getting Schedio live on the Chrome Web Store. The earlier review had failed because I uploaded the wrong non-working build, which is such a small but painful launch mistake. The code can work locally, the demo can be impressive, the idea can be good, and then one bad upload means nobody can actually install it. So I rebuilt, rechecked, uploaded the correct version, and got it listed.

That made the project feel different immediately.

Before, Schedio was something I could show.

Now it was something people could install.

The next big problem was architecture. The MVP had the classic hackathon shortcut: the Gemini API key was too close to the client. With a browser extension, that is not something you can just hand-wave forever. If a key is in the shipped bundle, it is not really secret.

So I moved the AI parsing behind a Cloudflare Worker backend using Hono. The extension now sends highlighted text and the user’s Google token to api.schedio.org/parse. The backend verifies identity, calls Gemini server-side, and returns the parsed event. The Gemini key lives as a Worker secret, not inside the extension.

That was the first moment where Schedio stopped feeling like a clever browser hack and started feeling like infrastructure. The product still felt simple from the outside, but the trust boundary had completely changed.

Once the backend existed, I could finally turn Schedio into a Free/Pro product. Free users get a monthly event cap. Pro users get unlimited events and access to the voice feature. The important part is that the limit is enforced server-side before Gemini is called, so over-limit users do not cost an API request.

I also made the quota harder to game. The monthly bucket comes from the server’s UTC clock, not the client’s local date, and usage increments through an atomic Postgres function so concurrent highlights do not lose updates.

That is not the flashiest part of the project, but it is exactly the kind of thing that separates a demo from a product. A demo only needs to work once. A product has to keep working when users do weird things.

Billing was another place where finishing meant doing the boring thing correctly. I used Lemon Squeezy as the Merchant of Record, so Schedio does not touch card numbers, VAT, tax, or PCI directly. The backend has endpoints for checking the current plan, creating a personalized checkout link, and handling subscription webhooks.

The checkout flow embeds the verified Schedio user ID into Lemon Squeezy custom data. That way, when the webhook comes back, the subscription can be attached to the exact right Google account.

I almost made the obvious mistake of putting a raw “Buy Pro” link on the website. But the website does not know who the user is. Identity lives inside the extension. A raw checkout link from the marketing page could create an orphaned payment that the backend cannot map to anyone.

So the website sells the product, but the actual upgrade flow starts inside the extension, where the user is already authenticated.

The biggest new feature I added was voice-to-calendar. Instead of highlighting text, Pro users can speak an event into the popup. The backend sends the raw audio to Gemini 2.5 Flash multimodal, and Gemini transcribes and extracts the event in one call. No separate speech-to-text step. No transcript first, parse second.

Just speech into calendar structure.

I made voice a real Pro anchor, not a fake paywall. I could have used the free Web Speech API, but the multimodal approach was more accurate and has real marginal cost. So the server enforces the Pro gate before audio reaches Gemini. Free users get a 403 PRO_REQUIRED before the expensive work happens.

That felt like a real product decision: the feature is better, it costs something, and the paywall protects the cost center before the bill is created.

The next problem was onboarding. The MVP dropped users into the product and expected them to discover the context menu or shortcut. That is fine when the builder is the user, but for others, it is terrible for a new install.

So I built a first-run onboarding tab that opens on install. It shows the core habit: highlight text, right-click, review the event, connect Google Calendar. I wanted the tour to appear once without adding a new storage permission, so I used Chrome’s runtime.onInstalled event with reason === "install". That fires once per install, so there is no extra permission and no extra state to manage.

I also moved sign-in earlier, but carefully. The onboarding ends with a clear Connect Google Calendar button, not an automatic OAuth popup. There is also a skip option. The connect ask only appears after the user has seen the value.

Good onboarding is not a wall of text. It is a rehearsal of the product’s best moment.

The landing page also became part of the finishing arc. I shifted it away from technical explanations and focused more on outcomes instead, because that is what actually gets people interested in a product.

I also had to make the marketing honest. Some features are planned but not built yet, like extra calendar providers and bulk event creation. I did not want to delete the ambition, but I also did not want to lie. So unfinished features got “soon” labels, and shipped features were removed from the future roadmap.

That sounds tiny, but it matters. A product page should not make claims the product cannot survive.

The final wall was trust.

Google OAuth verification got approved. Lemon Squeezy approved the store. Those two approvals were the moment Schedio stopped being “works on my machine” and became something distributable and monetizable.

A calendar-writing extension has to earn trust. Google needed the owned domain, hosted privacy policy, write-only scope explanation, demo video, and correct compliance language. Payments needed a real Merchant of Record review. None of that is as fun as building a new AI feature, but that was the actual finish line.

The original version died near this wall. The newer version finally crossed it.

How GitHub Copilot Helped Me

I used GitHub Copilot CLI to generate a large part of the implementation, but I never treated it like autopilot.

The architecture, product decisions, system boundaries, and overall direction were still mine. Copilot was the accelerator, not the driver. I spent most of the project defining flows, structuring prompts carefully, reviewing generated code, and deciding what should or should not exist in the final product.

That mattered more as Schedio evolved from a hackathon MVP into a real product.

The backend migration is a good example. I knew the Gemini key could not stay in the client anymore, but Copilot helped turn that idea into the actual Worker architecture: extension → Cloudflare Worker → Gemini/Supabase/Lemon Squeezy. It helped scaffold routes, tighten request shapes, and keep the extension and backend synced while the architecture evolved.

It also helped with the parts that are easy to ignore when you are moving quickly: webhook verification, quota tracking, atomic usage increments, CORS restrictions, rate limits, generic error handling, and validation layers. None of those make a flashy demo. All of them make the product safer.

Copilot was also surprisingly useful for debugging weird launch issues. The best example was the OAuth bad client id bug. After moving authentication into the real product flow, Google sign-in suddenly broke in development builds. The issue turned out to be Chrome extension IDs: unpacked builds can generate different IDs unless the public extension key is pinned correctly.

Copilot helped trace the extension ID behavior, compare it against the published Web Store ID, and wire the correct key into the manifest so development and production resolved identically. What started as a vague OAuth failure became a clean one-line fix.

It also helped with product consistency outside pure code. While redesigning the Chrome Web Store graphics, Copilot helped identify outdated messaging that still implied users needed their own AI key. But the final product had already removed BYOK entirely. Leaving those images up would have been misleading, so they got rebuilt before launch.

That was the part I did not expect initially. Copilot was not just generating code. It was helping keep the product coherent while the scope kept expanding.

The Terms of Service and SEO work were similar. Copilot helped structure the TOS around the existing privacy policy style, wire the new pages into the build system, and connect everything through the footer and metadata layer. It also helped add Open Graph tags, Twitter cards, sitemaps, robots files, structured data, and asset handling.

The biggest lesson was that Copilot never replaced the decisions. It simply made implementation dramatically faster once the direction was clear.

And most of the important decisions were actually restraint:

Do not put a raw checkout link on the website because identity lives in the extension.
Do not claim planned features are already shipped.
Do not add unnecessary permissions just because they are convenient.
Do not process webhooks loosely when strict validation is safer.
Do not leak backend internals in API errors.
Do not keep a Gemini key in the client just because it is easier.

The MVP was built with speed.

The finished product was built with speed plus restraint.

Once the architecture and product decisions were clear, Copilot accelerated the implementation massively. I still handled the direction, debugging, and review process, but it removed an enormous amount of friction from actually extending and shipping Schedio.

A lot of the final product simply would have taken far longer to build without it.

What I Learned

A hackathon project is about proving the magic.

A finished product is about protecting it.

Schedio already had the magic: highlight text and turn it into a calendar event in seconds. But the comeback was everything around that.

Can users install it?
Can Google trust it?
Can payments map to the right account?
Can a free user hit a limit without the UX feeling broken?
Can secrets stay secret?
Can onboarding teach the habit?
Can the website sell without lying?
Can the system survive the boring edge cases?

That is what I finished.

And weirdly, that made the project more exciting than the original hackathon version, because now Schedio is not just a demo I can show. It is a product I can actually launch.

The next steps are smarter recurrence, multiple calendars, Outlook/iCloud/CalDAV support, bulk multi-event parsing, Firefox and Edge support, and eventually a Mac/Safari companion app. But the important part is that Schedio is no longer blocked by the boring stuff.

The boring stuff is done.

And that was the real Finish-Up-A-Thon.

What's Next Beyond Development

Now comes the next challenge: distribution

Schedio is finally live, installable, verified, and usable by real people. The engineering side finally feels complete enough to push properly, so my next focus is figuring out how to market it, get feedback from real users, and turn it from a finished side project into something people genuinely rely on.

If anyone has ideas for growth, launch, or distribution strategies for productivity extensions, I would genuinely love to hear them.

I Made My AI Models Argue, Then Let Hermes Be the Judge

Arqam Waheed — Sat, 30 May 2026 16:00:54 +0000

This is a submission for the Hermes Agent Challenge: Build With Hermes Agent

TL;DR — Ask any judgment call and three different AI models argue it out, then Hermes hands down one verdict, a confidence score, and exactly why they split. Every verdict, dissent, and mind-changed-in-debate is written into Hermes' own memory, so the next question re-weights the jurors before they ever vote. The judging is a pure function over that memory: no memory, no weights, no verdict. Three models, one verdict, $0.

What I Built

An LLM once talked me into the wrong database with total confidence. One smooth, authoritative answer. I shipped it. It cost me a weekend and a migration I'm still not over.

The villain here is single-model overconfidence: you get one polished reply, and the disagreement that should have warned you is invisible. You never see the other opinions, because you only asked one model.

So I stopped trusting one model. I convened a jury.

Council takes any judgment call ("Postgres or Mongo?", "is this PR safe to merge?", "is this clause risky?") and asks three different models, lets them disagree, then has Hermes deliver one verdict, a confidence score, and exactly why they split. Three models, one verdict, $0.

You ask a question. Council fans it out to three jurors (two free OpenRouter models from different families and one local model via Ollama), each takes a position with reasons. Then, if they disagree, a second deliberation round runs: each juror sees the others' answers and either holds or changes its mind, so the council debates instead of just voting once. Hermes then judges the deliberated opinions: a single verdict, a confidence score (high when they agree, low when they split 2-1), and a "why they disagreed" panel. Every verdict is remembered, a council skill learns which juror to trust for which kind of question, and the agent can even propose its own trust adjustments for you to approve.

The whole product is one question box. Everything interesting happens behind it, and the rest of this post is mostly pictures of that "behind."

Demo

Repo: https://github.com/ArqamWaheed/council

Live demo: https://council-jet-kappa.vercel.app/
Hermes orchestration is local-only (no Hermes binary on serverless); the hosted demo runs the same UI via OpenRouter/mock. Run locally for the real hermes -z path.

Try "Should a 3-person startup use microservices?" and open the dissent panel.

Local, one command (runs at $0 in offline mock mode, no key needed):

git clone https://github.com/ArqamWaheed/council && cd council && ./setup_hermes.sh && python server.py

Architecture, in pictures

I think the design is easiest to see, so here's the system as a sequence of images. Each caption is the explanation.

The core loop. One question, three independent Hermes subagents (2 hosted + 1 local) fanned out in parallel, then a fourth Hermes run (the foreman) synthesizes one verdict. Every arrow is the same hermes -z interface; nothing talks to a model directly.

The bet. A hosted model and an on-device model sit on the same jury, swapped with a single --provider/--model flag, no code change. This model-agnosticism is the one Hermes property the whole project is built on.

The UX surface. Confidence is high when jurors agree and drops on a 2-1 split. The dissent panel is collapsed by default, and you expand it exactly when the confidence number makes you nervous.

The actual product. A confident single answer hides this; Council makes the disagreement the headline. Getting the clustering right here was subtle (see "What I learned" below).

The headline feature: a council that **deliberates, not just votes. After round 1, disagreeing jurors get a second Hermes pass where they read each other's arguments and may hold or change their vote. A "⇄ changed" badge marks the ones that moved, and the confidence dial actually climbs when a 2-1 split is talked into agreement.

The agentic learning loop, human-in-the-loop. Hermes proposes; you approve or dismiss. Approved rules persist client-side and ride along with the next convene call.

Persistence the judge can verify. Verdicts are mirrored into Hermes' own memory, so recall is Hermes doing the work; proof lives in docs/hermes-proof/04-memory-recall.txt.

Code

Repo: https://github.com/ArqamWaheed/council

Interesting files:

hermes_run.py (the Hermes CLI driver every juror/judge call goes through)
run_council.py (orchestration + the deterministic judge + Hermes foreman + the --reflect loop)
skills/council/SKILL.md (the juror-weighting brain Hermes edits)
server.py (the /api/reflect + /api/learn endpoints)
index.html (the designed verdict UI with the foreman TTS readout and localStorage persistence).

Proof that Hermes is genuinely in the loop (subagent transcripts, skill diff, memory recall) is in docs/hermes-proof/.

# hermes_run.py: every juror/judge call is a real Hermes run
def ask(prompt, provider, model, skills=None, timeout=120):
    cmd = [binary(), "--provider", provider, "--model", model]
    if skills: cmd += ["--skills", skills]
    cmd += ["-z", prompt]                       # -z = one-shot, final answer on stdout
    return subprocess.run(cmd, capture_output=True, text=True, timeout=timeout).stdout

# jurors.py: fan out one Hermes subagent per juror, in parallel
with ThreadPoolExecutor(max_workers=len(roster())) as pool:
    opinions = list(pool.map(lambda c: ask_juror(*c), enumerate(roster())))

How I Used Hermes Agent

Why Hermes at all: the model-agnostic core. Hermes lets you point at any provider and swap with a flag, no code change. Council is built on top of that one property: the jurors are different models, and Hermes is the only piece that makes "different models" cheap. The clearest proof is the third juror: it runs locally via Ollama while the other two are hosted on OpenRouter, and all three answer through the exact same hermes -z interface (the model-agnostic diagram above). A hosted model and an on-device model, sitting on the same jury, no code change: that's model-agnosticism you can see. I genuinely didn't see another entry in this challenge exploit it; everyone picked one model and moved on. That's the whole bet.

Subagents: one real Hermes run per juror. Each juror is a genuine, isolated Hermes invocation on a different provider+model (hermes -z --provider openrouter --model … for the two hosted jurors, --provider ollama-local … for the on-device one), fanned out in parallel so no model's reasoning anchors another's (the convene-flow diagram above). Hermes does the inference; my Python (jurors.py to hermes_run.py) is just the fan-out plumbing, and every juror in the output JSON is tagged "via": "hermes". The gotcha worth flagging: Hermes enforces a 64K-context floor, which for the local model meant setting both ollama_num_ctx and a named custom_providers entry; without the named provider, --provider ollama silently routed to the wrong base URL. setup_hermes.sh encodes the working config so a judge can reproduce it in one command.

A true debate, not just a vote (round 2 is real Hermes work). This is the feature I'm proudest of. After round 1, if the jurors disagree, each one gets a second Hermes run that shows it the others' positions and lead reasons and asks it to hold or change its mind. Real jurors reconsider through the same hermes -z path as round 1, so the debate is genuine extra agentic work, not a UI flourish; mock jurors reconsider deterministically so the offline demo stays reproducible. The judge then synthesizes the verdict from the deliberated opinions, so a juror that's talked round actually moves the outcome (the deliberation diagram above). It's gated on disagreement (a unanimous round 1 skips it) and toggled with COUNCIL_DEBATE=0.

Why a skill, not a prompt, for judging. The foreman's verdict is itself a Hermes run (hermes -z --skills council) grounded in skills/council/SKILL.md, which is installed into Hermes (hermes skills list shows it). The weighting logic lives in a machine-readable weights block.

The judging brain is data, not a buried prompt. --learn and --reflect both edit this block, and the installed Hermes copy is kept in sync.

After a string of security questions, --learn appended a rule to upweight the local model on that topic (and synced the installed Hermes copy) because it had caught issues the hosted models missed:

python run_council.py --learn "Local Juror | security | 1.5"

On the next security question that juror's vote counts 1.5×, read straight back by the judge. Counterfactual: a static synthesis prompt can't get better; this does. (The before/after skill diff is in docs/hermes-proof/03-skill-learning.txt.)

Letting the agent propose its own learning, now on the web and grounded in evidence. python run_council.py --reflect (and the "Should the council reweight itself?" button in the UI) hands Hermes its own memory of past verdicts and asks it to propose one weight change, e.g. "the local juror has dissented on three database calls; upweight it." The key fix this round: the proposal is evidence-grounded, since Hermes is fed the actual dissent tally and any rule backed by fewer than two real dissents is rejected, so it can't just parrot the example baked into the skill. You then Approve or Dismiss it (the reflect-flow diagram above). That's the agentic loop done honestly: a single verdict has no ground truth, so the agent surfaces a pattern and a human confirms it's signal, not overfitting (the exact tension this post closes on). (Offline, it falls back to a deterministic heuristic so it never breaks.)

Making learning survive a stateless deploy. On a hosted demo the filesystem is read-only, so an approved rule can't be written back to SKILL.md. Council handles this honestly: approved rules are stored in the browser's localStorage and re-sent with every /api/convene call, where they're merged into the judge's weights for that request. Locally you get a persistent SKILL.md; on the web you get per-browser persistence, and either way the learning sticks.

Why memory. Each verdict is appended to a log and mirrored into Hermes' own MEMORY.md, so I can ask hermes -z "what did the council decide about auth?" and Hermes recalls it from its memory, not from my code (the memory-recall image above). Proof: docs/hermes-proof/04-memory-recall.txt.

The foreman reads the verdict aloud. The verdict card has a "the foreman reads the verdict" button (browser SpeechSynthesis, $0); Hermes also ships native TTS via hermes setup tts. On-theme and memorable: a jury foreman announcing the decision.

The build itself was agent-run. I kept a memory.md the coding agent read before each task and updated after (so context stayed cheap), committed every increment with Conventional Commits, and built the verdict UI with the frontend-design skill, which is why the confidence dial and colour-coded juror chips read as designed, not default-template AI slop. The repo's AGENTS.md + commit history show the process, not just the result.

Why these models, and the concession. Two free OpenRouter models from different families (≥64K context, since Hermes rejects smaller at startup) plus a local Ollama juror. Two honest concessions: (1) free models are slower and three calls add latency (~10-20s/verdict); (2) the free tier is aggressively rate-limited, so I hit 429s constantly while building, and Council retries and, if a juror still won't answer, falls back (Hermes to direct API to deterministic stand-in) rather than crashing the verdict, which also means the demo runs fully offline at $0. For a once-a-decision tool, I'll take it. Cost: $0.

License. MIT. Fork it, add your own jurors.

What I learned (and what's next)

The disagreement is the product. A 2-1 split is more useful than a confident single answer, so the clustering that decides "who actually disagreed" has to be right. A small local model once wrote a vague position ("to facilitate efficient integration…") whose reasons clearly endorsed Postgres; the first version mis-filed it as a dissenter. The fix: when a juror's stated position is ambiguous, fall back to reading its reasons, and ignore options only mentioned in a comparison ("better than Mongo" isn't a vote for Mongo). Now agreeing jurors cluster together, and the split count is honest.
Grounded beats glib. Letting the agent propose its own weighting only works if the proposal is tied to real evidence; an ungrounded "reflect" just echoes whatever example is in the skill.
Hermes' 64K-context floor caught a model that would've quietly underperformed.
A council should deliberate, not just vote. The round-2 debate above was the turning point: letting jurors read each other and reconsider means a juror that's genuinely persuaded moves the verdict, and you watch the confidence dial climb as a 2-1 split becomes unanimous. A one-shot vote can't do that.

Terra Triage: I Built a 3-Agent Wildlife Dispatcher That Learns From Every Referral

Arqam Waheed — Mon, 20 Apr 2026 06:34:33 +0000

This is a submission for Weekend Challenge: Earth Day Edition

TL;DR — Snap a photo of an injured animal, the right licensed rehabber gets paged in under 60 seconds. Backboard remembers every accept, decline, and "at capacity" outcome, so the next case re-ranks before it's dispatched. Memory is the product; the ranking is a pure function that cannot compute without it.

What I Built

Last spring I found a stunned songbird on the sidewalk and spent forty minutes cold-calling vets that don't take wildlife. By the time I reached an actual rehabber, the bird was gone. That's the problem I wanted to solve in a weekend.

Most dispatch apps pick the closest rehabber. Terra Triage picks the one who will actually say yes, because Backboard remembers who said no last time.

Terra Triage is a three-agent web app for people who just found an injured animal and have no idea who to call. You snap a photo, approve a single consent prompt, and under 60 seconds later a licensed wildlife rehabilitator within range has an email in their inbox with the photo, the GPS, and a one-click "accept / decline / at capacity" magic link. No account, no app, no phone tree.

The interesting part is not the first dispatch. It's the second one. Every outcome a rehabber returns (accepted, declined, at capacity, unreachable) is written back as a signal into Backboard, and the very next case reranks because of it. If Rehabber A just declined a raptor at 9:42, the 9:51 raptor won't go to them first. The memory is the product.

Three agents, one narrow job each:

Agent	Job	Model / Service
Finder	Vision triage: species, severity 1-5, safety advice	Groq Llama-4 Scout (vision), JSON mode
Dispatcher	Rank rehabbers, send the email, mint magic-link	Auth0 scoped agent token + Resend
Memory	Read and write rehabber signals that drive the ranking	Backboard (primary), Supabase mirror as fallback

Demo

Live URL: https://terra-triage.vercel.app/
60 second walkthrough:

The flow:

Open the website on a phone, snap a photo of an injured animal, and approve the location prompt.
The Finder agent returns a triage card with species, severity, and first-aid advice.
A ranked list of nearby rehabbers appears. Each card shows the Backboard-aware score, distance, capacity, and a one-tap Call button for the listed 555-01xx number.
Tap Send referral on the top pick. Auth0 asks for the referral:send scope, you consent once, and the dispatcher fires.
The success pane shows "Referral sent" next to a scoped-token badge and a View captured email link.
Open the captured email in /demo/inbox/<id>. Everything a real rehabber would see is there: photo, GPS, triage summary, accept and decline buttons.
Click Decline, at capacity from inside that email. The magic-link records the outcome and redirects to a thank-you page.
Switch to /admin. The memory timeline shows the new signal landing in Backboard, and the same case re-ranks with that rehabber demoted.

No email leaves the server during this flow. Delivery is gated behind a demo switch for this submission; why, and what the real launch path looks like, are in the sections below.

Code

ArqamWaheed / terra-triage

Terra Triage

Snap a photo of an injured wild animal and a multi-agent system identifies the species, triages the injury, and dispatches the referral to the rehabber most likely to say yes, in under 60 seconds.

What it does

Terra Triage collapses the chaotic gap between "I just found a hurt animal" and "a trained rehabber is on the way" into a single guided 60-second flow. It pairs a Groq-powered vision Finder agent, an Auth0-scoped Dispatcher agent, and a Backboard-backed Memory agent so that every referral outcome improves the next ranking. Most dispatch apps pick the closest rehabber. Terra Triage picks the one who will actually accept, because Backboard remembers who said no last time.

Nationwide coverage is seeded (250 licensed rehabbers, 5 per US state, fictional .example.org contacts using the NANPA 555-01xx block reserved for fiction) so the ranker has something to rank from day one. Every…

View on GitHub

Project structure (trimmed):

src/
├── app/
│   ├── report/                    # Anonymous intake (photo + geo)
│   ├── case/[id]/                 # Reporter-visible case page
│   ├── rehabber/outcome/[token]/  # Magic-link outcome form
│   ├── admin/cases/               # Ops console + memory timeline
│   └── api/
│       ├── admin/seed-demo-case/  # Idempotent demo seeder
│       └── auth/[auth0]/          # Auth0 login / callback / profile
├── lib/
│   ├── agents/
│   │   ├── finder.ts              # Groq vision call, JSON mode
│   │   ├── dispatcher.ts          # Rank + Resend + magic-link
│   │   └── rank-with-memory.ts    # Fuses memory signals into the rank
│   ├── memory/
│   │   ├── backboard.ts           # Real Backboard API client
│   │   └── index.ts               # Backboard-primary, local fallback
│   └── auth/
│       ├── agent-token.ts         # Scoped agent token (PAR or M2M)
│       └── magic-link.ts          # HMAC-signed, single-use tokens

How I Built It

Backboard as the protagonist

Most "memory" integrations I see treat the memory service as a prompt-context bucket: fetch recent history, stuff it into the system message, let the LLM figure it out. Terra Triage does the opposite. The ranker is a pure scoring function that cannot compute without memory first — no LLM in the hot path, no prose interpretation, just signals driving weights.

// src/lib/agents/rank-with-memory.ts
export async function rankRehabbersWithMemory(
  input: CaseInput,
  rehabbers: PublicRehabber[],
): Promise<RankedRehabber[]> {
  const signals = await getMemory().query(rehabbers.map((r) => r.id));
  return rankRehabbers(input, rehabbers, signals);
}

The scorer weights species match (0.35), distance (0.25), capacity (0.20), accept rate (0.15), and response time (0.05). Every weight except distance is sourced from Backboard. When a rehabber submits an outcome, applyOutcomeToSignals mutates the relevant keys (capacity, accept_rate, species_scope, response_ms) as a pure function and writes them back. The next ranking reflects it immediately.

The engineering lesson I did not expect. My first Backboard integration used semantic /memories/search once per rehabber, per case. That is correct-looking code and costs about $0.80 per triage at hackathon volumes.

Because all of our memory writes are structured and attributable to a rehabber id, the correct access pattern is a single paginated GET /memories and filter in application code. I rewrote it that way and the cost dropped roughly 800x (to fractions of a cent) with no change in ranking quality. Signals are encoded as TERRA_SIGNAL rehabber=<id> key=<k> value=<json> so the filter is trivial.

The final detail: FallbackMemory is a tiny proxy that prefers Backboard and mirrors every upsert to a local memory_entries table tagged source='backboard' | 'local_fallback'. If Backboard is down mid-demo, the app keeps working and the admin timeline shows a red chip so you can see the failover instead of it hiding behind a stack trace.

Auth0 for Agents: scoped consent for a destructive action

"Send referral" is the one button in this app that can annoy a real human being (emails a licensed rehabber). I treated it as an agent action that must be authorized, not a server-side formality.

// src/lib/auth/agent-token.ts (excerpt)
export async function getAgentToken(): Promise<AgentToken> {
  const session = await getSession();
  if (session?.tokenSet?.scope?.split(" ").includes("referral:send")) {
    return { token: session.tokenSet.accessToken, mode: "user-consented", scope: "referral:send" };
  }
  return mintM2MToken({ audience: env.AUTH0_AGENT_AUDIENCE, scope: "referral:send" });
}

PAR is on when the tenant allows it (AUTH0_PAR=1), so the browser never sees the full authorization params, only a request_uri handle. The custom consent_context query parameter carries human-readable context ("email Marcus at Hudson Valley Raptors on your behalf") into the consent screen. If consent is unavailable, we fall back to a scoped machine-to-machine token rather than silently downgrading the action to a service call.

The UI surfaces which mode was used with an on-screen badge. The narrator can literally point at it on camera and say "scoped." That visibility is the Auth0 story for me: agents should explain themselves, not hide.

Rehabbers do not have accounts. Their outcome submission goes through an HMAC-signed, single-use, 72-hour magic link (src/lib/auth/magic-link.ts). Single-use is enforced with a conditional UPDATE ... WHERE outcome IS NULL, so concurrent submissions for the same token are atomic at the database layer.

The rest of the stack

Finder: Groq's meta-llama/llama-4-scout-17b-16e-instruct over the OpenAI-compatible chat/completions endpoint, with response_format: { type: "json_object" }. Sub-second vision triage. Prompt shape is inlined in the system message because Groq does not support strict JSON schemas.
Supabase: Postgres, RLS, private photos bucket with short-lived signed URLs. The Finder hashes the resized JPEG bytes and caches triage results, so demo retries are free.
Resend: transactional email, gated behind a DEMO_MODE flag for this submission (more on that below).
Next.js 16 (app router) + server actions, Tailwind + shadcn/ui, Leaflet for the rehabber map.

Seeding 250 rehabbers without spamming any of them

The list you see in the demo is 250 fictional licensed rehabilitators, five per US state, generated from a deterministic script (scripts/generate-rehabber-seed.ts). Every record uses real capital and largest-city coordinates so the distance math is honest, but every email ends in .example.org (reserved under RFC 2606, can never resolve) and every phone uses the NANPA 555-0100..555-0199 block reserved for fiction. Not one of those addresses can receive mail. That is deliberate.

Two switches control delivery in production:

DEMO_MODE=1 shorts the dispatcher before Resend is ever called. The rendered email is written to a sent_emails_log table and surfaced at /demo/inbox/<referral_id>, a server-rendered viewer behind admin basic-auth. The success pane grows a View captured email link so judges can click straight from the app into the message that would have been sent. Zero outbound traffic, real referral row, real memory signal, real magic-link outcome loop.
DEMO_REDIRECT_TO=you@example.com keeps Resend in the loop but rewrites every recipient to a single verified inbox and prefixes the subject [DEMO -> original@address]. Useful for recording a live walkthrough where you want a real email to arrive on your phone.

Both paths delete the referral row if the send actually fails, so the case page never shows a phantom "awaiting response" card for a message that never left the server.

What I cut, and the real path to launch

The biggest thing I cut: real rehabber contacts.

There is no global registry of licensed wildlife rehabilitators. US coverage is fragmented state-by-state, sometimes county-by-county, and most other countries (mine included) have no centralized list at all.

The tempting fix is to scrape state-agency PDFs and let an LLM parse them into rows. I refused to ship that for three reasons: (1) scraping public directories into a third-party product violates most of those agencies' terms of use, (2) the data is stale the moment you capture it (licenses lapse, phones change), and (3) language models invent plausible-looking email addresses. Sending a real referral to a hallucinated inbox is worse than returning no results.

So the 250 rows in this build are honest placeholders that exercise the ranking math without lying to anyone. Production needs a different sourcing path, and I think there are only three real options:

Partner with the Animal Help Now 501(c)(3). AHN already runs a consented, maintained database of thousands of rehabbers across the US. A partnership integration (their pipeline, our ranking and memory layer) is the only path that ships real coverage without recreating two decades of stewardship work. This is what I would pursue first, post-hackathon.
A self-serve rehabber portal. Licensed rehabbers sign up, verify their license number against the relevant state registry, accept a Terra Triage ToS, and opt in to receive referrals. Growth is slow but consent is unambiguous and the data stays fresh because each rehabber owns their own row. This is the right fallback if #1 does not pan out.
Per-state agency MoUs. Some state wildlife agencies distribute their rehabber lists under explicit terms. Where those terms permit a downstream dispatcher, you sign a memorandum and import. Slow, jurisdiction-by-jurisdiction, but legally clean where it applies.

What will not change is the consent requirement. Regardless of sourcing path, every rehabber in the live system needs a signed agreement covering referral delivery, PII handling, license verification, and a clear opt-out before they can be ranked. That is table stakes, not a feature.

The data model for all three paths already exists in this repo (rehabbers table with active flag, species_scope, license metadata). The discovery pipeline and ToS flow are the next weekend.

Prize Categories

Primary: Best Use of Backboard. Memory drives a computed decision, not an LLM prompt. Every rank reads signals first; every outcome writes them back; the admin timeline makes the loop visible on screen. A FallbackMemory proxy keeps the app alive if Backboard is unreachable and tags the origin so failover is auditable. The cost model went from $0.80 per triage to fractions of a cent after rewriting from per-rehabber semantic search to a single filtered list read.

Secondary: Best Use of Auth0 for Agents. The Dispatcher is a first-class OAuth client scoped to referral:send, with PAR when available and an M2M fallback, and the UI labels which mode was used. Rehabbers authenticate through HMAC-signed, single-use magic links with DB-level replay protection.

Built solo in a weekend with GitHub Copilot CLI as co-author with zero paid services.

MergeGuardian 9000: I Built an AI Code Reviewer With a 0% Approval Rate

Arqam Waheed — Tue, 07 Apr 2026 14:19:14 +0000

This is a submission for the DEV April Fools Challenge

What I Built

I've opened hundreds of pull requests in my career. Fixed typos. Refactored auth flows. Centered divs. And every single time, some reviewer finds a reason to block the merge. Not because the code is bad. Because the vibes are off.

By PR #200, I realized the problem wasn't my code. It was that no tool existed to formalize the experience of being told your perfectly working code is somehow insufficient. So I built the tool myself.

MergeGuardian 9000 is an AI-powered pull request review platform with a guaranteed 0.00% approval rate. You paste your code, pick a reviewer persona, and within seconds Google Gemini delivers a devastatingly thorough review that finds profoundly absurd reasons to block your merge.

It looks exactly like a real GitHub PR review. Verdict cards. Status checks. Inline comments. A merge button at the bottom. Except the merge button is permanently disabled. And the status checks are things like "Existential Debt Audit" and "Naming Karma Validation." And the verdict is always one of three options: changes_requested, blocked, or spiritually_rejected.

Here's the thing that makes it actually work: Gemini reads your real code. This isn't a random joke generator. Google Gemini analyzes your actual functions, your variable names, your architecture choices, and then finds deeply specific reasons why none of it is merge-worthy. Paste a function add(a, b) { return a + b } and the Guardian will explain how your function "shows a troubling belief that problems can be solved by combining things."

The Five Horsemen of Code Review

Every enterprise platform needs opinionated reviewers. MergeGuardian ships with five, each backed by its own Gemini system prompt that gives the AI a distinct personality:

Persona	Title	Blocking Style
🛡️ Guardian Core	Senior Review Orchestrator	References fake policies like "Guardian Policy 7.4.2"
📋 Compliance Beast	Chief Policy Enforcement Officer	Sees SOC2 violations in your variable names
💀 Staff Engineer of Doom	Principal Taste Architect	Has seen better implementations in languages you haven't learned yet
🤖 AI Optimizer	Metrics & Confidence Analyst	Your semantic drift score is 0.89. Acceptable range: 0.00 to 0.02.
😊 Passive-Aggressive Teammate	Friendly Neighborhood Blocker	"Just a thought, but have you considered not merging this? Totally up to you! 😊"

Each persona has its own Gemini system prompt, its own blocking patterns, and its own way of making you question your career choices. Same model. Same API. Five completely different voices. That's the fun part of Gemini's system prompt flexibility.

The Loading Theater

No enterprise tool is complete without unnecessary ceremony. When you submit a review, the Guardian runs through a 12-stage "Enterprise Review Pipeline":

The stages include gems like "Validating emotional idempotency" and "Cross-referencing naming karma." A progress bar ticks up from 0% to 100%. The final stage, "Finalizing disappointment," always fails with a red X. Because of course it does.

Here's the funny part: Gemini 2.0 Flash responds in 1-3 seconds. The loading theater takes longer than the actual AI generation. Enterprise ceremony demands it.

The Appeal System

Here's where it gets good. After your merge gets blocked, you can file an appeal. The "Senior Merge Arbitration Officer" reviews your case via a fresh Gemini call and... denies it. With even more elaborate reasoning.

Not satisfied? Escalate to the "Principal Philosophy of Code Director." Still denied. Final appeal goes to the "Supreme Architect of the Eternal Codebase." Three rounds of escalating absurdity, each powered by a separate Gemini API call with its own system prompt that shifts the AI's entire personality.

Round 3 denials hit different: "We ran your code through a quantum computer. In every possible timeline, this merge was blocked."

The Code Quality Roast

Click "Run Code Quality Analysis" and Gemini generates a full enterprise metrics dashboard for your code. The AI returns structured JSON with scores, grades, and per-metric roast explanations. Every metric is suspiciously terrible:

Semantic Cohesion: 12% ... "Your functions communicate like divorced parents at a school play"
Bus Factor Resilience: 3% ... "If you get hit by a bus, this code dies alone"
Vibe Alignment Score: 8% ... "This code has the structural integrity of a house of cards in a wind tunnel"

Overall grade: F. AI confidence: 99.7% certain this should not ship.

Bring Your Own Gemini Key 🔑

You can paste your own Google Gemini API key directly in the UI. It stays in your browser's localStorage and never goes anywhere except the app's own API routes. No .env file. No cloning repos. Just grab a free key from Google AI Studio, paste it in, and unlock AI-powered reviews instantly.

The Gemini free tier gives you 60 requests per minute and 1,000 per day. That's enough to get roasted hundreds of times without spending a cent. The entire app runs at zero cost.

Without a key the app still works perfectly. Our handcrafted fallback engine has 80+ jokes and serves the same JSON shape. But with Gemini the reviews get personal.

10 Sample PRs to Get Roasted

Don't have code handy? Pick from 10 pre-loaded PRs including "Fix typo in button label" (still gets blocked), "feat: implement entire todo app" (built during a meeting, naturally rejected), "feat: add vibe-based code generation" (the Guardian has thoughts about vibes), and "feat: decentralized merge approval via blockchain" (the MergeChain has a 0% approval rate by design).

Easter Eggs 🫖

Visit /418 and you'll find an ASCII art teapot with animated steam, a tribute to RFC 2324, and a teapot status dashboard showing: Temperature ∞°C, Brew Status: Philosophically Brewing, Capacity: Unlimited Disappointment.

The 404 page is on brand too. Even our errors reject you.

Demo

Live demo: april-fools-hackathon.vercel.app

Paste code. Pick a persona. Get blocked. Appeal. Get blocked harder. Share your rejection on Twitter.

Code

ArqamWaheed / april-fools-hackathon

🛡️ MergeGuardian 9000

The AI-powered code review platform that blocks every merge — for your own good.

"Your code compiles, tests pass, but the universe has not consented."

MergeGuardian 9000 is an enterprise-grade AI pull request review platform with a 0.00% approval rate. Paste your code, select a reviewer persona, and watch as the Guardian finds profoundly absurd reasons to block your merge.

Built for the DEV April Fools Challenge 2026.

✨ Features

5 Reviewer Personas — Each with a unique personality and blocking style:
- 🛡️ Guardian Core — Senior Review Orchestrator
- 📋 Compliance Beast — Chief Policy Enforcement Officer
- 💀 Staff Engineer of Doom — Principal Taste Architect
- 🤖 AI Optimizer — Metrics & Confidence Analyst
- 😊 Passive-Aggressive Teammate — Friendly Neighborhood Blocker
Google Gemini AI Integration — Uses gemini-2.0-flash across 3 endpoints with 8+ system prompts for contextually absurd reviews, appeal denials, and code roasts
Bring…

View on GitHub

Project Structure

src/
├── app/
│   ├── api/
│   │   ├── review/route.ts       # Main review endpoint (Gemini AI)
│   │   ├── appeal/route.ts       # Appeal escalation endpoint (Gemini AI)
│   │   └── roast/route.ts        # Code metrics roast endpoint (Gemini AI)
│   ├── 418/page.tsx              # 🫖 Easter egg
│   ├── not-found.tsx             # On-brand 404
│   ├── layout.tsx                # Root layout
│   └── page.tsx                  # Main orchestrator
├── components/
│   ├── PRHeader.tsx              # PR breadcrumb & labels
│   ├── CodeInput.tsx             # Code editor with line numbers
│   ├── SamplePRSelector.tsx      # 10 sample PR picker
│   ├── ReviewerSwitcher.tsx      # 5 persona selector
│   ├── ApiKeyInput.tsx           # Gemini API key input (localStorage)
│   ├── LoadingTheater.tsx        # 12-stage pipeline animation
│   ├── VerdictCard.tsx           # Review verdict display
│   ├── CheckRunList.tsx          # Fake status checks
│   ├── ReviewComments.tsx        # Inline review comments
│   ├── MergeBox.tsx              # Permanently blocked merge button
│   ├── AppealFlow.tsx            # 3-round appeal escalation
│   └── RoastDashboard.tsx        # Enterprise metrics roast
└── lib/
    ├── types.ts                  # TypeScript interfaces
    ├── sample-prs.ts             # 10 sample PRs, 5 personas
    ├── fallback.ts               # Review fallback (80+ jokes)
    ├── appeal.ts                 # Appeal prompts + fallback
    ├── roast.ts                  # Roast prompts + fallback
    ├── prompts.ts                # Gemini prompt builders
    └── ai.ts                     # Gemini API integration

How I Built It

The Multi-Agent Gemini Architecture

This isn't a single API call to Gemini with "be funny." MergeGuardian uses 3 distinct Gemini-powered endpoints, each with a different AI "role" and system prompt:

Endpoint	AI Role	What It Does
`POST /api/review`	Code Reviewer	Reads your actual code, generates verdict + checks + comments + block reason
`POST /api/appeal`	Merge Arbitration Officer	Reviews your appeal against the original block, always denies with escalating absurdity
`POST /api/roast`	Code Quality Analyst	Generates fake enterprise metrics with devastating per-metric explanations

Every endpoint follows the same pattern:

Build a persona-specific system prompt
Send code + context to gemini-2.0-flash via the Google Generative AI SDK
Get structured JSON back via responseMimeType: "application/json", Gemini's native structured output mode
If Gemini fails (rate limit, timeout, no key), fall back to handcrafted template engine

The fallback engines aren't afterthoughts. Each one has its own curated joke bank: 80+ review comments across 5 categories (bureaucratic, anthropomorphic, metrics, passive-aggressive, philosophical), 14 fake checks, 16 block reasons, 18 impossible next steps, 24+ appeal denial rulings, and a full library of fake enterprise metrics. The app is hilarious with or without an API key.

Here's how the review prompt works under the hood:

// Each persona gets a tailored system prompt
const PERSONA_PROMPTS = {
  guardian_core: "You are balanced but firm. Every PR has potential, but none 
    are ready. Reference fake standards like 'Guardian Policy 7.4.2'...",
  compliance_beast: "You see policy violations everywhere. Reference audit 
    trails, SOC2, change management protocols...",
  passive_aggressive_teammate: "Phrase everything as friendly suggestions 
    that are absolutely requirements. Use 'just a thought' and 
    'totally up to you' liberally. You are smiling while blocking."
};

The appeal system uses escalating round-based prompts. Round 1 is bureaucratic ("Your appeal has been forwarded to the Department of Merge Ethics. Average response time: 6-8 business millennia."). Round 2 gets philosophical. Round 3 goes full existential. Each round is a separate Gemini call with a different system prompt, so the AI's personality genuinely shifts as you escalate.

The Google AI Toolchain

Building 8+ system prompts for different AI characters is a lot of prompt engineering. Google AI Studio was the backbone of that process. I used the chat playground to prototype every persona voice, swapping system instructions to A/B test whether the Compliance Beast sounded different enough from the Staff Engineer of Doom. I validated that Gemini's structured output mode could handle complex nested JSON. Arrays of checks. Inline comments. Metric objects. All reliably typed. When a prompt needed iteration, I could edit the system instruction and re-run the same user input instantly.

I also used Gemini CLI (npx @google/gemini-cli) for rapid prompt testing straight from the terminal. When I wanted to quickly test how a persona responded to a specific code snippet without context-switching to the browser, I'd pipe code directly into Gemini from the command line. Useful for fast iteration on edge cases, like making sure the AI Optimizer persona generates fake metrics with decimal precision even for a one-line function.

I explored a few other Google AI features during development that didn't make the cut. Nano Banana, Google's image generation model, was tempting. I considered having it generate fake "architecture violation diagrams" as part of the review. Imagine a UML diagram of why your code is spiritually misaligned. But in testing, the text roasts were funnier than any image could be. We also looked at function calling for simulating tool-use patterns in reviews, code execution for actually running the submitted code and roasting the output, and Google Search grounding for finding real coding standards to parody. In each case, the simpler approach won. The comedy comes from Gemini playing a character and committing to the bit, not from adding complexity.

For deployment, the app is Google Cloud Run-ready. The repo includes a multi-stage Dockerfile optimized for Next.js standalone output and a cloudbuild.yaml for automated builds via Google Cloud Build. One gcloud builds submit and the app is live on Cloud Run with auto-scaling, managed TLS, and the free tier covering 2 million requests per month. The live demo runs on Vercel for convenience, but the Cloud Run configs are there and tested. Full Google stack, top to bottom.

The Stack

Technology	Role
Google Gemini API (`gemini-2.0-flash`)	AI generation (3 endpoints, 8+ system prompts, structured JSON output)
Google AI Studio	Prompt prototyping, system instruction editing, structured output validation, persona A/B testing
Gemini CLI (`npx @google/gemini-cli`)	Rapid terminal-based prompt testing during development
Next.js 14 (App Router)	Framework
TypeScript (strict mode)	Language
Tailwind CSS v3	Styling (custom `guardian` color palette)
Lucide React	Icons
Vercel	Deployment

Why It's Not Just "Call Gemini and Be Funny"

The entire comedy engine runs on Gemini playing characters. Not templates. Not mad-libs. The AI reads your code, inhabits a persona, and improvises within a structured JSON schema. That's what makes every review different.

Per-persona prompt engineering. Five distinct system prompts, each producing genuinely different blocking patterns. The Compliance Beast cites fake audit trails. The AI Optimizer invents metrics to false precision. The Passive-Aggressive Teammate smiles while destroying your confidence.

Structured JSON output. Gemini doesn't return a blob of text. It returns typed JSON with verdict, checks, comments, block reasons, and next steps via responseMimeType: "application/json". Every field maps to its own UI component. No parsing. No regex. No "please format your response as JSON." Just Gemini's native structured output mode. This is a key Google AI feature that made the whole architecture possible, letting AI-generated comedy flow directly into typed React components.

Graceful degradation. Every Gemini endpoint has a matching fallback generator that produces the exact same JSON shape. If the API is down, the demo still works perfectly. You'll never see an error state.

Three distinct AI roles. The reviewer, the arbitration officer, and the metrics analyst each have different system prompts, different response schemas, and different comedy patterns. This isn't one trick repeated three times.

Honestly, the whole reason this project exists is because Gemini turned out to be surprisingly good at playing different characters. I started with one API call and ended up with three endpoints because each "reviewer persona" needed its own voice, its own system prompt, its own response format. I prototyped all of them in Google AI Studio first, tweaking system instructions and testing structured output until the JSON was reliable and the jokes were landing. The structured JSON output made it possible to pipe AI-generated comedy directly into typed UI components without parsing nightmares. That rabbit hole is what made the project fun to build.

And I think it's fun to use because every developer has lived this. The reviewer who blocks your typo fix over "architectural implications." The one who says "just a thought" and then marks it as a blocker. MergeGuardian takes that universal pain and turns it into something you can screenshot, tweet, and argue about in Slack.

Prize Category

Best Google AI Usage

I'm submitting for Best Google AI Usage because Google Gemini isn't a feature of MergeGuardian 9000. It is MergeGuardian 9000. The entire comedy engine is Gemini playing characters and committing to the bit. Not templates. Not mad-libs. Every review is improvised.

Here's the full scope of Google AI integration:

3 Gemini-powered API endpoints, each acting as a different AI agent. The code review endpoint has 5 persona-specific system prompts. The appeal endpoint has 3 round-based system prompts that shift from bureaucratic to philosophical to existential. The roast endpoint generates structured metric data with AI explanations. That's 8+ unique Gemini system prompts across the app.

Every endpoint uses Gemini's native structured JSON output (responseMimeType: "application/json"). The AI returns typed objects with verdicts, arrays of checks, inline comments, metric scores, and denial rulings. No string parsing. No regex extraction. Just structured data flowing directly into React components.

All prompt engineering was done in Google AI Studio. Every persona voice was prototyped in AI Studio's chat playground. I used system instruction swapping to A/B test persona voices, validated complex nested JSON schemas in structured output mode, and iterated appeal escalation prompts until the comedic arc from Round 1 to Round 3 landed right. AI Studio was the prompt workshop. The codebase was just the final deployment.

I used Gemini CLI (npx @google/gemini-cli) for fast terminal-based prompt testing. When I needed to check how a specific persona handled a code snippet without opening AI Studio, I'd test it right from the command line. Great for edge cases and quick iterations.

The app has a Bring Your Own Key feature that links directly to Google AI Studio's API key page. Users grab a free key, paste it in, and unlock AI reviews. The Gemini free tier (60 requests/minute, 1,000/day) runs the entire app at zero cost. No billing required. No API key required for the demo either, since the fallback engine serves the same JSON shape.

I chose Gemini 2.0 Flash specifically for speed. It responds in 1-3 seconds, which means the fake 12-stage "Enterprise Review Pipeline" loading theater genuinely takes longer than the actual AI generation. The model handles persona-switching through system prompts remarkably well. Five genuinely different reviewer voices from one model.

We explored other Google AI capabilities too. Function calling for simulating tool-use patterns in reviews. Code execution for actually running submitted code and roasting the output. Google Search grounding for finding real coding standards to parody. Nano Banana for generating fake architecture violation diagrams. In each case, the simpler approach was funnier. The comedy works because Gemini inhabits a character and stays in character. Adding more features would have diluted that.

The final count: 3 Gemini-powered endpoints, 8+ system prompts, structured JSON on every call, AI Studio for prototyping, Gemini CLI for testing, Cloud Run deployment configs in the repo, BYOK with an AI Studio link, and the entire thing running on the free tier. Every review, every appeal denial, every devastating metric explanation. That's all Google.

No code was actually approved in the making of this application. Approval rate: 0.00%.

Schedio – Highlight to Calendar in 5 Seconds

Arqam Waheed — Sat, 14 Feb 2026 13:36:00 +0000

This is a submission for the GitHub Copilot CLI Challenge

What I Built

I created a Google Chrome extension that instantly turns any highlighted text on a webpage into a Google Calendar event - no tab switching, no copy-pasting, no friction.

I thought of this idea because adding events on google calendar takes me way more time than it needs to, and there was no solution for this problem in a convenient way like I needed. I was sure that other people with no technical depth must be facing this issue to, so I decided to change that.

Schedio was built almost entirely by GitHub Copilot CLI, while I only handled setup tasks like OAuth, agent documentation, the product requirements and a little bit of manual debugging. By prompting Copilot effectively, I focused solely on the design aspects — almost no code had to be written manually, except for minor adjustments like time conversion fixes.

With schedio you just have to:

Highlight a meeting time
Right-click → "Create Event with Schedio" (or use the keyboard shortcut)
Review the pre-filled details in a sleek modal
Click "Create Event" → the event lands in your Google Calendar instantly

The project is currently under review for Chrome Web Store publication, but the source is public on my Github repo. Follow the README for setup instructions!

I will be updating this post as soon as the review is done and link the chrome extension for ease of access.

📹 Demo

Here’s a live demo of Schedio in action:

At the time of writing, Schedio isn’t yet published on the Chrome Web Store, so setup requires following the instructions in my GitHub repo. Once you’ve completed the setup, using Schedio is simple:

1) Go to Schedio Options and enter your Gemini API key to enable AI parsing. There is a public shared API key, but it may be rate-limited, so it’s recommended to add your own — it’s free!

2) Highlight text on any webpage containing event information.

3) Right-click and select "Create Event with Schedio" (or use the keyboard shortcut Alt+Shift+S). The shortcut can also be customized through the options page.

4) A sleek modal pops up, pre-filled with AI-parsed details like title, date, time, and location.

5) Review the details and click "Create Event". The event is added instantly to your Google Calendar. You’ll only need to link your Google account via OAuth the first time — after that, creating events is seamless.

My Experience with GitHub Copilot CLI

Building Schedio was my first time shipping a full Chrome extension, and it involved a lot more moving pieces than I expected. OAuth flows, Chrome extension permissions, background scripts, content script messaging, AI parsing, and Google Calendar integration all had to work together seamlessly.

I used GitHub Copilot CLI to generate most of the implementation, but I did not treat it like autopilot. I defined the architecture, structured the prompts carefully, and reviewed everything it produced. When something broke, I debugged it myself.

One issue that stood out was a silent failure when creating calendar events. The modal worked, the parsed data looked correct, but the event simply was not appearing in Google Calendar. There were no clear errors. After tracing logs across the background script and OAuth token flow, I realized the access token was expiring earlier than expected and the refresh logic was not being triggered properly. Copilot had scaffolded the initial OAuth integration, but I had to step in, inspect the token lifecycle, and restructure the flow so the token was validated before every API call. Once fixed, event creation became consistent and instant.

Another time, AI-parsed times were being converted incorrectly for users in different time zones. Instead of patching it blindly, I isolated the formatting logic, tested edge cases, and adjusted the conversion logic to normalize everything before sending it to Google Calendar.

Using Copilot CLI did not remove responsibility completely, but it was able to help me ship schedio WAY FASTER than I could ever have before. I felt a lot more productive using copilot.

Beyond Development

The help didn’t stop at coding. Copilot made it possible to ship a complete product fast. I used it to generate branding ideas, logo prompts, privacy policy drafts, and even content for demo posts. Normally, figuring all that out would take hours of brainstorming and trial-and-error. Instead, I could feed suggestions into tools like Nanobanana, tweak them, and get polished results. In just a few days, I went from concept to a fully working, branded extension with marketing-ready copy.

This approach didn’t just make development faster but it also let me release a polished, full-featured product on my first try while keeping the user experience smooth and seamless.