DEV Community: Mir Shah

One More Year -> The Documentary I Never Filmed

Mir Shah — Sun, 12 Jul 2026 19:58:49 +0000

This is a submission for Weekend Challenge: Passion Edition

One More Year

What I Built

I built a documentary studio.

One More Year An AI documentary studio

Not a template with your name dropped in. An actual small production pipeline that reads four things you tell it, decides what your story is, casts it with real performed voices, builds its own sound design, writes its own score, and prints a real mp4. All in the browser.

Here's the observation it started from. Everyone has one thing they can't quit. Football, chess, a piano someone left them, a side project nobody asked for. And almost none of that ever becomes anything, because turning "the thing I can't stop doing" into something that actually feels like a documentary normally takes a camera crew, actors, a sound engineer, and about a week of editing.

So I tried to compress that whole crew into five automated desks and see how close I could get.

It's called One More Year. You answer four questions. It gives you back a short film with a narrator, you as the subject, and a couple of extra voices pulled from inside your own story: memory, doubt, whatever the piece needs. Every voice is performed, not read. Every sound effect is chosen for that scene. Every score is composed fresh. It's all mixed and encoded into a real video file without a render server ever touching it.

Demo

Why there is no hosted public demo

There's no hosted public demo. That's on purpose.

The ElevenLabs key is kept server-side only, so it never reaches browser JavaScript. Hosting it publicly would mean putting my own paid key on the open internet for strangers to spend. So instead, the whole thing runs locally in about two minutes, with your own free Gemini key and your own ElevenLabs key.

git clone https://github.com/abbasmir12/onemoreyear
cd onemoreyear
npm install
npm run dev

Open the Press Room, drop in a key, hit Start Your Story. Headphones on.

Code

abbasmir12 / onemoreyear

An Open Source documentary studio for the things you cannot quit - directed by Gemini, performed by ElevenLabs

ᴏɴᴇ ᴍᴏʀᴇ ʏᴇᴀʀ

An open-source documentary studio for the thing you cannot quit.

Built for the DEV Weekend Challenge: Passion Edition — July 10–13, 2026

Everyone has one thing they can't quit.

Most tools would ask you to write it down. This one asks you to say it out loud — then hands it to a director, a cast, a sound desk, and a print desk, and gets you back a real short film.

What it actually does

You answer four questions. Nothing technical — just what's the thing you can't quit, when did it almost end, why did you stay, and one line only you would say. That's the entire brief.

From there, a real production pipeline runs — not a template filled in, not a single flat voice reading a script. An AI director reads what you gave it, decides what the story actually is, and hands it…

View on GitHub

How I Built It

The four questions

The input surface is deliberately tiny.

What's the thing you can't quit. When did it almost end. Why did you stay. One line only you would say. No writing, no script, no timeline editor.

That constraint is the actual design decision. Asking someone to write a story gets you a self-conscious kind of writing. Asking them a direct question at 2am gets you something else entirely.

The core design decision

Five desks, not one prompt

I didn't want this to feel like "type something, get an AI blob back." So I split the work the way a real production would split it.

01 — The Interview

Four questions. No writing required.

02 — The Director

Gemini reads your answers, finds the arc, and writes a full script. It casts a narrator, you, and one or two symbolic voices pulled from inside the story itself.

03 — The Cast

Every voice is performed through ElevenLabs eleven_v3. Hesitations, corrections, a line that actually breaks where it should.

04 — Sound & Score

The director decides what a scene needs, generates that exact effect, then composes an original score through ElevenLabs Music v2.

05 — The Print

ffmpeg.wasm mixes every voice, effect, and cue against the frames and encodes a real mp4. Client-side. No render server.

The director doesn't summarize. It directs.

My first version of the prompt just asked Gemini to turn the answers into a short documentary.

What came back read like a summary with quotation marks around it.

Competent. Flat. Forgettable.

What worked instead was telling it to behave like a documentary editor. Find the arc, not the facts.

The directing principle

Cast a narrator and the subject, then add one or two symbolic voices, like memory or doubt. Those voices are explicitly not allowed to invent facts or pretend to be real witnesses.

Write two versions of every line. A clean transcript, and a separate performance string with sparse, motivated imperfections: a repeated word, a false start, a line that catches. Most lines stay clean, so the handful of imperfect moments actually mean something.

Every scene also gets a director-chosen transition. Direct, sound, or silence. The first scene is mandatory sound, because a documentary that opens on a voice mid-sentence with nothing behind it doesn't feel like one.

Read the actual rules I gave the AI director

A few of the real constraints from the production prompt, word for word:

"Symbolic voices are fragments of the subject's interior story, not factual witnesses. They must never claim to be a coach, friend, rival, family member, or person who was present."

"Never invent a quote, date, injury, achievement, relationship, or event the person did not provide."

"Human speech is not polished prose. In only 2 or 3 segments across the entire film, add ONE motivated imperfection. Never sprinkle random 'um' sounds everywhere."

"Segment 1 MUST use 'sound' for 4–5 seconds before anyone speaks. Establish the physical world first."

"Every film gets a score. If the story calls for restraint, choose 'instrumental' and keep it sparse, rather than skipping the score entirely."

The cast, the room tone, the score

The Cast

Every performed line goes through eleven_v3.

The Room Tone

Every scene's ambience, a stadium, a stairwell, rain on a tin roof, is a real generated sound effect chosen for that specific beat. Not a stock loop reused everywhere.

The Score

Every film also gets an original score through Music v2. The director can choose instrumental, vocal, or hybrid, but never nothing. A quiet story doesn't mean a silent one. It means a sparser score, not an empty desk.

Printed with ffmpeg, not on a server

This is the part I'm most proud of.

The part I'm most proud of

The final assembly happens with ffmpeg compiled to WebAssembly, running client-side. Timing every voice line against its lead-in. Layering sound effects at the right offsets. Ducking the score under dialogue. Encoding the whole thing to H.264.

No render queue

No render queue.

No upload and wait

No upload and wait.

A real downloadable file

The browser does the actual video engineering and hands you back a downloadable file.

Prize Categories

Best Use of Google AI

Best Use of Google AI. Gemini directs the entire piece: structured JSON output for the script and cast, image generation for the frames.

Best Use of ElevenLabs

Best Use of ElevenLabs. Every voice, every sound effect, and every score in the film is generated live through eleven_v3, the Sound Effects API, and Music v2.

https://github.com/abbasmir12/onemoreyear

ORVIX, Open-source Self-Organizing AI Engineering Company

Mir Shah — Tue, 07 Jul 2026 14:12:06 +0000

I Built an Open-Source AI Engineering Company Instead of Another AI Agent

For the past year, AI coding agents have become incredibly capable.

But almost all of them still follow the same idea:

One increasingly capable AI should build everything.

I wanted to explore the opposite.

What if the AI wasn't the engineer? What if it became the entire engineering company?

That's how Orvix started.

Meet Orvix

Orvix is a self-organizing AI engineering company.

You don't create agents manually or decide how many are needed.

You simply describe the mission.

From there, Orvix designs the project, creates exactly the specialists required for that mission, assigns ownership, coordinates their work, reviews their pull requests, and even creates entirely new specialists later if the project grows.

Every mission creates a different engineering company.

Building software like a real engineering team

Instead of one AI jumping between frontend, backend, infrastructure, testing and documentation, Orvix treats them as independent engineers.

Each specialist owns its own Git branch.

They work in parallel.

They negotiate decisions.

They review one another's work.

They communicate through the Orvix Book, a shared communication layer where engineers ask questions, request clarification, share discoveries, and coordinate their work instead of sharing one giant conversation.

The goal isn't just generating code.

It's organizing engineering.

Local first. Cloud when you need it.

One thing I wanted from the beginning was flexibility.

Orvix can run completely on your own machine if you want everything local.

Or you can deploy the Orvix runtime on Alibaba Cloud ECS, allowing anyone on your team to connect to the same engineering company remotely while every mission, every specialist, and the entire execution live on the server.

For this hackathon, Orvix was deployed on Alibaba Cloud and powered by Qwen Cloud (Alibaba Cloud Model Studio).

There's much more behind it

This post only scratches the surface.

The repository includes detailed documentation covering:

the complete architecture
the Orvix Map
the Orvix Book
the planning pipeline
dynamic specialist creation
mission lifecycle
deployment on Alibaba Cloud
and the reasoning behind the system's design.

If you're curious about how an AI engineering company actually works, I'd love for you to take a look.

Links

GitHub

https://github.com/abbasmir12/orvix

Documentation

https://github.com/abbasmir12/orvix/tree/main/docs

Demo

https://youtu.be/ZzocAW0nbTs

I'd love to hear what you think. If you have ideas for improving Orvix or thoughts about multi-agent engineering systems, feel free to share them.

Among Liars -> The 7th Player Isn't Human

Mir Shah — Sat, 20 Jun 2026 11:24:32 +0000

This is a submission for the June Solstice Game Jam.

I built Among Liars, a realtime multiplayer elimination where six humans join a room, but the game secretly adds a seventh player: a Gemini-powered AI hiding inside the Spy side.

There are two teams:

Detectives are trying to expose the hidden AI.
Spies are trying to protect the AI long enough for the Detectives to run out of chances.

The game is inspired by the Turing Test, but instead of asking "Can AI answer like a human?", it asks something more playable:

Can AI survive being socially judged by humans?

When the game begins, six human players are split into two teams: three Detectives and three Spy Agents. A hidden Gemini-powered AI is then added to the Spy side, creating a team of four spies. The Detectives must identify the AI, while the Spy Agents work together to keep it hidden.

Each round starts with a 2-minute warmup where teams can plan in private rooms. Detectives discuss who feels suspicious. Spies coordinate how to protect the AI.

Then one Detective asks a wildcard-style question to the Spy side. The question is automatically sent to every living Spy player and also to the Gemini AI. Everyone answers under pressure, and the Detective has to read the answers like evidence.

The trick is that Spy-side players receive new cover names every round, so Detectives cannot simply track the AI by name or position. They have to judge tone, timing, weirdness, confidence, and emotional detail.

A question like:

"Describe a tiny mistake you made today without making it sound important."

is much harder than a normal trivia question because it asks for texture, not correctness.

That is where the game becomes interesting.

Sometimes AI sounds too polished.

Sometimes humans sound fake on purpose.

Sometimes the suspicious answer is suspicious because it is AI.

Sometimes it is suspicious because a Spy is protecting the AI.

That tension is the core of Among Liars.

You can play it here:

Live Demo: https://amongliars.vercel.app

Video Demo

Live App

https://amongliars.vercel.app

GitHub Repository

https://github.com/abbasmir12/amongliars

How I Built It

The frontend is built with:

React
Vite
Framer Motion

and a custom black-and-white visual style.

The backend uses Supabase for:

Room creation
Random matchmaking
Player state
Role assignment
Realtime chat
Private team rooms
Round state
Answer storage
Eliminations
Win conditions

I used Supabase Realtime instead of a custom WebSocket server, so messages, answers, player changes, and round changes update live across browser tabs and devices.

The game includes:

6-player waiting room with automatic countdown
Private role reveal
Detective-only and Spy-only private rooms
Public chat
2-minute planning/warmup phase
Rotating Spy cover names every round
Wildcard question flow
90-second Detective question window
45-second Spy answer window
30-second Detective final read window
Gemini AI answer generation
Evidence cards
Detective guess phase
Round result screen
Eliminated player tracking
Detective/Spy win states

Round Flow

Each round is designed to feel like a small interrogation.

First, there is a 2-minute warmup. During this time everyone can continue talking publicly, but the private rooms are where the real strategy happens.

Detective Strategy Room

Detectives discuss:

Who sounds too clean
Who is avoiding pressure
What question would expose the AI
Which answer patterns felt suspicious in previous rounds

Spy Strategy Room

Spies coordinate:

How to protect the hidden AI
How messy or natural their answers should feel
Whether to draw suspicion away from one player
How to make the room harder for Detectives to read

After warmup, one living Detective is selected.

That Detective receives a 90-second question window.

The Detective writes a wildcard pressure prompt. Once submitted, the question is automatically sent to every living Spy-side player, including the human Spies and the hidden Gemini AI.

The Spy side then receives a 45-second answer window.

Human Spies type their responses while Gemini generates its answer through a Supabase Edge Function.

All responses are stored in Supabase and displayed as evidence cards.

Finally, the selected Detective receives a 30-second final read window and must click the answer card they believe belongs to the AI.

Resolution

The resolution is intentionally asymmetric:

If the Detective guesses wrong, that Detective is eliminated.
If the Detective guesses correctly, one human Spy bodyguard is eliminated.
The AI survives until all human Spies are gone and it has nowhere left to hide.

Undercover Names

One of the most important mechanics is the rotating identity system.

Spy-side players never keep the same visible name between rounds.

A player might be:

TAVI in Round 1
ORION in Round 2
MICA in Round 3

This prevents cheap detective work.

Detectives cannot simply memorize player names, positions, or patterns tied to a specific identity.

Instead, they must judge the answers themselves.

Previous round results preserve the original cover names, so historical evidence remains readable even after identities rotate.

Wildcard Questions

The wildcard question is the heart of the game.

The best questions pressure the difference between a lived human answer and a generated answer.

Examples:

"Describe a tiny mistake you made today without making it sound important."
"What is a smell that makes you trust a place?"
"Say something you would only text, not say out loud."
"Which answer in this room feels rehearsed, and why?"

These questions are not about facts.

They are about texture.

They force players to produce awkward, emotional, sensory, or social details under pressure.

That is where the Turing Test becomes playable.

Gemini Integration

The Gemini integration runs server-side through a Supabase Edge Function.

When a Detective submits a question, the function:

Checks the current room and round.
Finds the hidden AI player.
Reads the AI's current cover name.
Sends the question and game context to Gemini.
Receives a short in-character answer.
Saves the answer into Supabase.
Broadcasts it alongside the other Spy answers.

The Gemini API key is never exposed to the browser.

I also added multiple AI behavior styles so Gemini does not always respond with the same personality.

Sometimes it answers plainly.

Sometimes it is guarded.

Sometimes it is short, awkward, or oddly direct.

The goal is not to make the AI sound perfectly human every time.

The goal is to make it difficult to separate from the Spy side.

Prize Category

I am submitting for both optional prize categories.

Best Ode to Alan Turing

Among Liars is built directly around the idea of the Turing Test.

But instead of making the test a static question-and-answer screen, I turned it into a social game.

The AI is not judged by one answer alone.

It is judged by how it survives inside a room full of humans who are actively suspicious of it.

The game asks:

Can a machine imitate a human well enough to survive pressure, suspicion, and social reading?

That felt like a more interactive tribute to Alan Turing's original imitation game.

Best Google AI Usage

Gemini is not a decorative feature in this project.

It is the hidden player.

The entire game loop depends on Gemini:

Gemini receives the Detective's wildcard question.
Gemini answers as a Spy-side player.
Gemini uses the current room context and cover name.
Gemini's answer becomes evidence the Detective must judge.
The game cannot fully exist without the AI participant.

I integrated Gemini through a server-side Supabase Edge Function so the API key remains protected and the AI response becomes part of the realtime game state.

The AI is also given its current undercover identity and round context, allowing it to behave like a player inside the match rather than a generic assistant.

Final Thoughts

Among Liars started from a simple question:

What if the Turing Test was not a test, but a game night?

The result is a tense social deduction game where humans are reading AI, humans are imitating AI, and nobody can fully trust what "normal" sounds like.

That is the fun part.

In this game, the AI does not need to be perfect.

It just needs to survive.

PROCSee -> Turn Your System Into a Crime Scene & let Gemini Become Investigator!

Mir Shah — Thu, 26 Feb 2026 15:10:50 +0000

This is a submission for the Built with Google Gemini: Writing Challenge

What I Built with Google Gemini

About four weeks ago, the Gemini 3 Hackathon dropped — Google DeepMind's global hackathon with a $100K prize pool, asking builders to create something genuinely new with the Gemini 3 API. Not another chatbot. Not a wrapper. Something that actually pushes what the model can do.

I had one question sitting in my head for a while: what if your computer could investigate itself?

Not just flag a suspicious process. Not just match a signature and throw an alert at you. But actually think — form a theory, pull more evidence, change its mind, reach a conclusion. The way a real security analyst would.

That became PROCSee.

It's an autonomous security investigation system for Windows. It monitors every process on your machine in real-time, and when something looks suspicious, it hands the investigation off to Gemini 3 Pro — which then decides what additional data it needs, queries for it, reasons across multiple rounds of evidence, and writes a full forensic report. The tagline: Turn your system into a crime scene. Let PROCSee be the forensic analyst.

How It Actually Works (The Architecture)

Let me walk through this the way I wish someone had explained it to me before I built it, because the architecture decisions were the hardest part — not the AI integration.

Step 1: Catch everything, instantly

Windows is constantly spawning processes. Updaters, scanners, system services — hundreds of events per hour on a normal machine. The classic approach is polling: check what's running every few seconds. The problem? A piece of malware that executes, drops a payload, and exits in under a second is completely invisible to a poller.

We used WMI event callbacks instead. WMI (Windows Management Instrumentation) is a pub/sub system built into Windows. You subscribe to process creation events, and the OS calls your code the moment anything starts — under 10ms latency. We capture it all: the process name, path, parent process, command line, user account, everything. Immediately written to a raw events database in SQLite. No analysis yet. Just capture.

New Process Starts
      ↓
WMI fires callback (<10ms)
      ↓
Stored in raw_process_events immediately
      ↓
Done. Fast. No AI involved yet.

Step 2: The problem with "just send everything to Gemini"

Here's where we made our first big mistake, and then fixed it.

Our first instinct: every time a process starts, send its data to Gemini and ask "is this suspicious?" Simple idea. We tested it. On a normal Windows machine, that's easily 3-5 process creation events per second during active use. At that rate you'd blow through your API quota in literal minutes. Not hours. Minutes.

So we needed a smarter funnel before anything touches the API.

We built a local behavior-scoring engine — zero API cost, runs entirely on-device. It checks 40+ patterns before Gemini ever sees anything:

Is PowerShell using -encodedcommand or -executionpolicy bypass?
Is a browser spawning a shell? (chrome.exe → powershell.exe is almost never legitimate)
Is something executing from %TEMP% or AppData?
Is certutil.exe or mshta.exe making network connections? (classic LOLBins abuse)
Is a process touching registry persistence keys?

Each pattern scores points. If a process scores zero — nothing suspicious — it's logged and forgotten. It never reaches Gemini. Only things that earn attention get elevated. This alone cut our API calls by around 95%.

Step 3: The dual-database architecture

Even with the scoring filter, we still needed to solve what Gemini actually sees. This is the core architectural insight of the whole project: separate "store everything" from "show the AI what matters."

We run two databases side by side:

The raw events database stores every single process event in full detail. This is the forensic record — complete, unfiltered, queryable at any time. It's how we can answer "show me every process that made a network connection to this external IP in the last 30 minutes" without having pre-loaded all of that into Gemini's context.

The summary database gets built every 60 seconds. We aggregate the raw events into a 1-minute digest: what was unusual, what matched suspicious patterns, the high-level picture. This is small — a few KB of actual signal. This is what Gemini reads first to orient itself.

Raw Events DB (everything)  →→→  Summary DB (1-min digest)
       ↓                                    ↓
  Forensic record                   Gemini reads this first
  Gemini queries this               to orient itself
  when it needs more detail

Step 4: Gemini doesn't just read the data — it decides what it needs

This is the part we're most proud of, and it's what makes PROCSee different from a standard AI integration.

Most AI integrations work like this: gather all the data you think is relevant, stuff it into a prompt, hope the AI has what it needs. The problem is you're guessing what it'll need. You either over-send (burns tokens, hits context limits) or under-send (bad analysis).

We flipped it. After Gemini reads the summary, it can say: "I need more information before I give you a verdict." And then it tells us exactly what it wants:

{
  "needs_more_data": true,
  "queries": [
    {
      "action": "QUERY_PROCESS",
      "process_id": 4821,
      "time_range": "last_5_minutes",
      "details": ["network", "file_access", "cpu"]
    },
    {
      "action": "QUERY_NETWORK",
      "time_range": "last_30_minutes",
      "min_connections": 3
    }
  ]
}

We execute those queries against the raw database, return the results, and Gemini continues its analysis. Another round. It keeps doing this until it's confident enough to give a verdict.

A real investigation flow looks like this:

Round 1 — Gemini reads the summary: "There's suspicious PowerShell activity worth investigating."
→ We query: full process details for that PowerShell instance

Round 2 — Gemini gets the data: "It's making outbound connections to 203.0.113.42, an external IP."
→ We query: all network activity to that IP across the whole system

Round 3 — Gemini gets the data: "Three separate processes are all calling out to the same external IP. This is command-and-control communication."
→ Final verdict: CONFIRMED_THREAT — risk 0.95, confidence 0.92

No human in that loop. Gemini decided what it needed, went and got it, and formed its own conclusion.

Step 5: The forensic report

When risk is ≥50% and confidence is ≥60%, Gemini generates a full Markdown forensic report — written in first person, walking through its investigation process, the evidence chain, MITRE ATT&CK technique mappings, indicators of compromise, and recommended response actions.

We built a custom renderer for these reports with syntax highlighting for cmd:, path:, ip:, and proc: prefixes so they read like real analyst documents, not raw AI output.

The stack: Python + FastAPI for the backend agent, SQLite with WAL mode for the dual-database architecture, pywin32 for WMI, psutil for process data, React + Vite for the dashboard, WebSocket for real-time streaming, and the google-genai SDK for Gemini 3 Pro.

Demo

[GitHub: https://github.com/abbasmir12/procsee]

The dashboard has a few views worth calling out. The Gemini Conversation View is the one that surprised me most when I first saw it working. You can watch in real-time as Gemini issues a QUERY_NETWORK call, gets results back, immediately pivots to QUERY_PATTERN: shell_spawn because it noticed something in the results, and keeps iterating toward a verdict. It genuinely looks like watching someone think through a problem — because that's what it is.

The Detailed Report Viewer renders the final forensic reports with full Markdown and syntax highlighting. Each report includes the complete investigation chain — every query Gemini issued, every piece of evidence it weighed, every confidence score.

What I Learned

The architecture problem was the real challenge

Here's what I didn't expect going in: the Gemini integration itself was actually the smooth part. Once the architecture was solid, plugging in Gemini was relatively clean. The hard part — the part that took most of the time — was building the system that makes responsible, efficient use of the API.

The quota problem hit us immediately. Naive implementation: fire an API call for every process event. Reality: quota exhausted in under an hour on a busy machine. That forced us to completely rethink the data flow. The behavior scoring engine, the 60-second aggregation, the dual-database design — all of that exists because of one question we kept coming back to: how do we make sure Gemini only sees what's actually worth its attention?

Every architectural decision in this project traces back to that question. If you're building anything that involves a continuous data stream and an LLM, that's the question you need to answer first. Everything else flows from it.

Rate limits aren't just a bug — they shaped the whole system

When we stress-tested with multiple concurrent investigations, we hit rate limits constantly. Five investigations running simultaneously, each doing 3-4 query rounds with large contexts — that's a lot of tokens per minute, very fast.

The frustrating part: the error messages just said "resource exhausted." Is that requests per minute? Tokens per minute? Daily limit? These have completely different fixes. RPM you solve with throttling and request spacing. TPM you solve with context compression and smarter batching. Daily limits you solve with queuing strategy. Not knowing which one you're hitting means you're guessing at the solution while your quota keeps burning.

We ended up implementing all three mitigations simultaneously because we couldn't tell which problem we were actually solving: exponential backoff with jitter, a hard cap on max_query_rounds per investigation (default 4), a global concurrency limit (max 3 deep investigations at once), and dynamic thinking level selection so we're not burning high-compute calls on triage decisions that don't need them. The rate limit constraints literally shaped the concurrency model of the entire system — which is a weird thing to say but it's true.

Thinking levels matter more than I expected

thinking_level="low" vs thinking_level="high" isn't just a speed dial. It changes the quality of reasoning you get and what prompting strategies work well. We use low for fast triage — "is this worth investigating at all?" — and high for deep forensic analysis and report generation.

Early on we used high thinking everywhere. Slower responses, heavier quota usage, and no meaningful quality improvement for simple yes/no triage decisions. Right tool for the right job. Sounds obvious in retrospect, but figuring out which job needs which level took real experimentation.

Cutting features is a skill

We came into the hackathon wanting to build prevention features — automatically suspending suspicious processes, network isolation. Had it half-implemented. Then we cut all of it.

Not because of time. Because we realized: if you're wrong on a false positive and your tool kills a legitimate process, you've broken trust permanently. Investigation and reporting empowers analysts. Autonomous process-killing is a liability. Cutting that scope made the project sharper and more honest about what it actually is. The disabled beta_prevention block is still in config.yaml — kept it as a reminder of the decision.

Google Gemini Feedback

What genuinely worked

The 1M token context window carried the whole investigation model. I planned to build summarization logic to manage context across multi-turn investigations — compress old query results, drop less-relevant evidence as rounds progressed. Never needed any of it. The entire investigation history — all the autonomous queries, all the results, all the evidence across multiple rounds — fit comfortably. And Gemini would reason across all of it in later rounds, catching connections between something from round 1 and new data from round 3. That cross-context reasoning was more capable than I expected going in.

Structured JSON output was rock solid. The autonomous query protocol only works if Gemini reliably returns machine-parseable decisions mid-analysis. I was genuinely nervous this would be flaky — sometimes JSON, sometimes Markdown-wrapped, sometimes off-schema. It wasn't. response_mime_type="application/json" combined with a clear schema in the prompt was consistently reliable even when the underlying reasoning was complex.

Multi-turn reasoning quality is genuinely different. The gap between "here's all the data, give me a verdict" and the autonomous multi-turn investigation is not subtle. The model caught things in round 3 that it completely missed or hand-waved in round 1. Letting it pull the data it actually needed, rather than us guessing upfront, made a real difference to the quality of the final verdicts.

Where we hit friction

Rate limit error messages need more context. "Resource exhausted" isn't actionable. RPM, TPM, and daily limits all require different solutions, and not knowing which constraint you're hitting means you're solving the wrong problem while your quota keeps ticking down. Even a simple error code that differentiates the limit type would have saved us significant debugging time during the crunch.

thinking_level documentation is thin for practical use. Finding the parameter was easy. Understanding the actual tradeoffs — which prompt structures work best at each level, how it affects structured output reliability, what temperature to pair with each level — was entirely trial and error. For anyone building agentic systems where you're making many API calls with different complexity levels, practical guidance here would save a lot of iteration time.

Gemini will over-query if you let it. With high thinking enabled, it sometimes issued 5-6 autonomous queries when 2-3 would have been enough for a confident verdict. Thorough is good — but in a long-running monitoring system that's real quota cost accumulating over hours and days. Prompting it toward decisiveness helped somewhat, but the hard max_query_rounds cap was ultimately necessary as a backstop. Guidance on prompting for query efficiency specifically in agentic loops — not just single-shot quality — would be useful to see in the docs.

None of this broke the project. The core capability — letting Gemini autonomously decide what it needs and go get it — worked better than expected and is genuinely a different kind of AI integration than the standard request/response loop. PROCSee wouldn't exist without it.

You can also check out my original Gemini 3 Hackathon submission on Devpost here: https://devpost.com/software/procsee

[GitHub: https://github.com/abbasmir12/procsee | Built for the Gemini 3 Hackathon on Devpost]

MindMelee: Beat AI in debate Arena!

Mir Shah — Sun, 15 Feb 2026 12:36:26 +0000

This is a submission for the GitHub Copilot CLI Challenge

What I Built

MindMelee is basically your AI debate partner that's always ready to argue with you (in a good way!). I wanted to create something that helps people get better at debating without needing another person around.

https://youtu.be/lqRVlVBc24M

The app uses Google's Gemini Live API so you can actually talk to it like a real conversation. After each debate, it breaks down your performance - vocabulary, clarity, how persuasive you were, all that stuff. You can pick between two modes: Coach (nice and helpful) or Fierce (actually challenges you).

The UI is pretty bold - I went for this neubrutalist style inspired by CodeJam. Big timer on the left, your conversation on the right. Only shows the last 5 messages so you stay focused instead of scrolling through everything.

Demo

GitHub Repository: github.com/abbasmir12/mindmelee

Demo Site: https://mindmelee.vercel.app/

Screenshots

Dashboard - Start Your Debate

Live Debate Interface

Performance Analytics

I'm still learning - only scored 30 on my first try 😒 Let me know what you get!

Key Features

Real-Time Voice Debates - Just talk naturally, AI responds right away
Bold Neubrutalist UI - CodeJam-inspired design with smooth animations
Comprehensive Analytics - See exactly where you're strong and where you need work
Dual AI Modes - Coach mode for learning, Fierce mode when you want a real challenge
Progress Tracking - Charts and heatmaps showing your improvement
Persona Discovery - Find out what kind of debater you are

My Experience with GitHub Copilot CLI

Okay, so here's the thing - I'm not gonna pretend Copilot CLI built this entire app for me. I designed everything, figured out the architecture, made all the creative decisions. But man, when it came to actually implementing stuff? Copilot CLI saved my life.

How It Actually Helped

Talking to Code
Instead of spending hours reading docs, I could just ask Copilot CLI in plain English. Like when I needed the audio visualization thing - I just asked how to make a real-time audio analyzer. It gave me the code AND explained how the WebAudio API actually works. That's huge when you're learning.

Smart Suggestions
The context-aware stuff is wild. While I was building the UI, it would suggest the exact Tailwind classes I needed for those neubrutalist shadows. It somehow understood the pattern I was going for and kept everything consistent.

Instant Debugging
My components were re-rendering like crazy on every audio frame (performance nightmare). Copilot CLI spotted the issue immediately and told me to use refs instead. Saved me probably 3 hours of debugging.

Learning on the Fly
This is what really got me - it doesn't just give you code, it teaches you. When I was struggling with Framer Motion, it explained spring physics and showed me this mode="popLayout" feature I'd never seen. Every time I used it, I learned something new.

TypeScript Help
WebRTC and MediaStream types are confusing as hell. Copilot CLI just... knew the right types. And explained why they were needed. No more tab-switching to docs every 5 seconds.

The Real Impact

Look, what would've taken me days of Stack Overflow rabbit holes took minutes with Copilot CLI:

WebAudio API: 30 minutes vs. probably 4+ hours
Framer Motion animations: 1 hour vs. a full day of trial and error
Performance fixes: Instant vs. who knows how long
TypeScript types: Right there vs. endless doc searching

But it's not just about speed. Copilot CLI changed how I approach building stuff. I'm not scared to try new APIs anymore because I know I've got this AI pair programmer who can explain things in real-time.

Best Moments

The Genius Suggestion: Copilot CLI told me to use flex-col-reverse for the chat feed. New messages at bottom, old ones fade out at top. So simple, so perfect. I wouldn't have thought of that.
Design Consistency: It helped me keep the neubrutalist style consistent across 20+ components. Once it understood the pattern, it just... got it.
Performance Win: Fixed that re-render issue in seconds. Went from laggy mess to smooth 60fps just like that.
The Learning: Finally understanding React's closure issues with useEffect. Not just fixing the bug, but actually getting why it happened.

Why This Matters

If you're building something complex by yourself, Copilot CLI is like having a senior dev on your team. Not to do the work for you - to help you do it better.

I knew what I wanted to build. Copilot CLI helped me actually build it without getting stuck every 10 minutes. That's the difference between "cool idea" and "working app."

MindMelee has real-time voice AI, complex animations, comprehensive analytics - stuff that requires knowing a lot of different things. Copilot CLI made it possible to tackle all of it without drowning in documentation.

Try It Yourself

git clone https://github.com/abbasmir12/mindmelee.git
cd mindmelee
npm install
npm run dev

Add your Gemini API key in Settings and start debating!

I had the vision, Copilot CLI helped me execute it.

MindMelee: Beat AI In Arena!

Mir Shah — Sun, 15 Feb 2026 11:25:03 +0000

This is a submission for the GitHub Copilot CLI Challenge

What I Built

https://youtu.be/lqRVlVBc24M

Demo

GitHub Repository: github.com/abbasmir12/mindmelee

Demo Site: https://mindmelee.vercel.app/

Screenshots

Dashboard - Start Your Debate

Live Debate Interface

Performance Analytics

I'm still learning - only scored 30 on my first try 😒 Let me know what you get!

Key Features

Real-Time Voice Debates - Just talk naturally, AI responds right away
Bold Neubrutalist UI - CodeJam-inspired design with smooth animations
Comprehensive Analytics - See exactly where you're strong and where you need work
Dual AI Modes - Coach mode for learning, Fierce mode when you want a real challenge
Progress Tracking - Charts and heatmaps showing your improvement
Persona Discovery - Find out what kind of debater you are

My Experience with GitHub Copilot CLI

How It Actually Helped

TypeScript Help
WebRTC and MediaStream types are confusing as hell. Copilot CLI just... knew the right types. And explained why they were needed. No more tab-switching to docs every 5 seconds.

The Real Impact

Look, what would've taken me days of Stack Overflow rabbit holes took minutes with Copilot CLI:

WebAudio API: 30 minutes vs. probably 4+ hours
Framer Motion animations: 1 hour vs. a full day of trial and error
Performance fixes: Instant vs. who knows how long
TypeScript types: Right there vs. endless doc searching

Best Moments

The Genius Suggestion: Copilot CLI told me to use flex-col-reverse for the chat feed. New messages at bottom, old ones fade out at top. So simple, so perfect. I wouldn't have thought of that.
Design Consistency: It helped me keep the neubrutalist style consistent across 20+ components. Once it understood the pattern, it just... got it.
Performance Win: Fixed that re-render issue in seconds. Went from laggy mess to smooth 60fps just like that.
The Learning: Finally understanding React's closure issues with useEffect. Not just fixing the bug, but actually getting why it happened.

Why This Matters

If you're building something complex by yourself, Copilot CLI is like having a senior dev on your team. Not to do the work for you - to help you do it better.

I knew what I wanted to build. Copilot CLI helped me actually build it without getting stuck every 10 minutes. That's the difference between "cool idea" and "working app."

Try It Yourself

git clone https://github.com/abbasmir12/mindmelee.git
cd mindmelee
npm install
npm run dev

Add your Gemini API key in Settings and start debating!

I had the vision, Copilot CLI helped me execute it.