Mir Shah

Posted on Feb 26

PROCSee -> Turn Your System Into a Crime Scene & let Gemini Become Investigator!

#devchallenge #geminireflections #gemini #ai

This is a submission for the Built with Google Gemini: Writing Challenge

What I Built with Google Gemini

About four weeks ago, the Gemini 3 Hackathon dropped — Google DeepMind's global hackathon with a $100K prize pool, asking builders to create something genuinely new with the Gemini 3 API. Not another chatbot. Not a wrapper. Something that actually pushes what the model can do.

I had one question sitting in my head for a while: what if your computer could investigate itself?

Not just flag a suspicious process. Not just match a signature and throw an alert at you. But actually think — form a theory, pull more evidence, change its mind, reach a conclusion. The way a real security analyst would.

That became PROCSee.

It's an autonomous security investigation system for Windows. It monitors every process on your machine in real-time, and when something looks suspicious, it hands the investigation off to Gemini 3 Pro — which then decides what additional data it needs, queries for it, reasons across multiple rounds of evidence, and writes a full forensic report. The tagline: Turn your system into a crime scene. Let PROCSee be the forensic analyst.

How It Actually Works (The Architecture)

Let me walk through this the way I wish someone had explained it to me before I built it, because the architecture decisions were the hardest part — not the AI integration.

Step 1: Catch everything, instantly

Windows is constantly spawning processes. Updaters, scanners, system services — hundreds of events per hour on a normal machine. The classic approach is polling: check what's running every few seconds. The problem? A piece of malware that executes, drops a payload, and exits in under a second is completely invisible to a poller.

We used WMI event callbacks instead. WMI (Windows Management Instrumentation) is a pub/sub system built into Windows. You subscribe to process creation events, and the OS calls your code the moment anything starts — under 10ms latency. We capture it all: the process name, path, parent process, command line, user account, everything. Immediately written to a raw events database in SQLite. No analysis yet. Just capture.

New Process Starts
      ↓
WMI fires callback (<10ms)
      ↓
Stored in raw_process_events immediately
      ↓
Done. Fast. No AI involved yet.

Step 2: The problem with "just send everything to Gemini"

Here's where we made our first big mistake, and then fixed it.

Our first instinct: every time a process starts, send its data to Gemini and ask "is this suspicious?" Simple idea. We tested it. On a normal Windows machine, that's easily 3-5 process creation events per second during active use. At that rate you'd blow through your API quota in literal minutes. Not hours. Minutes.

So we needed a smarter funnel before anything touches the API.

We built a local behavior-scoring engine — zero API cost, runs entirely on-device. It checks 40+ patterns before Gemini ever sees anything:

Is PowerShell using -encodedcommand or -executionpolicy bypass?
Is a browser spawning a shell? (chrome.exe → powershell.exe is almost never legitimate)
Is something executing from %TEMP% or AppData?
Is certutil.exe or mshta.exe making network connections? (classic LOLBins abuse)
Is a process touching registry persistence keys?

Each pattern scores points. If a process scores zero — nothing suspicious — it's logged and forgotten. It never reaches Gemini. Only things that earn attention get elevated. This alone cut our API calls by around 95%.

Step 3: The dual-database architecture

Even with the scoring filter, we still needed to solve what Gemini actually sees. This is the core architectural insight of the whole project: separate "store everything" from "show the AI what matters."

We run two databases side by side:

The raw events database stores every single process event in full detail. This is the forensic record — complete, unfiltered, queryable at any time. It's how we can answer "show me every process that made a network connection to this external IP in the last 30 minutes" without having pre-loaded all of that into Gemini's context.

The summary database gets built every 60 seconds. We aggregate the raw events into a 1-minute digest: what was unusual, what matched suspicious patterns, the high-level picture. This is small — a few KB of actual signal. This is what Gemini reads first to orient itself.

Raw Events DB (everything)  →→→  Summary DB (1-min digest)
       ↓                                    ↓
  Forensic record                   Gemini reads this first
  Gemini queries this               to orient itself
  when it needs more detail

Step 4: Gemini doesn't just read the data — it decides what it needs

This is the part we're most proud of, and it's what makes PROCSee different from a standard AI integration.

Most AI integrations work like this: gather all the data you think is relevant, stuff it into a prompt, hope the AI has what it needs. The problem is you're guessing what it'll need. You either over-send (burns tokens, hits context limits) or under-send (bad analysis).

We flipped it. After Gemini reads the summary, it can say: "I need more information before I give you a verdict." And then it tells us exactly what it wants:

{
  "needs_more_data": true,
  "queries": [
    {
      "action": "QUERY_PROCESS",
      "process_id": 4821,
      "time_range": "last_5_minutes",
      "details": ["network", "file_access", "cpu"]
    },
    {
      "action": "QUERY_NETWORK",
      "time_range": "last_30_minutes",
      "min_connections": 3
    }
  ]
}

We execute those queries against the raw database, return the results, and Gemini continues its analysis. Another round. It keeps doing this until it's confident enough to give a verdict.

A real investigation flow looks like this:

Round 1 — Gemini reads the summary: "There's suspicious PowerShell activity worth investigating."
→ We query: full process details for that PowerShell instance

Round 2 — Gemini gets the data: "It's making outbound connections to 203.0.113.42, an external IP."
→ We query: all network activity to that IP across the whole system

Round 3 — Gemini gets the data: "Three separate processes are all calling out to the same external IP. This is command-and-control communication."
→ Final verdict: CONFIRMED_THREAT — risk 0.95, confidence 0.92

No human in that loop. Gemini decided what it needed, went and got it, and formed its own conclusion.

Step 5: The forensic report

When risk is ≥50% and confidence is ≥60%, Gemini generates a full Markdown forensic report — written in first person, walking through its investigation process, the evidence chain, MITRE ATT&CK technique mappings, indicators of compromise, and recommended response actions.

We built a custom renderer for these reports with syntax highlighting for cmd:, path:, ip:, and proc: prefixes so they read like real analyst documents, not raw AI output.

The stack: Python + FastAPI for the backend agent, SQLite with WAL mode for the dual-database architecture, pywin32 for WMI, psutil for process data, React + Vite for the dashboard, WebSocket for real-time streaming, and the google-genai SDK for Gemini 3 Pro.

Demo

[GitHub: https://github.com/abbasmir12/procsee]

The dashboard has a few views worth calling out. The Gemini Conversation View is the one that surprised me most when I first saw it working. You can watch in real-time as Gemini issues a QUERY_NETWORK call, gets results back, immediately pivots to QUERY_PATTERN: shell_spawn because it noticed something in the results, and keeps iterating toward a verdict. It genuinely looks like watching someone think through a problem — because that's what it is.

The Detailed Report Viewer renders the final forensic reports with full Markdown and syntax highlighting. Each report includes the complete investigation chain — every query Gemini issued, every piece of evidence it weighed, every confidence score.

What I Learned

The architecture problem was the real challenge

Here's what I didn't expect going in: the Gemini integration itself was actually the smooth part. Once the architecture was solid, plugging in Gemini was relatively clean. The hard part — the part that took most of the time — was building the system that makes responsible, efficient use of the API.

The quota problem hit us immediately. Naive implementation: fire an API call for every process event. Reality: quota exhausted in under an hour on a busy machine. That forced us to completely rethink the data flow. The behavior scoring engine, the 60-second aggregation, the dual-database design — all of that exists because of one question we kept coming back to: how do we make sure Gemini only sees what's actually worth its attention?

Every architectural decision in this project traces back to that question. If you're building anything that involves a continuous data stream and an LLM, that's the question you need to answer first. Everything else flows from it.

Rate limits aren't just a bug — they shaped the whole system

When we stress-tested with multiple concurrent investigations, we hit rate limits constantly. Five investigations running simultaneously, each doing 3-4 query rounds with large contexts — that's a lot of tokens per minute, very fast.

The frustrating part: the error messages just said "resource exhausted." Is that requests per minute? Tokens per minute? Daily limit? These have completely different fixes. RPM you solve with throttling and request spacing. TPM you solve with context compression and smarter batching. Daily limits you solve with queuing strategy. Not knowing which one you're hitting means you're guessing at the solution while your quota keeps burning.

We ended up implementing all three mitigations simultaneously because we couldn't tell which problem we were actually solving: exponential backoff with jitter, a hard cap on max_query_rounds per investigation (default 4), a global concurrency limit (max 3 deep investigations at once), and dynamic thinking level selection so we're not burning high-compute calls on triage decisions that don't need them. The rate limit constraints literally shaped the concurrency model of the entire system — which is a weird thing to say but it's true.

Thinking levels matter more than I expected

thinking_level="low" vs thinking_level="high" isn't just a speed dial. It changes the quality of reasoning you get and what prompting strategies work well. We use low for fast triage — "is this worth investigating at all?" — and high for deep forensic analysis and report generation.

Early on we used high thinking everywhere. Slower responses, heavier quota usage, and no meaningful quality improvement for simple yes/no triage decisions. Right tool for the right job. Sounds obvious in retrospect, but figuring out which job needs which level took real experimentation.

Cutting features is a skill

We came into the hackathon wanting to build prevention features — automatically suspending suspicious processes, network isolation. Had it half-implemented. Then we cut all of it.

Not because of time. Because we realized: if you're wrong on a false positive and your tool kills a legitimate process, you've broken trust permanently. Investigation and reporting empowers analysts. Autonomous process-killing is a liability. Cutting that scope made the project sharper and more honest about what it actually is. The disabled beta_prevention block is still in config.yaml — kept it as a reminder of the decision.

Google Gemini Feedback

What genuinely worked

The 1M token context window carried the whole investigation model. I planned to build summarization logic to manage context across multi-turn investigations — compress old query results, drop less-relevant evidence as rounds progressed. Never needed any of it. The entire investigation history — all the autonomous queries, all the results, all the evidence across multiple rounds — fit comfortably. And Gemini would reason across all of it in later rounds, catching connections between something from round 1 and new data from round 3. That cross-context reasoning was more capable than I expected going in.

Structured JSON output was rock solid. The autonomous query protocol only works if Gemini reliably returns machine-parseable decisions mid-analysis. I was genuinely nervous this would be flaky — sometimes JSON, sometimes Markdown-wrapped, sometimes off-schema. It wasn't. response_mime_type="application/json" combined with a clear schema in the prompt was consistently reliable even when the underlying reasoning was complex.

Multi-turn reasoning quality is genuinely different. The gap between "here's all the data, give me a verdict" and the autonomous multi-turn investigation is not subtle. The model caught things in round 3 that it completely missed or hand-waved in round 1. Letting it pull the data it actually needed, rather than us guessing upfront, made a real difference to the quality of the final verdicts.

Where we hit friction

Rate limit error messages need more context. "Resource exhausted" isn't actionable. RPM, TPM, and daily limits all require different solutions, and not knowing which constraint you're hitting means you're solving the wrong problem while your quota keeps ticking down. Even a simple error code that differentiates the limit type would have saved us significant debugging time during the crunch.

thinking_level documentation is thin for practical use. Finding the parameter was easy. Understanding the actual tradeoffs — which prompt structures work best at each level, how it affects structured output reliability, what temperature to pair with each level — was entirely trial and error. For anyone building agentic systems where you're making many API calls with different complexity levels, practical guidance here would save a lot of iteration time.

Gemini will over-query if you let it. With high thinking enabled, it sometimes issued 5-6 autonomous queries when 2-3 would have been enough for a confident verdict. Thorough is good — but in a long-running monitoring system that's real quota cost accumulating over hours and days. Prompting it toward decisiveness helped somewhat, but the hard max_query_rounds cap was ultimately necessary as a backstop. Guidance on prompting for query efficiency specifically in agentic loops — not just single-shot quality — would be useful to see in the docs.

None of this broke the project. The core capability — letting Gemini autonomously decide what it needs and go get it — worked better than expected and is genuinely a different kind of AI integration than the standard request/response loop. PROCSee wouldn't exist without it.

You can also check out my original Gemini 3 Hackathon submission on Devpost here: https://devpost.com/software/procsee

[GitHub: https://github.com/abbasmir12/procsee | Built for the Gemini 3 Hackathon on Devpost]

Top comments (5)

FrancisTRᴅᴇᴠ (っ◔◡◔)っ • Mar 19

The frustrating part: the error messages just said "resource exhausted." Is that requests per minute? Tokens per minute? Daily limit? These have completely different fixes.

That tends to be the most frustrating part. I would have wish it would been specific in some way. Great work btw and congrats on winning the google gemini challenge on the top 5!!!

Mir Shah • Mar 20

thank u so much!! also really enjoyed reading your submission as well, the comparison between gemini, copilot, and gpt on how they format responses was a neat touch, the point about gemini staying minimal by default resonates a lot, congrats on the win!!! 🎉

Mir Shah • Feb 26

Would love to hear your valuable thoughts!

Nadine • Mar 5

Yes, I've experienced the same with Gemini 3 where the model autonomously over-queries. The solution is to give the model constraints, like a JSON output block to force the model to commit to avoiding drifting into endless reasoning. Sounds like you did hard-code a query cap.

Mir Shah • Mar 6

yeah, the JSON constraint idea is solid,we went with kindof a hard cap as a quick fix but combining both approaches would probably be cleaner. Appreciate the insight!