<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Sujal Gupta</title>
    <description>The latest articles on DEV Community by Sujal Gupta (@sujal_gupta_3dc0d9052e350).</description>
    <link>https://dev.to/sujal_gupta_3dc0d9052e350</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3909504%2F6c46005c-3573-4b08-93e5-dad18d83ea04.png</url>
      <title>DEV Community: Sujal Gupta</title>
      <link>https://dev.to/sujal_gupta_3dc0d9052e350</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/sujal_gupta_3dc0d9052e350"/>
    <language>en</language>
    <item>
      <title>CodeDNA: AI Codebase Archaeologist Built with Gemma 4 Thinking Mode</title>
      <dc:creator>Sujal Gupta</dc:creator>
      <pubDate>Fri, 22 May 2026 22:01:12 +0000</pubDate>
      <link>https://dev.to/sujal_gupta_3dc0d9052e350/codedna-ai-codebase-archaeologist-built-with-gemma-4-thinking-mode-1ihg</link>
      <guid>https://dev.to/sujal_gupta_3dc0d9052e350/codedna-ai-codebase-archaeologist-built-with-gemma-4-thinking-mode-1ihg</guid>
      <description>&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;You inherited this codebase 6 months ago. You can feel something went wrong around 2021. Bug reports spiked. Velocity dropped. The original authors left. The commit history has 3,000 entries — and every answer is in there.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Nobody has time to read 3,000 commits.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;CodeDNA does.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  What I Built
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;CodeDNA&lt;/strong&gt; is an AI Codebase Archaeologist. You paste your &lt;code&gt;git log&lt;/code&gt;, and Gemma 4 — using Thinking Mode — reconstructs the story of your codebase: bug storms, architectural pivots, refactor eras, feature bursts, and an overall health score with a transparent breakdown.&lt;/p&gt;

&lt;p&gt;The output is 100% verifiable. You can check every milestone against your actual commit history. No hallucinated CVEs, no unverifiable financial claims — just pattern-extracted facts from structured text you already own.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;GitHub:&lt;/strong&gt; &lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/acchasujal" rel="noopener noreferrer"&gt;
        acchasujal
      &lt;/a&gt; / &lt;a href="https://github.com/acchasujal/codeDNA" rel="noopener noreferrer"&gt;
        codeDNA
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;CodeDNA — AI Codebase Archaeologist&lt;/h1&gt;
&lt;/div&gt;
&lt;blockquote&gt;
&lt;p&gt;Feed Gemma 4 your git history. Discover exactly when — and why — your codebase evolved.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;&lt;a rel="noopener noreferrer" href="https://github.com/acchasujal/codeDNA/./demo.gif"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Facchasujal%2FcodeDNA%2FHEAD%2F.%2Fdemo.gif" alt="CodeDNA Demo"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Every codebase has a turning point. The moment before is clean commits and clear intent
The moment after is hotfixes, reverts, and growing entropy. &lt;strong&gt;CodeDNA finds it.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;What It Does&lt;/h2&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Maps your codebase history with Gemma 4&lt;/strong&gt; — up to 400 commits, preprocessed and compressed for maximum analytical signal. The preprocessor extracts monthly commit histograms and per-file change frequency before sending to the model, so insights are grounded in observable data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Returns a structured archaeological report&lt;/strong&gt; — health score with transparent breakdown, milestone timeline (bug storms, refactors, pivots, feature bursts), and key metrics. Every claim cites a specific commit hash, date, or metadata value.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Streams Gemma 4's live reasoning&lt;/strong&gt; — watch the Thinking Mode trace in real-time as the model identifies causal patterns across years of history. Verifiable: the…&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/acchasujal/codeDNA" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;

&lt;/h2&gt;

&lt;h2&gt;
  
  
  The Problem It Solves
&lt;/h2&gt;

&lt;p&gt;You inherit a codebase. Something went wrong around late 2021 — you can feel it. Bug reports spiked, velocity dropped, the original authors left. The commit history has everything, but nobody has time to read 3,000 commits manually.&lt;/p&gt;

&lt;p&gt;Traditional tools give you graphs of commit frequency. That tells you &lt;em&gt;how much&lt;/em&gt; happened, not &lt;em&gt;what&lt;/em&gt; happened or &lt;em&gt;why&lt;/em&gt; one period was chaotic and another stable.&lt;/p&gt;

&lt;p&gt;CodeDNA uses Gemma 4's Thinking Mode to reason across your entire commit history and surface the narrative that was always there.&lt;/p&gt;




&lt;h2&gt;
  
  
  Live Demo
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fidfdbehqzv48hetlre87.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fidfdbehqzv48hetlre87.gif" alt="Demo of CodeDNA with React Hooks logs" width="560" height="329"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The live demo in action: CodeDNA processing the React repository’s architectural transition history.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Core Features
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Feature&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Animated timeline&lt;/td&gt;
&lt;td&gt;Color-coded milestones — red = bug storm, yellow = refactor, green = pivot, blue = feature burst&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Health score + breakdown&lt;/td&gt;
&lt;td&gt;0–100 score with transparent factor table (not a black-box number)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Live Thinking Mode stream&lt;/td&gt;
&lt;td&gt;Watch Gemma 4 reason step-by-step as it analyzes your history&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Smart preprocessing&lt;/td&gt;
&lt;td&gt;Caps at 180 commits, extracts monthly histograms and file hotspots before inference&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Multi-provider fallback&lt;/td&gt;
&lt;td&gt;Google AI Studio (26B → 31B) → OpenRouter (gemma-2-27b-it → gemma-3-12b-it → more)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Analysis caching&lt;/td&gt;
&lt;td&gt;Same git log = instant results on repeat runs&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Markdown export&lt;/td&gt;
&lt;td&gt;Download a complete archaeological report&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Messy commit handling&lt;/td&gt;
&lt;td&gt;Detects vague history and gives honest, low-confidence analysis instead of hallucinating&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;


&lt;h3&gt;
  
  
  Screenshots
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ryjhbp04po1bsxurl0q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1ryjhbp04po1bsxurl0q.png" alt="CodeDNA — Animated timeline showing React Hooks era milestones. Red bug storm card visible for Feb 2019." width="800" height="246"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The timeline builds milestone by milestone. Red = bug storm, yellow = refactor, green = pivot.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh67cc2a0p8ujw88dlw1u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fh67cc2a0p8ujw88dlw1u.png" alt="CodeDNA — Health Score breakdown showing factor-by-factor justification" width="800" height="288"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Health Score is never a black-box number. Every factor cites commit evidence.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F15a7tymhrtw6fl1q1ege.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F15a7tymhrtw6fl1q1ege.png" alt="CodeDNA — Live Thinking Mode stream showing Gemma 4 reasoning in real time" width="458" height="434"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The reasoning panel shows Gemma 4's step-by-step analysis as it happens. This is Thinking Mode — not post-hoc summarization.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Architecture
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;git log --stat (your paste or .txt upload)
        ↓
preprocessor.py
  → parse commits, build monthly histogram, extract file hotspots
  → metadata header injected: MONTHLY_COUNTS, TOP_CHANGED_FILES, BUG_FIX_RATIO
        ↓
Step 1: Reasoning Stream (REASONING_SYSTEM_PROMPT)
  → Gemma 4 Thinking Mode streams clean markdown report
  → Visible live in right panel
        ↓
Step 2: JSON Structuring (JSON_SYSTEM_PROMPT)
  → Separate Gemma call converts reasoning → typed AnalysisResult JSON
  → Pydantic v2 validates schema
        ↓
React UI
  → Health Score ring + breakdown table (center, always visible)
  → Animated vertical timeline (left)
  → Live reasoning stream (right)
  → Markdown export
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Map-reduce design:&lt;/strong&gt; By splitting reasoning (Step 1) from JSON structuring (Step 2), Thinking Mode output is clean prose instead of polluted with schema enforcement constraints. Insight quality is significantly higher.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stack:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Backend: FastAPI + httpx async + SSE streaming&lt;/li&gt;
&lt;li&gt;Frontend: React 18 + Vite + Tailwind CSS&lt;/li&gt;
&lt;li&gt;LLM: Gemma 4 via Google AI Studio (primary) + OpenRouter (fallback)&lt;/li&gt;
&lt;li&gt;State: In-memory + disk cache (no database)&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Why Gemma 4 — Not "Just Any LLM"
&lt;/h2&gt;

&lt;p&gt;This is the most important section for me to get right.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Thinking Mode for causal chain reasoning — not summarization&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Standard completion models count keywords. Gemma 4's Thinking Mode traces &lt;em&gt;why&lt;/em&gt; patterns emerged. When it sees 14 "fix" commits targeting &lt;code&gt;ReactFiberHooks.js&lt;/code&gt; in a 3-week window after a large API change, it connects them causally — it doesn't just report a spike.&lt;/p&gt;

&lt;p&gt;The live reasoning stream in the UI makes this directly observable. Judges (and users) can watch Gemma's chain-of-thought in real time. This is the intentional use criterion — not decorative AI, but AI whose reasoning process is the deliverable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. 128K context — the archaeology window&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;180 commits × ~200 tokens each = ~36K tokens of compressed history in one request. No chunking, no context loss, no multi-call stitching. Gemma 4 holds the full narrative arc in one reasoning window, which is the only way to detect multi-month causal patterns (e.g., a March 2019 API change causing a June 2019 bug cluster).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Structured output drives the UI deterministically&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The JSON schema is strict (Pydantic v2 validated). If Gemma returns valid JSON, the timeline renders. If not, the error is surfaced honestly. No post-processing guesswork.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Privacy-first by design&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Git history contains proprietary code, unreleased feature names, security patches, and competitive intelligence in commit messages. CodeDNA passes everything under your own API key. Zero data retention. This is not a UX choice — it's the only architecture engineering teams will actually trust with real repositories.&lt;/p&gt;


&lt;h2&gt;
  
  
  Demo: React Hooks Era (2018–2019)
&lt;/h2&gt;

&lt;p&gt;I ran CodeDNA on React's public git history during the Hooks transition — one of the most architecturally significant periods in any major open-source project.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What Gemma 4 found:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;2018-07:&lt;/strong&gt; Feature burst — Scheduler time-slicing and Fiber pool infrastructure added (5 commits, &lt;code&gt;Scheduler.js&lt;/code&gt; dominant)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2018-09–10:&lt;/strong&gt; Pivot — &lt;code&gt;React.lazy&lt;/code&gt;, &lt;code&gt;Suspense&lt;/code&gt;, and &lt;code&gt;createContext v2&lt;/code&gt; introduced across 6 commits&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2019-01–02:&lt;/strong&gt; Stability → Bug storm — 4 rapid fixes for &lt;code&gt;useRef&lt;/code&gt; and &lt;code&gt;useEffect&lt;/code&gt; infinite loops following the 16.8.0 release&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;2019-05:&lt;/strong&gt; Feature burst — &lt;code&gt;useTransition&lt;/code&gt;, &lt;code&gt;useDeferredValue&lt;/code&gt;, &lt;code&gt;unstable_createRoot&lt;/code&gt; (5 commits, &lt;code&gt;ReactFiberHooks.js&lt;/code&gt; dominant)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Health score: 58/100&lt;/strong&gt; — justified by 21% bug-fix ratio, two high-severity bug storms in 2019-01 and 2019-02, partially offset by clear feature burst eras and high commit message quality (83% of commits have descriptive messages ≥8 words).&lt;/p&gt;


&lt;h2&gt;
  
  
  Quick Start
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Clone&lt;/span&gt;
git clone https://github.com/acchasujal/codeDNA.git
&lt;span class="nb"&gt;cd &lt;/span&gt;codeDNA

&lt;span class="c"&gt;# Backend&lt;/span&gt;
&lt;span class="nb"&gt;cd &lt;/span&gt;backend
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
&lt;span class="nb"&gt;cp&lt;/span&gt; .env.example .env
&lt;span class="c"&gt;# Add your Google AI Studio key as GEMINI_API_KEY&lt;/span&gt;
uvicorn main:app &lt;span class="nt"&gt;--reload&lt;/span&gt;

&lt;span class="c"&gt;# Frontend (new terminal)&lt;/span&gt;
&lt;span class="nb"&gt;cd&lt;/span&gt; ../frontend
npm &lt;span class="nb"&gt;install
&lt;/span&gt;npm run dev
&lt;span class="c"&gt;# Opens http://localhost:5173&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;Get your git log:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;# Any repo you have locally:&lt;/span&gt;
git log &lt;span class="nt"&gt;--stat&lt;/span&gt; | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-3000&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; my_history.txt
&lt;span class="c"&gt;# Upload the .txt file or paste directly&lt;/span&gt;

&lt;span class="c"&gt;# React demo (what the screenshots use):&lt;/span&gt;
git clone https://github.com/facebook/react
&lt;span class="nb"&gt;cd &lt;/span&gt;react
git log &lt;span class="nt"&gt;--stat&lt;/span&gt; &lt;span class="nt"&gt;--after&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"2018-09-01"&lt;/span&gt; &lt;span class="nt"&gt;--before&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s2"&gt;"2019-06-01"&lt;/span&gt; | &lt;span class="nb"&gt;head&lt;/span&gt; &lt;span class="nt"&gt;-3000&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; react_hooks.txt
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;&lt;strong&gt;.env.example:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;GEMINI_API_KEY=your_google_ai_studio_key_here
GEMMA_MODEL=models/gemma-4-26b-a4b-it
MAX_COMMITS=180
OPENROUTER_API_KEY=optional_for_fallback
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Technical Highlights
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Multi-provider fallback chain&lt;/strong&gt; — At startup, CodeDNA queries the OpenRouter API to dynamically discover available Gemma models and builds a priority chain. Google AI Studio is primary; OpenRouter provides up to 9 additional Gemma models as fallback. The chain is logged at startup so you always know what's running.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Preprocessor intelligence&lt;/strong&gt; — Before any model call, the preprocessor extracts a &lt;code&gt;MONTHLY_COMMIT_COUNTS&lt;/code&gt; histogram and &lt;code&gt;TOP_CHANGED_FILES&lt;/code&gt; list from the raw git log. This ground-truth metadata is injected directly into the prompt, so Gemma cites real numbers ("commit count tripled to 47 in March 2019") rather than inferring from prose.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Anti-fluff enforcement&lt;/strong&gt; — The system prompt contains an explicit &lt;code&gt;FORBIDDEN_PHRASES&lt;/code&gt; list (&lt;code&gt;"technical debt"&lt;/code&gt;, &lt;code&gt;"the team"&lt;/code&gt;, &lt;code&gt;"seems like"&lt;/code&gt;, &lt;code&gt;"likely indicates"&lt;/code&gt;, and 12 others). Every insight must cite a specific commit hash, date, file name, or count — or say "insufficient evidence."&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Honest confidence&lt;/strong&gt; — Every milestone includes a &lt;code&gt;confidence&lt;/code&gt; field (&lt;code&gt;high | medium | low&lt;/code&gt;) with a justification sentence. Low-quality commit histories get a &lt;code&gt;QUALITY_WARNING&lt;/code&gt; header and produce conservative, clearly-labeled micro-analyses rather than dramatic fabrications.&lt;/p&gt;


&lt;h2&gt;
  
  
  The Reasoning System Prompt
&lt;/h2&gt;

&lt;p&gt;The full prompt that drives Step 1 (the reasoning stream):&lt;/p&gt;

&lt;p&gt;
  See the REASONING_SYSTEM_PROMPT
  &lt;br&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are CodeDNA, a concise git-history analyst.
Produce a clean public report, not private reasoning.

Rules:
- Output markdown prose only. No JSON. No code fences.
- No meta-commentary, self-correction, planning notes, or internal monologue.
- Never write "wait", "I used", "the prompt says", or any phrase from this
  forbidden list: technical debt, the team, engineers, developers, working hard,
  prioritized, decided to, management, business logic, seems like, appears to,
  it looks like, likely indicates, possibly, perhaps, might have.
- Use only observable evidence from the metadata header and commit log.
- Cite commit hashes, dates/months, file names, commit counts, and ratios
  whenever making a claim.
- If evidence is thin, say "insufficient evidence" and name the missing signal.
  Do not invent intent, people, architecture, risk, or causality.
- Keep every sentence useful. Avoid repetition.

Format exactly:
## Overview
Two to three factual sentences covering commit count, date range,
most changed files or file types, and BUG_FIX_RATIO.

## Milestones
Four to eight bullets when evidence allows. Each bullet:
- **YYYY-MM** - type - concise evidence sentence with commit hash(es),
  changed file(s), and count(s).
  Allowed types: bug_storm, refactor, pivot, feature_burst, stability.

## Health Signals
Three bullets: one positive signal, one negative signal, one confidence note.
Each bullet must cite evidence.

## Churn Summary
One concise sentence naming the peak period and the files or commits behind it.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;




&lt;/p&gt;


&lt;h2&gt;
  
  
  The Hardest Problem: Making Gemma Say Something Real
&lt;/h2&gt;

&lt;p&gt;The biggest technical challenge wasn't the UI, the SSE streaming, or the fallback chain. It was getting Gemma 4 to produce &lt;em&gt;specific, verifiable&lt;/em&gt; insights instead of confident-sounding nonsense.&lt;/p&gt;

&lt;p&gt;Here's what the first version produced on a repo with commits like &lt;code&gt;"fix navbar bug"&lt;/code&gt;, &lt;code&gt;"update readme"&lt;/code&gt;, &lt;code&gt;"refactor utils"&lt;/code&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"This period reflects a time of organizational growth and technical maturity. The team worked hard to address accumulated complexity while balancing feature delivery with stability concerns."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That output is useless. It contains zero commit references, zero file names, zero numbers. A junior consultant could have written it without looking at the code. A judge would mark it dead on arrival.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Three iterations to fix it.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Iteration 1 — Forbidden phrases list.&lt;/strong&gt;&lt;br&gt;
Added an explicit blocklist to the system prompt:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;FORBIDDEN PHRASES — never use these:
"technical debt", "the team", "engineers", "developers",
"working hard", "prioritized", "decided to", "management",
"seems like", "appears to", "it looks like", "likely indicates",
"possibly", "perhaps", "might have"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;The output became less flowery but still vague: &lt;em&gt;"There were many fixes in early 2019."&lt;/em&gt; How many? Which files? Which period exactly?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Iteration 2 — Mandatory evidence citation.&lt;/strong&gt;&lt;br&gt;
Added to the prompt: &lt;em&gt;"Every milestone description must cite at least one commit hash, date/month, file name, count, or ratio. If you cannot cite evidence, write 'insufficient evidence' and stop."&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Better, but Gemma was still counting commits itself — and sometimes miscounting.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Iteration 3 — Pre-computed metadata injection (the breakthrough).&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of asking Gemma to figure out what happened, I tell it what happened and ask it to &lt;em&gt;interpret&lt;/em&gt; it.&lt;/p&gt;

&lt;p&gt;The preprocessor now builds a metadata header before any model call:&lt;br&gt;
&lt;/p&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# META: 180tot|180ana|Q:HIGH|Fx:21%|Vg:0%
# DATES: 2019-06-20..2018-07-02
# MONTHS: 2018-09:3,2018-10:3,2019-01:4,2019-02:2,2019-05:5,2019-06:2
# HOTSPOTS: ReactFiberHooks.js:8,Scheduler.js:5,package.json:4
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;Now instead of asking &lt;em&gt;"were there a lot of fixes in early 2019?"&lt;/em&gt;, I'm asking &lt;em&gt;"given that commits spiked to 5 in 2019-05 and ReactFiberHooks.js was modified 8 times — what does that pattern indicate?"&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The model's job shifted from counting to interpreting. The output became:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"2019-01 through 2019-02 saw 6 commits (bf32345, ca53456, cb54567, cc55678, cd56789, ce57890) concentrated in ReactFiberHooks.js and ReactFiberBeginWork.js. ca53456 fixed an incorrect useRef identity across re-renders; cb54567 resolved an infinite useEffect loop triggered by object dependency comparison. The 16.8.0 release on 2019-02-06 (cd56789) was followed two days later by ce57890 — a hooks state regression fix, indicating at least one edge case reached production."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Every claim is checkable. Every hash is real. That's the difference.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The map-reduce split was the second breakthrough.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Asking Gemma 4 to simultaneously produce flowing Thinking Mode prose &lt;em&gt;and&lt;/em&gt; valid JSON produces neither well. I split it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Step 1 (stream):&lt;/strong&gt; REASONING_SYSTEM_PROMPT — output clean markdown only, no JSON, no schema constraints&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Step 2 (analyze):&lt;/strong&gt; JSON_SYSTEM_PROMPT — read the reasoning trace, output strict AnalysisResult JSON
The reasoning panel now shows actual analytical prose. The timeline data is reliably structured. Both improved dramatically when separated.&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  Limitations (Honest)
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Works best with 100–200 commits. Very large histories (1000+) need more aggressive preprocessing.&lt;/li&gt;
&lt;li&gt;Commit message quality determines insight quality. A repo full of &lt;code&gt;"fix"&lt;/code&gt;, &lt;code&gt;"wip"&lt;/code&gt;, &lt;code&gt;"update"&lt;/code&gt; commits will produce low-confidence analysis (CodeDNA tells you this clearly rather than inventing drama).&lt;/li&gt;
&lt;li&gt;The reasoning stream uses the primary model; fallback models handle JSON structuring. If all Google models are slow, the stream may be empty — but the timeline will still render from the fallback result.&lt;/li&gt;
&lt;li&gt;Currently runs locally only. Cloud deployment would require careful handling of API key security.&lt;/li&gt;
&lt;/ul&gt;


&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Actual GitHub API integration (analyze any public repo by URL, no manual log export)&lt;/li&gt;
&lt;li&gt;Branch comparison (main vs. feature branch health)&lt;/li&gt;
&lt;li&gt;Team velocity metrics (authors per period, bus factor analysis)&lt;/li&gt;
&lt;li&gt;CI/CD integration — run CodeDNA as a PR check to flag risky commit patterns&lt;/li&gt;
&lt;/ul&gt;



&lt;p&gt;&lt;em&gt;Built solo in 4 days for the Google Gemma 4 Challenge. Every commit in this repo is real — you can run CodeDNA on its own history.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub:&lt;/strong&gt; &lt;/p&gt;
&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/acchasujal" rel="noopener noreferrer"&gt;
        acchasujal
      &lt;/a&gt; / &lt;a href="https://github.com/acchasujal/codeDNA" rel="noopener noreferrer"&gt;
        codeDNA
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;
&lt;div class="markdown-heading"&gt;
&lt;h1 class="heading-element"&gt;CodeDNA — AI Codebase Archaeologist&lt;/h1&gt;
&lt;/div&gt;

&lt;blockquote&gt;
&lt;p&gt;Feed Gemma 4 your git history. Discover exactly when — and why — your codebase evolved.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a rel="noopener noreferrer" href="https://github.com/acchasujal/codeDNA/./demo.gif"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fraw.githubusercontent.com%2Facchasujal%2FcodeDNA%2FHEAD%2F.%2Fdemo.gif" alt="CodeDNA Demo"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every codebase has a turning point. The moment before is clean commits and clear intent
The moment after is hotfixes, reverts, and growing entropy. &lt;strong&gt;CodeDNA finds it.&lt;/strong&gt;&lt;/p&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;What It Does&lt;/h2&gt;
&lt;/div&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Maps your codebase history with Gemma 4&lt;/strong&gt; — up to 400 commits, preprocessed and compressed for maximum analytical signal. The preprocessor extracts monthly commit histograms and per-file change frequency before sending to the model, so insights are grounded in observable data.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Returns a structured archaeological report&lt;/strong&gt; — health score with transparent breakdown, milestone timeline (bug storms, refactors, pivots, feature bursts), and key metrics. Every claim cites a specific commit hash, date, or metadata value.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Streams Gemma 4's live reasoning&lt;/strong&gt; — watch the Thinking Mode trace in real-time as the model identifies causal patterns across years of history. Verifiable: the…&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/acchasujal/codeDNA" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;





</description>
      <category>devchallenge</category>
      <category>gemmachallenge</category>
      <category>gemma</category>
      <category>webdev</category>
    </item>
  </channel>
</rss>
