DEV Community

Cover image for Building CarbonSaathi: A Visible-Reasoning Carbon Companion for Indian Metro Professionals
Apoorv Gupta
Apoorv Gupta

Posted on

Building CarbonSaathi: A Visible-Reasoning Carbon Companion for Indian Metro Professionals

TL;DR

I built CarbonSaathi in roughly 60 hours for PromptWars Challenge 3: a carbon footprint companion for Indian metro professionals that logs activities in plain English and surfaces AI-generated insights. The differentiator is that the reasoning behind each insight and recommendation is streamed live to the UI via Server-Sent Events while the agents are still generating — not appended after the fact. Stack: FastAPI + Gemini 2.5 Flash/Pro + Firestore + Cloud Run, built with Claude Code via GitHub Copilot. Source on GitHub.


The problem (and why "awareness" isn't "tracking")

Build an application that helps people track and reduce their everyday carbon footprint through simple actions and personalized insights.

The word that matters in that brief is personalized insights, not track. Most carbon apps already track. They have a chart, maybe a calculator, a tally of your CO₂e by category. What they lack is a reason to trust the number and a specific action to take because of it.

The persona I designed for is Riya or Rahul, 28, a software engineer in a Tier-1 Indian metro — Bengaluru, Mumbai, Pune, Hyderabad, Delhi NCR. They pay their own electricity bill, commute by metro some days and Uber others, sometimes work from home. They're vaguely climate-aware but track nothing today, and wouldn't open a dedicated app to do it. The design follows from this: low-friction logging (plain English, not forms), no guilt-tripping copy, specific actionable advice, and Indian context everywhere.

The framing that shaped the architecture: an insight that says "your transport is 71% of this week's footprint" is a tracker. An insight where I can see how the AI bucketed 14 days of activities, what pattern it flagged, and why it proposed switching from a cab to metro — that's awareness. The difference is whether the reasoning is visible.

Almost every submission in a category like this shows the final output. I decided early to surface the reasoning that produced it. That decision drove every architectural choice that followed.


The differentiator: visible agent reasoning streamed live

The insights endpoint returns Server-Sent Events. Here's what it actually looks like:

curl -N -H "Accept: text/event-stream" \
     -H "Authorization: Bearer $TOKEN" \
     http://localhost:8080/api/insights/stream
Enter fullscreen mode Exit fullscreen mode
event: phase_start
data: {"event":"phase_start","phase":"analyst"}

event: reasoning
data: {"event":"reasoning","phase":"analyst","step":"Bucketed 14 days into this_week (8 activities) and last_week (6)."}

event: reasoning
data: {"event":"reasoning","phase":"analyst","step":"Transport is 71% of this week's 19.8 kg CO2e; weekday cab rides dominate."}

event: phase_complete
data: {"event":"phase_complete","phase":"analyst","status":"success","reason":null}

event: phase_start
data: {"event":"phase_start","phase":"coach"}

event: reasoning
data: {"event":"reasoning","phase":"coach","step":"Largest controllable bucket: 8 km weekday cab commute."}

event: reasoning
data: {"event":"reasoning","phase":"coach","step":"emission_service: petrol cab 0.170 vs metro 0.031 kg/km. Computed saving = 0.139 x 8 km x 2 x 5 = 11.1 kg/week."}

event: phase_complete
data: {"event":"phase_complete","phase":"coach","status":"success","reason":null}

event: done
data: {"event":"done","insights":[...],"recommendations":[...],"analyst_status":"success","coach_status":"success"}
Enter fullscreen mode Exit fullscreen mode

The reasoning text is model-generated and varies per run. The protocol structure — phase_startreasoningphase_completedone — is fixed by the orchestrator. There is an 80 ms inter-event pacing on the SSE path only; a JSON-only Accept header gets a consolidated payload with no pacing.

Insights view streaming Analyst and Coach reasoning steps mid-generation

Each reasoning step and the final insight/recommendation are persisted to Firestore with an agentReasoning field. The UI renders the live stream during generation and falls back to the persisted trace on subsequent views — so it's auditable, not just a visual effect.

Notice the Coach's arithmetic in the stream above: petrol cab 0.170 vs metro 0.031 kg/km. Those numbers come from a local factor table, not from the model's output. The model never sets a carbon number — it proposes the swap, and the agent validates and computes the saving from the emission service. This invariant is one of the most important design decisions in the project, and I'll explain how it emerged in the prompts section.


Architecture

Three sequential AI agents behind an async FastAPI service on Cloud Run. All emission arithmetic runs locally from cited Indian factor data; every model output is validated before it's trusted.

flowchart TD
    U["User browser<br/>Tailwind + vanilla JS"]
    FB["Firebase Auth<br/>Google Sign-In"]
    SM["Secret Manager<br/>gemini-api-key · firebase-api-key"]
    FS[("Firestore<br/>users / activities /<br/>insights / recommendations")]

    U -->|"Google Sign-In"| FB
    FB -->|"ID token"| U
    U -->|"Bearer ID token"| AUTH

    subgraph CR["FastAPI service · Cloud Run · asia-south1"]
        direction TB
        AUTH["verify_firebase_token<br/>uniform 401"]
        R["/api routes"]
        LOG["Logger Agent<br/>Gemini 2.5 Flash"]
        ORCH["Insight Orchestrator"]
        AN["Analyst Agent<br/>Gemini 2.5 Pro"]
        CO["Coach Agent<br/>Gemini 2.5 Flash"]
        EM["Emission Service<br/>grid · transport · food factors"]
        AUTH --> R
        R --> LOG
        R --> ORCH
        ORCH --> AN
        ORCH --> CO
        LOG --> EM
        CO --> EM
    end

    LOG --> FS
    ORCH --> FS
    SM -.->|"runtime env vars"| CR
    ORCH ==>|"SSE: phase_start · reasoning ·<br/>phase_complete · done"| U
Enter fullscreen mode Exit fullscreen mode

The Firestore data model carries agentReasoning on every user-facing document — it's what powers the "show your work" UI:

users/{uid}
  email, displayName, state, homeProfile{ bhk, hasAC, fridgeClass, dietary }
  onboardingComplete, createdAt, lastActive
  ├── activities/{id}      type, rawInput, structuredData, emissionKgCo2e,
  │                        confidence, emissionFactorSource, agentReasoning
  ├── insights/{id}        type, title, description, supportingActivities,
  │                        agentReasoning
  ├── recommendations/{id} type, title, expectedSavingKg, difficulty,
  │                        accepted, agentReasoning
  └── state/generation     analystStatus, coachStatus, lastCompletedAt
Enter fullscreen mode Exit fullscreen mode

All user-facing time aggregations — "today", "this week", the activity streak — are computed in IST (Asia/Kolkata) at read time. Timestamps are stored UTC. The streak uses a Duolingo-style same-day grace: if today has no activity yet, the streak counts backward from yesterday so it doesn't read as broken before you've logged.

CarbonSaathi dashboard showing today's footprint, 7-day breakdown, and activity streak


The agent pipeline

flowchart TD
    START(["GET /api/insights/stream"]) --> STALE{"is_pipeline_stale?"}
    STALE -->|"No — cache fresh"| CACHED["2x phase_complete status=cached<br/>+ done · zero agent calls"]
    STALE -->|"Yes"| A1["emit phase_start: analyst"]
    A1 --> AN["Analyst · Gemini Pro<br/>buckets 14d activity by week"]
    AN --> AOUT{"analyst status"}
    AOUT -->|"empty / failed"| ASKIP["phase_complete: analyst<br/>coach skipped → done"]
    AOUT -->|"success"| AR["stream reasoning steps<br/>persist insights"]
    AR --> C1["emit phase_start: coach"]
    C1 --> CO["Coach · Gemini Flash<br/>proposes swaps"]
    CO --> CC["validate saving_basis ·<br/>compute kg from emission_service"]
    CC --> CR2["stream reasoning steps<br/>persist recommendations"]
    CR2 --> DONE(["done: insights + recommendations"])
Enter fullscreen mode Exit fullscreen mode
Agent Model Job
Logger Gemini 2.5 Flash Parse free-text input into a typed activity via function calling
Analyst Gemini 2.5 Pro Find patterns, trends, and milestones over a 14-day activity window
Coach Gemini 2.5 Flash Propose swap / reduce / challenge recommendations

Three design rules underpin the pipeline:

The Coach computes savings; it never trusts the model for a number. The model returns a typed saving_basis — a discriminated union describing what swap to make. The agent validates that description against the emission factor table and computes expectedSavingKg locally. A model that says "save 3.2 kg/week" is often hallucinating; a model that says "swap petrol cab for metro, 8 km weekday commute" gives the agent everything it needs to compute the real number from real factors.

Every agent outcome is a typed discriminated union. LoggerOutcome, AnalystOutcome, CoachOutcome are all Annotated[Union[Success, Empty, Rejected, Failed], Field(discriminator="status")]. Expected failures — governance rejection, low data volume, malformed JSON from the model — are values, not exceptions. Routes pattern-match the status field to an HTTP response.

Staleness caching short-circuits the pipeline. If nothing has changed since the last run (IST-day aligned, with separate 10-minute empty-result TTLs for Analyst and Coach), the orchestrator returns cached results and calls zero agents. The cached path emits phase_complete(status="cached") events, so the UI response shape is consistent regardless of whether generation ran.


Built for India: the spec-alignment core

This is where most carbon apps fail the India spec. Every number is India-specific and source-cited.

Electricity — state grid factors (kg CO₂e/kWh, CEA CO₂ Baseline Database v19.0, 2023–24). The same kWh of electricity has very different carbon depending on your state's generation mix:

State Factor Note
Sikkim 0.38 Hydro-dominant (Teesta cascade)
Kerala 0.58 Hydro + renewable mix
Karnataka / Tamil Nadu 0.74 Southern grid
Maharashtra 0.79 Western regional baseline
Delhi 0.82
Bihar / West Bengal / Odisha 0.96 Eastern thermal grid
Jharkhand 1.05 Coal-belt, highest modelled

All 28 states and 8 UTs are in the factor file. Users who enter their electricity bill in rupees get a conversion using AVG_INR_PER_KWH = 8.0 — a flat average that's coarser than real slab-based DISCOM tariffs, so any bill-derived activity is forced to confidence = "estimated" regardless of the grid factor's confidence level. The assumption is typed and documented, not buried in a comment.

Transport (kg CO₂e/km — ICCT India, DMRC, India GHG Inventory):

Mode Factor Mode Factor
Metro 0.031 Auto-rickshaw (CNG) 0.066
Public bus 0.039 CNG taxi 0.095
Two-wheeler (EV) 0.047 Petrol taxi / cab 0.170
Two-wheeler (petrol) 0.060 Petrol car 0.192

Walking and WFH are zero by definition. The Logger prompt explicitly recognises auto-rickshaw, metro, bus, Uber/Ola/Rapido, two-wheeler, and WFH — the transport vocabulary of urban India.

Food (kg CO₂e/serving — FAO Food Emissions Database + Indian dietary survey data; rice includes paddy-field methane via IRRI):

Item Factor Item Factor
Veg thali 0.90 Egg (1) 0.25
Chicken meal 2.10 Dal serving 0.35
Mutton (goat) meal 4.50 Rice serving 0.43
Fish meal 1.20 Dairy (250 ml) 0.63

Mutton is goat, not sheep — the relevant Indian market context. Dietary categories (vegetarian / non-vegetarian / eggetarian) are shaped for Indian dietary patterns. Paneer appears separately from dairy.


Tools and selection rationale

Claude Code via GitHub Copilot. I used deliberate model rotation across the build. Sonnet 4.6 handled scaffolding-heavy phases (1A, 1C, 1D, 2, 3, 5A, 5B, 6, 7, 8, 9) where output volume mattered and specs were well-defined. Opus 4.8 handled high-stakes phases requiring structural reasoning (1B FastAPI core, 4A base agent + Logger, 4B Analyst + Coach, 5C SSE orchestration, 10 README polish). The logic: Opus costs more per token but makes fewer architectural errors on complex cross-file reasoning tasks where a wrong early decision compounds across the entire session.

Gemini 2.5 Flash for the Logger: function calling to parse free-text into a typed activity. Flash is fast enough for this use case and meaningfully cheaper than Pro at logging frequency. The Coach agent also runs on Flash — originally specified as Pro, but Flash was retained after Phase 9 end-to-end validation showed acceptable recommendation quality. More on this in the war stories section.

Gemini 2.5 Pro for the Analyst: pattern detection across a 14-day activity window is the one task where the quality difference between Flash and Pro is consistent enough to be worth the cost. The Analyst receives pre-bucketed activity data (this_week, last_week, earlier) already grouped in Python — date math is not delegated to the model.

FastAPI + Pydantic v2. Async throughout, which matters when a single request path involves multiple Firestore reads and a Gemini call. Frozen Pydantic models for immutable domain objects. Discriminated unions for agent outcomes — the pattern that makes failure handling exhaustive rather than defensive.

Firestore on the Spark free tier. Integrates natively with Firebase Auth (same project, same uid namespace), handles demo traffic at zero cost, and has no server to manage. The constraint is real: it's sized for the demo, not for scale.

Cloud Run in asia-south1. Lower latency for Indian users than any US region. min-instances=1 keeps a warm instance to avoid cold-start latency at the demo's submission window.

Firebase Authentication with Google Sign-In. The challenge required persistent user data. Firebase handles identity without a passwords database or JWT signing infrastructure.

Tailwind via CDN + vanilla ES modules, no build step. The original spec had HTMX for progressive enhancement. HTMX is designed for hypermedia APIs that return HTML fragments — it expects hx-get/hx-post targets to respond with rendered HTML, not JSON. Every API endpoint in this app is a JSON API. Replacing HTMX with vanilla fetch() + ES modules added hand-rolled client code but was the only design that worked cleanly with both the JSON API and the SSE stream that requires a Bearer header.


How the prompts evolved

The numbered-plan gate

Early in the build, describing a feature to Copilot and pressing Enter would generate 8 files in one shot. Some of what it built was right; some was premature scaffolding for features two phases away. After Phase 1A, every prompt was given a mandatory final instruction: "Output a numbered plan of what you will build and STOP. Do not write any files until I confirm."

The plan gate has a higher return-on-effort than any other single change I made to the prompting workflow. Two specific cases where it paid off: once the plan revealed Copilot intended to put business logic inside the route handler rather than the agent (caught before a line was written); once it exposed a Firestore schema design that would have required a migration in Phase 5. Both were fixed with a clarification at the plan stage, not a refactor.

The Coach-computes-savings invariant

The original Coach prompt asked the model to propose recommendations including expected kg savings. Testing the first implementation immediately revealed the problem: the model confidently returned specific numbers — "switching to metro saves 3.2 kg/week" — that had no relationship to the actual emission factors. The numbers varied between runs and were sometimes off by an order of magnitude.

The fix was architectural, not just prompt-level. I changed the Coach's response schema to a saving_basis discriminated union: either a transport swap (specify modes and distance), an electricity reduction (specify kWh or hours), or a food swap (specify categories and frequency). The model is never asked for a number. The agent receives the saving_basis, validates it against EmissionService, and computes expectedSavingKg from the real factors. The model shapes the description of a recommendation; it may not set its carbon impact.

This pattern generalises: if you need quantitative output from an agent, design a schema that has the model return typed parameters and compute the quantity locally. It removes hallucination from the number and concentrates model responsibility on the part it's actually good at — understanding what kind of action to recommend.

The 307-before-auth information leak

During Phase 5B security review, I tested what happens when an unauthenticated client calls /api/activities/ (with trailing slash) instead of /api/activities. FastAPI's default redirect_slashes=True issues a 307 redirect to the canonical path. That redirect fires before the auth dependency evaluates. An unauthenticated caller learns the route exists with a 307 instead of a 401 — a small but real information leak about the application's route map.

The fix: redirect_slashes=False on the FastAPI() constructor, all bare-resource routes registered with empty string (@router.post("") not @router.post("/")). With this in place, the slashed variant returns 404 to all callers regardless of auth state. Every test client, every curl command, and the entire frontend had to match the slashless paths. Auditing and propagating this took about 20 minutes; not catching it would have been a security finding.

EventSource can't send Bearer headers

The original SSE design used the browser's native EventSource API. Writing the frontend for Phase 6 revealed the constraint: EventSource does not support custom request headers. There is no standard way to send an Authorization: Bearer token. But the uniform auth contract required every protected route to authenticate the same way — a Bearer header in every request, with a consistent 401 response on failure.

The replacement was fetch() + ReadableStream. The client calls fetch("/api/insights/stream", { headers: { Authorization: "Bearer ..." } }), reads the response body as a stream, decodes chunks, and parses the SSE event format manually. More boilerplate than EventSource, but the security contract was not negotiable. The orchestrator has no knowledge of the transport — the 80 ms inter-event pacing lives in the route layer specifically so the orchestrator stays framework-agnostic.


GenAI vs designed: where the boundary is

This distinction matters for understanding what "vibe coding" actually means when the output is a production-quality service.

AI generated: Route handler skeletons and request/response DTOs. Pydantic model definitions for all domain objects. Test scaffolding and fixture files. Jinja2 HTML templates and CSS. Large portions of the agent prompt structures — GOOD/BAD examples, response schema definitions, system instruction text. Mermaid architecture diagrams. GCP deployment shell scripts. CI workflow YAML.

I designed: The three-agent split (Logger/Analyst/Coach vs a single mega-prompt or a different decomposition). Visible reasoning as the specific rubric differentiator — streaming agent traces to the UI while generation is in flight is not a standard pattern, and manual evaluators notice it. The India-specific data inclusion: deciding that state-level grid factors, INR→kWh conversion, IST timezones, and Indian dietary categories were worth the implementation cost, and sourcing the actual numbers. Every load-bearing convention: slashless routes, uniform 401 contract, Coach computes savings, IST everywhere. The phase-by-phase build sequence and the validation gauntlet at each phase boundary. The model rotation strategy (Sonnet for volume phases, Opus for reasoning-heavy phases). The deliberate trade-offs documented in the amendments log — each is a decision I made, not a default the framework imposed.

The line isn't "AI wrote the code, I had the idea." AI is a capable implementation partner for well-specified subtasks. The hard work is specifying those subtasks accurately, sequencing them correctly, knowing when the output has drifted from the spec, and knowing which decisions require a human to own them.


Engineering war stories

The CSP hole that made Firebase Auth fail generically

Firebase Auth's signInWithPopup was failing with auth/internal-error. That's one of the least informative error codes in the Firebase SDK — it means "something went wrong in the network layer, and I'm not going to tell you what." I checked credentials, verified the Firebase project configuration, tested the token exchange flow manually. Nothing pointed at the cause.

The actual problem was visible in the browser's Network tab: a request to apis.google.com was flagged as (blocked:csp). My Content Security Policy's connect-src directive covered *.googleapis.com but missed apis.google.com — a separate hostname Firebase uses for some Auth operations. The fix was one CSP directive. The debugging cost was about an hour of checking the wrong layer.

The lesson: when Firebase Auth fails with a generic error, open the browser Network tab before touching credentials. CSP blocks appear as (blocked:csp) there and are invisible at the application layer. The auth SDK wraps them as "internal-error" with no further context.

The timestamp format that silently zeroed Firestore range queries

Dashboard range queries were returning empty lists in production against real data. The symptom: list_activities_in_range(uid, start, end) returned nothing even with activities logged in the window. The test suite at 99.7% coverage had never caught it because every Firestore test mocked the client — the mocks returned pre-populated results and never exercised the real storage path.

The root cause was a serialization mismatch. model_dump(mode="json") — used for Firestore writes — serializes UTC datetimes as "2026-06-20T14:30:00Z" (Z suffix). datetime.isoformat() — used to generate the range bounds passed to the filter — produces "2026-06-20T14:30:00+00:00" (+00:00 suffix). Firestore was comparing these as strings, and "Z""+00:00" lexicographically. Every range query silently matched zero documents.

The fix was to pass datetime objects directly to Firestore rather than pre-serialized strings, letting the SDK own format consistency. High test coverage on mocked dependencies does not validate the interface contract with the real system. At least one "write then read" path through actual storage — or an emulator — per data type that involves datetime comparison is worth adding before the demo.

The Gemini Pro quota wall at 11 PM Saturday

Phase 7 was the first phase hitting the live Gemini API from a browser frontend. Around 11 PM on Saturday, the insights stream started returning 429 errors specifically for gemini-2.5-pro with a zero daily quota remaining. The free tier has a per-day limit on Pro; I'd exhausted it building and testing Phase 7.

Waiting for midnight wasn't an option — submission was Sunday evening and there was still a full day of work left. Options: switch to Flash temporarily, or push billing changes immediately and hope quota propagated in time (billing changes are not instant). I switched both Analyst and Coach to Flash and continued.

The next morning, after confirming billing was live and Pro quota had reset, I restored the Analyst to Pro. The Coach stayed on Flash. That wasn't emergency triage — it was a validation outcome. After a full end-to-end test with Coach on Flash, recommendation quality was acceptable. Switching Coach back to Pro introduced deployment risk at the submission window with no measurable quality gain for this structured-output use case.

The lesson is about model selection order: test with Flash first, and only upgrade to Pro when Flash demonstrably fails the quality bar. Flash nearly always meets it for structured-output tasks where the schema is tight and the reasoning depth requirement is moderate.


Limitations and trade-offs

These are real engineering trade-offs.

  • Food factors are modelled estimates, not measurements. Every food activity carries confidence: "estimated". A veg thali is modelled at 0.90 kg CO₂e on a representative dal + sabzi + roti + rice basis; real meals vary widely.
  • Electricity-from-bill uses a flat ₹8/kWh. Any bill→kWh conversion is forced to confidence: "estimated" regardless of the grid factor's confidence level. Real tariffs are slab-based and DISCOM-specific.
  • Grid factors are state-level annual averages — no DISCOM-level or time-of-day resolution. Coal-belt outliers (Jharkhand 1.05) are modelled adjustments above the regional average, not directly measured.
  • The reasoning stream is a paced replay (80 ms/event) of the agent's real structured trace, not token-level model streaming. It faithfully shows the reasoning steps the agent produced; it is not raw Gemini output.
  • One reasoning trace is denormalised across the 1–3 items a single Gemini call produces. The UI reads the first item's trace as canonical for the session.
  • Three agents, not four. Recommendations are not adversarially reviewed by a Devil's Advocate model — cut for time.
  • English-only, India-only, web-only — deliberate for v1, but a real limit for non-metro and non-English users.
  • Coach runs on Gemini 2.5 Flash, not Pro, as a pragmatic deploy-stability trade-off from the submission window. Recommendation quality is acceptable end-to-end but nominally bounded below Pro. Reverting to Pro is a single-line change.
  • min-instances=1 on Cloud Run — small standing cost chosen over cold-start latency for the demo.
  • Firestore on the Spark free tier is sized for the demo, not for load.
  • Coverage is 99.68%, not 100% — five defensive branches in firestore_service.py remain intentionally uncovered.

Tech stack and quality bar

Layer Choice
Language Python 3.13.7
Backend FastAPI (async, Pydantic v2)
Frontend Server-rendered Jinja2 + Tailwind (CDN) + vanilla ES-module JS, no build step
AI — Logger Gemini 2.5 Flash (function calling)
AI — Analyst Gemini 2.5 Pro
AI — Coach Gemini 2.5 Flash
Database Firestore (Spark free tier)
Auth Firebase Authentication — Google Sign-In
Hosting Cloud Run, asia-south1
Secrets Google Secret Manager → env at runtime
Logging Structured JSON → Cloud Logging

483 tests passing, 99.68% line + branch coverage against a 95% gate. mypy --strict clean across all source files. ruff clean, bandit clean, pip-audit clean. Per-user rate limiting via slowapi. Full OWASP Top 10 (2021) walkthrough shipped as SECURITY.md. CI runs ruff, black, mypy, pytest, bandit, and a Docker build on every push.


Try it

Live demo — sign in with any Google account, complete the onboarding (state + home profile), log an activity, and open Insights to watch the pipeline run live.

Source code

Built solo for PromptWars Challenge 3 (June 2026). Data sources: CEA CO₂ Baseline Database v19.0, ICCT India, DMRC, FAO Food Emissions Database, IRRI, Indian dietary survey data. License: MIT.


Top comments (0)