DEV Community

Cover image for HACKFARMER
Talel Boussetta
Talel Boussetta

Posted on

HACKFARMER

I Built a Multi-Agent AI System in Preparatory School — Here's Every Technical Decision, Mistake, and Lesson

A deep dive into LangGraph orchestration, LLM fallback chains, async concurrency, encrypted secrets, WebSocket streaming, and production deployment — written by a second-year prep student in Tunisia


The Premise

Before I explain the code, let me explain the context. I'm in my second year of classes préparatoires in Tunisia — the two-year preparatory program before engineering school. My curriculum is calculus, physics, and thermodynamics. It has nothing to do with software.

I built HackFarmer on weekends and late nights, not because a professor asked me to, but because I had a specific problem I wanted to solve: I wanted to see how far I could push a system where AI agents collaborate autonomously to produce something real — not a chatbot, not a summarizer, but a full-stack GitHub repository generated from nothing but a text description.

What came out of that is a system with 8 specialized LangGraph agents, a custom LLM fallback router, Appwrite as a backend-as-a-service, a React 18 frontend with real-time WebSocket streaming, Fernet-encrypted API key storage, asyncio concurrency control, CI/CD on GitHub Actions, and a full observability stack.

This article is everything I learned building it. Every design decision has a "why." Every mistake has a lesson.


Table of Contents

  1. System Overview — What HackFarmer Actually Does
  2. The Agent Architecture — LangGraph StateGraph Design
  3. The LLM Router — Multi-Provider Fallback Chain
  4. Concurrency and Queue Management — asyncio.Semaphore in Practice
  5. The Appwrite Layer — Database, Auth, Storage, Realtime
  6. API Key Security — Fernet Encryption and IDOR Protection
  7. Real-Time Streaming — WebSocket with Polling Fallback
  8. The GitHub Agent — Git Trees API Without a CLI
  9. Frontend Architecture — React 18, Zustand, and the Agent Stage
  10. The Deployment Puzzle — Single Heroku Dyno, Two Buildpacks
  11. Observability — Sentry, PostHog, and Papertrail
  12. CI/CD — GitHub Actions, Branch Protection, Rollback
  13. The Bugs That Taught Me the Most
  14. What I'd Do Differently

1. System Overview — What HackFarmer Actually Does

The user provides a project description — a paragraph, a PDF, or a DOCX. They click submit. From that point, HackFarmer does the following without any further human input:

  1. Analyst agent parses the input, identifies the domain, user personas, and core features.
  2. Architect agent designs the technical stack, folder structure, and API contracts.
  3. Three agents run in parallel: the Frontend agent writes React components, the Backend agent writes FastAPI routes and models, and the Business agent writes a README, pitch deck outline, and monetization strategy.
  4. Integrator agent combines the parallel outputs into a coherent, unified codebase.
  5. Validator agent runs pure Python AST analysis on the generated code — no LLM — and scores it out of 100.
  6. If the score is below 70, the pipeline loops back to the Integrator with feedback. This retry can happen up to 3 times.
  7. Once score ≥ 70, the GitHub agent creates a real repository via the GitHub REST API and pushes every generated file using the Git Trees API. No git CLI. No subprocess calls.
  8. Throughout all of this, the frontend receives live updates over a WebSocket connection to Appwrite Realtime — every agent state change, every log line, every score.

The user ends up with a working GitHub repository they can clone and run.


2. The Agent Architecture — LangGraph StateGraph Design

Why LangGraph Over Raw Orchestration

Before settling on LangGraph, I considered three options:

  • Raw asyncio orchestration — manually managing agent calls with asyncio.gather() and state dicts. Simple, but gives you nothing for routing logic or conditional branching. The retry loop would've been an ugly while with manual state tracking.
  • CrewAI — higher-level and opinionated. Good for simple sequential pipelines, but I needed a conditional edge (retry loop) and explicit parallel fan-out, and CrewAI's abstractions were fighting me on both.
  • LangGraph — a graph-based state machine built on top of LangChain. You define nodes (agents), edges (transitions), and conditional edges (branching logic). It gave me exactly what I needed.

The StateGraph Topology

analyst → architect → [frontend | backend | business] → integrator → validator
                                                                         │
                                                              ┌── score < 70 AND retries < 3 ──┐
                                                              │                                 │
                                                              └──────────→ integrator ←─────────┘
                                                                         │
                                                              score >= 70 OR retries >= 3
                                                                         │
                                                                      github → END
Enter fullscreen mode Exit fullscreen mode

The key architectural decision is the ProjectState TypedDict that flows through every node. Every agent reads from it and writes back to it. This is LangGraph's fundamental pattern: shared mutable state, passed by reference through the graph.

class ProjectState(TypedDict):
    job_id: str
    raw_text: str
    input_type: str           # "text" | "pdf" | "docx"
    analysis: dict
    architecture: dict
    frontend_code: dict       # { filename: content }
    backend_code: dict
    business_docs: dict
    integrated_code: dict
    validation_score: int
    validation_feedback: str
    retry_count: int
    repo_name: str
    repo_private: bool
    github_url: str
    refinement_feedback: str  # user-submitted refinement requests
Enter fullscreen mode Exit fullscreen mode

Lesson learned: I initially forgot to declare repo_name, repo_private, and refinement_feedback in the TypedDict. Python didn't raise an error — it just set them dynamically at runtime. But LangGraph's state serialization silently dropped undeclared keys during checkpointing. The GitHub agent would crash trying to read a key that had been serialized out. The fix was simple — declare everything upfront — but it cost me two hours of confused debugging.

Parallel Fan-Out and Fan-In

LangGraph supports parallel execution through add_node combined with a Send router on the conditional edge. For the three parallel agents, I define a routing function that returns multiple Send objects:

def route_to_parallel(state: ProjectState):
    return [
        Send("frontend_agent", state),
        Send("backend_agent", state),
        Send("business_agent", state),
    ]

graph.add_conditional_edges("architect", route_to_parallel)
Enter fullscreen mode Exit fullscreen mode

The results fan back in to the integrator node, which LangGraph waits for automatically. The integrator receives the complete state object with all three outputs populated.

Lesson learned: The parallel agents all write to different keys in ProjectState (frontend_code, backend_code, business_docs), so there are no write conflicts. But if two agents ever write to the same key, LangGraph's last-write-wins behavior will silently discard one. Always partition your state keys by agent.

The Validator Retry Loop

def route_after_validation(state: ProjectState):
    if state["validation_score"] >= 70 or state["retry_count"] >= 3:
        return "github_agent"
    return "integrator"

graph.add_conditional_edges("validator", route_after_validation)
Enter fullscreen mode Exit fullscreen mode

The retry counter lives in state. The integrator increments it on each pass and also receives validation_feedback — the Validator's explanation of what was wrong — as additional context for its next generation attempt.

Lesson learned: Without the retry_count >= 3 hard ceiling, a project that consistently scores 65 will loop forever. I learned this the hard way during testing when a poorly structured input created an infinite retry loop that consumed tokens for 8 minutes before I killed it manually.


3. The LLM Router — Multi-Provider Fallback Chain

The Problem With Single-Provider Dependency

Groq's free tier rate limits are aggressive. Gemini's API occasionally returns 503s under load. OpenRouter has higher latency but almost never goes down. No single provider is reliable enough for a production pipeline where users are waiting.

My solution is a LLMRouter class that takes a priority-ordered list of providers per agent and tries each one in sequence, with retries and timeouts.

class LLMRouter:
    def __init__(self, job_id: str, agent_name: str):
        self.job_id = job_id
        self.providers = AGENT_PROVIDER_PRIORITY[agent_name]

    async def complete(self, messages: list, **kwargs) -> str:
        for provider in self.providers:
            for attempt in range(2):  # 2 retries per provider
                try:
                    response = await asyncio.wait_for(
                        self._call_provider(provider, messages, **kwargs),
                        timeout=120.0
                    )
                    return response
                except asyncio.TimeoutError:
                    logger.warning(f"[{self.job_id}] {provider} timed out, attempt {attempt+1}")
                except RateLimitError:
                    break  # Don't retry rate limits, move to next provider immediately
                except Exception as e:
                    logger.error(f"[{self.job_id}] {provider} error: {e}")
        raise AllProvidersExhaustedError(f"All LLM providers failed for {self.job_id}")
Enter fullscreen mode Exit fullscreen mode

Provider Priority Per Agent

Different agents have different characteristics that map to different providers:

AGENT_PROVIDER_PRIORITY = {
    "analyst":        ["gemini", "groq_fast", "groq", "openrouter"],
    "architect":      ["gemini", "groq", "openrouter"],
    "frontend_agent": ["groq", "gemini", "openrouter"],
    "backend_agent":  ["openrouter", "groq", "gemini"],
    "business_agent": ["groq_fast", "groq", "gemini", "openrouter"],
    "integrator":     ["gemini", "groq", "openrouter"],
}
Enter fullscreen mode Exit fullscreen mode

The logic behind these choices:

  • Analyst uses Gemini first because it handles long, messy input documents well. Its context window is generous.
  • Frontend agent uses Groq first because it generates clean React code quickly and Groq's low latency reduces the user's perceived wait time on the first visible output.
  • Backend agent uses OpenRouter first because the meta-llama/llama-3.3-70b:free endpoint on OpenRouter tends to produce more structured FastAPI code than Groq's Llama serving for this specific task — empirical observation from testing.
  • Business agent uses groq_fast (llama-3.1-8b) first. The business docs don't require deep reasoning — a smaller, faster model handles them fine, and using a small model here keeps the pipeline fast.
  • Validator uses no LLM at all — pure Python AST analysis.

Lesson learned: I initially used the same provider priority for every agent, which overloaded Gemini. When Gemini hit its rate limit, the entire pipeline degraded simultaneously because all agents were trying to fall back at the same moment. Distributing the primary provider across agents means rate limits on one provider affect only a subset of the pipeline.

The groq_fast Distinction

I maintain two separate Groq client instances: one for llama-3.3-70b-versatile and one for llama-3.1-8b-instant. They share the same API key but have different rate limit buckets on Groq's side. Using groq_fast for agents that don't need the larger model preserves the 70B capacity for agents that do.


4. Concurrency and Queue Management — asyncio.Semaphore in Practice

The Core Problem

A single Heroku dyno has limited CPU and memory. Each pipeline run invokes up to 7 LLM API calls, builds a ZIP archive, and pushes to GitHub. Allowing unlimited concurrent pipelines would cause memory exhaustion and degrade every active run simultaneously.

asyncio.Semaphore(3)

class QueueManager:
    def __init__(self):
        self._semaphore = asyncio.Semaphore(3)
        self._queue: asyncio.Queue = asyncio.Queue()
        self._poller_task: asyncio.Task | None = None

    async def start(self):
        self._poller_task = asyncio.create_task(self._poll_loop())

    async def _poll_loop(self):
        while True:
            job_id, raw_text, user_id = await self._queue.get()
            async with self._semaphore:
                asyncio.create_task(
                    run_pipeline_task(job_id, raw_text, user_id)
                )
Enter fullscreen mode Exit fullscreen mode

The semaphore limits active pipelines to 3. Additional jobs queue in asyncio.Queue and are dispatched as slots open. The create_task inside the semaphore context keeps the semaphore held for the duration of the pipeline.

Critical bug I shipped: The original version of this code marked jobs as "running" in the database but never actually called run_pipeline_task. The queue poller was a stub — it updated status but the real work was supposed to be triggered from the HTTP route, which bypassed the queue entirely. The result: jobs would sit in "running" state indefinitely on startup, the semaphore never filled, and the poller was a no-op. The fix required:

  1. Persisting raw_text in the Appwrite jobs document on creation (I had avoided this to save database reads, a premature optimization that caused the bug)
  2. Having the poller read raw_text from the document and pass it to run_pipeline_task
  3. Adding raw_text as a size: 15000 attribute to the Appwrite collection schema via setup_appwrite.py

Lesson learned: Never separate a side effect (updating a status field) from the action that should follow it. If updating status = "queued" doesn't also enqueue the work, you have a race condition between the status update and whatever triggers the actual work.

Startup Crash Guard

@asynccontextmanager
async def lifespan(app: FastAPI):
    # On startup: reset orphaned running jobs
    orphaned = await db.list_documents(
        database_id=settings.APPWRITE_DATABASE_ID,
        collection_id="jobs",
        queries=[Query.equal("status", "running")]
    )
    for job in orphaned["documents"]:
        await db.update_document(..., data={"status": "failed", "errorMessage": "Server restarted"})

    await queue_manager.start()
    yield
    # On shutdown: cleanup
Enter fullscreen mode Exit fullscreen mode

Heroku dynos restart regularly — on deploys, on memory pressure, on the daily 24-hour cycle. Without this guard, any job running at restart time would be stuck in "running" forever with no way to retry or fail gracefully.


5. The Appwrite Layer — Database, Auth, Storage, Realtime

Why Appwrite Over a Traditional Database

I needed four things that would have required separate services if I used a raw database: a document database with queries, file storage with signed URLs, a complete OAuth2 authentication system, and real-time subscriptions. Appwrite provides all four in one SDK, which is exactly what a solo developer on a single dyno needs.

Collection Design

jobs collection
The canonical job record. status transitions: queued → running → complete | failed. inputType is text | pdf | docx. githubUrl is populated by the GitHub agent on completion.

userId (string)    status (string)    inputType (string)
rawText (string, 15000 chars max)     repoName (string)
githubUrl (string)                    errorMessage (string)
Enter fullscreen mode Exit fullscreen mode

agent-runs collection
One document per agent invocation. Allows the frontend to show which agent is currently running, how many retries have occurred, and per-agent output summaries without embedding everything in the job document.

jobId (string)    agentName (string)    status (string)
retryCount (int)  outputSummary (string, 2000 chars)
Enter fullscreen mode Exit fullscreen mode

user-api-keys collection
Encrypted user-provided API keys. The encryptedKey field stores Fernet-encrypted ciphertext. isValid is updated after each key test. lastUsed enables usage analytics.

userId (string)    provider (string)    encryptedKey (string)
isValid (bool)     lastUsed (datetime)
Enter fullscreen mode Exit fullscreen mode

job-events collection
The real-time event bus. Every significant state change publishes a document here. The frontend subscribes to this collection via Appwrite Realtime WebSocket and renders each event as it arrives.

jobId (string)    eventType (string)    payload (JSON, 5kb max)
Enter fullscreen mode Exit fullscreen mode

Event types: agent_start, agent_thinking, agent_done, agent_failed, job_complete, job_failed, job_queued, job_refining, heartbeat.

The Event Publishing Pattern

async def publish_event(job_id: str, event_type: str, payload: dict):
    await databases.create_document(
        database_id=settings.APPWRITE_DATABASE_ID,
        collection_id="job-events",
        document_id=ID.unique(),
        data={
            "jobId": job_id,
            "eventType": event_type,
            "payload": json.dumps(payload)[:5000]
        },
        permissions=[Permission.read(Role.user(get_job_owner(job_id)))]
    )
Enter fullscreen mode Exit fullscreen mode

The payload is stored as a JSON string, not a nested object. Appwrite's document model doesn't support arbitrary nested JSON as a first-class field type, so stringifying and truncating to 5kb is the simplest approach that survives schema validation.

Lesson learned: I initially tried to store payload as multiple typed fields (agentName, message, score, etc.). This seemed clean but meant every event type needed a different schema, which made querying and frontend parsing much more complex. A single opaque JSON string field is more flexible. Parse it on the client.


6. API Key Security — Fernet Encryption and IDOR Protection

The Threat Model

Users provide their own Gemini, Groq, and OpenRouter API keys. These are credentials that cost real money if leaked. The threat model has two main vectors:

  1. Database breach — someone reads the user-api-keys collection directly
  2. Insecure direct object reference — a user crafts a request to read another user's keys

Fernet Encryption

Fernet is symmetric authenticated encryption. I use Python's cryptography library. The encryption key is a 32-byte key stored in an environment variable, never in code.

from cryptography.fernet import Fernet

class EncryptionService:
    def __init__(self):
        self.fernet = Fernet(settings.FERNET_KEY.encode())

    def encrypt(self, plaintext: str) -> str:
        return self.fernet.encrypt(plaintext.encode()).decode()

    def decrypt(self, ciphertext: str) -> str:
        return self.fernet.decrypt(ciphertext.encode()).decode()
Enter fullscreen mode Exit fullscreen mode

Keys are encrypted before writing to Appwrite and decrypted only at execution time — immediately before passing to the LLM client. The decrypted key lives in memory for less than one function call and is never logged, never serialized back to the database, and never returned in an API response.

Lesson learned: I almost stored the Fernet key in frontend/.env.local alongside the Appwrite project ID, which would have committed it to git. The Appwrite project ID is not secret (it's essentially a namespace identifier), but the Fernet key absolutely is. I caught this before committing. The lesson: .env files should be listed in .gitignore before you add any content to them, not after.

IDOR Protection

Every route that accesses a job or API key includes an ownership check:

async def get_job(job_id: str, current_user: User = Depends(get_current_user)):
    job = await databases.get_document(
        database_id=settings.APPWRITE_DATABASE_ID,
        collection_id="jobs",
        document_id=job_id
    )
    if job["userId"] != current_user.id:
        raise HTTPException(status_code=403, detail="Forbidden")
    return job
Enter fullscreen mode Exit fullscreen mode

This pattern is repeated on every route that touches user-owned data. It's tedious but necessary. The alternative — relying on Appwrite's document-level permissions — works for direct Appwrite SDK calls from the frontend but doesn't protect server-side routes that use the server API key, which has elevated privileges.


7. Real-Time Streaming — WebSocket with Polling Fallback

Why WebSocket Matters for This Use Case

A pipeline run takes 60–180 seconds. Without streaming, the user stares at a spinner for two minutes with no feedback. With streaming, they watch each agent activate in sequence, see log messages appear, and observe the validator score compute in real time. The subjective experience of waiting changes completely.

Appwrite Realtime

// frontend/src/hooks/useJobStream.js
import { client } from '../lib/appwrite';
import { RealtimeResponseEvent } from 'appwrite';

export function useJobStream(jobId) {
  const [events, setEvents] = useState([]);
  const [connected, setConnected] = useState(false);

  useEffect(() => {
    let unsubscribe;
    let pollingInterval;

    const subscribe = () => {
      unsubscribe = client.subscribe(
        `databases.hackfarmer-db.collections.job-events.documents`,
        (response: RealtimeResponseEvent) => {
          const doc = response.payload;
          if (doc.jobId !== jobId) return; // filter client-side
          setEvents(prev => [...prev, {
            type: doc.eventType,
            payload: JSON.parse(doc.payload),
            timestamp: doc.$createdAt
          }]);
        }
      );
      setConnected(true);
    };

    const startPollingFallback = () => {
      setConnected(false);
      pollingInterval = setInterval(async () => {
        const recent = await fetchRecentEvents(jobId);
        setEvents(recent);
      }, 3000);
    };

    try {
      subscribe();
    } catch (err) {
      startPollingFallback();
    }

    return () => {
      unsubscribe?.();
      clearInterval(pollingInterval);
    };
  }, [jobId]);

  return { events, connected };
}
Enter fullscreen mode Exit fullscreen mode

The subscribe call is collection-scoped, not document-scoped, which means the frontend receives all events for all jobs by default. Client-side filtering by jobId is essential. This is slightly wasteful on the WebSocket connection but avoids the complexity of managing per-document subscriptions.

Lesson learned: The polling fallback isn't just for reliability — it was essential during development before I had Appwrite Realtime configured correctly. Starting with polling and adding the WebSocket layer on top meant I always had a working system. Never build a real-time feature without a polling fallback for the same data.

Heartbeat Events

The pipeline publishes a heartbeat event every 30 seconds even when no agent is actively running. This prevents the frontend from showing a stale "last updated 4 minutes ago" timestamp during the long Integrator phase (which can run for 45+ seconds on complex projects) and reassures the user that the system is alive.


8. The GitHub Agent — Git Trees API Without a CLI

Why Not subprocess.run(["git", "push"])?

Three reasons:

  1. Heroku dynos don't have git installed by default
  2. subprocess in an async FastAPI application blocks the event loop
  3. Calling git push requires cloning a repo locally first, which means disk I/O on a dyno that might have limited ephemeral storage

The Git Trees API allows creating an entire file tree in a single HTTP request, then pointing a branch at it with a second request. It's significantly faster and has zero local state.

The Implementation

async def push_to_github(repo_name: str, files: dict, github_token: str) -> str:
    headers = {
        "Authorization": f"Bearer {github_token}",
        "Accept": "application/vnd.github+json"
    }

    # Step 1: Create the repository
    async with httpx.AsyncClient() as client:
        repo_resp = await client.post(
            "https://api.github.com/user/repos",
            headers=headers,
            json={"name": repo_name, "private": False, "auto_init": False}
        )
        repo_data = repo_resp.json()
        owner = repo_data["owner"]["login"]

    # Step 2: Build the tree — all files in one API call
    tree = [
        {
            "path": filename,
            "mode": "100644",
            "type": "blob",
            "content": content
        }
        for filename, content in files.items()
    ]

    async with httpx.AsyncClient() as client:
        # Create tree object
        tree_resp = await client.post(
            f"https://api.github.com/repos/{owner}/{repo_name}/git/trees",
            headers=headers,
            json={"tree": tree}
        )
        tree_sha = tree_resp.json()["sha"]

        # Create commit
        commit_resp = await client.post(
            f"https://api.github.com/repos/{owner}/{repo_name}/git/commits",
            headers=headers,
            json={
                "message": "Initial commit — generated by HackFarmer",
                "tree": tree_sha,
                "parents": []
            }
        )
        commit_sha = commit_resp.json()["sha"]

        # Create branch pointing at commit
        await client.post(
            f"https://api.github.com/repos/{owner}/{repo_name}/git/refs",
            headers=headers,
            json={"ref": "refs/heads/main", "sha": commit_sha}
        )

    return f"https://github.com/{owner}/{repo_name}"
Enter fullscreen mode Exit fullscreen mode

The entire repository — potentially 30+ files — is created in 4 API calls. No disk I/O, no git CLI, no subprocess.

Lesson learned: The GitHub token comes from Appwrite Identities, not from user-submitted API keys. When a user authenticates with GitHub OAuth2 through Appwrite, Appwrite stores the providerAccessToken in the user's Identity document. The GitHub agent retrieves this token server-side. This is cleaner than asking the user to create a Personal Access Token, but it means the token scope must be set correctly in the OAuth2 app configuration — repo scope is required for private repositories.


9. Frontend Architecture — React 18, Zustand, and the Agent Stage

State Management Philosophy

I chose Zustand over Redux for its simplicity, and over React Context for its performance characteristics. Context re-renders every consumer on any state change — unacceptable for a component like AgentStage that updates on every WebSocket event.

// store.js
import { create } from 'zustand';

export const useStore = create((set) => ({
  user: null,
  currentJob: null,
  jobEvents: [],

  setUser: (user) => set({ user }),
  setCurrentJob: (job) => set({ currentJob: job }),
  appendEvent: (event) => set((state) => ({
    jobEvents: [...state.jobEvents, event]
  })),
  clearEvents: () => set({ jobEvents: [] }),
}));
Enter fullscreen mode Exit fullscreen mode

The Agent Stage Component

AgentStage is the most complex component in the application. It subscribes to the jobEvents array from Zustand, and renders each agent as a card that transitions through states: idle → active → thinking → done | failed. The "thinking" state shows a live log stream.

The visual metaphor is a pipeline — each agent card is connected to the next with an animated line that fills with color as the agent completes. The parallel agents (Frontend, Backend, Business) appear side by side.

Lesson learned: Rendering WebSocket events directly into component state (without Zustand) caused cascading re-renders. Every new event re-rendered the entire AgentStage, which re-rendered every agent card, which caused visible flicker. Moving events to Zustand and using useStore selectors to subscribe to only the events for a specific agent eliminated the flicker completely.

JWT Cache in useAuth

Appwrite's account.get() call verifies the session with the server. Called naively, this would make a network request on every page load. I cache the result for 10 minutes:

const AUTH_CACHE_KEY = 'hackfarmer_user_cache';
const CACHE_TTL = 10 * 60 * 1000; // 10 minutes

export function useAuth() {
  const [user, setUser] = useState(null);

  useEffect(() => {
    const cached = localStorage.getItem(AUTH_CACHE_KEY);
    if (cached) {
      const { data, timestamp } = JSON.parse(cached);
      if (Date.now() - timestamp < CACHE_TTL) {
        setUser(data);
        return;
      }
    }

    account.get().then((data) => {
      localStorage.setItem(AUTH_CACHE_KEY, JSON.stringify({
        data,
        timestamp: Date.now()
      }));
      setUser(data);
    });
  }, []);

  return { user };
}
Enter fullscreen mode Exit fullscreen mode

10. The Deployment Puzzle — Single Heroku Dyno, Two Buildpacks

The Constraint

I wanted a single Heroku dyno — one URL, one $PORT, one bill. But I had two runtimes: Python (FastAPI) and Node.js (React/Vite build). Heroku supports multiple buildpacks, but you have to be deliberate about the order and how they interact.

The Buildpack Stack

1. heroku-buildpack-monorepo  →  APP_BASE=backend
2. heroku/nodejs              →  reads backend/package.json
3. heroku/python              →  reads backend/requirements.txt
Enter fullscreen mode Exit fullscreen mode

The monorepo buildpack sets APP_BASE=backend, which makes Heroku treat the backend/ directory as the root. The Node.js buildpack then finds backend/package.json (a shim that just declares Node version) and runs npm run build. The Python buildpack finds backend/requirements.txt and installs dependencies.

The Procfile

release: bash build.sh
web: uvicorn src.api.main:app --host 0.0.0.0 --port $PORT
Enter fullscreen mode Exit fullscreen mode

build.sh runs the Vite build inside the frontend/ directory, producing frontend/dist/. This runs on every deploy as a Heroku release phase task, before any traffic reaches the new version.

Serving React from FastAPI

from fastapi.staticfiles import StaticFiles
from pathlib import Path

DIST_PATH = Path(__file__).parent.parent.parent / "frontend" / "dist"

if DIST_PATH.exists():
    app.mount("/assets", StaticFiles(directory=DIST_PATH / "assets"), name="assets")

    @app.get("/{full_path:path}")
    async def serve_spa(full_path: str):
        # Don't intercept API routes
        if full_path.startswith("api/") or full_path.startswith("auth/"):
            raise HTTPException(status_code=404)
        return FileResponse(DIST_PATH / "index.html")
Enter fullscreen mode Exit fullscreen mode

The catch-all route returns index.html for any non-API path, which is how React Router's client-side routing works with server-side rendering. The if DIST_PATH.exists() guard means this code does nothing in local development, where the Vite dev server runs separately.

Lesson learned: The release phase in Heroku blocks traffic until it completes. If build.sh fails — a missing npm dependency, a TypeScript error, anything — the deploy is rolled back automatically. This is a feature, not a bug. It means a broken frontend build never reaches production.


11. Observability — Sentry, PostHog, and Papertrail

Sentry — Error Tracking

// frontend/src/main.jsx
Sentry.init({
  dsn: import.meta.env.VITE_SENTRY_DSN || '',
  enabled: !!import.meta.env.VITE_SENTRY_DSN && import.meta.env.PROD,
  release: import.meta.env.VITE_COMMIT_SHA,
  integrations: [Sentry.browserTracingIntegration()],
  tracesSampleRate: 0.1,
});
Enter fullscreen mode Exit fullscreen mode

The enabled flag prevents Sentry from capturing noise during local development. The release tag is set to the git commit SHA at build time via vite.config.js:

define: {
  'import.meta.env.VITE_COMMIT_SHA': JSON.stringify(process.env.GITHUB_SHA || 'local')
}
Enter fullscreen mode Exit fullscreen mode

This means every error in Sentry is tagged to the exact commit that caused it.

On the backend, Sentry integrates with FastAPI via sentry_sdk.init() and the FastApiIntegration(). Every unhandled exception is captured with full request context.

PostHog — Product Analytics

I track five events:

  • job_created — with input_type as a property
  • job_completed — with duration_seconds and validation_score
  • job_failed — with failure_stage (which agent failed)
  • api_key_added — with provider, never the key content
  • api_key_tested — with provider and is_valid

The funnel from job_created to job_completed tells me what percentage of jobs successfully reach GitHub. When that percentage drops, something in the pipeline is degrading.

Papertrail — Log Aggregation

Five alerts configured in Papertrail:

  1. Pipeline failed — triggers on any log line matching the error pattern from run_pipeline_task
  2. agent CRASHED — triggers on unhandled agent exceptions
  3. All LLM providers exhausted — triggers on AllProvidersExhaustedError
  4. Error R10 — Heroku boot timeout (the dyno took > 60 seconds to bind to $PORT)
  5. Error R14 — memory quota exceeded

The R10 alert has triggered twice — both times during deploys where the Vite build took longer than expected inside the release phase. The fix was pre-caching the npm dependency install layer by checking Heroku's build cache.


12. CI/CD — GitHub Actions, Branch Protection, Rollback

Branch Strategy

feature/* → dev → main → auto-deploy to Heroku
Enter fullscreen mode Exit fullscreen mode

main is protected with:

  • No direct pushes (even from me)
  • CI must pass before merge
  • Linear history only (rebase, not merge)

dev is where integration happens. feature/* branches are created from dev and merged back via PR.

The CI Workflow

# .github/workflows/ci.yml
jobs:
  backend:
    steps:
      - run: pip install ruff
      - run: ruff check backend/src
      - run: python -c "import backend.src.api.main"  # import check
      - run: pytest backend/tests/security/ -v

  frontend:
    steps:
      - run: cd frontend && npm ci
      - run: cd frontend && npm run build
Enter fullscreen mode Exit fullscreen mode

The backend import check is lightweight but catches a class of errors that linting misses: circular imports, missing __init__.py files, and module-level side effects that raise exceptions.

Rollback Procedure

heroku releases --app hackfarmer-api    # list releases
heroku rollback v47 --app hackfarmer-api  # roll back to specific version
Enter fullscreen mode Exit fullscreen mode

Heroku's slug-based deployment means rollback is instant — you're switching which pre-built slug is running, not rebuilding. The target recovery time is under 2 minutes for any production incident.


13. The Bugs That Taught Me the Most

Bug 1: The Queue That Did Nothing

Described above — the queue_manager marked jobs as "running" but never actually invoked the pipeline. The system appeared to work during manual testing because I was always triggering pipelines directly from the HTTP route, bypassing the queue entirely. The bug only manifested under the queue path, which was the path used after the dyno had already accepted 3 concurrent jobs.

Root lesson: Test the queued path explicitly. Don't assume the non-queued path and the queued path are equivalent.

Bug 2: Sentry DSN Hardcoded in Source

I committed the frontend Sentry DSN directly in main.jsx. The DSN is not a secret in the way an API key is — it's designed to be public-facing — but it's still poor practice to hardcode environment-specific values. The correct fix is moving it to import.meta.env.VITE_SENTRY_DSN and setting it as a build environment variable. This also means you can have different DSNs for staging and production without changing code.

Bug 3: ProjectState Missing Fields

TypedDicts in Python are not enforced at runtime. Setting a key that isn't declared doesn't raise an error — it just gets set. LangGraph's checkpointing serialized the state and silently dropped undeclared keys. The GitHub agent read state["repo_name"] and got KeyError at runtime in production.

Root lesson: Treat your TypedDict like a strict schema. If you're setting a key anywhere in the codebase, it must be declared in the TypedDict and defaulted in the normalizer.

Bug 4: The Parallel Agent Write Collision That Didn't Happen (But Almost Did)

During one refactor, I temporarily changed the Frontend agent to also write a summary to the architecture key (which the Architect agent writes to). Both agents ran in parallel. The result was non-deterministic: sometimes the Frontend agent's overwrite arrived after the Architect's write, sometimes before. The bug was subtle because the integration step sometimes got correct architecture data and sometimes got a one-line summary.

Root lesson: In a parallel fan-out, each agent must own exactly one set of state keys. Any shared write is a race condition waiting to be a production incident.


14. What I'd Do Differently

Use a proper job queue instead of in-memory asyncio.Queue. The current queue is lost on dyno restart. If 5 jobs are queued and the dyno restarts, those jobs disappear. A proper solution — Redis + Celery, or Heroku's own Scheduler, or even Appwrite Functions — would persist the queue.

Add streaming LLM responses. Right now, the frontend receives events at the agent level: agent_start, agent_done. It doesn't see tokens as they stream from the LLM. This means during the 30–45 seconds the Integrator is running, the user sees a spinner with no feedback. Streaming tokens to the frontend via Server-Sent Events would dramatically improve perceived performance.

Move to per-agent microservices for scaling. The current architecture runs everything in one process. If I wanted to scale — multiple Heroku dynos, or Kubernetes pods — I'd need to extract each agent into its own service communicating over a message bus (RabbitMQ, NATS). The LangGraph abstraction actually makes this easier because the StateGraph definition is separate from the execution environment.

Add input validation for repo names. GitHub repository names have strict rules: no spaces, specific allowed characters, maximum 100 characters. The current system trusts whatever the Analyst agent extracts from the input. A malformed repo name causes the GitHub agent to fail with a cryptic 422 error. A regex validation step between the Analyst and the Architect would catch this early.

Write integration tests. I have security unit tests and a pre-deploy checklist script, but no end-to-end integration tests that exercise the full pipeline with mock LLM responses. The absence of these tests is the single largest quality risk in the project.


Closing Thought

I'm still in prépa. I haven't started engineering school yet. The professors who taught me calculus and physics didn't teach me any of this — I learned it by reading documentation, making mistakes, and debugging things that I didn't fully understand until they broke.

If you're early in your education and you're thinking about building something technically ambitious: start before you feel ready. You will not feel ready. Build it anyway.

The codebase is at github.com/talelboussetta/HackFarm.


Talel Boussetta — 2nd-year preparatory student, Tunisia

Stack: FastAPI · LangGraph · Python 3.11 · React 18 · Vite · Zustand · Appwrite · Heroku · LangGraph · Sentry · PostHog · Papertrail · GitHub Actions

Top comments (0)