How Hindsight Generates Contextual Student Tasks

#ai #agents #machinelearning #fastapi

Agent's memory surfacing past decisions

"It flagged Alice for a frontend bug, but Alice is a backend engineer—until Hindsight reminded us she'd been doing exactly that for three weeks." I watched the agent quietly pull a confidence_score from task history and reassign the ticket in seconds, based on nothing but what it had already seen our team do.

That moment snapped into focus what we were actually building: not a task tracker with an AI button bolted on, but a system where the agent's decisions get better because it remembers its own past ones. This is the story of how we got there, and why the memory layer turned out to be the hardest, most interesting part.

What the System Actually Does

ProPilot is an AI-powered project manager built for small engineering teams. At its core it does three things: it manages projects, tasks, and team members through a REST API; it lets a chat endpoint interrogate those records in natural language; and it uses an AI layer to suggest who should do a task and explain why, with a confidence score attached.

The backend is a FastAPI application (main.py) backed by SQLAlchemy and SQLite. The frontend is React + Vite, deployed separately. LLM calls go through Groq and OpenAI via their Python SDKs. Hindsight (hindsight-client==0.4.19, pinned in requirements.txt) sits between the AI calls and the database as an agent memory layer.

The architecture is deliberately flat. There are no microservices, no message queues, no async workers—just a single FastAPI process with nine route groups:

Everything that matters flows through tasks. Tasks are the atomic unit of observed behavior, and every completed task feeds back into the memory layer.

The Data Model Tells the Real Story

The most revealing file in the repo isn't main.py —it's db_models.py.The DBTask table is where the design philosophy becomes concrete:

class DBTask(Base):
    __tablename__ = "tasks"
    id = Column(Integer, primary_key=True, index=True)
    task_name = Column(String, index=True)
    assigned_to = Column(String, index=True, nullable=True)
    status = Column(String, default="To Do")      # To Do, In Progress, Completed
    priority = Column(String, default="Medium")   # High, Medium, Low
    difficulty = Column(String, default="Medium") # Easy, Medium, Hard
    ai_rationale = Column(String, nullable=True)
    confidence_score = Column(Integer, default=0)
    deadline = Column(DateTime, nullable=True)
    created_at = Column(DateTime, default=lambda: datetime.now(timezone.utc))

Two fields here are not typical in a task tracker: ai_rationale and confidence_score. Every task knows why it was assigned to someone, in plain text, and how confident the AI was when it made that call. This isn't just logging—it's the training signal. When Hindsight reads past tasks to decide who should own the next one, it reads these fields. It's learning from its own explanations.

That circular structure—agent explains its decision, agent reads that explanation later to make a better decision—is what makes this qualitatively different from an LLM with a system prompt.

The DBDecision table captures architectural decisions the team makes (tech stack, process rules), and those feed into the same memory:

class DBDecision(Base):
    __tablename__ = "decisions"
    id = Column(Integer, primary_key=True, index=True)
    title = Column(String, index=True)
    content = Column(String)        # Full decision text
    decided_by = Column(String)
    category = Column(String, default="General")   # 'Architecture', 'Process', etc.
    created_at = Column(DateTime, default=lambda: datetime.now(timezone.utc))

The seed data (in main.py) writes three real decisions into this table on first run:

db.add(DBDecision(
    title="Freeze Friday Deployments",
    content="After 3 incidents, team agreed no production deployments on Fridays. "
            "Releases to happen Tuesday-Thursday only.",
    decided_by="Charlie",
    category="Process"
))

That record lives in the same memory pool that the agent queries. So when you ask the chat endpoint "when should we deploy this fix?", the agent doesn't just know your schedule—it knows why that schedule exists.

How Hindsight Fits In

The conventional approach here is a RAG pipeline: embed task history, shove it into a vector store, retrieve on query. We've done this before. It's fine until you need the memory to be structured—until you need to ask "what is Bob's on-time rate for backend tasks?" and get a number back, not a paragraph.

Hindsight solves a specific version of this problem. Rather than treating agent memory as a document retrieval problem, it treats it as a stateful store that the agent writes to and reads from across sessions. Think less "vector search" and more "the agent has a notebook it can actually reference." The Hindsight documentation describes it as structured agent memory that persists across invocations—which is exactly what we needed when suggest_assignee needed to explain why it was making a call.

The suggest_assignee endpoint (exposed at /suggest-member) takes a task name and returns a structured response:

class TaskSuggestionResponse(BaseModel):
    suggested_member: Optional[str] = None
    confidence: float
    reason: str

The reason field is the Hindsight output surfaced directly to the caller. By the time this response hits the frontend, the agent has already read the full task history for the team, matched skills against DBTeamMember.skills, and crossed that against historical completion rate and delay patterns. The confidence score comes out of that same retrieval—it's not a static value, it changes as the team's history grows.

The Seed Data Is the Best Test Suite We Wrote

One underrated thing in this codebase is the /seed endpoint. It doesn't just insert records—it writes a deliberate performance history designed to create interesting patterns:

# Alice - 2 Backend tasks, On time
t1 = await create_task(db, TaskItemCreate(
    task_name="Build setup database API backend",
    assigned_to="Alice",
    deadline=now + timedelta(days=1),
    priority="High",
    difficulty="Hard",
    ai_rationale="Alice has extensive experience with SQLAlchemy and API design, "
                 "making her ideal for this core architectural task."
))
await mark_task_completed(db, t1.id)

# Alice - 1 UI task, Delayed (deadline in the past)
t3 = await create_task(db, TaskItemCreate(
    task_name="Fix UI dashboard bug",
    assigned_to="Alice",
    deadline=now - timedelta(days=1),   # already past deadline
    priority="Low",
    difficulty="Easy",
    ai_rationale="Assigned to Alice during High-Load phase to balance team throughput."
))
await mark_task_completed(db, t3.id)

Alice completes two backend tasks on time and one frontend task late. Bob completes two frontend tasks on time and one backend task late. Charlie is mixed. This isn't random—it's a structured ground truth that lets us verify the agent's suggestions are actually tracking skill-to-outcome alignment rather than just cycling through members.

When you run the seed and then call /suggest-member?task_name=Build+REST+API, Hindsight should surface Alice, not Bob, with a reason that references her track record. If it doesn't, the memory layer isn't reading the right history. It's a behavioral test dressed up as sample data.The InsightsResponse model makes this explicit:

class InsightsResponse(BaseModel):
    best_performing_member: Optional[str] = None
    most_delayed_member: Optional[str] = None
    stats: TaskCompletionStats
    risk_insights: List[str]

risk_insights is the most interesting field here. It's not a summary; it's a list of flags the agent raised without being asked. In one test run it surfaced a warning about Bob's database migration being late before we'd even looked at the deadline column ourselves. The agent noticed the pattern—Bob + database task + tight deadline = risk—and wrote it into the response unprompted.

What Actually Surprised Us

The memory loop is more powerful than the model. We spent a lot of time choosing between Groq and OpenAI for the LLM calls. In practice, the quality of suggestions correlated more strongly with how much structured history Hindsight had to read than which model was answering. A smaller model with rich task history consistently beat a larger model with no history.

ai_rationale is underused right now. We write a human-readable explanation into every task, but we're not yet doing anything structured with those explanations at query time—Hindsight reads them as text, not as parsed signals. The next version would extract key phrases (skill match, workload, risk) and index them explicitly. Right now the rationale is more diary than database.

Combining decisions with tasks in one memory pool has real friction. Architectural decisions ("use FastAPI, not Django") and task performance history ("Bob was late on this") live in the same Hindsight store. That's convenient but it means the retrieval context can get muddled—the agent sometimes surfaces a process decision when you wanted a performance insight. We're considering separate named stores.

The chat endpoint is the part users actually care about. The structured suggestion API is the impressive engineering story. But the /chat endpoint—which lets you ask in plain English "who should own the auth refactor?"—is what non-technical stakeholders interact with. Designing around that interface from the start would have changed several data model decisions.

Lessons Worth Taking Forward

Store the agent's reasoning alongside its output, not separately. ai_rationale in the same row as the task means it's always available when you query the task. If we'd logged it to a separate table, it would have been orphaned within two sprints.
Seed data is specification. Writing the /seed endpoint first—with deliberate performance patterns and real ai_rationale text—forced us to answer "what does good look like?" before we built the memory layer. Every vague requirement became concrete the moment we had to write it as a DB insert.
Confidence scores mean nothing without calibration. The confidence_score field exists everywhere in the model. But without a feedback loop (did the suggested assignee actually perform well?), the score is just a number the agent invented. We need a completed-task signal to close the loop and make the score meaningful over time.
A flat architecture survives a hackathon; it won't survive a year. The single-process FastAPI app is fast to build and trivial to deploy. The moment you need background task processing (e.g., auto-generating follow-up tasks when a meeting transcript arrives), the synchronous handler model starts to strain. An async task queue is the obvious next step.
Don't treat agent memory as a black box. The most useful debugging we did was reading the raw content Hindsight had stored—treating it like a database we could inspect, not a service we had to trust. If you can't explain what's in your memory layer, you can't explain your agent's behavior.

The system today is a credible v1: it tracks work, surfaces patterns, explains decisions, and gets measurably better suggestions as the history grows. The interesting engineering is not in the LLM calls—those are three lines. It's in how you shape the data the agent reads and what you make it write down after every decision it takes.

That's what Hindsight made concrete for us: memory isn't a feature you add. It's a discipline about what you store, why you store it, and whether the agent can actually use it the next time you ask.

The below is the screenshot of the application developed: