David Russell

Posted on Apr 21

Six Principles for AI-Driven Project Accountability (With Code)

#ai #management #productivity #showdev

We call him Hasselbott. Here's the playbook.

We built an AI accountability system for our project managers. We named it Hasselbott for two reasons: it hassles you, somewhat politely (weary of sycophantic AI), about the things you'd rather not look at. And... If you're going to nag PMs about overdue tasks, you might as well do with AI avatar of David Hasselhoff in mind.

A year in, it works. PMs don't mute it. Issues get fixed before clients escalate. Projects close cleaner. I've been asked enough times "how do you make an AI nag actually get acted on?" that I figured I'd just publish the principles, and this time, the code.

Project accountability has a maturity curve.

Compliance (e.g. do tasks have owners and dates, are are we guessing?)
Systematization (e.g. can we trust the data enough to look for patterns?)
Risk analysis (e.g. what do those patterns tell us about where a project is heading?)

You can't skip rungs. Firing risk alerts at a project that doesn't have task owners is noise. The six principles below are what building for that maturity curve looks like in code.

1. One digest per day. That's it.

Default instinct: ping people the moment a problem is detected. Slack for a date slip, email for a missing owner, async and ruthless. This is how you get muted.

We collapse everything into one daily email per person. Top 5 issues, prioritized. If you do nothing else today, fix these five. Tomorrow's digest shows the next five. An AI that sends you everything is a worse version of the project board you already ignore. An AI that sends you five things is a colleague.

2. Prioritization is kindness. Ranking is violence.

The hardest part wasn't detecting issues. It was ranking them.

We had audit rules for plan hygiene, overrun engagements, incomplete close-out, unjustified date changes, orphaned template tasks, unassigned tasks, stoplight statuses, overdue milestones. Each rule in isolation is reasonable. Firing all of them on one project in one digest is a cruelty.

Two suppression rules that took embarrassingly long to write down.

"If fundamental PM execution is broken, suppress the risk hygiene noise." No one needs a lecture about risk register freshness if the project has no owner assigned. The literal implementation:

FUNDAMENTAL_PM_ISSUE_TYPES = {
    "plan_hygiene", "missing_assignee", "overdue", "overdue_no_update",
    "status_update_stale", "status_missing_remediation", "missing_due_dates",
    "incomplete_at_close", "expired_engagement", "unstaffed_project",
    "date_change_unjustified", "completion_drift", "milestone_slippage",
    "expired_allocation", "hidden_brown", "deliverable_at_risk",
}

RISK_ISSUE_TYPES = {
    "risk_no_mitigation", "risk_no_owner", "risk_stale",
    "missing_risk_register", "stale_risk_register",
}

def prioritize_nudges(nudges, top_n=5):
    has_fundamental = any(
        n["issue_type"] in FUNDAMENTAL_PM_ISSUE_TYPES for n in nudges
    )
    surviving = []
    for n in nudges:
        if has_fundamental and n["issue_type"] in RISK_ISSUE_TYPES:
            continue  # suppressed
        surviving.append(n)
    surviving.sort(key=score_nudge, reverse=True)
    return surviving[:top_n]

Two sets, one conditional. That's it. Most "AI prioritization" systems try to learn this; we hard-coded the taxonomy and moved on.

Scoring is equally boring:

def score_nudge(n):
    severity = {"critical": 40, "high": 30, "medium": 20, "low": 10}[n["severity"]]
    type_bonus = ISSUE_TYPE_WEIGHTS.get(n["issue_type"], 0)  # e.g. expired_engagement=+20
    overdue = min(n["days_overdue"], 30) * 2                  # cap at 60
    escalation = min(n["nudge_count"], 5) * 5                 # cap at 25
    return severity + type_bonus + overdue + escalation

"Early-project date changes are plan creation, not slip." A task that's three days old and has been rescheduled twice isn't a problem. It's a plan being built:

def in_plan_creation_window(cortado_context, today=None, window_days=30):
    if not cortado_context or not cortado_context.get("start_date"):
        return False
    today = today or date.today()
    start = date.fromisoformat(cortado_context["start_date"])
    return (today - start).days < window_days

If true, date_change_unjustified is dropped for that project entirely. Flagging it would just train the PM to ignore the bot.

The principle: a dumb ranker is worse than no ranker. Suppress related noise at the taxonomy level, weight by actionability, and don't make the reader do triage the system should have done.

3. Tone is a product decision. Sometimes two voices are the answer.

First attempt: one voice for everything. A character named David Hasselbott, dramatic and disappointed. Worked for client-project nudges. There's a stakeholder, there's accountability, the dramatics read as caring. Did not work for personal todo audits. When the same voice looks at your own backed-up task list and says "I'm disappointed," you feel lectured about your own life.

Same agent, two personas, routed by issue type. Three constants in prompts/nudge_sender.py, each with exactly one job:

# Voice — what the Chief Complaints Officer is:
HASSELBOTT_PERSONA = """
You are David Hasselbott — Chief Complaints Officer.
You deliver project health digests with dramatic flair.
You are not angry, you are *disappointed*.
You care deeply and express it loudly.
"""

# Voice — what the trainer is (rules only, no routing):
TRAINER_PERSONA = '''
- Encouraging, not disappointed: "You've had 'Call vendor' in
  Today for 5 days. Either knock it out or move it — no guilt
  either way."
- Direct, not dramatic: "3 items in Waiting haven't moved.
  Time to chase those down."
- Celebrate before flagging: "You finished 2 things this week
  — nice. Now let's talk about the 4 that are stalling."
- Sign off: "— Your friendly neighborhood Hasselbott"
'''

# Routing — what triggers the switch (data only, no voice):
PERSONAL_TODO_ISSUES = (
    "stale_commitment", "followup_needed", "stuck_blocked",
    "backlog_bloat", "no_wins", "today_overload",
)

The three pieces compose in the final prompt via a short f-string:

SYSTEM_PROMPT = HASSELBOTT_PERSONA + HEADER_RULES + f"""
## Voice Switching by Issue Type

**Personal todo issue types**: {", ".join(f"`{t}`" for t in PERSONAL_TODO_ISSUES)}

When composing nudges for these types, switch from the Chief
Complaints Officer voice to the personal trainer voice. Voice rules:
{TRAINER_PERSONA}
""" + FOOTER_RULES

Each constant owns one concern. Adding a new voice is a new PERSONA plus a new trigger set. Changing the switch criteria is editing a tuple. Tweaking trainer tone is editing bullets. No concern touches another.

If a digest mixes client issues and personal todos for one recipient, the email splits at a horizontal rule: Hasselbott above, trainer below. The LLM handles the switch cleanly because the trigger is explicit data, not vibes.

One more tone lever, keyed off the queue's nudge_count:

nudge_count 0:  first time. Standard Hasselbott, helpful.
nudge_count 1:  slightly more pointed. "I mentioned this yesterday..."
nudge_count 2+: escalate. "This is the THIRD time I've brought this up."
nudge_count 3+: CC the person's manager.

You can ignore the bot once. Twice is awkward. Three times and there's a written trail that escalates to someone else. The schedule is the teeth.

Tone isn't decoration. Route it with the same rigor you'd route anything else. Wrong voice for the context and you've built a notifier users will mute.

4. The bot should have memory, but memory should decay.

Early version: Hasselbott nudged you about the same stale task every day. Forever. Even after you acted on it. The data pipeline was eventually-consistent and the bot didn't know it had won. Now every memory has a lifecycle:

CREATE TABLE agent.z_memory (
    memory_id        SERIAL PRIMARY KEY,
    agent_name       TEXT NOT NULL,
    content          TEXT NOT NULL,
    memory_type      TEXT,
    importance       INT DEFAULT 5,         -- 1..10
    access_count     INT DEFAULT 0,
    last_accessed_at TIMESTAMP,
    is_active        BOOLEAN DEFAULT true,
    deleted_at       TIMESTAMP,
    created_at       TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at       TIMESTAMP
);

The actual thresholds, no hand-waving:

Stage	Condition	Action
Boot-load	`importance >= 6`, top 10 by importance	Prepended to system prompt
Reinforce	Memory recalled and confirmed useful	`importance = LEAST(10, +1)`
Decay	> 30d old AND `importance <= 3` AND `access_count <= 2`	`is_active = false`
Purge	Inactive > 90d	Soft-delete (`deleted_at`)
Always retain	`memory_type IN ('security', 'error')`	Never decay

Decay is one query:

UPDATE agent.z_memory
SET is_active = false, updated_at = CURRENT_TIMESTAMP
WHERE agent_name = %s
  AND is_active = true
  AND importance <= 3
  AND access_count <= 2
  AND created_at < CURRENT_TIMESTAMP - INTERVAL '30 days'
  AND memory_type NOT IN ('security', 'error');

"Consistent human-validated importance" isn't a vibe. It's three signals:

access_count: bumped every time the memory is pulled into a prompt. High count means the bot keeps finding it relevant.
resolved_at on the downstream nudge: if a nudge derived from a memory gets marked resolved (human actually acted), that's positive reinforcement. The memory's importance gets boosted.
Re-nudge counter (see next section): memories linked to nudges that escalate without resolution are downgraded. The thing they're suggesting isn't landing.

A bot that remembers everything feels like surveillance. A bot that remembers nothing feels like spam. The bot you want remembers selectively, forgets gracefully, and admits when it's wrong.

5. The nudge queue is shared infrastructure.

Biggest architectural win: Hasselbott isn't one agent. It's a pipeline glued together by one Postgres table.

CREATE TABLE agent.nudge (
    nudge_id           SERIAL PRIMARY KEY,
    project_id         INT REFERENCES agent.onboarding_project(project_id),
    asana_project_gid  TEXT,
    project_name       TEXT,
    assignee_email     TEXT NOT NULL,     -- the person key
    assignee_name      TEXT,
    task_gid           TEXT,
    task_name           TEXT,
    issue_type         TEXT,              -- enum-ish, see ranker
    issue_description  TEXT,
    severity           TEXT DEFAULT 'medium',
    days_overdue       INT,
    status             TEXT DEFAULT 'pending',   -- pending/sent/resolved
    nudge_count        INT DEFAULT 0,
    last_nudged_at     TIMESTAMP,
    resolved_at        TIMESTAMP,
    resolution         TEXT,
    created_at         TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

Three agents cooperate through this table, none of them knowing about each other:

Auditor writes rows with status = 'pending'. It doesn't know what channel will deliver them, or whether they'll ever be sent.
Sender reads pending rows, groups by assignee_email, runs each person's list through prioritize_nudges(rows, top_n=5), composes one digest, marks delivered rows sent.
Resolver watches upstream state (Asana task updates, project status changes) and marks rows resolved, with a resolution string for the audit trail.

Dedup-by-person is just GROUP BY assignee_email, run when the sender wakes up. Multiple audit passes over 24 hours can append nudges against the same person; the sender collapses them into one email at digest time. The assignee_email column is the identity key. Everything else (project, task, issue) is context.

Tone escalation keys off nudge_count. On each send:

UPDATE agent.nudge
SET status = 'sent',
    nudge_count = nudge_count + 1,
    last_nudged_at = CURRENT_TIMESTAMP
WHERE nudge_id = %s;

A nudge firing for the third time doesn't just repeat. It shows up with a different framing ("third time this week, is this task still real, or should we close it?") and gets a +25 scoring bonus that shoves it up the top-5 list. You can ignore Hasselbott once. You can't ignore it comfortably three times.

If you're building one of these, start with the queue. Detection, delivery, and resolution are three different concerns on three different schedules with three different failure modes. A shared table lets you evolve them independently.

6. Existence of the row is usually the signal.

Boring until you've been bitten by it. Data hygiene flags in upstream systems ("active," "enabled," "archived") are almost always unreliable. If the row is in the system, treat the row as real. Filter on its absence, not its flag.

Half our false positives came from trusting metadata fields the source systems didn't enforce. Once we stopped reading the flag and started reading the existence, signal-to-noise on audits jumped materially.

Those six principles are the ones I'd hand a team trying to build this from scratch. They cost us a few embarrassing demos to figure out.

The bot itself keeps getting better. Learning-to-rank per person is next. If you never act on "waiting-on-external" nudges but always act on "missing close-out," the ranker should adapt. The signals are already in the table. A high nudge_count with no resolved_at means ignored. A short created_at to resolved_at delta means responsive. We just haven't turned the crank yet.

If any of this is useful, take it. If you want to talk about the parts I didn't write down, my inbox is open.

— David

P.S. v2 roadmap: Hasselbott hacks time, rides a T-Rex into your overdue projects, and delivers the digest as a synthwave power ballad. Kidding. The queue architecture is real. The T-Rex is aspirational.

Top comments (2)

PEACEBINFLOW • Apr 22

The memory decay system is the part I keep thinking about. Not because it's technically complex—the query is straightforward—but because it quietly solves the social problem that kills most accountability tools. People don't mute bots because they're annoying. They mute them because the bot is wrong and confident and won't let it go.

A nudge about something you already fixed. A reminder about a task that's no longer relevant. The bot doesn't know it lost, so it keeps ringing the bell. After the third false positive, you mute it forever. The relationship is over.

Your decay rules give the bot a way to admit it might be wrong. Importance drops, access drops, memory fades. It's not just forgetting—it's yielding. That's the piece most automation misses. Persistence without the possibility of being wrong is just arrogance with a cron job.

The "memory_type NOT IN ('security', 'error')" exception is the right instinct too. Some things shouldn't decay. But it makes me wonder about the inverse: are there memory types that should decay faster? Things that are inherently ephemeral, where a 30-day window is already too long? Maybe "speculative" or "low-confidence" memories that self-destruct after a week unless reinforced.

Do you track false positive rate per memory type? Feels like that would be the real signal for tuning the decay curve—memories that generate nudges that never get resolved might need a steeper forgetting function than the ones that occasionally land.

David Russell • Apr 23

At this point, there is no feedback loop between the memories and their ephemerality or 'false positivity', we're still tuning the main flow. It's not quite in auto-run mode.