DEV Community

Daniel Romitelli
Daniel Romitelli

Posted on • Originally published at craftedbydaniel.com

User Corrections Always Win: The Streaming Outlook Add‑in UI That Turns Human Edits Into Training Signal (Series Part 4)

I knew I’d built the wrong thing the first time I watched a recruiter hesitate over a single field.

The UI was “working”—the extractor was returning data—but the moment the company name landed wrong, everything downstream became fragile. Not because the AI was bad, but because the system was implicitly asking a human to trust a blob of JSON. That’s not a workflow. That’s a gamble.

This is Part 4 of my series “How to Architect an Enterprise AI System (And Why the Engineer Still Matters)”. In Part 3, I wrote about the six-tier enrichment cascade and why provenance has to be tracked per-field. This post is the next decision in that chain: user corrections always win—and the system learns from them.

The key insight (and why the naive approach fails)

The naive architecture is seductive:

  1. Extract email → 2. Create CRM records → 3. Let users “fix it later.”

It fails for a very human reason: corrections made “later” are expensive, inconsistent, and often never happen. Worse, if the AI writes directly to the database, you’ve inverted accountability. The AI becomes the author of record, and the human becomes a janitor.

So I flipped the direction of trust.

My core decision

The AI populates a form, not a database.

The form is the contract:

  • The AI streams candidate/company/deal fields into the UI.
  • The human reviews, edits, and approves.
  • The payload preserves both:
    • the AI’s original extraction (ai_extraction)
    • the human’s final edits (user_corrections)

That last bullet is the whole trick. The correction isn’t a fallback. It’s the primary design.

My mental model is a movie set, not a factory line: the AI is a talented assistant dressing the scene quickly, but the human is the director who decides what’s canon. The system is built so the director’s call is recorded—every time.

How the UX works: streaming extraction into a human-reviewed form

The Outlook Add-in experience is intentionally kinetic. When the user clicks the button, they don’t wait for one monolithic response. They watch fields populate as the extraction arrives. That’s not just for “feel”—it’s an operational choice.

Streaming gives me two things at once:

  1. Time-to-first-value: the user can start reading/editing before the full extraction finishes.
  2. A natural review loop: per-field confidence badges make it obvious where human attention matters.

The streaming extraction call (with a hard timeout + fallback)

In addin/taskpane.js I implemented a streaming extraction flow with a 60s timeout and a fallback path. The important detail isn’t “WebSocket” as a buzzword—it’s that the UI is resilient when the network isn’t.

/**
 * Extract email data and show preview
 */
async function extractAndPreview() {
    console.log('extractAndPreview function started');
    let calendlyHints = null;
    const getCalendlyFallback = () => {
        if (calendlyHints && Object.keys(calendlyHints).length > 0) {
            return calendlyHints;
        }
        if (c
Enter fullscreen mode Exit fullscreen mode

This is the exact moment in the code where I anchored the UX: extraction begins as an interactive session, not a single request/response. What surprised me in practice is how often the fallback path becomes the difference between “annoying” and “unusable” when a corporate network decides to be itself.

Per-field confidence: badges that teach the user where to look

If you want humans to reliably review AI output, you can’t ask them to review everything. You have to tell them where the risk is.

That’s what the confidence badges do. The extractor streams values, but the UI communicates uncertainty per field. Each field gets a confidence value and a color-coded badge.

This is where the earlier provenance decision (Part 3) pays rent: once you treat fields independently, you can also treat confidence independently.

Confidence badges in the UI

The add-in renders color-coded badges per field so the user knows exactly where to focus their attention. A recruiter looking at a form with twelve fields doesn't want to read everything—the badge system is the attention-routing mechanism.

The tiers map directly to what a recruiter needs to know:

if (confidence >= 0.8) {
    indicator.textContent = 'High Confidence';
    indicator.className = 'extracted-indicator confidence-high';
} else if (confidence >= 0.6) {
    indicator.textContent = 'Medium Confidence';
    indicator.className = 'extracted-indicator confidence-medium';
} else if (confidence >= 0.3) {
    indicator.textContent = 'Low Confidence';
    indicator.className = 'extracted-indicator confidence-low';
    field.classList.add('low-confidence');
} else {
    indicator.textContent = 'Please Review';
    indicator.className = 'extracted-indicator confidence-very-low';
    field.classList.add('very-low-confidence');
}
Enter fullscreen mode Exit fullscreen mode

Green means the extractor found this cleanly—trust it and move on. Yellow means it was inferred or partially supported—worth a glance. Red means it's a guess—verify before you send. And below 0.3, the field itself highlights to pull the eye.

I landed on four tiers instead of two or five because binary high/low didn't give recruiters enough signal—a name parsed from a signature and a name guessed from a Calendly link are different kinds of uncertain. But five tiers turned out to be noise; nobody could remember what medium-low meant. Four maps cleanly to the decision a recruiter actually makes: trust it, glance at it, check it, or fill it in yourself.

The non-obvious win is behavioral—people correct more when they believe it matters.

The override pattern: human edits overwrite AI extraction

Here’s the rule I enforce end-to-end:

  • The AI can propose.
  • The user can overwrite.
  • The user’s overwrite is authoritative.

That means when the backend processes the intake request, it must treat the final payload as truth—even if the AI was “confident.” Confidence is advisory; edits are law.

Step 6: store and learn from user corrections

In app/routes/intake_email.py, Step 6 explicitly calls out storing user corrections for learning by comparing the original AI extraction with the final user-edited version.

        # STEP 6.5: Store user corrections for learning (AI chat, direct edits, filled empty fields)
        # Compare original AI extraction with final user
Enter fullscreen mode Exit fullscreen mode

I’m deliberately showing this at the comment level because that’s what’s present here: the contract is documented in the route itself. The key is that the route is not merely “saving the final record”—it’s treating the diff as first-class data.

Correction learning: capturing diffs as training signal

A correction pipeline lives or dies on one detail: you must preserve what the AI said originally.

If you only store the final edited values, you lose the learning signal. You can’t tell whether the AI was wrong, whether the user changed their mind, or whether the email itself was ambiguous.

So the payload design matters. The system keeps two versions:

  • ai_extraction: what the model originally produced
  • user_corrections: what the human changed (including filled empty fields)

That’s the dataset.

And because the add-in is streaming, the UI can also track edits as they happen.

Field change tracking in the add-in

The add-in tracks user field changes so the backend can store meaningful diffs for learning (implemented in addin/taskpane.js in the field-change tracking section).

// 2. Current field is empty OR has invalid placeholder
const newValueIsValid = !invalidValues.includes(companyLower);
const currentIsEmptyOrInvalid = !currentFirmValue || invalidValues.includes(currentFirmValue);

if (newValueIsValid && currentIsEmptyOrInvalid) {
    streamedData.company_name = data.company_name;
    updateFieldFromStream('company_name', data.company_name, data.confidence);
    addStreamingProgressMessage(`Found company: ${data.company_name}`, '');
} else {
    console.log(`️ Skipping company update - new:'${data.company_name}' current:'${currentFirmValue}'`);
}
Enter fullscreen mode Exit fullscreen mode

This is one of those “small” guardrails that prevents a very real failure mode: once the user starts editing, the stream must not keep fighting them. The thing I like about this pattern is that it’s not philosophical—it’s mechanical. It makes the UI behave like a respectful assistant instead of a stubborn autocomplete.

Architecture flow: from streaming extraction to learning signal

Below is the real flow I built: the AI drafts into a form, the human approves, and the system stores both the draft and the edits.

flowchart TD
  userClick[User clicks Extract] --> addinUI[Outlook Add-in form]
  addinUI --> streamPipe[Streaming field updates]
  streamPipe --> confidenceBadges[Per-field confidence badges]
  confidenceBadges --> userEdits[User edits fields]
  userEdits --> sendAction[User clicks Send]
  sendAction --> intakeRoute[intake_email route]
  intakeRoute --> overrideRule[User corrections override AI]
  overrideRule --> diffStore[Store both versions]
  diffStore --> learningLoop[Correction learning pipeline]
Enter fullscreen mode Exit fullscreen mode

The non-obvious design choice is that the learning loop is downstream of an explicit human approval step. That keeps the dataset clean: the correction signal comes from intentional edits, not from silent database fixes weeks later.

Nuances that mattered in production

A few details turned out to be the difference between “nice demo” and “stable workflow.”

1) Streaming must stop overwriting once the user takes control

If the stream keeps populating after the user edits, you get a tug-of-war. The snippet above shows the exact stance: if the current field is already filled (or the new value is invalid), the stream backs off.

2) Confidence is UX, not governance

I treat confidence as a review aid. The system still accepts the user’s edits as the final truth. This keeps the human in charge and avoids the worst kind of AI product bug: “the model was confident, so we ignored you.”

3) “AI learned” feedback is not fluff

That little toast (AI learned: ...) is an incentive mechanism. It makes the user feel like correcting the system is part of the job, not a tax. And it’s honest: the system is explicitly designed to store those diffs.

Closing

The moment I stopped trying to make the extractor “right” and started making it reviewable, the whole system got calmer. Streaming field-by-field updates let the user see what’s happening, confidence badges tell them where to look, and the override rule makes the contract unambiguous: the human decides what becomes truth—and every correction becomes data.

In Part 5 of this series, I’ll show the other half of running AI in a real enterprise environment: feature flags as guardrails, including the hardcoded kill switch that lets me shut off risky behavior without redeploying the world.


🎧 Listen to the Enterprise AI Architecture audiobook
📖 Read the full 13-part series with an AI assistant

Top comments (0)