A Technical Framework for Deciding Which Business Workflows to Automate with AI First

#ai #webdev #startup #productivity

Every week, I talk to engineering leads who've been handed the same brief: "Figure out where we should be using AI."

It sounds like a strategy question. It's actually an engineering question in disguise — because the right answer depends on factors that non-technical leadership can't easily evaluate: data availability, task structure, output verifiability, integration complexity.

Here's the framework we use to answer it systematically.

The 4-Dimension Workflow Scoring Matrix

For any workflow you're evaluating, score it on these 4 dimensions on a 1–5 scale:

workflow_score = {
    "task_structure": 0,       # How structured and repeatable is the task?
    "data_availability": 0,    # Do we have historical data to train/guide the AI?
    "output_verifiability": 0, # Can we verify if the AI output is correct without deep expertise?
    "volume_frequency": 0,     # How often does this task occur?
}

# Scoring rubric
TASK_STRUCTURE_GUIDE = {
    1: "Highly ambiguous — requires accumulated domain judgment",
    3: "Mixed — structured core with some judgment calls",
    5: "Fully structured — clear decision criteria or rule-based."
}

DATA_AVAILABILITY_GUIDE = {
    1: "No historical data exists",
    3: "Some data, partially structured",
    5: "Rich historical data, well-labelled, production-quality."
}

OUTPUT_VERIFIABILITY_GUIDE = {
    1: "Near impossible to verify without deep domain expertise",
    3: "Can sample-review ~20% of outputs",
    5: "Automated verification possible (format, schema, rules)."
}

VOLUME_FREQUENCY_GUIDE = {
    1: "< 10 instances/month",
    3: "50–200 instances/month",
    5: "> 1000 instances/month or continuous."
}

# Weighted final score
def ai_readiness_score(scores):
    weights = {
        "task_structure": 0.30,
        "data_availability": 0.30,
        "output_verifiability": 0.25,
        "volume_frequency": 0.15
    }
    return sum(scores[k] * weights[k] for k in scores)

# Interpretation
# 3.5+ → Strong AI candidate. Start here.
# 2.5–3.5 → Viable, but address the low-scoring dimension first
# < 2.5 → Not yet ready. Data or structure work is needed before AI.

Example: Scoring 5 Common Workflows

Workflow                    | Structure | Data | Verify | Volume | Score
---------------------------|-----------|------|--------|--------|------
Invoice data extraction    |     5     |  4   |   5    |   4    |  4.55  ← Start here
Customer support triage    |     4     |  4   |   3    |   5    |  3.85  ← Strong
Employee onboarding Q&A    |     4     |  3   |   4    |   3    |  3.55  ← Good
Contract risk flagging     |     3     |  3   |   2    |   3    |  2.80  ← Later
Strategic pricing decisions|     1     |  2   |   1    |   2    |  1.45  ← Not AI

Invoice extraction and support triage score the highest. Contract risk flagging is interesting, but the output verifiability gap makes it premature. Strategic pricing stays with humans.

The Pre-Build Data Readiness Checklist

Before scoping any AI implementation, verify:

data_readiness = {
    "volume": "500+ historical examples of the task completed?",
    "quality": "Historical data consistent and trustworthy (not riddled with exceptions)?",
    "labels": "Do we have correct outputs for historical inputs (for supervised approaches)?",
    "recency": "Data representative of current conditions (not stale from 2+ years ago)?"
}
# Any 'No': address the data problem before scoping the AI.
# AI quality is bounded by training/retrieval data quality — no model selection fixes bad data.

The Two Infrastructure Questions Nobody Asks Until Week 8

Where does the AI output go?

If outputs need to integrate with a legacy system (SAP, Salesforce, or an older ERP), the integration layer is typically 40–60% of the total build effort. Map downstream systems before scoping the AI component.

What happens when the AI is wrong?

Every AI system has a failure rate. Define your acceptable error rate before building. What does a wrong output cost? Can you spot-check a sample? How do you handle low-confidence outputs? These are week-1 architecture decisions, not week-8 discoveries.

The One-Page Prioritisation Output

After scoring every workflow:

AI READINESS SCORECARD — [Company Name]
=========================================
TOP 3 AI PRIORITY WORKFLOWS:
1. Invoice Data Extraction        Score: 4.55  ROI: HIGH
   Data ready: YES  Integration: 3 weeks  First value: 4 weeks

2. Customer Support Triage        Score: 3.85  ROI: MEDIUM-HIGH  
   Data ready: PARTIAL  Gap: Need 200 labelled examples

3. Employee Onboarding Q&A        Score: 3.55  ROI: MEDIUM
   Data ready: YES (Notion + Confluence)  Integration: 2 weeks

NOT YET READY (address these data gaps first):
- Contract Risk Flagging (output verifiability gap)
- Strategic Pricing (judgment-based, insufficient structure)

Run this before you talk to any AI vendor. Know your scores before you know the solution.