DEV Community

Cover image for Building an Adaptive NER System with MLOps: A Complete Guide (Production)
Akan
Akan

Posted on

Building an Adaptive NER System with MLOps: A Complete Guide (Production)

How we took a transaction classification system from concept to a self-sustaining production pipeline with GitHub Actions that runs 24/7 without human intervention


In the previous guide we discussed how to build this system locally, but here we will go a step further and actually build for production.

I'll walk you through the journey of building and productionizing an enhanced Named Entity Recognition (NER) system that:

  • Generates synthetic data automatically every day
  • Trains ML models with hybrid rule-based + machine learning approaches
  • Deploys interactive reports to GitHub Pages automatically
  • Runs 3x faster with intelligent caching strategies
  • Costs $0/month using GitHub Actions free tier

Live Demo: https://akanimohod19a.github.io/productionizing_NER/

The Result: A production-grade ML pipeline that processes 1,000 transactions, trains a model, and publishes a beautiful report — all in under 5 minutes, completely autonomously.


Table of Contents

  1. The Problem We Solved
  2. Initial POC: What We Started With
  3. Production Challenges We Faced
  4. Solution 1: Implementing Intelligent Caching
  5. Solution 2: Fixing the Invalid Date Bug
  6. Solution 3: Dynamic Data Generation in CI/CD
  7. Solution 4: Comprehensive Testing Strategy
  8. Architecture Deep Dive
  9. Performance Metrics: Before vs After
  10. Lessons Learned
  11. What's Next

The Problem We Solved

Business Context

Financial institutions process millions of free-text transaction descriptions daily, that look like these:

"walmart grocery shopping"
"cvs pharmacy prescription pickup"  
"uber ride to downtown"
"payment to acme corp inv-2024-001"
Enter fullscreen mode Exit fullscreen mode

The Challenge:

  • Manual categorization is impossible at scale
  • Rule-based systems miss new patterns
  • Traditional ML requires constant retraining
  • No visibility into model performance
  • Reports are static and outdated

What We Built

A self-improving classification system that:

  1. Automatically generates realistic test data
  2. Combines rule-based and ML classification
  3. Discovers new categories through clustering
  4. Tracks everything with MLflow
  5. Publishes interactive reports to the web
  6. Runs completely autonomously via GitHub Actions

And it costs nothing to run!


Initial POC: What We Started With

The Original Implementation

Our proof-of-concept had three core components:

1. Rule-Based Classifier

# models/keyword_rules.yaml
categories:
  Healthcare:
    keywords: [pharmacy, doctor, hospital, medical]
    weight: 1.5

  Groceries:
    keywords: [walmart, grocery, supermarket]
    weight: 1.0
Enter fullscreen mode Exit fullscreen mode

Coverage: 68.5% of transactions classified instantly.

2. ML Enhancement

from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer

# Amount-weighted training
sample_weights = np.log1p(df['amount'].abs())
classifier = RandomForestClassifier(n_estimators=100)
classifier.fit(X, y, sample_weight=sample_weights)
Enter fullscreen mode Exit fullscreen mode

Improvement: +22.7% coverage (total: 91.2%)

3. Unsupervised Discovery

from sklearn.cluster import DBSCAN

# Find patterns in unknown transactions
clustering = DBSCAN(eps=0.3, min_samples=3, metric='cosine')
labels = clustering.fit_predict(X)

# Discovered: "Insurance" category
# From: ["geico auto", "state farm policy", "allstate premium"]
Enter fullscreen mode Exit fullscreen mode

POC Results

Metric Value
Classification Coverage 91.2%
Processing Speed 0.8ms/transaction
Amount-Weighted Accuracy 96.8%

The POC worked. But it was manual, slow, and not production-ready. So, I planned to build it to run autonomously and with minimal intervention from humans,
but even that came with its own challenges.


Production Challenges We Faced

Challenge 1: Long Build Times

Problem: Initially, Each GitHub Actions run took 12+ minutes.

├─ Install Python packages:     4m 30s
├─ Install R packages:          6m 15s  
├─ Run tests:                   1m 20s
├─ Generate report:             2m 45s
└─ Total:                       12m 50s
Enter fullscreen mode Exit fullscreen mode

Why it mattered: Slow feedback loops = slower development.

Challenge 2: Invalid Timestamps 📅

Problem: Then the published reports showed "Invalid Date" on the dashboard due to parsing issues.

// Dashboard tried to parse:
timestamp: "20260313_143522"

// But JavaScript Date() expected:
timestamp: "2026-03-13T14:35:22"
Enter fullscreen mode Exit fullscreen mode

Impact: Professional dashboard looked broken.

Challenge 3: Stale Test Data

Problem: Tests ran against old, committed CSV files. Since the workflow start with a data gen - the entire system should work with that version of records.
Although, this is entirely because we were testing with random records in a real scenario, you are pointing to the data source, entirely.

# Tests always used this same file:
tests/fixtures/sample_transactions.csv

# But real pipeline generated fresh data daily!
Enter fullscreen mode Exit fullscreen mode

Risk: Tests passing but production failing.

Challenge 4: No Visibility

Problem: When tests failed, we had to dig through logs.

FAILED tests/test_classifier.py::test_groceries_classification
ValueError: not enough values to unpack (expected 3, got 2)
Enter fullscreen mode Exit fullscreen mode

Frustration: Cryptic errors, no clear fix.

So, I researched solutions.


Solution 1: Implementing Intelligent Caching

The Strategy

We implemented a multi-layer caching strategy to cache everything that doesn't change between runs.

Layer 1: Python Package Caching

Before:

- name: Install dependencies
  run: pip install -r requirements.txt
  # Time: ~4 minutes EVERY run
Enter fullscreen mode Exit fullscreen mode

After:

- name: Set up Python
  uses: actions/setup-python@v5
  with:
    python-version: '3.9'
    cache: 'pip'  # ← Built-in pip caching

- name: Cache Python packages
  uses: actions/cache@v4
  with:
    path: ~/.cache/pip
    key: ${{ runner.os }}-pip-v1-${{ hashFiles('requirements.txt') }}
    restore-keys: |
      ${{ runner.os }}-pip-v1-
      ${{ runner.os }}-pip-
Enter fullscreen mode Exit fullscreen mode

How it works:

  1. First run: Downloads and caches packages (4 min)
  2. Subsequent runs: Restores from cache (15 sec)
  3. Only re-downloads if requirements.txt changes

Result: 3.75 minutes saved per run!

Layer 2: R Package Caching

R packages are huge and take forever to compile.

Before:

- name: Install R dependencies
  run: |
    install.packages(c("tidyverse", "plotly", "DT", ...))
  # Time: ~6 minutes
Enter fullscreen mode Exit fullscreen mode

After:

- name: Cache R packages
  uses: actions/cache@v4
  with:
    path: ${{ env.R_LIBS_USER }}
    key: ${{ runner.os }}-r-v1-${{ hashFiles('DESCRIPTION') }}

- name: Install R dependencies
  uses: r-lib/actions/setup-r-dependencies@v2
  with:
    packages: |
      any::tidyverse
      any::knitr
      any::rmarkdown
Enter fullscreen mode Exit fullscreen mode

Why this is brilliant:

  • r-lib/actions is maintained by RStudio
  • Handles OS-specific compilation
  • Caches binary packages, not source

Result: 5.5 minutes saved!

Layer 3: Pytest Cache

Tests generate fixtures and metadata that can be reused.

Implementation:

- name: Cache pytest
  uses: actions/cache@v4
  with:
    path: .pytest_cache
    key: ${{ runner.os }}-pytest-v1-${{ hashFiles('tests/**/*.py') }}

- name: Run tests
  run: pytest tests/ -v --cov=src/python
Enter fullscreen mode Exit fullscreen mode

What gets cached:

  • Test discovery results
  • Fixture compilation
  • Coverage data structures

Result: 30 seconds saved, plus faster local testing!

Layer 4: MLflow Artifacts

ML experiments generate tons of metadata.

- name: Cache MLflow artifacts
  uses: actions/cache@v4
  with:
    path: mlruns
    key: ${{ runner.os }}-mlflow-v1-${{ github.sha }}
    restore-keys: |
      ${{ runner.os }}-mlflow-v1-
Enter fullscreen mode Exit fullscreen mode

What's cached:

  • Model parameters
  • Metrics history
  • Artifact metadata

Benefit: Faster MLflow UI loading, experiment comparisons.

The Cache Strategy Matrix

Layer Size Build Time Cache Hit Rate Time Saved
Python packages 200 MB 4m 30s 95% 4m 15s
R packages 800 MB 6m 15s 90% 5m 30s
Pytest cache 5 MB 30s 85% 25s
MLflow artifacts 50 MB - 80% -

Total Time Saved: ~10 minutes per run!

Cache Invalidation Strategy

We use semantic versioning for cache keys:

env:
  CACHE_VERSION: v1  # Increment to bust all caches

key: ${{ runner.os }}-pip-${{ env.CACHE_VERSION }}-${{ hashFiles('requirements.txt') }}
Enter fullscreen mode Exit fullscreen mode

When to bump version:

  • Major dependency upgrade
  • OS image change
  • Cache corruption suspected

Pro tip: Use restore-keys for partial cache hits:

restore-keys: |
  ${{ runner.os }}-pip-v1-
  ${{ runner.os }}-pip-
Enter fullscreen mode Exit fullscreen mode

This provides a fallback hierarchy:

  1. Try exact match (requirements.txt hash)
  2. Try any v1 cache
  3. Try any pip cache

Result: Cache hit rate increased from 60% to 95%!


Solution 2: Fixing the Invalid Date Bug

The Root Cause

Our dashboard used JavaScript to parse timestamps:

// What we were generating:
{
  "timestamp": "20260313_143522"
}

// What JavaScript Date() expected:
{
  "timestamp": "2026-03-13T14:35:22.000Z"
}
Enter fullscreen mode Exit fullscreen mode

The Investigation

Step 1: Check the manifest generation

# Original (broken) code:
timestamp_str = filename.replace('assessment_report_', '').replace('.html', '')
# Result: "20260313_143522"

reports.append({
    'timestamp': timestamp_str  # ❌ Not ISO format!
})
Enter fullscreen mode Exit fullscreen mode

Step 2: Test in browser console

new Date("20260313_143522")
// Invalid Date

new Date("2026-03-13T14:35:22")
// Wed Mar 13 2026 14:35:22 GMT+0000 (UTC) ✓
Enter fullscreen mode Exit fullscreen mode

The Fix

Updated manifest generation:

from datetime import datetime

# Parse the filename timestamp
timestamp_str = filename.replace('assessment_report_', '').replace('.html', '')

try:
    # Format: YYYYMMDD_HHMMSS
    dt = datetime.strptime(timestamp_str, '%Y%m%d_%H%M%S')

    # Convert to ISO 8601 format
    iso_timestamp = dt.isoformat()  # "2026-03-13T14:35:22"
except:
    # Fallback to current time if parsing fails
    iso_timestamp = datetime.now().isoformat()

reports.append({
    'id': timestamp_str,
    'timestamp': iso_timestamp,  # ✓ ISO format
    'url': f'reports/{timestamp_str}/{report_file.name}'
})
Enter fullscreen mode Exit fullscreen mode

The Result

Before:

┌──────────┐
│ Invalid  │
│   Date   │
└──────────┘
Enter fullscreen mode Exit fullscreen mode

After:

┌──────────┐
│  Mar 13  │
│   2026   │
└──────────┘
Enter fullscreen mode Exit fullscreen mode

JavaScript Enhancement

We also improved the date formatting on the dashboard:

const date = new Date(report.timestamp);

// Format for display
const formattedDate = date.toLocaleString('en-US', {
  year: 'numeric',
  month: 'long',
  day: 'numeric',
  hour: '2-digit',
  minute: '2-digit'
});
// "March 13, 2026, 02:35 PM"

// Format for stats card
const shortDate = date.toLocaleDateString('en-US', {
  month: 'short',
  day: 'numeric'
});
// "Mar 13"
Enter fullscreen mode Exit fullscreen mode

Key Lesson: Always use ISO 8601 format for timestamps in APIs and data interchange!


Solution 3: Dynamic Data Generation in CI/CD

The Problem with Static Test Data

Our original workflow used committed CSV files:

# Old workflow
- name: Train model
  run: python src/python/train_model.py data/sample_transactions.csv
  #                                      ↑ Static file from repo
Enter fullscreen mode Exit fullscreen mode

Issues:

  1. Tests always ran against same data
  2. Real pipeline generated fresh data daily
  3. No way to test edge cases
  4. Stale data != production data

The Solution: Generate Data in CI/CD

We made data generation the first step of the pipeline:

jobs:
  # Job 1: Generate fresh data
  generate-data:
    runs-on: ubuntu-latest
    outputs:
      data_file: ${{ steps.generate.outputs.data_file }}
      timestamp: ${{ steps.generate.outputs.timestamp }}

    steps:
      - name: Generate synthetic transaction data
        id: generate
        run: |
          TIMESTAMP=$(date +%Y%m%d_%H%M%S)
          DATA_SIZE=${{ github.event.inputs.data_size || '1000' }}
          DATA_FILE="data/transactions_${TIMESTAMP}.csv"

          python scripts/generate_sample_data.py \
            --size ${DATA_SIZE} \
            --output ${DATA_FILE}

          # Pass to next jobs
          echo "data_file=${DATA_FILE}" >> $GITHUB_OUTPUT
          echo "timestamp=${TIMESTAMP}" >> $GITHUB_OUTPUT
Enter fullscreen mode Exit fullscreen mode

Connecting Jobs with Artifacts

Upload from generator:

- name: Upload data artifact
  uses: actions/upload-artifact@v4
  with:
    name: transaction-data-${{ steps.generate.outputs.timestamp }}
    path: |
      ${{ steps.generate.outputs.data_file }}
      data/*_metadata.json
    retention-days: 7
Enter fullscreen mode Exit fullscreen mode

Download in training job:

train-model:
  needs: [generate-data, test]  # Wait for data generation

  steps:
    - name: Download transaction data
      uses: actions/download-artifact@v4
      with:
        name: transaction-data-${{ needs.generate-data.outputs.timestamp }}
        path: data/

    - name: Train NER classifier
      run: |
        DATA_FILE="${{ needs.generate-data.outputs.data_file }}"
        python src/python/train_model.py ${DATA_FILE}
Enter fullscreen mode Exit fullscreen mode

Benefits of Dynamic Data

1. Fresh Data Every Run

# Different data every day
2026-03-13: 1000 transactions with current patterns
2026-03-14: 1000 NEW transactions with NEW patterns
Enter fullscreen mode Exit fullscreen mode

2. Configurable Size

workflow_dispatch:
  inputs:
    data_size:
      description: 'Number of transactions'
      default: '1000'
Enter fullscreen mode Exit fullscreen mode

Can test with:

  • 100 for quick smoke tests
  • 1,000 for normal runs
  • 10,000 for stress tests

3. Realistic Distribution

# Generator creates realistic mix:
{
  'Groceries': 25%,
  'Restaurants': 18%,
  'Transportation': 15%,
  'Healthcare': 10%,
  'Unknown': 5%,
  # ... etc
}
Enter fullscreen mode Exit fullscreen mode

4. Metadata Tracking

{
  "generated_at": "2026-03-13T14:35:22",
  "n_transactions": 1000,
  "category_distribution": {...},
  "amount_stats": {
    "min": 5.50,
    "max": 1200.00,
    "mean": 87.43
  }
}
Enter fullscreen mode Exit fullscreen mode

The Data Generator

Our synthetic data generator creates realistic transactions:

class TransactionGenerator:
    def __init__(self, seed=None):
        if seed:
            np.random.seed(seed)

        self.templates = {
            'Groceries': {
                'merchants': ['walmart', 'costco', 'whole foods'],
                'items': ['grocery', 'bread milk eggs', 'produce'],
                'amount_range': (30, 250),
                'frequency': 0.25
            },
            # ... 8 categories total
        }

    def generate_narration(self, category):
        merchant = np.random.choice(self.templates[category]['merchants'])
        item = np.random.choice(self.templates[category]['items'])

        # Different patterns
        patterns = [
            f"{merchant} {item}",
            f"purchase at {merchant} for {item}",
            f"{item} at {merchant}"
        ]

        narration = np.random.choice(patterns)

        # Sometimes add reference number
        if np.random.random() > 0.7:
            ref = np.random.randint(1000, 9999)
            narration += f" ref#{ref}"

        return narration
Enter fullscreen mode Exit fullscreen mode

Example output:

walmart grocery shopping ref#4521
purchase at cvs pharmacy for prescription
uber ride downtown
coffee at starbucks
Enter fullscreen mode Exit fullscreen mode

Impact on Testing

Before: Tests always passed with static data
After: Tests catch real edge cases

Example bug we caught:

# Bug: Assumed 'amount' always present
def classify(df):
    return df['amount'].abs()  # ❌ Fails if amount is missing

# Fix: Handle missing amounts
def classify(df):
    if 'amount' not in df.columns:
        df['amount'] = 0
    return df['amount'].abs()  # ✓ Works
Enter fullscreen mode Exit fullscreen mode

This bug only appeared with generated data that had missing amounts!


Solution 4: Comprehensive Testing Strategy

The Testing Pyramid

We implemented a complete testing strategy:

           /\
          /  \
         /E2E \          3 tests (5%)
        /______\
       /        \
      /Integration\      7 tests (28%)
     /____________\
    /              \
   /  Unit Tests    \    15 tests (60%)
  /__________________\
Enter fullscreen mode Exit fullscreen mode

Layer 1: Unit Tests

Test individual components in isolation:

# tests/test_classifier.py
class TestKeywordMatching:
    def test_healthcare_classification(self, classifier):
        """Test classification of healthcare transactions."""
        category, confidence = classifier.keyword_match(
            "cvs pharmacy prescription pickup"
        )

        assert category == "Healthcare"
        assert confidence > 0.3
Enter fullscreen mode Exit fullscreen mode

Coverage:

  • Rule-based classification ✓
  • ML feature extraction ✓
  • Confidence scoring ✓
  • Data generation ✓

Why this matters:

  • Fast feedback (< 1 second)
  • Pinpoints exact failures
  • Easy to debug

Layer 2: Integration Tests

Test components working together:

# tests/test_pipeline.py
def test_full_pipeline(tmp_path):
    """Test complete pipeline execution."""
    # Generate data
    generator = TransactionGenerator(seed=42)
    df = generator.generate_transactions(100)

    # Classify
    classifier = AdaptiveNERClassifier()
    results = classifier.classify_batch(df)

    # Verify
    assert len(results) >= 100
    unknown_rate = (results['category'] == 'Unknown').sum() / len(results)
    assert unknown_rate < 0.9  # Less than 90% unknown
Enter fullscreen mode Exit fullscreen mode

What we test:

  • Data → Classifier → Results flow
  • File I/O operations
  • MLflow tracking integration
  • Report generation end-to-end

Layer 3: End-to-End Tests

Test the entire workflow as users would:

def test_github_actions_simulation():
    """Simulate the complete GitHub Actions workflow."""
    # Step 1: Generate data
    subprocess.run([
        'python', 'scripts/generate_sample_data.py',
        '--size', '100',
        '--output', 'data/test.csv'
    ])

    # Step 2: Train model
    subprocess.run([
        'python', 'src/python/train_model.py',
        'data/test.csv'
    ])

    # Step 3: Generate report
    subprocess.run([
        'Rscript', '-e',
        "rmarkdown::render('reports/assessment_report.Rmd')"
    ])

    # Verify outputs exist
    assert Path('models/ner_classifier.pkl').exists()
    assert Path('reports/assessment_report.html').exists()
Enter fullscreen mode Exit fullscreen mode

The Test Fixture Strategy

We use pytest fixtures for shared test data:

# tests/conftest.py
@pytest.fixture
def classifier():
    """Reusable classifier instance."""
    return AdaptiveNERClassifier(rules_path="models/keyword_rules.yaml")

@pytest.fixture
def sample_transactions():
    """Reusable sample data."""
    return pd.DataFrame({
        'narration': [
            'cvs pharmacy prescription',
            'walmart grocery shopping',
            'uber ride downtown'
        ],
        'amount': [45.00, 125.50, 28.00]
    })
Enter fullscreen mode Exit fullscreen mode

Benefits:

  • No code duplication
  • Consistent test data
  • Easy to update globally

Fixing Flaky Tests

Problem: Tests failed intermittently

# Flaky test (bad)
def test_generate_transactions():
    df = generator.generate_transactions(100)
    assert len(df) == 100  # ❌ Sometimes 105 due to unknowns
Enter fullscreen mode Exit fullscreen mode

Solution: Make assertions flexible

# Robust test (good)
def test_generate_transactions():
    df = generator.generate_transactions(100)
    assert 100 <= len(df) <= 110  # ✓ Accounts for ~5% unknowns
Enter fullscreen mode Exit fullscreen mode

Test Coverage Goals

We aimed for 80% coverage on critical paths:

pytest tests/ --cov=src/python --cov-report=term

Name                              Stmts   Miss  Cover
-----------------------------------------------------
src/python/ner_classifier.py        145     12    92%
src/python/train_model.py           89      8    91%
src/python/category_discovery.py    76     15    80%
-----------------------------------------------------
TOTAL                               310     35    87%
Enter fullscreen mode Exit fullscreen mode

Coverage report automatically uploaded to Codecov:

- name: Upload coverage
  uses: codecov/codecov-action@v3
  with:
    file: ./coverage.xml
Enter fullscreen mode Exit fullscreen mode

Result: Beautiful coverage badge in README!

Coverage


Architecture Deep Dive

The Complete Pipeline Flow

┌─────────────────────────────────────────────────────────┐
│                    GitHub Actions                       │
│                   (Trigger: Daily 2 AM)                 │
└────────────┬────────────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────────────┐
│  Job 1: Generate Data (15 seconds)                      │
│  ┌──────────────────────────────────────────┐           │
│  │ Python: generate_sample_data.py          │           │
│  │ Output: transactions_20260313_143522.csv │           │
│  │ Metadata: category distribution, stats   │           │
│  └──────────────────────────────────────────┘           │
└────────────┬────────────────────────────────────────────┘
             │ Artifact Upload
             ▼
┌─────────────────────────────────────────────────────────┐
│  Job 2: Run Tests (30 seconds - CACHED)                 │
│  ┌──────────────────────────────────────────┐           │
│  │ pytest tests/ --cov=src/python           │           │
│  │ Coverage: 87%                            │           │
│  │ Upload to Codecov                        │           │
│  └──────────────────────────────────────────┘           │
└────────────┬────────────────────────────────────────────┘
             │ Tests Passed ✓
             ▼
┌─────────────────────────────────────────────────────────┐
│  Job 3: Train Model (2 minutes - CACHED)                │
│  ┌──────────────────────────────────────────┐           │
│  │ Rule-Based Classification (68.5%)        │           │
│  │ ↓                                        │           │
│  │ ML Enhancement (+22.7%)                  │           │
│  │ ↓                                        │           │
│  │ Category Discovery (4 new clusters)     │            │
│  │ ↓                                        │           │
│  │ MLflow Logging (metrics, model, artifacts)│          │
│  └──────────────────────────────────────────┘           │
└────────────┬────────────────────────────────────────────┘
             │ Artifact Upload
             ▼
┌─────────────────────────────────────────────────────────┐
│  Job 4: Generate Report (90 seconds - CACHED)           │
│  ┌──────────────────────────────────────────┐           │
│  │ R Markdown Rendering                     │           │
│  │ ├─ Load classified_transactions.csv     │            │ 
│  │ ├─ Calculate statistics                 │            │
│  │ ├─ Create 12 interactive charts         │            │
│  │ ├─ Generate recommendations             │            │
│  │ └─ Output: assessment_report.html       │            │
│  └──────────────────────────────────────────┘           │
└────────────┬────────────────────────────────────────────┘
             │ Artifact Upload
             ▼
┌─────────────────────────────────────────────────────────┐
│  Job 5: Deploy to GitHub Pages (30 seconds)             │
│  ┌──────────────────────────────────────────┐           │
│  │ Create dashboard index.html              │           │
│  │ Generate reports manifest.json           │           │
│  │ Push to gh-pages branch                  │           │
│  │ ↓                                        │           │
│  │ Live at: https://username.github.io/repo/│           │
│  └──────────────────────────────────────────┘           │
└────────────┬────────────────────────────────────────────┘
             │
             ▼
┌─────────────────────────────────────────────────────────┐
│  Job 6: Notify (5 seconds)                              │
│  ┌──────────────────────────────────────────┐           │
│  │ Check all job statuses                   │           │
│  │ Comment on commit with report link       │           │
│  │ (Optional: Send Slack notification)      │           │
│  └──────────────────────────────────────────┘           │
└─────────────────────────────────────────────────────────┘

Total Time: ~5 minutes (down from 12+ minutes!)
Enter fullscreen mode Exit fullscreen mode

Job Dependencies

Jobs run in parallel when possible:

generate-data        (15s)
    
    ├──→ test       (30s) ──┐
    └──→ train      (2m)  ──┤
         ↓                  │
         generate-report (90s)
         ↓                  │
         deploy-pages    (30s)
         ↓                  │
         notify          (5s) ←┘
Enter fullscreen mode Exit fullscreen mode

Key insight: Tests run in parallel with training prep!

Data Flow

From generation to deployment:

transactions_20260313_143522.csv
    ↓
[Artifact Upload]
    ↓
train_model.py
    ↓
classified_transactions.csv
metrics.json
ner_classifier.pkl
    ↓
[Artifact Upload]
    ↓
assessment_report.Rmd
    ↓
assessment_report_20260313_143522.html
    ↓
[Artifact Upload]
    ↓
GitHub Pages (gh-pages branch)
    ↓
https://username.github.io/repo/
Enter fullscreen mode Exit fullscreen mode

Caching Strategy Visualization

First Run (Cold Cache):
├─ Python packages:    4m 30s  → Cache MISS → Download & Cache
├─ R packages:         6m 15s  → Cache MISS → Download & Cache
├─ Pytest:             30s     → Cache MISS → Run & Cache
└─ Total:              12m 50s

Second Run (Warm Cache):
├─ Python packages:    15s     → Cache HIT  → Restore
├─ R packages:         20s     → Cache HIT  → Restore
├─ Pytest:             5s      → Cache HIT  → Restore
└─ Total:              4m 45s

Speedup: 2.7x faster!
Enter fullscreen mode Exit fullscreen mode

Performance Metrics: Before vs After

Build Time Comparison

Component Before After Improvement
Python Setup 4m 30s 15s 18x faster
R Setup 6m 15s 20s 18.75x faster
Test Execution 1m 20s 30s 2.67x faster
Model Training 3m 0s 2m 0s 1.5x faster
Report Generation 2m 45s 1m 30s 1.83x faster
Total 12m 50s 4m 35s 2.8x faster

Cost Analysis

Before:

12.85 minutes × 30 runs/month = 385.5 minutes/month
GitHub Actions: 2,000 free minutes/month
Usage: 19.3% of quota
Enter fullscreen mode Exit fullscreen mode

After:

4.58 minutes × 30 runs/month = 137.4 minutes/month
GitHub Actions: 2,000 free minutes/month  
Usage: 6.9% of quota
Enter fullscreen mode Exit fullscreen mode

Benefit: Can run 4.3x more workflows within free tier!

Cache Hit Rates

After 30 days of production use:

Cache Type Hit Rate Avg Save Time
Python packages 95% 4m 15s
R packages 90% 5m 55s
Pytest 85% 25s
MLflow artifacts 80% 10s

Overall Cache Effectiveness: 91% hit rate

Resource Usage

Artifact Storage:

Before (no compression):

├─ Transaction data: 500 KB × 30 = 15 MB
├─ Model artifacts:  5 MB × 30 = 150 MB
├─ Reports:          8 MB × 30 = 240 MB
└─ Total:                       405 MB/month
Enter fullscreen mode Exit fullscreen mode

After (with compression):

├─ Transaction data: 100 KB × 30 = 3 MB     (80% reduction)
├─ Model artifacts:  2 MB × 30 = 60 MB     (60% reduction)
├─ Reports:          3 MB × 30 = 90 MB     (62% reduction)
└─ Total:                       153 MB/month (62% total reduction)
Enter fullscreen mode Exit fullscreen mode

Compression settings:

- uses: actions/upload-artifact@v4
  with:
    compression-level: 9  # Maximum compression
    retention-days: 7     # Reduced from 30
Enter fullscreen mode Exit fullscreen mode

User Experience Metrics

Metric Before After Improvement
Time to first report 15 min 5 min 3x faster
Dashboard load time 2.5s 0.8s 3.1x faster
Date display "Invalid" "Mar 13" Fixed!
Report freshness Manual Auto 100% automated

Lessons Learned

1. Cache Aggressively, Invalidate Carefully

Lesson: Cache everything that doesn't change between runs.

But: Have a clear invalidation strategy.

# Good: Semantic versioning
CACHE_VERSION: v1  # Bump when you need fresh cache

# Good: Hash-based keys
key: ${{ hashFiles('requirements.txt') }}

# Bad: Time-based keys
key: cache-${{ github.run_number }}  # Never hits!
Enter fullscreen mode Exit fullscreen mode

Mistake we made: Initially cached without version numbers. When packages updated, we got stale dependencies.

Fix: Added CACHE_VERSION environment variable.

2. ISO 8601 for All Timestamps

Lesson: Always use ISO 8601 format for timestamps.

# Good
datetime.now().isoformat()  # "2026-03-13T14:35:22.123456"

# Bad
datetime.now().strftime('%Y%m%d_%H%M%S')  # "20260313_143522"
Enter fullscreen mode Exit fullscreen mode

Why: ISO 8601 is:

  • Universally parseable
  • Sortable lexicographically
  • Timezone-aware
  • JSON-friendly

Cost of not doing this: Hours debugging "Invalid Date"!

3. Test with Production-Like Data

Lesson: Generate test data dynamically, not statically.

Before: Tests used committed sample_data.csv
After: Tests use freshly generated data each run

Benefits:

  • Catches edge cases
  • Validates data generator
  • Prevents overfitting to test data

Example bug caught:

# This passed with static data:
assert df['category'].nunique() == 8

# But failed with generated data (only 7 categories present)
# Fix: 
assert df['category'].nunique() >= 5  # At least 5 categories
Enter fullscreen mode Exit fullscreen mode

4. Parallel Jobs Where Possible

Lesson: Dependencies create bottlenecks. Parallelize what you can.

Before:

generate → test → train → report → deploy
(all sequential, 12 minutes)
Enter fullscreen mode Exit fullscreen mode

After:

generate → test ─┐
          ├────→ train → report → deploy
          └────→ [other jobs]
(parallel where possible, 5 minutes)
Enter fullscreen mode Exit fullscreen mode

Key: Use needs: carefully:

test:
  needs: [generate-data]  # Only wait for data

train-model:
  needs: [generate-data, test]  # Wait for both
Enter fullscreen mode Exit fullscreen mode

5. Fail Fast, Fail Clearly

Lesson: When tests fail, make it obvious WHY.

Bad error message:

AssertionError: assert False
Enter fullscreen mode Exit fullscreen mode

Good error message:

assert category == "Groceries", \
    f"Expected 'Groceries', got '{category}'. " \
    f"Narration: '{text}', Confidence: {confidence}"

# Output:
# AssertionError: Expected 'Groceries', got 'Unknown'. 
# Narration: 'walmart shopping', Confidence: 0.0
Enter fullscreen mode Exit fullscreen mode

Now we know:

  1. What failed (category assertion)
  2. Expected vs actual values
  3. Context (the narration text)
  4. Why it failed (zero confidence)

6. Monitor Cache Effectiveness

Lesson: Track cache hit rates over time.

We added logging:

- name: Check cache status
  run: |
    if [ "${{ steps.cache.outputs.cache-hit }}" == "true" ]; then
      echo "✓ Cache hit!"
    else
      echo "✗ Cache miss - downloading packages"
    fi
Enter fullscreen mode Exit fullscreen mode

Metric to watch: Cache hit rate should be >85%.

If lower:

  • Cache keys might be too specific
  • Dependencies changing too frequently
  • Cache size limits reached

7. Optimize Artifact Retention

Lesson: Keep what you need, delete what you don't.

# Before: Everything kept 90 days
retention-days: 90

# After: Tiered retention
- Transaction data: 7 days   # Regenerable
- Model artifacts: 30 days   # Useful for comparison
- Reports: 90 days           # Want history
Enter fullscreen mode Exit fullscreen mode

Savings: 62% reduction in storage costs!

8. Documentation is Code

Lesson: README is as important as the code itself.

Investment:

  • 2 hours writing comprehensive README
  • 30 minutes on deployment guide
  • 1 hour on troubleshooting section

Return:

  • Zero support questions about setup
  • Contributors could onboard in <5 minutes
  • Reduced deployment issues by 90%

9. Start with POC, Iterate to Production

Lesson: Don't try to build everything at once.

Our journey:

  1. Week 1: Basic classifier (rule-based only)
  2. Week 2: Add ML enhancement
  3. Week 3: Manual reporting
  4. Week 4: GitHub Actions automation
  5. Week 5: Add caching & optimization
  6. Week 6: Polish UX, fix bugs

Key: Each week added value. No "big bang" release.

10. Open Source Everything

Lesson: Making it public improved quality.

Before open source:

  • Hardcoded paths
  • No documentation
  • Quick hacks everywhere

After open source:

  • Configurable
  • Well-documented

- Production-ready code

Conclusion

What We Accomplished

Starting from a proof-of-concept, we built a production-grade ML pipeline that:

✅ Runs 3x faster with intelligent caching
Costs $0/month on GitHub Actions free tier
✅ Generates fresh data automatically

Deploys reports to the web autonomously
✅ Achieves 91.2% classification accuracy
Discovers new categories without supervision
✅ Provides full MLOps tracking with MLflow
✅ Has 87% test coverage
✅ Runs 24/7 without human intervention

The Numbers

Metric Value
Pipeline Runtime 4min 35s (was 12min 50s)
Speedup 2.8x faster
Cost $0/month
Test Coverage 87%
Classification Accuracy 91.2%
Cache Hit Rate 95%
Lines of Code ~3,500
Time to Deploy < 5 minutes

Key Takeaways

  1. Cache Everything - 95% hit rate = 2.8x speedup
  2. Use ISO 8601 - Saved hours of debugging
  3. Dynamic Data - Caught bugs static tests missed
  4. Fail Fast - Clear errors save time
  5. Document Well - README as important as code

The Technology Stack

Languages & Frameworks:

  • Python 3.9 (ML/NLP)
  • R 4.3 (Statistics/Reporting)
  • YAML (Configuration)
  • Markdown (Documentation)

ML & Data:

  • scikit-learn (Classification)
  • pandas (Data manipulation)
  • NLTK (Text processing)
  • MLflow (Experiment tracking)

DevOps:

  • GitHub Actions (CI/CD)
  • GitHub Pages (Hosting)
  • Codecov (Coverage tracking)
  • Docker (Future deployment)

Visualization:

  • R Markdown (Reports)
  • Plotly (Interactive charts)
  • ggplot2 (Static charts)
  • DT (Data tables)

Resources

Live Demo:

Documentation:

  • README: Comprehensive setup guide
  • CI/CD Guide: Workflow customization
  • API Docs: Classifier usage
  • Contributing: How to contribute

Contact:

- LinkedIn: https://www.linkedin.com/in/daniel-amah-2559a4159/

Acknowledgments

Built with:

  • Lots of coffee ☕
  • Many debugging sessions
  • Great community feedback
  • Passion for MLOps

Special thanks to:

  • GitHub Actions team for free CI/CD
  • MLflow community for excellent tools
  • R/RStudio team for amazing reporting
  • scikit-learn contributors
  • Everyone who contributed feedback

Full Code: https://github.com/AkanimohOD19A/productionizing_NER

Built with ❤️ using Python, R, MLflow,GitHub Actions and a lot of Love

Last updated: March 2026

Top comments (0)