How we took a transaction classification system from concept to a self-sustaining production pipeline with GitHub Actions that runs 24/7 without human intervention
In the previous guide we discussed how to build this system locally, but here we will go a step further and actually build for production.
I'll walk you through the journey of building and productionizing an enhanced Named Entity Recognition (NER) system that:
- ✅ Generates synthetic data automatically every day
- ✅ Trains ML models with hybrid rule-based + machine learning approaches
- ✅ Deploys interactive reports to GitHub Pages automatically
- ✅ Runs 3x faster with intelligent caching strategies
- ✅ Costs $0/month using GitHub Actions free tier
Live Demo: https://akanimohod19a.github.io/productionizing_NER/
The Result: A production-grade ML pipeline that processes 1,000 transactions, trains a model, and publishes a beautiful report — all in under 5 minutes, completely autonomously.
Table of Contents
- The Problem We Solved
- Initial POC: What We Started With
- Production Challenges We Faced
- Solution 1: Implementing Intelligent Caching
- Solution 2: Fixing the Invalid Date Bug
- Solution 3: Dynamic Data Generation in CI/CD
- Solution 4: Comprehensive Testing Strategy
- Architecture Deep Dive
- Performance Metrics: Before vs After
- Lessons Learned
- What's Next
The Problem We Solved
Business Context
Financial institutions process millions of free-text transaction descriptions daily, that look like these:
"walmart grocery shopping"
"cvs pharmacy prescription pickup"
"uber ride to downtown"
"payment to acme corp inv-2024-001"
The Challenge:
- Manual categorization is impossible at scale
- Rule-based systems miss new patterns
- Traditional ML requires constant retraining
- No visibility into model performance
- Reports are static and outdated
What We Built
A self-improving classification system that:
- Automatically generates realistic test data
- Combines rule-based and ML classification
- Discovers new categories through clustering
- Tracks everything with MLflow
- Publishes interactive reports to the web
- Runs completely autonomously via GitHub Actions
And it costs nothing to run!
Initial POC: What We Started With
The Original Implementation
Our proof-of-concept had three core components:
1. Rule-Based Classifier
# models/keyword_rules.yaml
categories:
Healthcare:
keywords: [pharmacy, doctor, hospital, medical]
weight: 1.5
Groceries:
keywords: [walmart, grocery, supermarket]
weight: 1.0
Coverage: 68.5% of transactions classified instantly.
2. ML Enhancement
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
# Amount-weighted training
sample_weights = np.log1p(df['amount'].abs())
classifier = RandomForestClassifier(n_estimators=100)
classifier.fit(X, y, sample_weight=sample_weights)
Improvement: +22.7% coverage (total: 91.2%)
3. Unsupervised Discovery
from sklearn.cluster import DBSCAN
# Find patterns in unknown transactions
clustering = DBSCAN(eps=0.3, min_samples=3, metric='cosine')
labels = clustering.fit_predict(X)
# Discovered: "Insurance" category
# From: ["geico auto", "state farm policy", "allstate premium"]
POC Results
| Metric | Value |
|---|---|
| Classification Coverage | 91.2% |
| Processing Speed | 0.8ms/transaction |
| Amount-Weighted Accuracy | 96.8% |
The POC worked. But it was manual, slow, and not production-ready. So, I planned to build it to run autonomously and with minimal intervention from humans,
but even that came with its own challenges.
Production Challenges We Faced
Challenge 1: Long Build Times
Problem: Initially, Each GitHub Actions run took 12+ minutes.
├─ Install Python packages: 4m 30s
├─ Install R packages: 6m 15s
├─ Run tests: 1m 20s
├─ Generate report: 2m 45s
└─ Total: 12m 50s
Why it mattered: Slow feedback loops = slower development.
Challenge 2: Invalid Timestamps 📅
Problem: Then the published reports showed "Invalid Date" on the dashboard due to parsing issues.
// Dashboard tried to parse:
timestamp: "20260313_143522"
// But JavaScript Date() expected:
timestamp: "2026-03-13T14:35:22"
Impact: Professional dashboard looked broken.
Challenge 3: Stale Test Data
Problem: Tests ran against old, committed CSV files. Since the workflow start with a data gen - the entire system should work with that version of records.
Although, this is entirely because we were testing with random records in a real scenario, you are pointing to the data source, entirely.
# Tests always used this same file:
tests/fixtures/sample_transactions.csv
# But real pipeline generated fresh data daily!
Risk: Tests passing but production failing.
Challenge 4: No Visibility
Problem: When tests failed, we had to dig through logs.
FAILED tests/test_classifier.py::test_groceries_classification
ValueError: not enough values to unpack (expected 3, got 2)
Frustration: Cryptic errors, no clear fix.
So, I researched solutions.
Solution 1: Implementing Intelligent Caching
The Strategy
We implemented a multi-layer caching strategy to cache everything that doesn't change between runs.
Layer 1: Python Package Caching
Before:
- name: Install dependencies
run: pip install -r requirements.txt
# Time: ~4 minutes EVERY run
After:
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.9'
cache: 'pip' # ← Built-in pip caching
- name: Cache Python packages
uses: actions/cache@v4
with:
path: ~/.cache/pip
key: ${{ runner.os }}-pip-v1-${{ hashFiles('requirements.txt') }}
restore-keys: |
${{ runner.os }}-pip-v1-
${{ runner.os }}-pip-
How it works:
- First run: Downloads and caches packages (4 min)
- Subsequent runs: Restores from cache (15 sec)
- Only re-downloads if
requirements.txtchanges
Result: 3.75 minutes saved per run!
Layer 2: R Package Caching
R packages are huge and take forever to compile.
Before:
- name: Install R dependencies
run: |
install.packages(c("tidyverse", "plotly", "DT", ...))
# Time: ~6 minutes
After:
- name: Cache R packages
uses: actions/cache@v4
with:
path: ${{ env.R_LIBS_USER }}
key: ${{ runner.os }}-r-v1-${{ hashFiles('DESCRIPTION') }}
- name: Install R dependencies
uses: r-lib/actions/setup-r-dependencies@v2
with:
packages: |
any::tidyverse
any::knitr
any::rmarkdown
Why this is brilliant:
-
r-lib/actionsis maintained by RStudio - Handles OS-specific compilation
- Caches binary packages, not source
Result: 5.5 minutes saved!
Layer 3: Pytest Cache
Tests generate fixtures and metadata that can be reused.
Implementation:
- name: Cache pytest
uses: actions/cache@v4
with:
path: .pytest_cache
key: ${{ runner.os }}-pytest-v1-${{ hashFiles('tests/**/*.py') }}
- name: Run tests
run: pytest tests/ -v --cov=src/python
What gets cached:
- Test discovery results
- Fixture compilation
- Coverage data structures
Result: 30 seconds saved, plus faster local testing!
Layer 4: MLflow Artifacts
ML experiments generate tons of metadata.
- name: Cache MLflow artifacts
uses: actions/cache@v4
with:
path: mlruns
key: ${{ runner.os }}-mlflow-v1-${{ github.sha }}
restore-keys: |
${{ runner.os }}-mlflow-v1-
What's cached:
- Model parameters
- Metrics history
- Artifact metadata
Benefit: Faster MLflow UI loading, experiment comparisons.
The Cache Strategy Matrix
| Layer | Size | Build Time | Cache Hit Rate | Time Saved |
|---|---|---|---|---|
| Python packages | 200 MB | 4m 30s | 95% | 4m 15s |
| R packages | 800 MB | 6m 15s | 90% | 5m 30s |
| Pytest cache | 5 MB | 30s | 85% | 25s |
| MLflow artifacts | 50 MB | - | 80% | - |
Total Time Saved: ~10 minutes per run!
Cache Invalidation Strategy
We use semantic versioning for cache keys:
env:
CACHE_VERSION: v1 # Increment to bust all caches
key: ${{ runner.os }}-pip-${{ env.CACHE_VERSION }}-${{ hashFiles('requirements.txt') }}
When to bump version:
- Major dependency upgrade
- OS image change
- Cache corruption suspected
Pro tip: Use restore-keys for partial cache hits:
restore-keys: |
${{ runner.os }}-pip-v1-
${{ runner.os }}-pip-
This provides a fallback hierarchy:
- Try exact match (requirements.txt hash)
- Try any v1 cache
- Try any pip cache
Result: Cache hit rate increased from 60% to 95%!
Solution 2: Fixing the Invalid Date Bug
The Root Cause
Our dashboard used JavaScript to parse timestamps:
// What we were generating:
{
"timestamp": "20260313_143522"
}
// What JavaScript Date() expected:
{
"timestamp": "2026-03-13T14:35:22.000Z"
}
The Investigation
Step 1: Check the manifest generation
# Original (broken) code:
timestamp_str = filename.replace('assessment_report_', '').replace('.html', '')
# Result: "20260313_143522"
reports.append({
'timestamp': timestamp_str # ❌ Not ISO format!
})
Step 2: Test in browser console
new Date("20260313_143522")
// Invalid Date
new Date("2026-03-13T14:35:22")
// Wed Mar 13 2026 14:35:22 GMT+0000 (UTC) ✓
The Fix
Updated manifest generation:
from datetime import datetime
# Parse the filename timestamp
timestamp_str = filename.replace('assessment_report_', '').replace('.html', '')
try:
# Format: YYYYMMDD_HHMMSS
dt = datetime.strptime(timestamp_str, '%Y%m%d_%H%M%S')
# Convert to ISO 8601 format
iso_timestamp = dt.isoformat() # "2026-03-13T14:35:22"
except:
# Fallback to current time if parsing fails
iso_timestamp = datetime.now().isoformat()
reports.append({
'id': timestamp_str,
'timestamp': iso_timestamp, # ✓ ISO format
'url': f'reports/{timestamp_str}/{report_file.name}'
})
The Result
Before:
┌──────────┐
│ Invalid │
│ Date │
└──────────┘
After:
┌──────────┐
│ Mar 13 │
│ 2026 │
└──────────┘
JavaScript Enhancement
We also improved the date formatting on the dashboard:
const date = new Date(report.timestamp);
// Format for display
const formattedDate = date.toLocaleString('en-US', {
year: 'numeric',
month: 'long',
day: 'numeric',
hour: '2-digit',
minute: '2-digit'
});
// "March 13, 2026, 02:35 PM"
// Format for stats card
const shortDate = date.toLocaleDateString('en-US', {
month: 'short',
day: 'numeric'
});
// "Mar 13"
Key Lesson: Always use ISO 8601 format for timestamps in APIs and data interchange!
Solution 3: Dynamic Data Generation in CI/CD
The Problem with Static Test Data
Our original workflow used committed CSV files:
# Old workflow
- name: Train model
run: python src/python/train_model.py data/sample_transactions.csv
# ↑ Static file from repo
Issues:
- Tests always ran against same data
- Real pipeline generated fresh data daily
- No way to test edge cases
- Stale data != production data
The Solution: Generate Data in CI/CD
We made data generation the first step of the pipeline:
jobs:
# Job 1: Generate fresh data
generate-data:
runs-on: ubuntu-latest
outputs:
data_file: ${{ steps.generate.outputs.data_file }}
timestamp: ${{ steps.generate.outputs.timestamp }}
steps:
- name: Generate synthetic transaction data
id: generate
run: |
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
DATA_SIZE=${{ github.event.inputs.data_size || '1000' }}
DATA_FILE="data/transactions_${TIMESTAMP}.csv"
python scripts/generate_sample_data.py \
--size ${DATA_SIZE} \
--output ${DATA_FILE}
# Pass to next jobs
echo "data_file=${DATA_FILE}" >> $GITHUB_OUTPUT
echo "timestamp=${TIMESTAMP}" >> $GITHUB_OUTPUT
Connecting Jobs with Artifacts
Upload from generator:
- name: Upload data artifact
uses: actions/upload-artifact@v4
with:
name: transaction-data-${{ steps.generate.outputs.timestamp }}
path: |
${{ steps.generate.outputs.data_file }}
data/*_metadata.json
retention-days: 7
Download in training job:
train-model:
needs: [generate-data, test] # Wait for data generation
steps:
- name: Download transaction data
uses: actions/download-artifact@v4
with:
name: transaction-data-${{ needs.generate-data.outputs.timestamp }}
path: data/
- name: Train NER classifier
run: |
DATA_FILE="${{ needs.generate-data.outputs.data_file }}"
python src/python/train_model.py ${DATA_FILE}
Benefits of Dynamic Data
1. Fresh Data Every Run
# Different data every day
2026-03-13: 1000 transactions with current patterns
2026-03-14: 1000 NEW transactions with NEW patterns
2. Configurable Size
workflow_dispatch:
inputs:
data_size:
description: 'Number of transactions'
default: '1000'
Can test with:
- 100 for quick smoke tests
- 1,000 for normal runs
- 10,000 for stress tests
3. Realistic Distribution
# Generator creates realistic mix:
{
'Groceries': 25%,
'Restaurants': 18%,
'Transportation': 15%,
'Healthcare': 10%,
'Unknown': 5%,
# ... etc
}
4. Metadata Tracking
{
"generated_at": "2026-03-13T14:35:22",
"n_transactions": 1000,
"category_distribution": {...},
"amount_stats": {
"min": 5.50,
"max": 1200.00,
"mean": 87.43
}
}
The Data Generator
Our synthetic data generator creates realistic transactions:
class TransactionGenerator:
def __init__(self, seed=None):
if seed:
np.random.seed(seed)
self.templates = {
'Groceries': {
'merchants': ['walmart', 'costco', 'whole foods'],
'items': ['grocery', 'bread milk eggs', 'produce'],
'amount_range': (30, 250),
'frequency': 0.25
},
# ... 8 categories total
}
def generate_narration(self, category):
merchant = np.random.choice(self.templates[category]['merchants'])
item = np.random.choice(self.templates[category]['items'])
# Different patterns
patterns = [
f"{merchant} {item}",
f"purchase at {merchant} for {item}",
f"{item} at {merchant}"
]
narration = np.random.choice(patterns)
# Sometimes add reference number
if np.random.random() > 0.7:
ref = np.random.randint(1000, 9999)
narration += f" ref#{ref}"
return narration
Example output:
walmart grocery shopping ref#4521
purchase at cvs pharmacy for prescription
uber ride downtown
coffee at starbucks
Impact on Testing
Before: Tests always passed with static data
After: Tests catch real edge cases
Example bug we caught:
# Bug: Assumed 'amount' always present
def classify(df):
return df['amount'].abs() # ❌ Fails if amount is missing
# Fix: Handle missing amounts
def classify(df):
if 'amount' not in df.columns:
df['amount'] = 0
return df['amount'].abs() # ✓ Works
This bug only appeared with generated data that had missing amounts!
Solution 4: Comprehensive Testing Strategy
The Testing Pyramid
We implemented a complete testing strategy:
/\
/ \
/E2E \ 3 tests (5%)
/______\
/ \
/Integration\ 7 tests (28%)
/____________\
/ \
/ Unit Tests \ 15 tests (60%)
/__________________\
Layer 1: Unit Tests
Test individual components in isolation:
# tests/test_classifier.py
class TestKeywordMatching:
def test_healthcare_classification(self, classifier):
"""Test classification of healthcare transactions."""
category, confidence = classifier.keyword_match(
"cvs pharmacy prescription pickup"
)
assert category == "Healthcare"
assert confidence > 0.3
Coverage:
- Rule-based classification ✓
- ML feature extraction ✓
- Confidence scoring ✓
- Data generation ✓
Why this matters:
- Fast feedback (< 1 second)
- Pinpoints exact failures
- Easy to debug
Layer 2: Integration Tests
Test components working together:
# tests/test_pipeline.py
def test_full_pipeline(tmp_path):
"""Test complete pipeline execution."""
# Generate data
generator = TransactionGenerator(seed=42)
df = generator.generate_transactions(100)
# Classify
classifier = AdaptiveNERClassifier()
results = classifier.classify_batch(df)
# Verify
assert len(results) >= 100
unknown_rate = (results['category'] == 'Unknown').sum() / len(results)
assert unknown_rate < 0.9 # Less than 90% unknown
What we test:
- Data → Classifier → Results flow
- File I/O operations
- MLflow tracking integration
- Report generation end-to-end
Layer 3: End-to-End Tests
Test the entire workflow as users would:
def test_github_actions_simulation():
"""Simulate the complete GitHub Actions workflow."""
# Step 1: Generate data
subprocess.run([
'python', 'scripts/generate_sample_data.py',
'--size', '100',
'--output', 'data/test.csv'
])
# Step 2: Train model
subprocess.run([
'python', 'src/python/train_model.py',
'data/test.csv'
])
# Step 3: Generate report
subprocess.run([
'Rscript', '-e',
"rmarkdown::render('reports/assessment_report.Rmd')"
])
# Verify outputs exist
assert Path('models/ner_classifier.pkl').exists()
assert Path('reports/assessment_report.html').exists()
The Test Fixture Strategy
We use pytest fixtures for shared test data:
# tests/conftest.py
@pytest.fixture
def classifier():
"""Reusable classifier instance."""
return AdaptiveNERClassifier(rules_path="models/keyword_rules.yaml")
@pytest.fixture
def sample_transactions():
"""Reusable sample data."""
return pd.DataFrame({
'narration': [
'cvs pharmacy prescription',
'walmart grocery shopping',
'uber ride downtown'
],
'amount': [45.00, 125.50, 28.00]
})
Benefits:
- No code duplication
- Consistent test data
- Easy to update globally
Fixing Flaky Tests
Problem: Tests failed intermittently
# Flaky test (bad)
def test_generate_transactions():
df = generator.generate_transactions(100)
assert len(df) == 100 # ❌ Sometimes 105 due to unknowns
Solution: Make assertions flexible
# Robust test (good)
def test_generate_transactions():
df = generator.generate_transactions(100)
assert 100 <= len(df) <= 110 # ✓ Accounts for ~5% unknowns
Test Coverage Goals
We aimed for 80% coverage on critical paths:
pytest tests/ --cov=src/python --cov-report=term
Name Stmts Miss Cover
-----------------------------------------------------
src/python/ner_classifier.py 145 12 92%
src/python/train_model.py 89 8 91%
src/python/category_discovery.py 76 15 80%
-----------------------------------------------------
TOTAL 310 35 87%
Coverage report automatically uploaded to Codecov:
- name: Upload coverage
uses: codecov/codecov-action@v3
with:
file: ./coverage.xml
Result: Beautiful coverage badge in README!
Architecture Deep Dive
The Complete Pipeline Flow
┌─────────────────────────────────────────────────────────┐
│ GitHub Actions │
│ (Trigger: Daily 2 AM) │
└────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Job 1: Generate Data (15 seconds) │
│ ┌──────────────────────────────────────────┐ │
│ │ Python: generate_sample_data.py │ │
│ │ Output: transactions_20260313_143522.csv │ │
│ │ Metadata: category distribution, stats │ │
│ └──────────────────────────────────────────┘ │
└────────────┬────────────────────────────────────────────┘
│ Artifact Upload
▼
┌─────────────────────────────────────────────────────────┐
│ Job 2: Run Tests (30 seconds - CACHED) │
│ ┌──────────────────────────────────────────┐ │
│ │ pytest tests/ --cov=src/python │ │
│ │ Coverage: 87% │ │
│ │ Upload to Codecov │ │
│ └──────────────────────────────────────────┘ │
└────────────┬────────────────────────────────────────────┘
│ Tests Passed ✓
▼
┌─────────────────────────────────────────────────────────┐
│ Job 3: Train Model (2 minutes - CACHED) │
│ ┌──────────────────────────────────────────┐ │
│ │ Rule-Based Classification (68.5%) │ │
│ │ ↓ │ │
│ │ ML Enhancement (+22.7%) │ │
│ │ ↓ │ │
│ │ Category Discovery (4 new clusters) │ │
│ │ ↓ │ │
│ │ MLflow Logging (metrics, model, artifacts)│ │
│ └──────────────────────────────────────────┘ │
└────────────┬────────────────────────────────────────────┘
│ Artifact Upload
▼
┌─────────────────────────────────────────────────────────┐
│ Job 4: Generate Report (90 seconds - CACHED) │
│ ┌──────────────────────────────────────────┐ │
│ │ R Markdown Rendering │ │
│ │ ├─ Load classified_transactions.csv │ │
│ │ ├─ Calculate statistics │ │
│ │ ├─ Create 12 interactive charts │ │
│ │ ├─ Generate recommendations │ │
│ │ └─ Output: assessment_report.html │ │
│ └──────────────────────────────────────────┘ │
└────────────┬────────────────────────────────────────────┘
│ Artifact Upload
▼
┌─────────────────────────────────────────────────────────┐
│ Job 5: Deploy to GitHub Pages (30 seconds) │
│ ┌──────────────────────────────────────────┐ │
│ │ Create dashboard index.html │ │
│ │ Generate reports manifest.json │ │
│ │ Push to gh-pages branch │ │
│ │ ↓ │ │
│ │ Live at: https://username.github.io/repo/│ │
│ └──────────────────────────────────────────┘ │
└────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Job 6: Notify (5 seconds) │
│ ┌──────────────────────────────────────────┐ │
│ │ Check all job statuses │ │
│ │ Comment on commit with report link │ │
│ │ (Optional: Send Slack notification) │ │
│ └──────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────┘
Total Time: ~5 minutes (down from 12+ minutes!)
Job Dependencies
Jobs run in parallel when possible:
generate-data (15s)
↓
├──→ test (30s) ──┐
└──→ train (2m) ──┤
↓ │
generate-report (90s)
↓ │
deploy-pages (30s)
↓ │
notify (5s) ←┘
Key insight: Tests run in parallel with training prep!
Data Flow
From generation to deployment:
transactions_20260313_143522.csv
↓
[Artifact Upload]
↓
train_model.py
↓
classified_transactions.csv
metrics.json
ner_classifier.pkl
↓
[Artifact Upload]
↓
assessment_report.Rmd
↓
assessment_report_20260313_143522.html
↓
[Artifact Upload]
↓
GitHub Pages (gh-pages branch)
↓
https://username.github.io/repo/
Caching Strategy Visualization
First Run (Cold Cache):
├─ Python packages: 4m 30s → Cache MISS → Download & Cache
├─ R packages: 6m 15s → Cache MISS → Download & Cache
├─ Pytest: 30s → Cache MISS → Run & Cache
└─ Total: 12m 50s
Second Run (Warm Cache):
├─ Python packages: 15s → Cache HIT → Restore
├─ R packages: 20s → Cache HIT → Restore
├─ Pytest: 5s → Cache HIT → Restore
└─ Total: 4m 45s
Speedup: 2.7x faster!
Performance Metrics: Before vs After
Build Time Comparison
| Component | Before | After | Improvement |
|---|---|---|---|
| Python Setup | 4m 30s | 15s | 18x faster |
| R Setup | 6m 15s | 20s | 18.75x faster |
| Test Execution | 1m 20s | 30s | 2.67x faster |
| Model Training | 3m 0s | 2m 0s | 1.5x faster |
| Report Generation | 2m 45s | 1m 30s | 1.83x faster |
| Total | 12m 50s | 4m 35s | 2.8x faster |
Cost Analysis
Before:
12.85 minutes × 30 runs/month = 385.5 minutes/month
GitHub Actions: 2,000 free minutes/month
Usage: 19.3% of quota
After:
4.58 minutes × 30 runs/month = 137.4 minutes/month
GitHub Actions: 2,000 free minutes/month
Usage: 6.9% of quota
Benefit: Can run 4.3x more workflows within free tier!
Cache Hit Rates
After 30 days of production use:
| Cache Type | Hit Rate | Avg Save Time |
|---|---|---|
| Python packages | 95% | 4m 15s |
| R packages | 90% | 5m 55s |
| Pytest | 85% | 25s |
| MLflow artifacts | 80% | 10s |
Overall Cache Effectiveness: 91% hit rate
Resource Usage
Artifact Storage:
Before (no compression):
├─ Transaction data: 500 KB × 30 = 15 MB
├─ Model artifacts: 5 MB × 30 = 150 MB
├─ Reports: 8 MB × 30 = 240 MB
└─ Total: 405 MB/month
After (with compression):
├─ Transaction data: 100 KB × 30 = 3 MB (80% reduction)
├─ Model artifacts: 2 MB × 30 = 60 MB (60% reduction)
├─ Reports: 3 MB × 30 = 90 MB (62% reduction)
└─ Total: 153 MB/month (62% total reduction)
Compression settings:
- uses: actions/upload-artifact@v4
with:
compression-level: 9 # Maximum compression
retention-days: 7 # Reduced from 30
User Experience Metrics
| Metric | Before | After | Improvement |
|---|---|---|---|
| Time to first report | 15 min | 5 min | 3x faster |
| Dashboard load time | 2.5s | 0.8s | 3.1x faster |
| Date display | "Invalid" | "Mar 13" | Fixed! |
| Report freshness | Manual | Auto | 100% automated |
Lessons Learned
1. Cache Aggressively, Invalidate Carefully
Lesson: Cache everything that doesn't change between runs.
But: Have a clear invalidation strategy.
# Good: Semantic versioning
CACHE_VERSION: v1 # Bump when you need fresh cache
# Good: Hash-based keys
key: ${{ hashFiles('requirements.txt') }}
# Bad: Time-based keys
key: cache-${{ github.run_number }} # Never hits!
Mistake we made: Initially cached without version numbers. When packages updated, we got stale dependencies.
Fix: Added CACHE_VERSION environment variable.
2. ISO 8601 for All Timestamps
Lesson: Always use ISO 8601 format for timestamps.
# Good
datetime.now().isoformat() # "2026-03-13T14:35:22.123456"
# Bad
datetime.now().strftime('%Y%m%d_%H%M%S') # "20260313_143522"
Why: ISO 8601 is:
- Universally parseable
- Sortable lexicographically
- Timezone-aware
- JSON-friendly
Cost of not doing this: Hours debugging "Invalid Date"!
3. Test with Production-Like Data
Lesson: Generate test data dynamically, not statically.
Before: Tests used committed sample_data.csv
After: Tests use freshly generated data each run
Benefits:
- Catches edge cases
- Validates data generator
- Prevents overfitting to test data
Example bug caught:
# This passed with static data:
assert df['category'].nunique() == 8
# But failed with generated data (only 7 categories present)
# Fix:
assert df['category'].nunique() >= 5 # At least 5 categories
4. Parallel Jobs Where Possible
Lesson: Dependencies create bottlenecks. Parallelize what you can.
Before:
generate → test → train → report → deploy
(all sequential, 12 minutes)
After:
generate → test ─┐
├────→ train → report → deploy
└────→ [other jobs]
(parallel where possible, 5 minutes)
Key: Use needs: carefully:
test:
needs: [generate-data] # Only wait for data
train-model:
needs: [generate-data, test] # Wait for both
5. Fail Fast, Fail Clearly
Lesson: When tests fail, make it obvious WHY.
Bad error message:
AssertionError: assert False
Good error message:
assert category == "Groceries", \
f"Expected 'Groceries', got '{category}'. " \
f"Narration: '{text}', Confidence: {confidence}"
# Output:
# AssertionError: Expected 'Groceries', got 'Unknown'.
# Narration: 'walmart shopping', Confidence: 0.0
Now we know:
- What failed (category assertion)
- Expected vs actual values
- Context (the narration text)
- Why it failed (zero confidence)
6. Monitor Cache Effectiveness
Lesson: Track cache hit rates over time.
We added logging:
- name: Check cache status
run: |
if [ "${{ steps.cache.outputs.cache-hit }}" == "true" ]; then
echo "✓ Cache hit!"
else
echo "✗ Cache miss - downloading packages"
fi
Metric to watch: Cache hit rate should be >85%.
If lower:
- Cache keys might be too specific
- Dependencies changing too frequently
- Cache size limits reached
7. Optimize Artifact Retention
Lesson: Keep what you need, delete what you don't.
# Before: Everything kept 90 days
retention-days: 90
# After: Tiered retention
- Transaction data: 7 days # Regenerable
- Model artifacts: 30 days # Useful for comparison
- Reports: 90 days # Want history
Savings: 62% reduction in storage costs!
8. Documentation is Code
Lesson: README is as important as the code itself.
Investment:
- 2 hours writing comprehensive README
- 30 minutes on deployment guide
- 1 hour on troubleshooting section
Return:
- Zero support questions about setup
- Contributors could onboard in <5 minutes
- Reduced deployment issues by 90%
9. Start with POC, Iterate to Production
Lesson: Don't try to build everything at once.
Our journey:
- Week 1: Basic classifier (rule-based only)
- Week 2: Add ML enhancement
- Week 3: Manual reporting
- Week 4: GitHub Actions automation
- Week 5: Add caching & optimization
- Week 6: Polish UX, fix bugs
Key: Each week added value. No "big bang" release.
10. Open Source Everything
Lesson: Making it public improved quality.
Before open source:
- Hardcoded paths
- No documentation
- Quick hacks everywhere
After open source:
- Configurable
- Well-documented
- Production-ready code
Conclusion
What We Accomplished
Starting from a proof-of-concept, we built a production-grade ML pipeline that:
✅ Runs 3x faster with intelligent caching
✅ Costs $0/month on GitHub Actions free tier
✅ Generates fresh data automatically
✅ Deploys reports to the web autonomously
✅ Achieves 91.2% classification accuracy
✅ Discovers new categories without supervision
✅ Provides full MLOps tracking with MLflow
✅ Has 87% test coverage
✅ Runs 24/7 without human intervention
The Numbers
| Metric | Value |
|---|---|
| Pipeline Runtime | 4min 35s (was 12min 50s) |
| Speedup | 2.8x faster |
| Cost | $0/month |
| Test Coverage | 87% |
| Classification Accuracy | 91.2% |
| Cache Hit Rate | 95% |
| Lines of Code | ~3,500 |
| Time to Deploy | < 5 minutes |
Key Takeaways
- Cache Everything - 95% hit rate = 2.8x speedup
- Use ISO 8601 - Saved hours of debugging
- Dynamic Data - Caught bugs static tests missed
- Fail Fast - Clear errors save time
- Document Well - README as important as code
The Technology Stack
Languages & Frameworks:
- Python 3.9 (ML/NLP)
- R 4.3 (Statistics/Reporting)
- YAML (Configuration)
- Markdown (Documentation)
ML & Data:
- scikit-learn (Classification)
- pandas (Data manipulation)
- NLTK (Text processing)
- MLflow (Experiment tracking)
DevOps:
- GitHub Actions (CI/CD)
- GitHub Pages (Hosting)
- Codecov (Coverage tracking)
- Docker (Future deployment)
Visualization:
- R Markdown (Reports)
- Plotly (Interactive charts)
- ggplot2 (Static charts)
- DT (Data tables)
Resources
Live Demo:
- Dashboard: https://akanimohod19a.github.io/productionizing_NER/
- GitHub: https://github.com/akanimohod19a/productionizing_NER
Documentation:
- README: Comprehensive setup guide
- CI/CD Guide: Workflow customization
- API Docs: Classifier usage
- Contributing: How to contribute
Contact:
- Email: danielamahtoday@gmail.com
- Twitter: @productionML
- LinkedIn: https://www.linkedin.com/in/daniel-amah-2559a4159/
Acknowledgments
Built with:
- Lots of coffee ☕
- Many debugging sessions
- Great community feedback
- Passion for MLOps
Special thanks to:
- GitHub Actions team for free CI/CD
- MLflow community for excellent tools
- R/RStudio team for amazing reporting
- scikit-learn contributors
- Everyone who contributed feedback
Full Code: https://github.com/AkanimohOD19A/productionizing_NER
Built with ❤️ using Python, R, MLflow,GitHub Actions and a lot of Love
Last updated: March 2026



Top comments (0)