Why Your AI Portfolio Needs Observability, Not More Repos

#aiaipythonragcareer

I have 15 GitHub repos and 10,800+ automated tests. My freelance rate was stuck at $40-55/hr. After running a multi-model research pipeline (Perplexity, Gemini, Grok, ChatGPT) to figure out what to build next, every model converged on the same answer:

Stop building. Start observing.

The production deployment gap is the #1 constraint on AI engineer hiring signal and rates in 2026. Only 11% of enterprises have deployed AI agents to production despite 66% experimenting. The engineers who can prove their systems work at runtime (not just at test time) are 10x rarer than the ones who can build the prototype.

Here is what I changed and what I learned.

The Gap: 1,109 Tests, Zero Traces

My RAG pipeline (DocExtract) had everything a hiring manager might want to see in a repo:

Agentic RAG with ReAct reasoning loop
Circuit breaker model fallback (Sonnet to Haiku)
HITL correction workflow with audit trail
RAGAS evaluation + LLM-as-judge CI gate
Kubernetes manifests, Terraform IaC, Docker multi-stage builds
1,109 tests at 90%+ coverage

What it did NOT have: any evidence it had ever processed a real request. No traces. No latency dashboards. No cost-per-request visibility. No runtime quality monitoring.

A hiring manager clicking the repo would see impressive architecture docs and green CI badges. But they would have no way to verify the system actually works under real conditions.

The Fix: Sync Sidecar Pattern

The key constraint: observability must never slow down the request path.

The pattern is simple. FastAPI BackgroundTasks handle all trace submission after the response is sent:

@app.post("/extract")
async def extract(req: Request, background_tasks: BackgroundTasks):
    trace = langfuse_trace("extraction", session_id=req.state.request_id)
    response = await run_extraction(req, trace)
    background_tasks.add_task(langfuse_flush)
    return response

The user gets their extraction result immediately. Langfuse receives the full trace (model, tokens, latency, confidence) in the background. This adds ~0ms to the request path.

Tiered Evaluation: The $0 CI Gate

Running LLM-as-a-judge on every PR is financially unviable for a solo engineer. At $0.003+ per metric per test case, a 50-case golden set costs $50-100/month in CI alone.

The solution is tiered evaluation:

Tier 1 (every PR, $0): Deterministic checks only

Schema conformance (Pydantic validation)
Confidence scores in 0.0-1.0 range
Field completeness (no empty extractions)
Citation grounding (extracted values appear in source text)
Baseline accuracy above 90%

Tier 2 (nightly cron, API cost): LLM-as-a-judge via DeepEval

Contextual precision
Faithfulness (hallucination detection)
Answer relevancy

This gives fast, reliable feedback on every change while reserving expensive quality checks for nightly validation.

PII Sanitization: Non-Negotiable Before Tracing

You cannot send production user data to external tracing services in plain text. Before any trace leaves the application:

# Regex-based PII masking (SSN, credit card, phone, email)
def sanitize_for_trace(data):
    if isinstance(data, str):
        for token, pattern in _PATTERNS:
            data = pattern.sub(token, data)
    elif isinstance(data, dict):
        return {k: sanitize_for_trace(v) for k, v in data.items()}
    return data

Zero dependencies. Deterministic. Runs before every Langfuse trace submission.

The HITL Data Advantage

Human corrections are not just UX. They are data assets.

DocExtract's review queue captures structured corrections: original extraction, corrected fields, error type, reviewer ID. This creates organic training data for future fine-tuning without the typical $2K-8K dataset curation cost.

When correction volume reaches critical mass, this feeds directly into a QLoRA fine-tuning pipeline (DPO pairs already exported in JSONL format).

What I Would Do Differently

Add Langfuse from day one. Retrofitting observability onto 40+ endpoints is more work than building it in. The Sync Sidecar pattern adds about 10 lines per endpoint.
Start with cloud-managed everything. Self-hosting Langfuse requires ClickHouse + Redis + S3. The cloud free tier (1M spans/month) is the right call for a solo engineer.
Tier the eval strategy earlier. Running DeepEval on every PR sounded good in theory. The tiered approach (deterministic CI + nightly LLM-judge) is the sustainable pattern.

The Real Lesson

The multi-model research consensus was clear: 15 repos with 10K+ tests is more code than most engineers at double my rate. The gap is not engineering skill. It is observable production signal.

Monitoring and observability represent 70% of production AI work that nobody puts in their portfolio. Adding Langfuse tracing, tiered DeepEval CI gates, PII sanitization, and cost tracking transforms a demo project into a production system.

The code changes took 2 days. The positioning shift is worth 10x that in hiring signal.

Stack: FastAPI, PostgreSQL + pgvector, Redis, Claude API, Langfuse, DeepEval, ARQ, Docker, Kubernetes, GitHub Actions

Repo: github.com/ChunkyTortoise/docextract