DEV Community: Arseniy Potapov

Using Claude Code Without Technical Debt

Arseniy Potapov — Tue, 03 Mar 2026 12:00:00 +0000

Last month I spent three days debugging race conditions where overlapping transactions interlocked and corrupted shared state. The code passed every test. Linters were clean. Type checks passed. In production, under real load, two users hitting the same resource at the same time broke everything.

Some of that code was AI-assisted. Not all of it - race conditions don't need AI to exist - but the AI-generated parts had sailed through review because they looked impeccable. Syntactically perfect. Well-structured. Exactly the kind of code you glance at and think "looks good." That's the trap. Auto-approving AI output because it reads well is how you end up spending a week staring at transaction logs instead of shipping features.

That experience crystallized something I'd been feeling for months. AI tools write code that's correct in isolation - functions that do what you asked, following patterns that pass every static check. But production isn't isolation. Production is concurrent users, stale caches, network partitions, and data that doesn't look like your test fixtures. The gap between "works in dev" and "survives production" is where technical debt hides.

I use Claude Code every day for production AI/ML systems. I'm not here to tell you to stop using AI tools - I'd be a hypocrite. I'm here to share the workflows I've built after learning these lessons the hard way. Two modes of working with AI, a context-priming system that makes output dramatically more predictable, and the red flags I've trained myself to catch before they reach production.

Here's how I use Claude Code without losing sleep.

What AI Gets Wrong (And Why It's Hard to Spot)

AI-generated code has a dangerous property: it looks right. Clean variable names, correct syntax, reasonable structure. It passes lint, passes type checks, often passes tests. The problems are in the things you can't see by reading the code.

It Doesn't Understand Your Runtime

AI generates code in a vacuum. It doesn't know that your FastAPI endpoint gets hit by 200 concurrent users during batch processing. It doesn't know that your Celery workers share a database connection pool that saturates under load. It writes code that's correct for a single request and completely wrong for a thousand simultaneous ones.

My race condition? AI-generated async code where two transactions could hit the same resource within milliseconds. Each transaction was correct in isolation. Together, they interlock and corrupt shared state. Nothing in the code looks wrong - the bug is in the timing that only exists under production load.

It Doesn't Know Your Architecture

Every Claude Code conversation starts fresh. Without context, it invents patterns. Feature A gets a service layer. Feature B gets business logic inline in the route handler. Feature C introduces a repository pattern nobody asked for. Each one works individually. Together, your codebase becomes an archaeology dig where every layer is a different civilization. This kind of architecture drift is how teams end up in the rewrite-vs-refactor debate that costs months instead of weeks.

It Over-Engineers

AI loves abstractions. Ask for a simple data processor and you'll get an AbstractBaseProcessor with a StrategyFactory and a PluginRegistry. Code that should be 30 lines becomes 150. AI doesn't feel the maintenance cost of abstraction - it just reaches for the most "proper" solution it's seen in training data.

It Misses What's Between the Lines

Business rules that nobody documented. The constraint that user IDs can't change after first payment because three downstream systems cache them. The convention that all async tasks must be idempotent because your queue has at-least-once delivery. AI can't know what it was never told, and the stuff that isn't written down is usually the stuff that matters most.

The False Confidence Trap

This is the compounding problem. Because AI code is syntactically clean and well-structured, you trust it more than you should. You review it less carefully. You approve faster. And that's exactly when subtle bugs slip through - not because the AI is bad, but because the code looks too good to question.

These aren't reasons to stop using AI tools. They're reasons to use them with a system. The rest of this article is that system.

flowchart LR
    subgraph AI["What AI Generated"]
        A1["await db.get(sender)"]
        A2["await db.get(receiver)"]
        A3["if balance >= amount"]
        A4["sender.balance -= amount"]
        A5["await db.commit()"]
        A1 --> A2 --> A3 --> A4 --> A5
    end

    subgraph PROD["What Production Needed"]
        P1["select(...).with_for_update()"]
        P2["Lock rows in sorted order"]
        P3["if balance >= amount"]
        P4["sender.balance -= amount"]
        P5["await db.commit()"]
        P1 --> P2 --> P3 --> P4 --> P5
    end

    AI -. "Passes tests\nbut no row locking" .-> PROD

Two Modes of Working With AI

I've developed two distinct approaches for working with Claude Code in production. Neither eliminates the need to review AI output - both just change when you invest your thinking time.

Mode 1: Inline Review (The Conversation)

This is the interactive approach. I work with Claude Code on changes one at a time, reviewing and giving feedback as we go.

The workflow looks like this: Claude suggests a change, I read it carefully, I either accept it, reject it, or ask for modifications. Then we move to the next change. It's slower per change, but I catch issues immediately when my mental context is still fresh.

I review code changes and immediately give feedback. This is much slower but doesn't force me to review everything in one gigantic PR.

The trade-off is speed. You're not going to ship a feature in an hour with this approach. But you also won't face a 2,000-line PR full of AI-generated code that you need to audit all at once. The review burden is distributed across the work, not concentrated at the end.

There's a hidden benefit: this mode is educational. I've learned new patterns, algorithms, and technologies through these conversations. When Claude suggests an approach I haven't seen before, I stop and understand it before accepting. Over time, that builds real expertise.

I use this mode when I'm in unfamiliar territory - exploring a new API, debugging complex business logic, or working on code where the architecture isn't obvious yet. When I don't fully understand the problem space, I want to think through each step rather than delegate execution to AI.

Mode 2: Documentation-First (The Blueprint)

This is the upfront investment approach. Before Claude writes a single line of code, I invest significant time creating detailed documentation.

I invest time in creating very detailed documentation for the feature and then let AI work on tickets. Faster, but requires very clear understanding and plan before work.

The documentation includes: the architecture (which services, which databases, how they interact), data flow (what comes in, how it transforms, what goes out), edge cases (what happens when X fails?), and constraints (performance requirements, backwards compatibility, security boundaries).

Then I break the work into well-defined "tickets" and let Claude Code execute against that specification. For example, last week I needed a new API endpoint: I wrote a one-page spec covering the route, request/response schemas, database queries, error cases, and auth requirements. Claude produced a working implementation on the first pass - because it wasn't guessing, it was following a blueprint.

This doesn't save me from all AI slop. I still review the results. But the code is more predictable, more consistent across features, and closer to what I actually wanted.

I use this mode for well-understood features: CRUD operations, database migrations, repetitive refactoring, test writing. When I know exactly what needs to happen, documentation-first is faster overall despite the upfront time investment.

When to Use Which

The decision is straightforward:

Use Mode 1 (Inline Review) when:

You're in new or unfamiliar territory
The problem requires complex business logic
You're debugging and don't know the root cause yet
The architecture isn't clear and you need to feel your way forward
You're doing DevOps or infrastructure work (shell commands, network rules, Docker configs)
The work unfolds as you go - you don't know what you'll find until you look

Use Mode 2 (Documentation-First) when:

You understand the requirements completely
The feature follows established patterns in your codebase
You're doing repetitive work (migrations, similar CRUD endpoints)
You're writing tests for well-understood behavior

I often alternate between them in the same day. Morning: documentation-first for a straightforward API endpoint. Afternoon: inline review for debugging a race condition where I don't know what's broken yet.

The Key Insight

Neither mode eliminates the need to review. The question isn't "should I review AI output?" The question is "when do I invest my thinking time?"

Mode 1: thinking happens during execution (distributed review).
Mode 2: thinking happens before execution (upfront design, then review at the end).

Both require discipline. Both require saying no to AI suggestions that aren't right. The difference is timing.

flowchart TB
    subgraph MODE1["Mode 1: Inline Review"]
        direction LR
        M1A["AI suggests change"] --> M1B["You review"] --> M1C["Accept / Reject / Modify"] --> M1A
        style M1B fill:#ffd,stroke:#aa0
    end

    subgraph MODE2["Mode 2: Documentation-First"]
        direction LR
        M2A["You write spec"] --> M2B["AI executes tickets"] --> M2C["You review result"]
        style M2A fill:#ffd,stroke:#aa0
    end

    MODE1 --- T1["Thinking distributed across work"]
    MODE2 --- T2["Thinking upfront, review at end"]

Context Is Everything

The single biggest factor in AI code quality isn't the model, the prompt, or the temperature setting. It's context. The more your AI tool knows about your project, your conventions, and your constraints, the better its output will be.

I've found three layers of context that dramatically change Claude Code's output quality.

Layer 1: CLAUDE.md - Your Project's AI Constitution

Claude Code reads a CLAUDE.md file at the root of your project before every conversation. This is where you define the rules of engagement.

Mine includes things like: "Never wrap code in try-except by default - we handle errors globally." And: "Do not use inline imports, always put imports at module level." These are my coding conventions, and without them Claude would happily generate code that violates both.

CLAUDE.md should contain your coding style, architecture decisions, what NOT to do, and any project-specific constraints. Keep it concise - this isn't a wiki. It's a dense set of instructions that shapes every line of code Claude generates.

Layer 2: Subdocument References

CLAUDE.md gets long fast. The trick is keeping it lean and linking to deeper documentation for specific areas.

My CLAUDE.md references separate docs for database schemas, API patterns, deployment procedures, and testing conventions. Claude reads the relevant subdocuments when it needs context for a specific task. This layered approach means CLAUDE.md stays scannable while deeper context is always available.

Layer 3: AI-Targeted Documentation

This is the layer most people miss. Traditional documentation is written for humans - it explains concepts, includes tutorials, gives background. AI-targeted documentation is different. It's dense, specific, and architectural.

Instead of "Our API uses REST principles," you write: "All API endpoints follow this pattern: FastAPI router in api/v1/, Pydantic models in schemas/, service layer in services/. Responses use StandardResponse wrapper. Auth via get_current_user dependency."

That one paragraph gives Claude more useful context than pages of explanation. The AI doesn't need to understand why - it needs to know what patterns to follow.

The hidden cost: maintaining this documentation takes real effort. Every architecture change, every new convention needs to be reflected in these docs. It's not free. But it's the highest-ROI investment I've found for AI-assisted development. Better context in means dramatically better code out.

The Principle

Garbage context in, garbage code out. If your AI tool doesn't know your patterns, it will invent its own. If it doesn't know your constraints, it will ignore them. If it doesn't know your architecture, every feature will look different.

Invest in documentation. Not for future developers - for your AI tools, right now. The payoff is immediate: more consistent code, fewer review cycles, less time fixing AI's guesses.

There's a whole topic around automating the maintenance of AI documentation - keeping it fresh as your codebase evolves. I wrote a definitive guide to CLAUDE.md covering the full five-layer configuration system. For now, start with CLAUDE.md and build from there.

flowchart TB
    L1["CLAUDE.md\nConventions, architecture, constraints"]
    L2["Subdocuments\nDB patterns, API conventions, testing guide"]
    L3["AI-Targeted Feature Docs\nDense specs: routes, schemas, edge cases"]
    OUT["AI Output Quality"]

    L1 -->|"references"| L2
    L2 -->|"details"| L3
    L1 & L2 & L3 -->|"context in"| OUT

    style L1 fill:#e8f4fd,stroke:#1a73e8
    style L2 fill:#d4edda,stroke:#28a745
    style L3 fill:#fff3cd,stroke:#ffc107
    style OUT fill:#f8d7da,stroke:#dc3545

Red Flags in Your Diff

You already know why AI code goes wrong - runtime blindness, architecture drift, over-engineering. Here's what to actually look for when you're reviewing a diff.

Try-except wrapping everything. The most reliable AI tell. You'll see entire function bodies wrapped in try: ... except Exception: logger.error(...). Your global error handler never fires because every exception gets caught and buried. If you see a bare except Exception in a diff, reject it.

Magic numbers. Timeouts of 30, retry counts of 3, batch sizes of 100. AI picks numbers that look reasonable but aren't tied to anything real. Your timeout should be 5 seconds because the downstream SLA is 3. Check every numeric literal that isn't 0 or 1.

Comments explaining "what." # Increment the counter above counter += 1. If you see a comment describing what the next line does, delete it. The only comments worth keeping explain why - and AI can't write those because it doesn't know your intent.

Near-duplicate functions. Three functions that differ by one parameter. AI generates each independently, doesn't see the duplication across sessions. Search for similar function signatures in the same module.

Missing boundary validation. Internal functions that trust all inputs. AI treats each function as self-contained, skipping the validation that protects your system at entry points. Check: does new code that handles external input validate it?

Inline imports. Imports inside function bodies instead of at module level. AI does this because it's "convenient" for the snippet. If your project convention is module-level imports, this is an instant reject.

None of these require deep analysis. They're mechanical checks you can spot in seconds during review - which is exactly why they're worth having on a checklist.

# RED FLAG: Try-except wrapping everything
# BEFORE - global error handler never fires
def process_order(order_id: int):
    try:
        order = Order.get(order_id)
        order.validate()
        order.charge()
        return {"status": "completed"}
    except Exception:
        logger.error(f"Failed to process order {order_id}")
        return {"status": "failed"}

# AFTER - let exceptions propagate
def process_order(order_id: int):
    order = Order.get(order_id)
    order.validate()
    order.charge()
    return {"status": "completed"}

# RED FLAG: Magic numbers
# BEFORE
response = httpx.get(url, timeout=30)

# AFTER - tied to a real constraint
response = httpx.get(url, timeout=DOWNSTREAM_TIMEOUT_SEC)  # 5s, SLA is 3s

Automating What You Can

Every red flag from the previous section can be caught by a machine. Linters, type checkers, and pre-commit hooks handle the mechanical stuff - bare except Exception blocks, inline imports, magic numbers, functions over 100 lines. You already have ruff and mypy (or ESLint and TypeScript strict). The step most people skip is adding custom pre-commit checks for the AI-specific patterns: error swallowing, numeric literals outside constants, near-duplicate functions.

# .pre-commit-config.yaml - AI-specific checks
- repo: local
  hooks:
    - id: no-bare-except
      name: no bare except Exception
      language: pygrep
      entry: 'except\s+(Exception|BaseException)\s*:'
      types: [python]

    - id: no-inline-imports
      name: no inline imports
      language: pygrep
      entry: '^\s{4,}(import |from \S+ import )'
      types: [python]

The underrated CI trick: run your tests with concurrency. Most test suites run sequentially, so timing-dependent bugs pass locally. Run with pytest -n auto or your framework's parallel flag. Bugs that only surface under concurrent execution will start failing in CI - which is exactly where you want them caught.

But here's what matters more than any of this: what automation can't catch. No tool will tell you whether this feature follows the same patterns as the rest of your codebase. No linter knows your business rules. No type checker can tell you the code solves the wrong problem.

Automation handles the floor. Human review handles the ceiling. The mistake is confusing which problems belong to which layer.

The Honest Trade-offs

Time for the part most AI articles skip.

What AI Is Genuinely Great At

Boilerplate. CRUD endpoints, data models, serializers - anything where the pattern is established and you just need more of it. AI excels here because there's nothing to get wrong beyond following the template.

Tests, when you define scope. Tell Claude exactly which cases to cover and it writes solid tests fast. Let it decide what to test and you'll get 90% happy-path coverage with zero edge cases.

Mechanical refactoring. Renaming, moving code between modules, converting class patterns. Tedious work that humans mess up because our attention drifts. AI doesn't get bored.

Exploring unfamiliar APIs. Need to integrate a library you've never used? Claude reads the docs faster than you and produces a working first draft. You still need to understand what it wrote, but the exploration phase shrinks dramatically.

First-draft documentation. API docs, README updates, docstrings. AI produces a decent starting point that's faster to edit than to write from scratch.

What AI Is Genuinely Bad At

Everything that requires understanding beyond the code itself. Concurrency correctness, architecture decisions, implicit business rules, cross-codebase consistency - I covered these in detail earlier, and they remain the core risks.

But the one I keep coming back to: AI doesn't push back. It's eager to please. Tell it to build something wrong and it will do so enthusiastically. You need to be the one who decides "we shouldn't build this." AI has no judgment about whether the task itself makes sense.

The Hidden Costs

Maintaining AI-targeted documentation takes real effort. Every architecture change needs to be reflected in your CLAUDE.md and supporting docs. This is an ongoing tax, not a one-time setup.

Review time sometimes exceeds writing time. For complex logic, reading and verifying AI output takes longer than writing it yourself would have. You save nothing - you just shifted the work from writing to reading.

Context-switching between driving and reviewing is cognitively expensive. You're either thinking creatively or thinking critically. Switching between the two hundreds of times a day is draining in a way that pure coding isn't.

The speed gain is real but smaller than the hype. I'm not 10x faster. I'm not even 5x faster.

My Honest Assessment

AI tools make me maybe 1.5-2x faster overall. The biggest gains are in exploration and boilerplate - not in the hard parts that actually matter. Core logic, architecture, debugging - these take the same time they always did, sometimes longer because I'm reviewing AI's work on top of my own thinking.

Quality is maintained only because I refuse to skip review. If I stopped reviewing, I'd ship faster but sleep worse.

You're the Architect, AI Is the Contractor

AI coding tools are the most productive addition to my workflow in years. They're also the easiest way to accumulate technical debt I've ever seen. The difference between the two outcomes is discipline - not talent, not experience, just a system you follow consistently.

You wouldn't hand a contractor a plot of land and say "build something." You'd give them blueprints, constraints, materials specs, and you'd inspect the work at every stage. AI tools are the same. You design. AI executes. You review. Skip any step and you're gambling.

Here's what matters:

Invest in documentation - for your AI, not for future developers. CLAUDE.md, subdocuments, AI-targeted architecture docs. Better context in, better code out. The payoff is immediate.

Choose your review mode deliberately. Inline review for the unknown. Documentation-first for the predictable. Match the mode to the work, not your mood.

Automate the mechanical checks. Linters, type checkers, pre-commit hooks - let tools handle the floor so your review time goes to the ceiling.

Know the limits honestly. 1.5-2x faster, not 10x. Great at boilerplate, bad at concurrency. The speed is real. The hype isn't.

If you want a concrete starting point: write a CLAUDE.md for your main project this week. Just your coding conventions, your architecture patterns, and three things AI should never do in your codebase. That single file will change the quality of every AI interaction you have.

The race condition you don't catch today is the production incident you debug next week. And enough unchecked drift turns into the kind of big rewrite nobody wants. Build the system. Follow the system. Sleep well.

This article was created with the assistance of Claude Code.

Build Your Own Passwordless OTP Auth on AWS Lambda

Arseniy Potapov — Sat, 28 Feb 2026 10:10:03 +0000

I was adding authentication to a side project and started evaluating managed auth services. Auth0 gives you 25,000 MAU free. Clerk gives you 50,000. Cognito gives you 10,000. For a side project, any of them would cost zero dollars.

But all of them meant routing my auth through someone else's infrastructure. My users, my verification flow, my data - controlled by a third party. All I needed was for users to prove they own an email or phone number. No passwords, no social login, no user profiles. Just "enter your email, get a code, get a token."

So I built one. Two Lambda functions, a DynamoDB table, and 180 lines of Python. It's been running in production since February 2023 and costs about a dollar a month.

This article walks through the real code, the trade-offs, and a build-vs-buy framework so you can decide whether owning your auth stack is worth it - or whether a managed service is the right call.

You can try the live demo and browse the source code on GitHub.

How OTP Works

Most OTP tutorials generate codes with random.randint(100000, 999999). That's not a one-time password - that's a random number with no cryptographic guarantees. An attacker who intercepts one code learns nothing about the next, but there's no mathematical relationship enforcing that. A proper OTP uses HOTP (HMAC-based One-Time Password), defined in RFC 4226. HOTP takes two inputs: a shared secret (a random base32 string, minimum 128 bits per the RFC) and a counter (an integer that increments). Run them through HMAC-SHA1 and dynamic truncation, and you get a 6-digit code that's cryptographically tied to that specific secret and counter value. The pyotp library implements this algorithm, so you don't write any cryptography yourself - you pass in the secret and counter, and it gives you back a code.

import pyotp

secret = pyotp.random_base32()
hotp = pyotp.HOTP(secret)

code = hotp.at(0)   # first code
hotp.verify(code, 0) # True - matches counter 0
hotp.verify(code, 1) # False - wrong counter

Three properties make HOTP work for authentication:

Can't reuse. Each code is tied to a specific counter value. Once verified, the record is deleted, so the same code never works again.
Can't predict. Without the secret, there's no way to compute the next code. The secret never leaves the server.
Can't go backwards. A code generated for counter 5 doesn't work for counter 4.

HOTP vs TOTP

You've probably used TOTP - Time-based One-Time Password - with authenticator apps like Google Authenticator. TOTP is HOTP with time as the counter. Codes rotate every 30 seconds.

	HOTP (counter-based)	TOTP (time-based)
Code valid until	Used or expired by TTL	~30 seconds
Use case	"Send me a code" flows	Authenticator apps
Clock sync needed	No	Yes
User pressure	None - enter when ready	Must type within window

For "type your email, get a code" flows, HOTP is the right choice. TOTP would mean the user has 30 seconds to check their email and type the code - that's stressful and leads to failed attempts. With HOTP, the code stays valid until the record expires (5 minutes in my implementation) or the user enters it successfully.

The Architecture

The whole system is four components: two Lambda functions, one DynamoDB table, and whichever delivery service you prefer for sending codes. I use Mailgun for email and Twilio for SMS.

I chose serverless because OTP verification is the definition of a bursty workload. Most of the time nobody is logging in. Then 50 people sign up after a Product Hunt launch and you need to handle them all. Lambda scales to zero when idle and handles bursts without provisioning anything. For a service that processes a few requests per day on average, paying for a running server would be wasteful.

The Data Model

Each OTP record lives in DynamoDB with a TTL that auto-deletes expired codes. The pynamodb ORM keeps the model definition clean:

from pynamodb.models import Model
from pynamodb.attributes import UnicodeAttribute, NumberAttribute

class OTP(Model):
    class Meta:
        table_name = "simple-otp-secrets"
    id = UnicodeAttribute(hash_key=True)
    otp_secret = UnicodeAttribute()
    counter = NumberAttribute()
    expires = NumberAttribute()

The id is the user's email or phone number - one record per identity. otp_secret is the random base32 string that seeds HOTP generation. counter tracks how many codes have been generated for this secret. expires is a Unix timestamp for DynamoDB's TTL - after 5 minutes, DynamoDB deletes the record automatically.

What It Costs

This is where serverless gets interesting. Here's what I actually pay:

Component	Monthly Cost
Lambda (2 functions)	$0.00 (free tier: 1M requests/mo)
DynamoDB (on-demand)	$0.00 (free tier: 25 GB storage, 25 WCU/RCU)
API Gateway	$0.00 (free tier: 1M calls/mo)
Mailgun (email delivery)	$0.00 (free tier: 100 emails/day)
Twilio (SMS delivery)	~$0.008 per SMS + $1.15/mo for a number
Total	<$2/mo (or $0 if email-only)

The only component that costs real money is Twilio SMS. If you only need email verification, the entire service runs within free tiers indefinitely. You could also replace Mailgun with AWS SES ($0.10 per 1,000 emails) and Twilio with AWS SNS (~$0.007/SMS) to keep everything in AWS - I used Mailgun and Twilio because I already had accounts, but SES would make the email path truly $0. I've been running this for three years and my AWS bill has never exceeded $0.50 in a single month.

Deployment uses the Serverless Framework, which wraps CloudFormation and handles the API Gateway + Lambda wiring. Here's the core of serverless.yml:

functions:
  otpVerificationStart:
    handler: main.otp_verification_start
    events:
      - http:
          path: "/otp-verification/start"
          method: POST
          cors: true
    reservedConcurrency: 1
    timeout: 30

  otpVerificationComplete:
    handler: main.otp_verification_complete
    events:
      - http:
          path: "/otp-verification/complete"
          method: POST
          cors: true
    reservedConcurrency: 1
    timeout: 30

resources:
  Resources:
    simpleOtpSecrets:
      Type: AWS::DynamoDB::Table
      Properties:
        TableName: simple-otp-secrets
        AttributeDefinitions:
          - AttributeName: id
            AttributeType: S
        KeySchema:
          - AttributeName: id
            KeyType: HASH
        ProvisionedThroughput:
          ReadCapacityUnits: 1
          WriteCapacityUnits: 1
        TimeToLiveSpecification:
          AttributeName: expires
          Enabled: true

Two functions, one DynamoDB table with TTL enabled, IAM permissions for DynamoDB access. First deploy takes about 3 minutes. Subsequent deploys take under 30 seconds.

The Two Endpoints

The entire service is two Lambda functions behind API Gateway. One creates the OTP and sends it. The other verifies it and issues a JWT.

/start - Create and Send the Code

When a user requests a code, the start endpoint creates or resets an OTP record in DynamoDB, generates the next HOTP code, and sends it via Mailgun (email) or Twilio (SMS).

def otp_verification_start(event, context):
    params = event["queryStringParameters"]
    email = params.get("email", "").strip().lower() or None
    phone = re.sub(r"\D", "", params.get("phone", "").strip()) or None

    otp_id = email or phone
    try:
        otp = OTP.get(otp_id)
        otp.counter += 1
    except OTP.DoesNotExist:
        otp = OTP(otp_id, otp_secret=pyotp.random_base32(), counter=0,
                   expires=int(time.time()) + 300)
    otp.save()

    code = pyotp.HOTP(otp.otp_secret).at(otp.counter)

    if email:
        requests.post(f"https://api.mailgun.net/v3/{MAILGUN_DOMAIN}/messages",
                      auth=("api", MAILGUN_API_KEY),
                      data={"from": FROM_EMAIL, "to": [email],
                            "subject": "Verify your email", "text": f"Your PIN: {code}"})
    elif phone:
        requests.post(f"https://api.twilio.com/2010-04-01/Accounts/{TWILIO_SID}/Messages.json",
                      auth=(TWILIO_SID, TWILIO_TOKEN),
                      data={"To": f"+{phone}", "From": TWILIO_NUMBER,
                            "Body": f"Your PIN: {code}"})

If a record already exists for this email or phone, I increment the counter instead of creating a new secret. This means the previous code is immediately invalidated - pyotp.HOTP.verify() checks against the exact counter value, so only the latest code works. If a user requests a second code before entering the first one, they need to use the new code.

The Mailgun and Twilio calls are raw HTTP requests. No SDK. For a function this small, pulling in boto3 or the Twilio SDK would double the deployment package for no benefit.

/complete - Verify and Issue JWT

When the user enters their code, the complete endpoint looks up their OTP record, verifies the HOTP code, deletes the record, and returns a signed JWT.

def otp_verification_complete(event, context):
    params = event["queryStringParameters"]
    pin = params["pin"]
    otp_id = params.get("email", "").strip().lower() or re.sub(r"\D", "", params.get("phone", ""))

    otp = OTP.get(otp_id)  # raises DoesNotExist if expired or never created
    if not pyotp.HOTP(otp.otp_secret).verify(pin, otp.counter):
        return error_response(400, "Invalid PIN")

    otp.delete()

    sub = f"email:{params['email']}" if params.get("email") else f"tel:{params['phone']}"
    token = jwt.encode({
        "iss": "https://otp.potapov.dev/",
        "aud": "https://api.potapov.dev/",
        "sub": sub,
        "iat": datetime.now(timezone.utc),
        "exp": datetime.now(timezone.utc) + timedelta(days=1),
    }, os.environ["JWT_SECRET"], algorithm="HS256")

    return success_response({"token": token})

I delete the record on successful verification rather than incrementing the counter. This prevents code reuse and avoids stale records piling up in DynamoDB. If verification fails, the record stays - the user can try again with the same code until TTL expires.

One shortcut I should be honest about: rate limiting. RFC 4226 explicitly requires throttling to resist brute force attacks - a 6-digit code has only a million possible values. I handle this with Lambda's reservedConcurrency set to 1 per function, which you can see in the serverless.yml above. That's not real rate limiting per user - it's a concurrency cap that serializes requests to the endpoint, but doesn't stop a patient attacker from trying codes sequentially for a single email. For a personal demo that handles a few logins per day, it's acceptable. For anything beyond that, you'd want per-IP or per-user throttling at the API Gateway level, or a DynamoDB counter that locks out after 3 failed attempts.

JWT: From Code to Client

After successful OTP verification, the service issues a signed JWT. This token is how downstream APIs know the user proved ownership of their email or phone number.

Verifying on the Receiving End

The /complete endpoint issues the token. The interesting part is what happens when a downstream API receives it. Here's what validation looks like:

import jwt

claims = jwt.decode(
    token,
    os.environ["JWT_SECRET"],
    algorithms=["HS256"],
    audience="https://api.potapov.dev/",
    issuer="https://otp.potapov.dev/",
)
user_id = claims["sub"]  # "email:user@example.com"

The algorithms parameter is required in PyJWT 2.x - omitting it raises DecodeError. The audience and issuer parameters enforce that the token was meant for this specific API and issued by our OTP service. If either doesn't match, PyJWT raises an exception before your code ever sees the claims. The 24-hour expiry is generous for a demo - in production I'd cut it to 1-4 hours depending on the use case.

One Python detail worth noting: the datetime.now(timezone.utc) call in the /complete endpoint (with from datetime import datetime, timedelta, timezone). If you're reading older tutorials that use datetime.utcnow(), that's deprecated since Python 3.12 and now emits a DeprecationWarning. The old version returns a naive datetime, the new one returns a timezone-aware datetime. PyJWT 2.x handles both, but the new form is correct and won't spam your logs with warnings.

HS256 vs RS256

I use HS256 (symmetric signing) because this is a single-service setup. The same secret that signs the token also verifies it. Simple, fast, one environment variable.

The limitation: only services that know the JWT_SECRET can verify tokens. If I wanted the OTP service to act as a third-party identity provider - where other apps verify tokens without knowing the signing key - I'd switch to RS256. RS256 uses a private key to sign and a public key to verify. You publish the public key, and any service can validate tokens without accessing secrets.

For a side project or internal tool, HS256 is the right call. For a product where external services consume your tokens, RS256 is worth the extra setup.

Client-Side Decoding

JWTs are signed, not encrypted. The payload is base64-encoded JSON that any client can read:

import { jwtDecode } from "jwt-decode";

const claims = jwtDecode(token);
const userId = claims.sub; // "email:user@example.com"

That's by design. The client needs to know who's logged in without making a server round-trip. But it means you should never put sensitive data in JWT claims - anything in the payload is readable by anyone with the token.

Build vs Buy

Every OTP tutorial skips this part. They show you how to build it, declare victory, and leave you to figure out whether you should have used Auth0 instead. I've run this service for three years alongside projects that use Cognito and Clerk. Here's when each makes sense.

When to Build Your Own

Build your own OTP service when all of these are true:

Your auth needs are simple: email or phone verification, nothing else.
You don't need SSO, SAML, or social login.
You're comfortable deploying to Lambda (or any serverless platform).
You want to own your auth stack. No vendor can change pricing, deprecate an API, or sunset a feature you depend on. Your data stays in your AWS account.

Cost isn't the main reason to build - managed services have generous free tiers now. The real advantage is simplicity and control. When something breaks (and it will, eventually), you can read every line of the service in 10 minutes. Try doing that with Cognito's documentation.

When to Buy

Buy a managed auth service when any of these are true:

You need MFA beyond OTP (FIDO2, passkeys, authenticator apps).
Your customers require SSO/SAML integration.
You have compliance requirements (SOC 2, HIPAA, GDPR consent flows).
You're building a team product where auth touches user roles, permissions, organizations.
You'd rather not think about auth at all - the free tiers are generous enough that cost isn't a factor.

I wouldn't use this OTP service for a B2B SaaS product. Enterprise customers expect SSO. Security auditors expect documented auth flows, not a Lambda function someone built on a weekend. The moment you need "forgot password" or "link social account" or "enforce password rotation," you're reinventing a wheel that Auth0 and Clerk have spent years refining.

The Decision

Criteria	Build (DIY OTP)	Buy (Auth0/Cognito/Clerk)
Monthly cost (low traffic)	<$1	$0 (free tiers: 10-50K users)
Monthly cost at scale	Scales with SMS only	$35-240+/mo for paid features
Setup time	2-4 hours	30 minutes
SSO/SAML	No	Yes
MFA options	OTP only	OTP, FIDO2, passkeys
Vendor lock-in	None	Medium to high
Compliance certifications	You own it	Provider handles it
Maintenance burden	Near zero (serverless)	Near zero (managed)
Customization	Total	Limited by provider

The honest answer is that most teams should buy. Auth is a solved problem, and the managed services handle edge cases (account recovery, brute force protection, session management) that you'll eventually need. I kept running my own because the cost is negligible and I like owning the stack. That's a preference, not a recommendation.

There's a grey area worth mentioning: projects that start with simple OTP and grow. You build this for your MVP, users love it, now you need "remember this device" and "sign in with Google" and "admin can revoke sessions." At that point you're building an auth platform, not an OTP service. The right move is to migrate to a managed provider before you've reinvented half of it. I've seen teams spend months adding features to DIY auth that Auth0 ships out of the box.

If you're a solo founder building an MVP, a side project, or anything where "user enters email, gets a code, gets a token" is the entire auth story - building takes an afternoon and costs nothing. The moment auth becomes a feature instead of plumbing, switch to a managed service and spend your time on what makes your product different.

Three Years in Production

I deployed this service in February 2023. It's now 2026 and it's still running. Here's what that looks like in practice.

What Went Right

The service has had exactly zero outages that I caused. Lambda functions don't go stale, DynamoDB tables don't need vacuuming, and there's no server to patch. I haven't SSH'd into anything because there's nothing to SSH into. The only maintenance I've done in three years is updating the Mailgun sending domain when I moved to a new domain registrar.

Total operational cost over three years: under $30, almost entirely Twilio SMS charges. The AWS components have never exceeded free tier.

What Went Wrong

Twice, Mailgun rate-limited my sending domain because I was on the free tier and hit the 100 emails/day cap during a demo. Not a code problem - just a free tier limit I should have anticipated. I've since switched to a self-hosted mail server, but AWS SES would have been the simpler fix.

Once, a DynamoDB TTL deletion was delayed by about 15 minutes (TTL is best-effort, not exact), which meant an expired code still technically existed in the database. The HOTP verification still failed because the counter didn't match, so no security issue - just a confusing "OTP code not found" vs "Invalid PIN" error message distinction.

What I'd Change

If I were hardening this for production use beyond my own projects, four things would change.

I'd switch to RS256 for JWT signing, for the reasons I described above - the moment a second service needs to verify tokens, symmetric signing becomes a liability. I'd also add real per-user rate limiting instead of the reservedConcurrency workaround I mentioned in the endpoints section. A DynamoDB counter that locks out after 3 failed attempts per email is straightforward and would actually satisfy the RFC 4226 throttling requirement.

The error responses need work. Right now they're generic ("Invalid PIN", "OTP code not found"). Distinguishing between "expired" and "never created" helps the frontend show the right UX - "Your code expired, request a new one" is more useful than "something went wrong."

And I'd wrap the OTP secret with KMS before storing it. DynamoDB encrypts at rest by default, so it's not sitting unencrypted on disk, but application-level encryption with KMS would add negligible cost and measurable security.

None of these are deal-breakers for a personal demo. They're the kind of improvements that matter when the service grows beyond "my side project" into "something other people depend on."

What You Get

The whole service is 180 lines of Python, two Lambda functions, and one DynamoDB table. It handles email and SMS verification with proper HOTP codes, issues signed JWTs, and cleans up after itself. I've been running it for three years without touching it.

Should you build this? If your app needs auth and your budget is zero, you now have working code. If your app needs auth and your budget isn't zero, you now know exactly when a managed service is worth it - and you can make that call based on features you actually need, not on the assumption that auth is too hard to own.

Fork the repo, try the live demo to see the flow in action, and decide for yourself.

This article was created with the assistance of Claude Code.

AI Data Cleaning: From Demo to Production

Arseniy Potapov — Sat, 14 Feb 2026 12:00:00 +0000

In December 2022, I wrote an article about using AI to clean data. I was excited. GPT-3 had just become accessible, text-davinci-003 could parse messy CSV rows, and I built a demo with 6 rows of sample data. Six rows. I showed how the model could normalize dates, fix capitalization, and extract structured fields from free text. It worked beautifully.

It also didn't scale past a demo.

The model cost $0.02 per 1,000 tokens. Rate limits capped throughput at a few hundred requests per minute. Every API call returned slightly different formatting. There was no validation, no error handling, no way to reproduce results. I was essentially asking the world's most expensive intern to hand-clean each row one at a time.

Three years later, I build data ingestion systems that process millions of records from dozens of sources, each arriving in its own special flavor of broken. County tax rolls with names like "SMITH JOHN A JR & MARY B SMITH-JONES TTEE OF THE SMITH FAMILY TRUST." Address files where column B switches meaning halfway through. CSV exports with mixed encodings, missing headers, and creative interpretations of what "null" means.

AI is central to how these systems work. But not the way I imagined in 2022. The LLM doesn't touch every row. It looks at a sample, figures out what's wrong, creates a transformation plan, and hands it off to pandas and Pydantic to execute and validate. The expensive, intelligent part happens once. The cheap, deterministic part runs on every row.

This is the article I should have written three years ago. Here's what actually works when you're cleaning data at production scale - not with 6 rows, but with millions.

Every Customer's Data Is Broken in Its Own Way

Clean datasets are all alike; every dirty dataset is dirty in its own way.

I've processed data from county governments, mortgage companies, title agencies, and tax assessors across 14 states. Not one source arrived clean. The problems fall into predictable categories, but the specific combination is unique every time.

Start with the structural problems. CSV files with no headers, mixed delimiters (tabs in one section, commas in another), multiline fields that break naive parsers. One county sends an Excel workbook with 6 tabs. Another sends a ZIP of CSVs named EXPORT_FINAL_v2_CORRECTED(1).csv.

Then there's encoding. Latin-1 data in a UTF-8 world. Windows line endings mixed with Unix. BOM markers that show up as invisible characters. I once spent a day debugging why 200 property addresses contained Ã© instead of é - classic UTF-8 double-encoding from an intermediate system that converted the file twice.

Semantic ambiguity is worse. "John Smith" in one source, "SMITH, JOHN A JR" in another, "SMITH JOHN & MARY TTEE" in a third. Same person, three representations. Dates arrive in 15+ formats: 01/02/2024, 2024-01-02, Jan 2, 2024, 20240102, and my favorite, 1/2/24 (is that January 2nd or February 1st?).

Missing data has personality. Empty string, "N/A", "n/a", "NA", "NULL", "None", "-", "0", and a single space character - I've seen all of these mean "this field has no value" within the same file. Then there are the blanket deletions where an entire column is empty because the export process silently dropped it.

And domain-specific landmines. Addresses like "123 MAIN ST APT 4B C/O J SMITH" that need to be split into street, unit, and care-of fields. Property class codes that mean different things in different counties ("01" is "Single Family" in Lee County, "Residential" in Broward, and "Vacant Land" in Marion).

The problems are predictable by category but unique in combination. You can't write one script that handles all sources. And you can't hire enough people to manually map every new dataset - not when each one arrives with its own encoding, its own column names, its own creative interpretation of how to represent a null value.

The question is what you do about it. People have tried a lot of things.

What People Try Before AI

Every approach to data cleaning exists for a reason, and most of them work fine within their limits. The trouble starts when you outgrow those limits.

Manual cleanup in Excel. Power Query, VLOOKUP, find-and-replace. This handles one-off tasks with small files perfectly well. If you're fixing 200 rows once a quarter, Excel is the right tool. It's not reproducible, though. Next quarter, when the same file arrives with a new encoding issue, you start from scratch.

One-off Python scripts. A developer writes a pandas pipeline per data source. I've done this dozens of times: read CSV, rename columns, parse dates, strip whitespace, export. Works great for 5-10 sources. At 50+, you're maintaining 50 scripts, each encoding tribal knowledge that walks out the door when the developer leaves.

ETL platforms like dbt, Fivetran, or Airbyte are excellent for structured sources: APIs, databases, well-defined schemas. They struggle with truly messy data. The CSV with no headers. The Excel file where column B is sometimes "Owner Name" and sometimes "Property Address" depending on which county clerk exported it.

Data quality frameworks (Great Expectations, Soda, dbt tests) are superb at detecting problems: "15% of addresses failed to geocode." They don't fix anything. You still need a human to figure out why and write the fix.

Pasting into ChatGPT is where I was in 2022. Send 50 rows, ask for clean output, marvel at the result. It genuinely works for small batches. It doesn't scale: Osmos measured $2,250 for 100K rows through GPT-4 before Microsoft acquired them. Inconsistent output between calls, no validation, and rate limits that kill any automated pipeline.

AI-assisted pipeline is what I'll spend the rest of this article on. AI analyzes samples, generates transformation configs, quality gates verify the output. It combines deterministic execution from scripts, config-driven structure from ETL, automated checks from quality frameworks, and LLMs' ability to understand messy, ambiguous data.

Here's how they compare:

Approach	Scales to 1M rows	Reproducible	Handles ambiguity	Cost per run
Excel / manual	No	No	Human judgment	Free + hours
One-off scripts	Yes	Per-script	No	Dev time per source
ETL platforms	Yes	Yes	No	Platform fees
Quality frameworks	Detection only	Yes	No	Config time
Paste into ChatGPT	No	No	Yes (LLM)	$2-2,250 per batch
AI-assisted pipeline	Yes	Yes	Yes (LLM + rules)	Hours once, pennies per run

None of these approaches is universally wrong. Excel is perfect for one-off cleanup. Scripts are perfect for stable, well-understood sources. ETL platforms are perfect for structured pipelines. The AI-assisted approach fills the gap between them: understanding messy, ambiguous data and generating the transformation logic that traditional tools then execute.

AI as Planner, Not Processor

The pattern that works in production is simple to describe: AI looks at a sample of your data, creates a transformation plan, and the system executes that plan with traditional tools. The LLM touches 100 rows. pandas processes 10 million.

Here's the concrete version. Five steps, each doing one thing.

Step 1: Define the target shape. Before any AI gets involved, you need a contract - what should clean data look like? I use Pydantic models for this because they double as validation:

class PropertyRecord(BaseModel):
    parcel_id: str = Field(description="Unique parcel identifier, e.g. '12-34-56-789'")
    owner_first: str
    owner_last: str
    address: str
    city: str
    state: str = Field(max_length=2)
    zip_code: str = Field(pattern=r"^\d{5}(-\d{4})?$")
    assessed_value: float = Field(ge=0)
    is_homestead: bool

This schema isn't just documentation. It's the success criterion. After cleaning, every row must validate against it.

Step 2: Describe expectations. Give the AI context about the data - not the schema (that's step 1), but what the data represents. "This is a property tax roll export from a county government. Each row is a parcel. Owners can be individuals, trusts, or corporations. Addresses include both mailing and property-site addresses."

Step 3: Feed AI sample data plus a toolkit. Send the LLM 50-100 representative rows, the target schema, and a list of available transformation operations: rename_column, parse_date, split_name, filter_rows, map_values, convert_type. The LLM doesn't need to implement these - it just needs to know they exist.

Step 4: AI generates the plan. The LLM responds with an ordered list of transformations. In our production system, this is a JSON config:

{
  "transforms": [
    {"op": "rename", "mapping": {"OWNERNAME1": "owner_raw", "SITUS_ADDR": "address"}},
    {"op": "map_values", "column": "prop_class",
     "mapping": {"01": "Single Family", "02": "Mobile Home", "03": "Condo"}},
    {"op": "split_name", "source": "owner_raw",
     "targets": ["owner_first", "owner_last", "trust_flag"]},
    {"op": "filter", "expr": "address_type == 'situs'"},
    {"op": "convert", "column": "assessed_value", "to": "float", "null_value": 0}
  ]
}

This config could be a Python script, a SQL query, or a dbt model. The format matters less than the pattern: it's declarative, versionable, and deterministic. Same input always produces same output.

Step 5: Validate, execute, monitor. Run the plan on a subset. Check quality gates (more on those next). If gates fail, send the failures back to the AI and let it adjust the plan. Once the subset passes, run on the full dataset. Check gates again.

This is why it scales. The expensive part (LLM analysis) happens once per data source. The cheap part (pandas applying the config) runs on every row. When the same county sends next month's data in the same format, the config runs without any AI involvement at all. When a new county arrives with a completely different layout, AI generates a new config.

In our production system, we run 100+ county configurations. Each one was originally created by a human engineer studying the data. But the process that human follows - look at samples, understand the schema, write the mapping - is exactly what an AI agent can automate. The engineer's job shifts from "write the config" to "review the AI's config and approve it."

How Do You Know the Data Is Actually Clean?

You don't trust AI-generated code without tests. You shouldn't trust AI-generated transformation plans without quality gates either. The principle is the same: automate the work, verify the result.

In our production pipeline, every data load runs through four layers of checks before anything hits the database.

Schema gates are the baseline. Every field matches its expected type. IDs are unique. Required fields aren't null. Dates parse to ISO format. Foreign keys point at existing records. If your Pydantic model says zip_code must match ^\d{5}$ and 3% of rows have "N/A" in that field, the gate catches it before those rows propagate.

Shape gates catch structural problems. On a refresh (re-processing the same data source), parcel count shouldn't change by more than 10%. If it does, someone uploaded the wrong file, or the schema changed, or data got truncated. We flag anything above 5% as a warning and above 10% as an error that blocks ingestion.

Distribution gates are where it gets interesting. After parsing 50,000 owner names, we check the top 20 most common first names. If "SMITH" appears as a first name, the parser swapped first and last name columns. If more than 5 of the top 20 first names are uncommon (not in a standard frequency list), the parser is splitting names wrong - probably treating "JOHN MICHAEL" as first="JOHN MICHAEL" instead of first="JOHN", middle="MICHAEL."

We run similar checks on address geocoding: after standardizing addresses through an address validation API (we use SmartyStreets), at least 60% should validate. Below 40% means a systematic parsing problem - wrong column, encoding mismatch, or format the parser doesn't recognize. On exemption rates, 25-40% of residential parcels typically have a homestead exemption. Below 10% means exemption data is missing entirely. Above 60% means the parser is incorrectly flagging non-exempt parcels.

Business logic gates check relationships across tables. Every owner record must reference an existing parcel. Every transaction must reference an existing parcel. Exemption codes must exist in the reference table. These catch join errors and mapping mistakes that look fine in isolation but break when you query across tables.

The thresholds (10%, 40%, 60%) come from processing 100+ data sources over several years. They encode what "normal" looks like. Data is never 100% accurate - there's always noise, always edge cases. The goal isn't perfection. It's catching systematic problems early, before bad data propagates through the system and produces wrong results downstream.

Here's the part that connects to the AI pipeline: when a gate fails, the system doesn't crash. It reports which gate failed, which rows caused the failure, and what looks wrong. That report feeds back to the AI agent, which can analyze the failures and adjust the transformation plan. Human reviews the adjustment, approves, and the pipeline reruns. This is the feedback loop that makes the system trustworthy: AI generates the plan, gates verify it, failures get fixed, and each iteration improves the result.

This isn't academic idealism. Research on production agent deployments shows that 68% need human intervention within 10 steps. The best AI systems aren't fully autonomous - they're semi-autonomous with clear checkpoints. Quality gates are those checkpoints for data.

Name Parsing - Where the Hybrid Pattern Clicks

Let me show you the "AI as planner" pattern with a concrete case study.

Our data has a full_name field. It contains everything from "John Smith" to "SMITH, JOHN A JR & MARY B SMITH-JONES TTEE OF THE SMITH FAMILY TRUST DATED 01/01/2020." We need to extract: first name, last name, trust flag, corporate flag. For every record.

The traditional approach is a regex-based parser. Ours handles the common patterns well: "JOHN SMITH," "SMITH, JOHN A JR," "DR JOHN MICHAEL SMITH," Hispanic compounds like "DE LA CRUZ." It scores each parse result with a confidence value based on how well the tokens match known name patterns.

About 80% of names parse cleanly with high confidence. That's 40,000 out of 50,000 records handled instantly by deterministic code. No AI involved, no API cost, no latency.

The remaining 20% is where regex breaks down. Trust names: "DEBBIE DAVIDSON REVOCABLE TRUST." Multiple owners in a trust: "WALSH JOSEPH T & NHUNG REVOCABLE TRUST." Corporate entities: "M & M MANAGEMENT LLC." Names where context matters: is "LE" a Vietnamese surname or a legal abbreviation?

These low-confidence results get routed to an LLM. The prompt includes the raw name string, the target schema (first_name, last_name, trust_flag, is_corporate), and a few examples of correct parsing. The LLM handles the ambiguity that regex can't.

Out of those 10,000 LLM-routed names, roughly 9,500 resolve cleanly. The remaining 500 get flagged for human review - a queue that a person can clear in a few hours instead of manually fixing all 10,000.

The cost math: the LLM portion costs about $2 for 10,000 names. The alternative is hiring someone to manually parse 10,000 complex names, which takes days.

The pattern generalizes beyond names. Any data cleaning task where 80% is predictable and 20% is ambiguous fits this model: let traditional tools handle the easy cases, route the hard cases to AI, validate everything against the schema, and flag what neither can handle. You're not replacing the regex parser. You're adding a second tier that handles what the regex parser can't.

Stop Copy-Pasting Into ChatGPT

There's a difference between using AI as a chat tool and using AI as a worker. Most "AI data cleaning" tutorials show the chat version: open ChatGPT, paste 50 rows, ask it to clean them, copy the result back. That works for one-off tasks the way manual Excel cleanup works - fine for small, non-recurring jobs.

The production version looks completely different from the user's perspective. Here's what actually happens when someone uploads data to our system:

A user receives a link. They drop their files - a ZIP of CSVs, an Excel workbook, a folder of PDFs. They don't know or care about our schema. They just know they need to get their data into the system.

Behind the scenes: the system unpacks the archive and identifies file formats. An AI agent examines samples from each file, detects what each column likely represents, and generates a transformation config. The system runs that config, quality gates verify the output, and if something fails, the agent adjusts and retries. If it still can't resolve the issue, a human gets a specific report: "45% of addresses didn't geocode. Likely cause: PO boxes mixed with street addresses. 1,247 rows affected."

The user who uploaded never sees any of this. They get "success" or "we need to clarify a few things."

The key difference from the chat approach: an AI agent uses tools. It doesn't try to hold a million rows in its context window. It calls pandas to read files, runs SQL to check distributions, executes validation functions, and writes configs. It iterates - try, check, adjust, retry - without a human in the loop for each step.

An agent processing 1,000 files doesn't pipe each file through the LLM. It categorizes files by format and problem type, creates a plan per category, and executes plans with traditional tools. The LLM might make 20 API calls total for 1,000 files. The rest is pandas, Pydantic, and SQL doing what they do best: processing data fast and deterministically.

This isn't a chatbot. It doesn't need a conversation UI. It doesn't need your attention. It reads messy files, figures out what's wrong, writes the fix, tests it, and leaves a report. When it can't fix something, it tells you exactly what's wrong and why. That's the kind of AI I'm excited about - not the one that talks to you, but the one that works for you.

Production Realities

Let me be honest about the parts that aren't elegant.

Sending every row through an LLM doesn't survive contact with the finance team. One million CSV rows at 500 tokens each through a budget API ($0.10/M tokens) costs $50 per run. Through GPT-4, it's closer to $5,000. Run that weekly across 50 data sources and you're looking at a six-figure annual line item for data cleaning alone. The config-generation approach costs a few dollars per data source (one-time LLM analysis of samples) plus pennies for pandas execution on every run. That's 3-4 orders of magnitude cheaper. One practical tip: if you are sending data to an LLM, send it as CSV, not JSON. CSV uses 50-56% fewer tokens for the same data and LLMs parse it just as accurately.

Rate limits compound the cost problem. We've had API providers throttle us during burst processing. Fifty thousand records hitting an LLM endpoint floods the rate limit within minutes. Batch APIs help (50% discount, higher throughput), but the latency goes from seconds to hours. If you need data cleaned before a morning deadline, that's a problem.

LLMs also aren't deterministic. Same prompt, different run, slightly different output. For planning this is fine - the generated config is deterministic once it exists. For row-level processing it's a deal-breaker. I've seen the same name parsed as "John Smith Jr" in one call and "Smith, John Junior" in the next. Configs don't have this problem. They run the same way every time.

And there are things AI simply can't do yet. Deduplication across records is still hard - a 2025 evaluation found that LLMs excel at standardization and profiling but fail at non-exact deduplication. "J. Smith at 123 Main" vs "John Smith at 123 Main St Apt 4" requires comparing millions of pairs, and LLMs are both too slow and too inconsistent for this. Use dedupe or recordlinkage for fuzzy matching. AI can help define the matching rules, but the execution needs traditional algorithms.

For high-volume or privacy-sensitive pipelines, self-hosting (Ollama for development, vLLM for production, cloud GPUs for burst) eliminates rate limits and API costs. The break-even is roughly 2 million tokens per day at 70%+ utilization. Below that, APIs are cheaper when you factor in infrastructure overhead.

The upside that makes all of this worthwhile is resilience. Traditional pipelines are brittle - one unexpected column name, one new encoding, one schema change, and the pipeline crashes at 3 AM. AI-assisted pipelines degrade gracefully. New format arrives? The agent analyzes the sample and generates a new config. Encoding changed? Detected and handled. Unseen edge case? Flagged for review instead of silently producing garbage. Each new data source makes the system smarter, not more fragile. The library of configs grows, the quality gate thresholds get refined, and the percentage of cases handled automatically increases over time.

The Silent Helper

Everyone is tired of chatbots. "Talk to AI" has become the default pitch for every product, every feature, every startup. But there's another kind of AI that doesn't get the hype and does most of the useful work.

It doesn't have a chat interface. It doesn't need your attention. It reads 500 messy files while you sleep, figures out what's wrong with each one, writes the transformation config, tests it against quality gates, and leaves a report on your desk in the morning. When it can't fix something, it tells you exactly what's wrong and why. When a new format arrives that it's never seen before, it adapts instead of crashing.

What previously required a week of an onboarding engineer's time - staring at spreadsheets, guessing at column meanings, writing one-off scripts, debugging encoding issues - now takes an afternoon of supervised automation. The human becomes the reviewer, not the laborer. That's a better use of judgment.

The three-year lesson is simple. AI is best as a planner and edge-case resolver, not as a row-level processor. Define your target schema. Sample the data. Let AI generate the transformation plan. Verify with quality gates. Execute at scale with traditional tools. Handle exceptions. That's the pattern, and it works.

This approach isn't revolutionary - the industry has converged on it independently. Every major data platform shipped some version of "AI suggests, human approves" in 2025. What's still missing is the practitioner knowledge: what thresholds to set, how to structure the feedback loop, when to route to AI versus when traditional tools are sufficient. That's what I tried to share here.

As agents get better at using tools, this pattern will become standard infrastructure. Today it's an engineering pattern you build. Soon it'll be a checkbox in your data platform. But the fundamentals - target schemas, quality gates, confidence-based routing - those don't change. They're the boring, essential foundation that makes the AI part trustworthy.

This article was created with the assistance of Claude Code.