DEV Community

Cover image for Why Your Agent Works Locally But Fails in Production
Gantz AI for Gantz

Posted on

Why Your Agent Works Locally But Fails in Production

It worked on your machine. It doesn't work in production.

Classic.

AI agents have their own special reasons for this. Here's what's actually going wrong.

The usual suspects

1. Environment variables aren't there

Local:

export OPENAI_API_KEY="sk-..."
export DATABASE_URL="postgres://localhost/dev"
Enter fullscreen mode Exit fullscreen mode

Production:

Error: OPENAI_API_KEY not found
Enter fullscreen mode Exit fullscreen mode

You forgot to set them. Or you set them wrong. Or they're in the wrong format.

# This works locally
api_key = os.environ["OPENAI_API_KEY"]

# This crashes in production if not set
# KeyError: 'OPENAI_API_KEY'
Enter fullscreen mode Exit fullscreen mode

Fix:

# Fail fast with clear error
api_key = os.environ.get("OPENAI_API_KEY")
if not api_key:
    raise ValueError("OPENAI_API_KEY environment variable is required")
Enter fullscreen mode Exit fullscreen mode

2. File paths are wrong

Local:

config = open("./config.yaml")  # Works from project root
Enter fullscreen mode Exit fullscreen mode

Production:

FileNotFoundError: ./config.yaml
Enter fullscreen mode Exit fullscreen mode

Your working directory is different in production.

Fix:

from pathlib import Path

# Relative to this file, not working directory
CONFIG_PATH = Path(__file__).parent / "config.yaml"
config = open(CONFIG_PATH)
Enter fullscreen mode Exit fullscreen mode

3. Network is different

Local:

localhost:5432 → PostgreSQL ✓
localhost:6379 → Redis ✓
api.example.com → Internet ✓
Enter fullscreen mode Exit fullscreen mode

Production:

localhost:5432 → Nothing (database is on different host)
db.internal:5432 → PostgreSQL (but you're still using localhost)
Enter fullscreen mode Exit fullscreen mode

Fix:

# Use environment variables for all hosts
DATABASE_HOST = os.environ.get("DATABASE_HOST", "localhost")
REDIS_HOST = os.environ.get("REDIS_HOST", "localhost")
Enter fullscreen mode Exit fullscreen mode

4. Permissions are different

Local:

# You're running as yourself with full access
./script.sh  # Works
Enter fullscreen mode Exit fullscreen mode

Production:

# Running as restricted user
./script.sh  # Permission denied
Enter fullscreen mode Exit fullscreen mode

Fix:

# Dockerfile
RUN chmod +x /app/scripts/*.sh
USER appuser  # Test with restricted user locally too
Enter fullscreen mode Exit fullscreen mode

Agent-specific problems

5. Tool binaries aren't installed

Your MCP tools call system binaries.

Local:

- name: search
  script:
    shell: rg "{{query}}" .  # ripgrep installed
Enter fullscreen mode Exit fullscreen mode

Production:

Command not found: rg
Enter fullscreen mode Exit fullscreen mode

Fix:

# Dockerfile
RUN apt-get update && apt-get install -y \
    ripgrep \
    jq \
    curl
Enter fullscreen mode Exit fullscreen mode

Or use Gantz Run which handles tool dependencies.

6. Rate limits hit differently

Local:

You: Test one query
API: 200 OK
You: Works!
Enter fullscreen mode Exit fullscreen mode

Production:

Users: 1000 queries/minute
API: 429 Too Many Requests
Agent: Crashes
Enter fullscreen mode Exit fullscreen mode

Fix:

import time
from functools import wraps

def rate_limited(max_per_minute):
    min_interval = 60.0 / max_per_minute
    last_call = [0]

    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            elapsed = time.time() - last_call[0]
            if elapsed < min_interval:
                time.sleep(min_interval - elapsed)
            result = func(*args, **kwargs)
            last_call[0] = time.time()
            return result
        return wrapper
    return decorator

@rate_limited(max_per_minute=60)
def call_api(query):
    return api.search(query)
Enter fullscreen mode Exit fullscreen mode

7. Timeouts are too short

Local:

Tool execution: 2 seconds (fast local disk)
Timeout: 30 seconds
Result: Success
Enter fullscreen mode Exit fullscreen mode

Production:

Tool execution: 45 seconds (slow network storage)
Timeout: 30 seconds
Result: Timeout error
Enter fullscreen mode Exit fullscreen mode

Fix:

# Environment-aware timeouts
TIMEOUT = int(os.environ.get("TOOL_TIMEOUT", 30))

# Or dynamic based on operation
def get_timeout(operation):
    timeouts = {
        "quick_search": 10,
        "file_read": 30,
        "large_query": 120,
        "external_api": 60
    }
    return timeouts.get(operation, 30)
Enter fullscreen mode Exit fullscreen mode

8. Memory limits

Local:

RAM: 32GB
Load entire codebase: Works
Enter fullscreen mode Exit fullscreen mode

Production:

Container RAM: 512MB
Load entire codebase: OOM Killed
Enter fullscreen mode Exit fullscreen mode

Fix:

# Stream instead of load all
def search_files(pattern):
    # Bad: loads everything
    # all_files = list(Path(".").rglob("*"))

    # Good: streams results
    for file in Path(".").rglob("*"):
        if pattern in file.read_text():
            yield file
Enter fullscreen mode Exit fullscreen mode

9. Concurrent requests

Local:

You: One request at a time
Agent: Handles it fine
Enter fullscreen mode Exit fullscreen mode

Production:

Users: 50 concurrent requests
Agent: Race conditions, shared state corruption
Enter fullscreen mode Exit fullscreen mode

Fix:

# Don't use global mutable state
class Agent:
    def __init__(self):
        self.context = {}  # Instance state, not global

# Or use proper locks
import threading

class SharedResource:
    def __init__(self):
        self.lock = threading.Lock()
        self.data = {}

    def update(self, key, value):
        with self.lock:
            self.data[key] = value
Enter fullscreen mode Exit fullscreen mode

10. Different model behavior

Local:

Model: gpt-4-turbo-preview (latest)
Response: Perfect
Enter fullscreen mode Exit fullscreen mode

Production:

Model: gpt-4-turbo-preview (but it's a different version now!)
Response: Slightly different, breaks your parsing
Enter fullscreen mode Exit fullscreen mode

Fix:

# Pin model versions
MODEL = "gpt-4-0125-preview"  # Specific version, not "latest"

# And handle response variations
def parse_response(response):
    try:
        return json.loads(response)
    except json.JSONDecodeError:
        # Model didn't return JSON, try to extract it
        match = re.search(r'\{.*\}', response, re.DOTALL)
        if match:
            return json.loads(match.group())
        raise
Enter fullscreen mode Exit fullscreen mode

11. Cold starts

Local:

Agent: Already running, warm
First request: 100ms
Enter fullscreen mode Exit fullscreen mode

Production (serverless):

Agent: Cold, needs to initialize
First request: 5000ms (timeout!)
Enter fullscreen mode Exit fullscreen mode

Fix:

# Lazy loading for heavy dependencies
_model = None

def get_model():
    global _model
    if _model is None:
        _model = load_model()  # Only on first call
    return _model

# Or pre-warm in background
@app.on_event("startup")
async def startup():
    asyncio.create_task(warm_up_agent())
Enter fullscreen mode Exit fullscreen mode

12. Log verbosity

Local:

logging.basicConfig(level=logging.DEBUG)
# See everything, debug easily
Enter fullscreen mode Exit fullscreen mode

Production:

logging.basicConfig(level=logging.ERROR)
# See nothing, debug impossible
Enter fullscreen mode Exit fullscreen mode

Fix:

# Structured logging with appropriate levels
import structlog

logger = structlog.get_logger()

def handle_tool_call(tool, params):
    logger.info("tool_call_start", tool=tool, params=params)

    try:
        result = execute(tool, params)
        logger.info("tool_call_success", tool=tool, result_size=len(str(result)))
        return result
    except Exception as e:
        logger.error("tool_call_failed", tool=tool, error=str(e), exc_info=True)
        raise
Enter fullscreen mode Exit fullscreen mode

The checklist

Before deploying, verify:

Environment

  • [ ] All environment variables set
  • [ ] Correct values (not copy-paste of local values)
  • [ ] Secrets in secure storage (not in code)

Dependencies

  • [ ] All system binaries installed
  • [ ] Correct versions of packages
  • [ ] No dev-only dependencies missing

Network

  • [ ] Correct hostnames for databases, APIs
  • [ ] Firewall allows outbound connections
  • [ ] DNS resolves correctly

Resources

  • [ ] Adequate memory limits
  • [ ] Adequate CPU
  • [ ] Disk space for temp files

Timeouts

  • [ ] API timeouts appropriate for production latency
  • [ ] Tool execution timeouts realistic
  • [ ] Request timeouts configured

Concurrency

  • [ ] No shared mutable state
  • [ ] Connection pools sized correctly
  • [ ] Rate limiting in place

Observability

  • [ ] Logging configured
  • [ ] Metrics exposed
  • [ ] Errors reported

Testing for production

Test with production-like config

# tests/conftest.py
import pytest

@pytest.fixture
def production_like_env(monkeypatch):
    """Simulate production environment."""
    # Restrictive timeouts
    monkeypatch.setenv("TOOL_TIMEOUT", "5")
    # Memory pressure
    monkeypatch.setenv("MAX_CONTEXT_SIZE", "1000")
    # No debug mode
    monkeypatch.setenv("DEBUG", "false")
Enter fullscreen mode Exit fullscreen mode

Test failure cases

def test_handles_api_timeout():
    with mock.patch("api.call", side_effect=TimeoutError):
        result = agent.run("do something")
        assert result.status == "error"
        assert "timeout" in result.message.lower()

def test_handles_missing_tool():
    agent = Agent(tools=[])  # No tools
    result = agent.run("use a tool")
    assert result.status == "error"
    assert "tool" in result.message.lower()
Enter fullscreen mode Exit fullscreen mode

Load testing

import asyncio
import aiohttp

async def load_test(n_requests=100, concurrency=10):
    semaphore = asyncio.Semaphore(concurrency)

    async def single_request():
        async with semaphore:
            async with aiohttp.ClientSession() as session:
                start = time.time()
                async with session.post("/agent", json={"query": "test"}) as resp:
                    duration = time.time() - start
                    return resp.status, duration

    results = await asyncio.gather(*[single_request() for _ in range(n_requests)])

    success = sum(1 for status, _ in results if status == 200)
    avg_time = sum(d for _, d in results) / len(results)

    print(f"Success rate: {success}/{n_requests}")
    print(f"Average time: {avg_time:.2f}s")
Enter fullscreen mode Exit fullscreen mode

Debugging production

Add request tracing

import uuid

class Agent:
    def run(self, query, request_id=None):
        request_id = request_id or str(uuid.uuid4())

        logger.info("request_start", request_id=request_id, query=query[:100])

        try:
            result = self._execute(query)
            logger.info("request_success", request_id=request_id)
            return result
        except Exception as e:
            logger.error("request_failed", request_id=request_id, error=str(e))
            raise
Enter fullscreen mode Exit fullscreen mode

Capture failing inputs

def run_with_capture(self, query):
    try:
        return self.run(query)
    except Exception as e:
        # Save failing case for reproduction
        with open(f"/var/log/agent/failed_{time.time()}.json", "w") as f:
            json.dump({
                "query": query,
                "error": str(e),
                "traceback": traceback.format_exc(),
                "context": self.context
            }, f)
        raise
Enter fullscreen mode Exit fullscreen mode

Health check endpoint

@app.get("/health")
async def health():
    checks = {
        "database": check_database(),
        "api_key": bool(os.environ.get("OPENAI_API_KEY")),
        "tools": check_tools_available(),
        "memory": get_memory_usage() < 90,  # percent
    }

    healthy = all(checks.values())

    return {
        "status": "healthy" if healthy else "unhealthy",
        "checks": checks
    }
Enter fullscreen mode Exit fullscreen mode

Summary

Your agent fails in production because:

Environment:

  • Missing env vars
  • Wrong file paths
  • Different network topology
  • Restricted permissions

Scale:

  • Rate limits hit
  • Timeouts too short
  • Memory limits exceeded
  • Concurrent request issues

Dependencies:

  • Missing binaries
  • Version mismatches
  • Model behavior differences

Fix it by:

  • Using environment variables for everything configurable
  • Testing with production-like constraints
  • Adding comprehensive logging
  • Implementing proper error handling
  • Load testing before deploy

The agent isn't broken. The environment is different.


What's the weirdest production-only bug you've encountered with AI agents?

Top comments (0)