Daniel Gustafsson

Posted on Mar 14

One Redis Instance, Three Jobs: DevOps for AI Agents Without the Overkill

#ai #architecture #devops #redis

I've built my share of microservices over the years. It usually ends with an architecture where every service has its own database, its own cache, its own queue, and a 200 line YAML file just to hold everything together in Docker Compose.

When I started experimenting with AI agents, I expected the same story. A vector database here, a message queue there, a cache service, a state store. But it turned out that Redis Stack handles all of it. And it simplifies operations more than I expected.

The entire infrastructure: one docker compose yaml

services:
  redis:
    image: redis/redis-stack:latest
    ports:
      - "6379:6379"
      - "8001:8001"  # RedisInsight
    volumes:
      - redis_data:/data
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 5

  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_models:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - capabilities: [gpu]
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:11434/api/tags || exit 1"]
      interval: 10s
      timeout: 5s
      retries: 3

volumes:
  redis_data:
  ollama_models:

That is it. Two containers. No Postgres, no Pinecone, no RabbitMQ, no separate Memcached. Redis handles all three jobs and Ollama runs models locally.

What Redis actually does (and why it is elegant)

1. Conversation memory (checkpointer)

LangGraph needs somewhere to persist conversation state, such as which messages were sent, which tools were called, and what the results were. RedisSaver handles that:

with RedisSaver.from_conn_string("redis://localhost:6379") as checkpointer:
    checkpointer.setup()
    agent = create_react_agent(..., checkpointer=checkpointer)

Each thread (thread_id) gets its own history. Restart the agent and reconnect to the same thread? It picks up right where it left off. No manual serialization, no migrations.

From an ops perspective: these are just regular Redis keys. You can inspect them in RedisInsight, set TTL to automatically clean up old conversations, and monitor memory usage with INFO memory.

2. Long term memory (vector index)

RedisStore with a vector index gives you semantic search. Data is stored with embeddings and can be queried based on meaning, not exact string matching.

with RedisStore.from_conn_string(
    "redis://localhost:6379",
    index={
        "embed": embeddings,
        "dims": 768,
        "distance_type": "cosine",
        "fields": ["text"],
    },
) as store:
    store.setup()

The point from a DevOps perspective: you do not need to run a separate vector database. No Milvus, no Qdrant, no Weaviate. Redis Stack includes RediSearch out of the box, and it is more than sufficient for this kind of workload.

3. Semantic cache

The same question phrased slightly differently, like "what is WCAG?" versus "explain WCAG to me", produces the same answer. Instead of sending both through the LLM, we cache responses based on vector proximity:

from redisvl.extensions.llmcache import SemanticCache

cache = SemanticCache(
    name="llm_cache",
    redis_url="redis://localhost:6379",
    distance_threshold=0.1,
    ttl=3600,
)

A distance_threshold of 0.1 means queries with cosine distance ≤0.1 get cached responses. In practice, this means very similar questions. Where exactly to set this threshold depends on your embedding model (different models spread their vectors differently), so experiment. TTL of 3600 seconds automatically cleans up stale data.

Three completely different use cases. Same redis://localhost:6379.

No state in the application

This is the biggest win for operations. The agent itself is stateless. All state lives in Redis:

Conversation history: Redis (checkpointer)
Saved memories: Redis (vector index)
Cached responses: Redis (semantic cache)
Scan history: Redis (vector index)

This means you can:

Restart the agent without losing anything
Run multiple instances behind a load balancer
Scale horizontally without shared state in memory
Deploy new versions with zero downtime (rolling update)

In practice: docker compose restart agent and the user notices nothing.

Ollama as a local inference server

Ollama abstracts away model management. You pull a model once, then it is exposed as an HTTP API:

ollama pull qwen3.5:4b      # 2.5 GB, requires ~4 GB VRAM
ollama pull nomic-embed-text  # 274 MB, for embeddings

From the agent, it looks like any other API call:

model = ChatOllama(
    model="qwen3.5:4b",
    base_url="http://ollama:11434",  # Docker service name
)

No API key management. No rate limiting. No cost per token. Models run locally on your GPU.

Want to swap models? Change an environment variable:

CHAT_MODEL=qwen3.5:4b
EMBEDDING_MODEL=nomic-embed-text

No code changes. Ollama handles the rest.

Monitoring without extra tooling

Redis has built in monitoring that goes a long way:

# Memory usage
redis-cli INFO memory | grep used_memory_human

# Key count
redis-cli DBSIZE

# Live command stream
redis-cli MONITOR

# Slow queries
redis-cli SLOWLOG GET 10

RedisInsight (port 8001 in the docker compose) gives you a web UI to inspect keys, run queries, and view memory graphs. It is included in the redis-stack image.

For Ollama:

# Which models are loaded?
curl http://localhost:11434/api/tags

# How much VRAM is being used?
nvidia-smi

Need more? A Prometheus exporter exists for Redis (redis_exporter) and is straightforward to set up. But for most use cases, the built in tools are enough.

The cost model is different

With cloud AI (OpenAI, Claude, etc) you pay per token. That makes costs unpredictable, as an agent that makes many tool calls can get expensive.

With Ollama locally, the cost is fixed to your hardware. A machine with an RTX 4070 (12 GB VRAM) costs around $1,500 and runs qwen3.5:4b fast enough for production.

Rough math:

Cloud (e.g. Qwen 3 72b via OpenRouter): ~$0.005 per scan. 200 scans per day = $30 per month
Local (Qwen 3.5 4b): ~$0 per scan. Unlimited.

The difference becomes dramatic at volume. And the local model works well enough for tasks with clear context.

You can also mix them. Run locally for 90% of traffic and fall back to a cloud model for complex queries.

Backup and persistence

Redis Stack with appendonly yes (default in redis stack) gives you AOF persistence. Every write is logged to disk. On restart, everything is restored.

Backup:

# Snapshot
redis-cli BGSAVE
cp /data/dump.rdb /backup/redis.rdb

# Or copy AOF
cp /data/appendonly.aof /backup/

Ollama models are cached in /root/.ollama. Mount it as a Docker volume and models survive container restarts without needing to be downloaded again.

What I would do differently in production

Set maxmemory and eviction policy: Redis without a memory limit on a shared machine is a ticking time bomb. maxmemory-policy allkeys-lru automatically evicts the oldest entries.
TTL on everything that does not need to live forever: Cached LLM responses: 1 hour. Conversation history: 7 days. Scan history: keep.
Separate Redis instances per environment: Dev, staging, prod should not share data. Use key prefixes (dev:, staging:, prod:) or ideally separate Redis instances entirely. Avoid logical databases (/0, /1, /2). RediSearch and other modules only work on database 0, and clustering does not support them either.
Health checks in docker compose: Already included in the example above. If you add an agent service, use depends_on with condition: service_healthy so it does not start before Redis and Ollama are ready.
Log token usage: Even with local models, you want to know how much inference you are running. It helps with capacity planning.

Conclusion

What makes this stack attractive for DevOps is its simplicity. Two containers. No external state. Standard monitoring tools. Predictable cost.

Redis is no longer just a cache. With the Stack distribution, it replaces three or four services that would otherwise require separate operations. LangGraph abstracts away agent orchestration. Ollama turns LLM inference into a local service.

Bottom line: less to operate, less that can break, easier to debug.

Stack: Redis Stack, LangGraph, Ollama. Everything runs in Docker. You need a GPU with ≥8 GB VRAM for local models, or point Ollama at a cloud endpoint.

Published: March 2026 | Daniel Gustafsson

DEV Community