DEV Community

Jangwook Kim
Jangwook Kim

Posted on • Originally published at effloow.com

Meta Llama Stack: Deploy Llama 4 With OpenAI-Compatible API

Why Llama Stack Changes the Open-Source Deployment Story

Running open-source LLMs in production has always had a catch: you pick a backend (Ollama, vLLM, llama.cpp), write your integration code against that backend's specific API, and then find yourself locked in. Swap the backend for performance or cost reasons and you're rewriting client code.

Meta Llama Stack solves exactly this problem. It's an open-source AI application server that sits in front of any backend and exposes a single, OpenAI-compatible API layer. The same /v1/chat/completions call that works against your local Ollama instance in development routes to vLLM or AWS Bedrock in production — with zero application-code changes.

As of April 2026, the repository has over 8,200 GitHub stars and is under active development by Meta's open-source team. It ships with native support for Llama 4 Scout and Llama 4 Maverick, alongside older Llama 3.x models. If you're running or planning to run open-weight Llama models in production, Llama Stack is the infrastructure layer worth knowing.

What Llama Stack Actually Is

Most frameworks for LLM deployment focus on one thing — inference serving. Llama Stack goes wider. It provides a unified API layer covering seven concerns:

  • Inference — run Llama models against any supported backend
  • RAG — vector store integration for retrieval-augmented generation
  • Agents — multi-step agent orchestration with tool use
  • Tools — web search, code interpreter, custom tool registration
  • Safety — Llama Guard integration for prompt/output filtering
  • Evals — built-in evaluation harness
  • Telemetry — OpenTelemetry-based tracing and logging

The architecture has two layers: a distribution (a pre-configured bundle of provider implementations) and the Llama Stack server (a single process that routes API calls to whichever providers are configured in that distribution).

Your application only ever talks to the Llama Stack server. Swapping backends, safety models, or vector stores is a config change, not a code change.

Distributions: The Core Concept

A distribution is the unit of deployment in Llama Stack. It bundles together one provider for each API component and packages them into a runnable server.

Meta ships several official distributions out of the box:

Distribution Inference Backend Best For
ollama Ollama Local development, CPU/Apple Silicon
vllm vLLM GPU production servers
tgi HuggingFace TGI HuggingFace-native stacks
together Together AI Managed API, no GPU needed
fireworks Fireworks AI Low-latency managed inference
bedrock AWS Bedrock AWS-native production
openai OpenAI API Hybrid open/closed LLM routing

The pattern: develop with ollama, deploy with vllm or a managed service. The API your application uses doesn't change.

You can also build custom distributions if you need to mix providers — for example, using Fireworks for inference but self-hosted ChromaDB for vector storage.

Installation and Quick Start

Install the Client

pip install llama-stack-client
Enter fullscreen mode Exit fullscreen mode

Run the Server with Ollama (Local Development)

First, make sure Ollama is running and has pulled the model:

ollama pull llama3.3
Enter fullscreen mode Exit fullscreen mode

Then start the Llama Stack server pointing at the Ollama distribution:

export INFERENCE_MODEL="llama3.3"
export LLAMA_STACK_PORT=8321

docker run -it \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  -v ~/.llama:/root/.llama \
  -e INFERENCE_MODEL=$INFERENCE_MODEL \
  llamastack/distribution-ollama:latest
Enter fullscreen mode Exit fullscreen mode

The server is now running at http://localhost:8321 and exposes standard OpenAI-compatible endpoints.

Your First API Call

Use the OpenAI Python client directly — point it at the Llama Stack server:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8321/v1",
    api_key="not-required"  # Llama Stack handles auth separately
)

response = client.chat.completions.create(
    model="llama3.3",
    messages=[
        {"role": "user", "content": "Explain attention mechanisms in one paragraph."}
    ]
)

print(response.choices[0].message.content)
Enter fullscreen mode Exit fullscreen mode

That's it. Any existing code that uses the OpenAI Python SDK can point at a Llama Stack server instead, with no further changes.

Or Use the Native Llama Stack Client

from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url="http://localhost:8321")

response = client.inference.chat_completion(
    model_id="llama3.3",
    messages=[{"role": "user", "content": "What are the Llama 4 model sizes?"}]
)

print(response.completion_message.content)
Enter fullscreen mode Exit fullscreen mode

The native client exposes additional Llama Stack-specific APIs (agents, memory, safety) that aren't part of the OpenAI SDK interface.

Production Deployment with vLLM

For GPU production deployments, swap the distribution from ollama to vllm.

Start the vLLM Backend

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-4-Scout-17B-16E-Instruct \
  --tensor-parallel-size 2
Enter fullscreen mode Exit fullscreen mode

Start Llama Stack Pointed at vLLM

export INFERENCE_MODEL="meta-llama/Llama-4-Scout-17B-16E-Instruct"
export VLLM_URL="http://localhost:8000"
export LLAMA_STACK_PORT=8321

docker run -it \
  -p $LLAMA_STACK_PORT:$LLAMA_STACK_PORT \
  -e INFERENCE_MODEL=$INFERENCE_MODEL \
  -e VLLM_URL=$VLLM_URL \
  llamastack/distribution-vllm:latest
Enter fullscreen mode Exit fullscreen mode

Now your application code is unchanged — it still talks to http://your-server:8321/v1. The only thing that moved was the backend.

Llama 4 Model Options

Llama Stack supports the full Llama 4 family. The two currently available production models:

Model Parameters Active Params Context Best For
Llama 4 Scout 109B total / 17B active 17B 10M tokens Single GPU, balanced tasks
Llama 4 Maverick 400B total / 52B active 52B 10M tokens Multi-GPU, high-quality output

Both are MoE (Mixture of Experts) models under Meta's open-weight license. Scout runs on a single A100 80GB; Maverick requires 2–4 GPUs depending on quantization.

Agents and Tool Use

Llama Stack's agents API goes well beyond basic chat completion. An agent maintains a session, executes multi-step plans, and calls tools.

from llama_stack_client import LlamaStackClient

client = LlamaStackClient(base_url="http://localhost:8321")

# Create an agent with web search enabled
agent_config = {
    "model": "llama3.3",
    "instructions": "You are a research assistant. Use search when you need current information.",
    "tools": [{"type": "brave_search", "api_key": "YOUR_BRAVE_API_KEY"}],
    "max_infer_iters": 5
}

agent = client.agents.create(**agent_config)
session = client.agents.sessions.create(
    agent_id=agent.agent_id,
    session_name="research-session"
)

# Turn 1
response = client.agents.turns.create(
    agent_id=agent.agent_id,
    session_id=session.session_id,
    messages=[{"role": "user", "content": "What are the latest Llama 4 benchmarks?"}],
    stream=True
)

for chunk in response:
    if hasattr(chunk, "event") and chunk.event.payload.event_type == "turn_complete":
        print(chunk.event.payload.turn.output_message.content)
Enter fullscreen mode Exit fullscreen mode

The agent automatically decides when to call the search tool, reads the results, and synthesizes a final answer — all within Llama Stack's orchestration layer.

Built-in tools include:

  • brave_search — web search via Brave Search API
  • wolfram_alpha — math and science queries
  • code_interpreter — sandboxed Python execution
  • Custom tools via function registration

Safety: Llama Guard Integration

Every Llama Stack distribution can run Llama Guard as the safety provider, filtering both inputs and outputs against configurable policy categories.

# Check a response against safety policy
safety_response = client.safety.run_shield(
    shield_id="meta-llama/Llama-Guard-3-8B",
    messages=[{"role": "assistant", "content": response_text}]
)

if safety_response.violation:
    print(f"Safety violation: {safety_response.violation.user_message}")
else:
    print(response_text)
Enter fullscreen mode Exit fullscreen mode

Safety categories can be tuned per deployment. Production use cases often enable all defaults; internal developer tools might relax some restrictions.

Additional safety capabilities in Llama Stack:

  • Prompt injection detection
  • Output validation before streaming
  • Rate limiting (configurable per session or API key)
  • Human-in-the-loop controls for high-stakes actions
  • Audit logging for compliance

Memory and RAG

The Memory API supports four storage types for different use cases:

Type Best For
Vector (FAISS, ChromaDB, Weaviate) Semantic similarity search, RAG
Key-Value (Redis, PostgreSQL) Session state, structured lookup
Keyword (BM25) Exact-match and hybrid search
Graph (Neo4j) Relationship-based retrieval

Adding RAG to an agent is a configuration change:

# Create a memory bank (vector store)
memory_bank = client.memory_banks.register(
    memory_bank_id="my-docs",
    params={
        "memory_bank_type": "vector",
        "embedding_model": "all-MiniLM-L6-v2",
        "chunk_size_in_tokens": 512,
        "overlap_size_in_tokens": 64
    }
)

# Insert documents
client.memory.insert(
    bank_id="my-docs",
    documents=[
        {"document_id": "doc-1", "content": "Llama 4 Scout has a 10M token context window..."}
    ]
)
Enter fullscreen mode Exit fullscreen mode

In production, PostgreSQL is the recommended backend for both vector storage and key-value persistence, replacing in-memory FAISS for durability across restarts.

Telemetry and Observability

Llama Stack ships a complete OpenTelemetry-native telemetry system. Traces, spans, and events flow from the server to any OTEL-compatible backend (Jaeger, Grafana Tempo, Datadog, etc.).

# Enable OTEL tracing in your distribution config
OTEL_EXPORTER_OTLP_ENDPOINT=http://jaeger:4317
OTEL_SERVICE_NAME=llama-stack-prod
Enter fullscreen mode Exit fullscreen mode

Every inference call, agent step, tool invocation, and safety check becomes a traced span. This gives you:

  • Per-request token counts and latency
  • Agent tool-call traces with reasoning steps
  • Safety shield evaluation times
  • Provider-level error attribution

For teams already using Langfuse or other LLM observability tools, Llama Stack's OTEL output integrates cleanly with existing dashboards.

Common Mistakes

Using the wrong distribution for your hardware. The ollama distribution works fine on CPU and Apple Silicon, but for A100/H100 servers, vllm gives 3–5x better throughput. Don't use CPU-tier distributions for GPU production workloads.

Not setting model IDs consistently. The model identifier in your API call must match exactly what the provider backend has loaded. With vLLM this is usually the full HuggingFace path (meta-llama/Llama-4-Scout-17B-16E-Instruct); with Ollama it's the short tag (llama3.3). Mismatches return a 404 that looks like a server error.

Skipping Safety in development. Llama Guard evaluation adds ~50ms latency. Developers sometimes disable it locally to speed up iteration, then forget to re-enable it before production. Treat safety configuration as part of your deployment checklist, not a late addition.

Ignoring session management for agents. Agent sessions accumulate context across turns. For production services that handle many concurrent users, set session_ttl and clean up sessions explicitly, or you'll see memory growth over time.

Mounting volumes incorrectly for model weights. The Docker images expect model weights at specific paths. If the volume mount doesn't match, the container downloads models on startup — slow, expensive, and fragile in autoscaling environments. Pre-pull weights and mount them at the documented path.

FAQ

Q: Can I use Llama Stack with non-Llama models?

Yes. Llama Stack supports any model available through its provider backends. The openai distribution lets you route to GPT-4o or GPT-6, the anthropic provider connects to Claude, and the vllm distribution serves any HuggingFace-compatible model. The "Llama" branding is about the defaults, not a hard constraint.

Q: How does Llama Stack compare to LiteLLM?

LiteLLM focuses on unified API routing to managed providers (OpenAI, Anthropic, Azure, etc.) with cost tracking and fallback logic. Llama Stack is broader: it includes self-hosting, agent orchestration, safety, RAG, and evaluation. For a team running managed cloud providers, LiteLLM is simpler. For teams self-hosting Llama models who need agents and safety, Llama Stack adds significant value beyond what LiteLLM offers.

Q: Is Llama Stack production-ready?

The core inference and OpenAI-compatible endpoints are stable and used in production deployments. The agent, memory, and evaluation APIs are under more active development. As of version 0.2.x (April 2026), production use is most reliable for inference + safety use cases. Agents work well but have more API surface area that can change between minor versions.

Q: What's the minimum hardware for running Llama 4 Scout with Llama Stack?

Llama 4 Scout (17B active parameters) requires approximately 35GB VRAM in BF16 or 20GB with 4-bit quantization. A single A100 40GB handles it comfortably. For Apple Silicon, M4 Pro (48GB unified memory) or M4 Max runs it at reduced throughput. Maverick needs 2–4 A100/H100 GPUs depending on quantization level.

Q: Can I run Llama Stack without Docker?

Yes. Install via pip: pip install llama-stack and run llama-stack start --config path/to/config.yaml. Docker is recommended for production for isolation and reproducibility, but the Python package works for development and custom deployments.

Key Takeaways

Meta Llama Stack is the cleanest path from local Llama model experimentation to production deployment. Its distribution model — develop with Ollama, deploy with vLLM, never change your API client — removes the most common painful rewrite in open-source LLM adoption.

The OpenAI compatibility layer is the practical unlock: teams already using the OpenAI Python SDK can switch to self-hosted Llama 4 by changing one line (base_url). Combined with built-in agents, safety, RAG, and telemetry, Llama Stack positions itself as a full infrastructure layer, not just a model server.

For the current state of production deployments: inference and safety are solid; agents and RAG are functional with active API evolution. A reasonable approach is to start with inference + safety in production, then evaluate the agents API for lower-stakes workloads while it stabilizes.

Bottom Line

Llama Stack is the best available open-source infrastructure layer for Llama 4 production deployments. The OpenAI-compatible API and swappable distribution model eliminate the usual vendor lock-in trade-off of self-hosting — you get full control without rewriting your application code when you scale up or change backends.


Prefer a deep-dive walkthrough? Watch the full video on YouTube.

Top comments (0)