DEV Community

Rost
Rost

Posted on • Originally published at glukhov.org

Claude Code install and config for Ollama, llama.cpp, pricing

Claude Code is not autocomplete with better marketing. It is an agentic coding tool: it reads your codebase, edits files, runs commands, and integrates with your development tools.

That difference matters because the unit of work stops being "a line of code" and starts being "a task with an end state".

Anthropic frames the distinction clearly: code completion suggests the next line as you type, while Claude Code operates at the project level, plans across multiple files, executes changes, runs tests, and iterates on failures. In practice, that makes it closer to a terminal-native junior engineer who can do chores fast, but still needs review.

That speed-versus-supervision tension is a lot of what people bundle under “vibe coding”; What is Vibe Coding? unpacks the term, where it came from, and what efficiency and risk look like in practice.

One detail that is easy to miss when skimming documentation: the Terminal CLI (and the VS Code surface) can be configured to use third-party providers. That is where Ollama and llama.cpp come in.

Once Claude Code is pointed at a local HTTP endpoint, the runtime, hardware, and hosting trade-offs sit outside the client; this comparison of LLM hosting in 2026 lines up Ollama, dedicated inference stacks, and cloud options in one place.

To see how Claude Code fits next to other AI-assisted coding and delivery workflows, this guide to AI developer tools pulls Copilot-style assistants, automation, and editor patterns into one place.

For a tool-by-tool survey of coding assistants in the same bucket, AI Coding Assistants Comparison walks through Cursor, Copilot, Cline, and the rest at a higher level than this install guide.

Claude Code Installation and quickstart

Installation options and what they imply

There are several install paths, and they are not equal:

  • Native install scripts are the "always current" option because they auto-update.
  • Homebrew and WinGet are the "controlled change" option because you upgrade explicitly.

Install commands (official quickstart):

# macOS, Linux, WSL
curl -fsSL https://claude.ai/install.sh | bash
Enter fullscreen mode Exit fullscreen mode
# Windows PowerShell
irm https://claude.ai/install.ps1 | iex
Enter fullscreen mode Exit fullscreen mode
:: Windows CMD
curl -fsSL https://claude.ai/install.cmd -o install.cmd && install.cmd && del install.cmd
Enter fullscreen mode Exit fullscreen mode

Then start an interactive session from inside a project folder:

cd /path/to/your/project
claude
Enter fullscreen mode Exit fullscreen mode

Login and account types

Claude Code needs an account to run in the first-party mode. The quickstart flow supports logins via a Claude subscription (Pro, Max, Team, Enterprise), a Console account (API credits), or supported cloud providers. A useful operational footnote: on first Console login, a "Claude Code" workspace is created for centralised cost tracking.

Claude Code Configuration: settings.json and environment variables

If Claude Code feels magical when it works, it often feels "mysterious" when it does not. The cure is understanding its configuration layering and the few environment variables that actually matter.

Settings files and precedence

Claude Code settings are hierarchical, with three developer-facing files:

  • User scope, applies everywhere: ~/.claude/settings.json
  • Project scope, shared in a repo: .claude/settings.json
  • Local scope, per-machine overrides: .claude/settings.local.json (gitignored)

Precedence is (highest to lowest): managed policy, CLI flags, local, project, user. That ordering explains several "why is my config ignored" moments.

You can manage settings interactively via the /config command, which opens a settings UI inside the REPL.

Environment variables that control provider routing

Claude Code can be steered at runtime by environment variables. Two behaviour quirks are worth treating as design constraints:

1) If ANTHROPIC_API_KEY is set, Claude Code will use the key instead of a Claude subscription even when you are logged in. In print mode (-p) the key is always used when present.

2) If ANTHROPIC_BASE_URL points to a non-first-party host (a proxy, gateway, or local server), some features are intentionally conservative. For example, MCP tool search is disabled by default unless you explicitly re-enable it.

A minimal "use a gateway" pattern looks like this:

export ANTHROPIC_BASE_URL=https://your-gateway.example
export ANTHROPIC_API_KEY=sk-your-key
Enter fullscreen mode Exit fullscreen mode

Gateway note: Claude Code expects certain API formats. For the Anthropic Messages format, the gateway must expose /v1/messages and /v1/messages/count_tokens and must forward anthropic-beta and anthropic-version headers. If a gateway rejects those headers, there is a dedicated knob to strip experimental betas.

Model selection in Claude Code when you are not using Anthropic directly

Claude Code has a concept of aliases (opus, sonnet, haiku) and also supports pinning specific model IDs. There is also an allowlist that can restrict what users can select in the model picker, even when routed through third-party providers.

A pragmatic pattern is to set an initial model and restrict the picker, then pin what "default" resolves to via env:

{
  "model": "claude-sonnet-4-5",
  "availableModels": ["claude-sonnet-4-5", "haiku"],
  "env": {
    "ANTHROPIC_DEFAULT_SONNET_MODEL": "claude-sonnet-4-5"
  }
}
Enter fullscreen mode Exit fullscreen mode

Running self-hosted LLMs via Ollama

Ollama is currently the lowest-friction way to make Claude Code run on non-Anthropic models, because it exposes an Anthropic-compatible API for Claude Code to talk to.

Quick setup with ollama launch

If you have Ollama installed and running, the fast path is:

ollama launch claude
Enter fullscreen mode Exit fullscreen mode

Or specify a model at launch:

ollama launch claude --model glm-4.7-flash
Enter fullscreen mode Exit fullscreen mode

Manual setup with explicit environment variables

The Ollama integration documents a simple manual wiring where Claude Code talks to Ollama through the Anthropic-compatible API endpoint:

export ANTHROPIC_AUTH_TOKEN=ollama
export ANTHROPIC_API_KEY=""
export ANTHROPIC_BASE_URL=http://localhost:11434

claude --model qwen3.5
Enter fullscreen mode Exit fullscreen mode

This pattern is opinionated in a useful way: it treats "provider routing" as an environment concern, not something you click in a GUI.

Context window reality check

Agentic coding is context-hungry. Ollama calls it out bluntly: Claude Code requires a large context window and recommends at least 64k tokens. If your local model tops out at 8k or 16k, Claude Code will still run, but the "project-level" promise becomes fragile.

For hands-on local model behaviour in a similar terminal-agent setup (Ollama and llama.cpp, coding tasks, and frank failure notes), Best LLMs for OpenCode - Tested Locally is a useful cross-check when you are shortlisting GGUF or Ollama tags for Claude Code.

Running self-hosted LLMs via llama.cpp

llama.cpp is attractive for the opposite reason: it is not trying to be a platform. It is a fast, lightweight server that can expose both OpenAI-compatible routes and an Anthropic Messages API compatible route.

For install paths, llama-cli, and llama-server behaviour beyond the snippets below, llama.cpp Quickstart with CLI and Server is the end-to-end reference.

What to run on the server side

The llama.cpp HTTP server (llama-server) supports an Anthropic-compatible Messages API at POST /v1/messages, with streaming via SSE. It also offers count_tokens at /v1/messages/count_tokens.

Two details matter for Claude Code:

  • The server explicitly does not make strong claims of full Anthropic API spec compatibility, but states it works well enough for many apps.
  • Tool use requires starting llama-server with the --jinja flag. If you miss this, Claude Code will behave like it suddenly forgot how to be an agent.

A minimal local run looks like:

# Build or download llama-server, then run with a GGUF model
./llama-server -m /models/your-model.gguf --jinja --host 127.0.0.1 --port 8080
Enter fullscreen mode Exit fullscreen mode

If you want a hard auth boundary, llama-server can be configured with an API key:

./llama-server -m /models/your-model.gguf --jinja --api-key my-local-key --host 127.0.0.1 --port 8080
Enter fullscreen mode Exit fullscreen mode

Point Claude Code at llama-server

With the server running, your Claude Code side is mostly a base URL override:

export ANTHROPIC_BASE_URL=http://127.0.0.1:8080
export ANTHROPIC_API_KEY=my-local-key   # only if you enabled --api-key on llama-server

claude --model your-model-alias
Enter fullscreen mode Exit fullscreen mode

If you do not set an API key or auth token, Claude Code may try to fall back to subscription login, which is the source of many "why is it opening a browser" complaints.

Health checks and first failure triage

llama-server exposes a simple health endpoint that returns "loading model" until the model is ready, and "ok" when it is usable. When Claude Code appears to hang on the first request, checking /health is a fast way to distinguish "client config bug" from "server still loading".

Pricing and cost model

Claude Code pricing is less about "buying a CLI" and more about "which billing rail backs the tokens".

Subscription plans include Claude Code

Anthropic includes Claude Code in paid Claude subscription tiers. As of April 2026, the published pricing lists:

  • Pro at $17 per month with an annual discount ($200 billed up front), or $20 billed monthly, and it includes Claude Code.
  • Max plans starting at $100 per month.
  • Team plans priced per seat, with a standard seat at $20 per seat per month billed annually ($25 monthly) and a premium seat at $100 per seat per month billed annually ($125 monthly).

API token pricing

If you use Claude Code via API billing, costs follow token rates. Anthropic publishes per-million-token (MTok) pricing for models such as:

  • Haiku 4.5 at $1/MTok input and $5/MTok output.
  • Sonnet 4.5 at $3/MTok input and $15/MTok output.
  • Opus 4.5 at $5/MTok input and $25/MTok output.

Cost controls in the CLI

Print mode (-p) supports direct budget caps like --max-budget-usd, which is handy when you are scripting tasks and want predictable spend.

Inside interactive sessions, /cost shows token usage statistics.

Local backends change the bill, not the physics

Routing Claude Code to Ollama or llama.cpp can remove per-token API bills, but it does not make the work free. You are swapping cloud costs for local compute, memory, and "someone owns uptime". For some teams, that trade is the entire point.

Typical workflow: from plan to PR

My bias is that Claude Code is strongest when you treat it as a workflow engine, not a chatbot. The tooling hints at this.

Start with the permission model, not the prompt

Claude Code is permission-gated by design. The docs describe a tiered model: read-only operations such as file reads and grep are allowed, while bash commands and file modifications need approval.

Permission modes exist to manage the friction. In the CLI you can cycle modes with Shift+Tab (default -> acceptEdits -> plan). Plan mode reads and proposes changes but does not edit. acceptEdits mode allows Claude Code to create and edit files in your working directory without prompting, while still prompting for commands with side effects outside its safe list.

Auto mode is a newer option that reduces prompts by delegating approvals to a classifier, positioned as a safer middle path between constant prompts and disabling prompts entirely. It requires a minimum Claude Code version and specific plan and model requirements.

Use built-in commands to keep sessions honest

A few commands turn Claude Code from "assistant" into "tooling":

  • /init generates a CLAUDE.md project guide, which is a lightweight way to feed consistent context.
  • /diff gives an interactive view of changes, including per-turn diffs.
  • /rewind lets you rewind conversation and/or code to a previous point, using checkpoints.
  • /debug enables debug logging mid-session.
  • /doctor diagnoses and verifies your installation and settings.

These are not gimmicks; they are the safety rails you lean on when an agent edits more than you expected.

When to go non-interactive

For one-shot tasks (explain, summarise, generate a patch plan), print mode is a good fit:

claude -p "Summarise the repository architecture and list the riskiest modules"
Enter fullscreen mode Exit fullscreen mode

It exits after the answer, which works well in scripts and CI.

Troubleshooting checklist

Most Claude Code issues are configuration issues in disguise. Here is a checklist that maps common symptoms to the underlying mechanism.

Claude Code keeps asking to sign in while using a local server

This typically means Claude Code is still trying to use first-party subscription auth. Ensure you set an explicit auth mode for the proxy:

  • Set ANTHROPIC_API_KEY for gateways that expect X-Api-Key.
  • Or set ANTHROPIC_AUTH_TOKEN for gateways that use Authorization Bearer.

Remember that ANTHROPIC_API_KEY overrides subscription usage even if you are logged in, and in interactive mode you may need to approve that override once.

The gateway errors on anthropic-beta headers

Some gateways reject unknown headers or beta fields. There is an environment variable designed for this exact failure mode:

export CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS=1
Enter fullscreen mode Exit fullscreen mode

The LLM gateway documentation also notes you may need this when using the Anthropic Messages format with Bedrock or Vertex.

Tool calling does not work on llama.cpp

Double-check server flags. llama-server documents that tool use requires the --jinja flag. Without it, the server can respond, but the agent loop will degrade.

Permission prompts are interrupting every command

That can be normal, depending on mode and permission rules. Options include:

  • Switching to acceptEdits temporarily (file edits flow faster).
  • Writing explicit allow rules for known-safe bash commands in settings.json.
  • Using /sandbox to isolate the bash tool while reducing prompts.
  • Evaluating auto mode if your plan and version support it, as a middle ground.

Something feels off and you need observability

Use the built-ins:

  • /doctor to validate installation and settings.
  • /debug to start capturing logs from that point forward.
  • If you are in print mode, consider a tight max budget and max turns to keep experiments bounded.

Top comments (0)