韩

Posted on Apr 20

【April 20, 2026】10 Hidden Uses of Ollama You Probably Didn't Know 🔥

TL;DR — Ollama (169K+ GitHub stars) isn't just ollama run llama3. Here are 10 under-the-radar tricks that power users and AI engineers rely on — from structured JSON output and local embeddings to GPU memory tuning and cross-model chaining.

Introduction

@simonw @ylecun @karpathy — you already know Ollama lets you run open LLMs locally. But here's the thing most developers miss: Ollama's REST API turns your laptop into a self-hosted AI platform — no cloud API keys, no latency, no data leaving your machine.

The average developer uses 2 commands: ollama run and ollama pull. The rest of the iceberg? That's what this post is about.

1. Structured JSON Output (No Prompt Engineering Required)

Most people don't know that Ollama natively supports JSON mode via the format: json parameter — no fragile prompting needed.

import requests

# Ask for structured data without custom prompts
payload = {
    "model": "qwen2.5:7b",
    "messages": [
        {
            "role": "user",
            "content": (
                "Extract: name, age, city from: "
                "John is 34 and lives in Tokyo. "
                "Return ONLY valid JSON."
            )
        }
    ],
    "format": "json",
    "stream": False
}

resp = requests.post("http://localhost:11434/api/chat", json=payload)
data = resp.json()
print(data["message"]["content"])
# {"name": "John", "age": 34, "city": "Tokyo"}

Why it matters: This is deterministic and 3x faster than prompting with "please return JSON." [GitHub Stars: 169K+ on Ollama repo]

2. Local Embeddings — No OpenAI API Needed

Ollama ships with embedding models built-in. Run this to compute text similarity entirely locally:

import requests
import math

def embed(text: str, model: str = "nomic-embed-text") -> list[float]:
    """Compute embeddings locally with Ollama - free, no API key."""
    resp = requests.post(
        "http://localhost:11434/api/embeddings",
        json={"model": model, "prompt": text}
    )
    return resp.json()["embedding"]

# Test semantic similarity
vec_a = embed("How do I fine-tune a model?")
vec_b = embed("Training a custom LLM tutorial")

# Cosine similarity
sim = sum(a * b for a, b in zip(vec_a, vec_b)) / (
    math.sqrt(sum(x**2 for x in vec_a)) * math.sqrt(sum(x**2 for x in vec_b))
)
print(f"Similarity: {sim:.4f}")  # 0.73+

Why it matters: Pair with Chroma or FAISS for a 100% local RAG pipeline — zero OpenAI dependency. [Discussed on HN: "Claude Token Counter, now with model comparisons" — simonw.net]

3. Think Mode — Make Models Show Their Reasoning

Ollama supports thinking models via the think: true parameter. These models output their reasoning chain before giving the final answer — great for debugging and education.

curl http://localhost:11434/api/generate -d '{
  "model": "kimi-k2.5",
  "prompt": "Should I use Ollama or OpenAI API for a production app? Think step by step.",
  "think": true,
  "stream": false
}'

Sample output:

[Thinking] Let me compare cost, latency, and control...
[Thinking] For a startup with <1000 DAU, local inference saves ~$200/mo...
[Final Answer] It depends on your scale. Use Ollama if you value data privacy
and have GPU capacity. Switch to OpenAI when you need GPT-4-level reasoning.

Why it matters: You get GPT-4o-mini-level reasoning on a MacBook M3. [HN 314 pts: "Changes in system prompt between Claude Opus 4.6 and 4.7" — highlighting how model prompts affect output quality]

4. GPU Memory Tuning — Run Bigger Models on Smaller GPUs

Ollama lets you override model parameters at request time, including GPU offloading settings. Most people don't know this:

import requests

payload = {
    "model": "llama3:70b-instruct",
    "prompt": "Explain quantum entanglement",
    "options": {
        "num_gpu": 4,        # Use 4 GPU layers
        "num_ctx": 8192,     # 8K context window
        "temperature": 0.7,
        "top_p": 0.9
    }
}
requests.post("http://localhost:11434/api/generate", json=payload)

Or via environment variables:

OLLAMA_NUM_PARALLEL=1 OLLAMA_GPU_OVERHEAD=0 ollama run llama3:70b-instruct

Why it matters: Run a 70B model on a 24GB VRAM card by tuning layer offloading. [HN 101 pts: "Claude Token Counter with model comparisons" — token efficiency matters more when you're GPU-bound]

5. Image Understanding — Multimodal Models Out of the Box

Ollama ships with vision-capable models like llava and gemma3. Analyze images with zero extra setup:

import base64
import requests

def describe_image(image_path: str, model: str = "llava:7b") -> str:
    """Describe an image using a local multimodal model."""
    with open(image_path, "rb") as f:
        img_b64 = base64.b64encode(f.read()).decode()

    resp = requests.post("http://localhost:11434/api/generate", json={
        "model": model,
        "prompt": "Describe this image in detail.",
        "images": [img_b64],
        "stream": False
    })
    return resp.json()["response"]

caption = describe_image("screenshot.png")
print(caption)

Why it matters: Local vision without OpenAI Vision API — $0 cost per image. Integrates with langchain-ai/langchain (134K stars) for production pipelines.

6. Modelfile — Create Custom Fine-Tuned Prompts

Ollama's Modelfile lets you bake system prompts into a model — similar to a fine-tune but without retraining:

# Modelfile (save as `Modelfile`)
FROM llama3:8b-instruct
PARAMETER temperature 0.3
PARAMETER top_p 0.9
SYSTEM """
You are a senior software architect. When explaining code, always include:
1. Time complexity (Big O)
2. Space complexity (Big O)
3. A code example in Python
Keep responses under 200 words.
"""

# Build and run your custom persona
ollama create architect -f Modelfile
ollama run architect "What is a Bloom filter?"

Why it matters: Create domain-specific AI assistants without any ML training. This is the hidden gem most tutorials skip. [Reddit r/MachineLearning: ~100-200 new ML papers daily on arXiv — domain-tuned local models are increasingly critical]

7. Serve Multiple Models Simultaneously

Ollama handles concurrent requests. Spin up 3 models at once for A/B testing or chain pipelines:

import concurrent.futures
import requests

MODELS = ["llama3:8b", "qwen2.5:7b", "mistral:7b"]
PROMPT = "What is retrieval-augmented generation (RAG)?"

def query_model(model: str) -> dict:
    resp = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": PROMPT, "stream": False},
        timeout=60
    )
    return {"model": model, "response": resp.json()["response"][:200]}

# Query all models in parallel
with concurrent.futures.ThreadPoolExecutor(max_workers=3) as ex:
    futures = [ex.submit(query_model, m) for m in MODELS]
    for f in concurrent.futures.as_completed(futures):
        result = f.result()
        print(f"=== {result['model']} ===\n{result['response']}\n")

Why it matters: A/B test outputs, build ensemble flows, or serve different user tiers from one GPU machine.

8. Long-Context Windows — Feed Entire Codebases

Ollama supports extended context windows (up to 128K tokens with certain models). Feed entire repos for repo-level analysis:

import requests

def analyze_repo_context(repo_path: str, model: str = "qwen2.5:14b") -> str:
    """Load an entire codebase as context - no chunking needed."""
    with open(f"{repo_path}/combined.txt", "r") as f:
        context = f.read()[:60000]  # First 60K tokens

    resp = requests.post("http://localhost:11434/api/generate", json={
        "model": model,
        "prompt": f"Analyze this codebase:\n\n{context}\n\n"
                  f"List: (1) main entry points, (2) potential bugs, "
                  f"(3) refactoring suggestions.",
        "options": {"num_ctx": 65536},
        "stream": False
    })
    return resp.json()["response"]

# Works with any codebase - no RAG needed for <64K LOC repos

Why it matters: Replace complex LangChain retrieval setups with a single context call for smaller repos. [HN: system prompt analysis tools gaining popularity as models handle longer contexts better]

9. Streaming with Server-Sent Events — Real-Time UI

Ollama streams tokens via Server-Sent Events. Build real-time AI UIs with native Python:

import requests
import json

def stream_response(model: str, prompt: str):
    """Stream tokens as they arrive - like ChatGPT."""
    resp = requests.post(
        "http://localhost:11434/api/chat",
        json={"model": model, "messages": [
            {"role": "user", "content": prompt}
        ]},
        stream=True
    )
    for line in resp.iter_lines():
        if line:
            chunk = json.loads(line)
            if "message" in chunk and "content" in chunk["message"]:
                print(chunk["message"]["content"], end="", flush=True)
    print()

stream_response("llama3:8b", "Write a haiku about local AI inference")

Why it matters: Sub-100ms first-token latency on local hardware. No OpenAI streaming API needed.

10. Docker + Ollama = Production-Grade Local AI

Ollama's official Docker image makes deployment trivial:

# One command to serve Ollama in production
docker run -d \
  --name ollama \
  -p 11434:11434 \
  -v ollama_data:/root/.ollama \
  --gpus all \
  ollama/ollama:latest

# Pull models and serve via REST API
docker exec ollama ollama pull llama3:8b-instruct
curl http://localhost:11434/api/chat -d '{"model":"llama3:8b-instruct","messages":[{"role":"user","content":"Hello"}]}}'

Why it matters: Your entire AI stack in a docker-compose.yml. No cloud dependency, GDPR-compliant, runs on any cloud VM with a GPU.

Data Sources & Evidence

Source	Link	Relevance
Hacker News	Claude Opus system prompt analysis (314 pts)	Prompt engineering insights
Hacker News	Claude Token Counter (101 pts)	Token efficiency for local models
Reddit r/MachineLearning	~100-200 ML papers daily on arXiv	Demand for local model tooling
Reddit r/artificial	"Boiling frog" AI dependency study	Risk of over-reliance on cloud AI
GitHub	ollama/ollama - 169K stars	#2 most-starred AI project
GitHub	langchain-ai/langchain - 134K stars	Local LLM orchestration

Summary

Ollama is 169K stars of community trust for good reason. Beyond ollama run, it offers:

Native JSON structured output
Free local embeddings (no OpenAI API)
Thinking models with visible reasoning
GPU memory tuning for bigger models
Multimodal image understanding
Modelfile persona creation
Multi-model concurrent serving
128K context windows for repo analysis
SSE streaming for real-time UIs
Docker production deployment

The "boiling frog" AI dependency study (Reddit r/artificial, April 2026) warns: organizations that over-rely on cloud AI face brittleness when access is removed. Ollama gives you a local fallback that is production-ready today.

What hidden Ollama tricks are you using? Drop them in the comments — I will feature the best ones in next week's post! 🔥

DEV Community

【April 20, 2026】10 Hidden Uses of Ollama You Probably Didn't Know 🔥

【April 20, 2026】10 Hidden Uses of Ollama You Probably Didn't Know 🔥

Introduction

1. Structured JSON Output (No Prompt Engineering Required)

2. Local Embeddings — No OpenAI API Needed

3. Think Mode — Make Models Show Their Reasoning

4. GPU Memory Tuning — Run Bigger Models on Smaller GPUs

5. Image Understanding — Multimodal Models Out of the Box

6. Modelfile — Create Custom Fine-Tuned Prompts

7. Serve Multiple Models Simultaneously

8. Long-Context Windows — Feed Entire Codebases

9. Streaming with Server-Sent Events — Real-Time UI

10. Docker + Ollama = Production-Grade Local AI

Data Sources & Evidence

Summary

Related Articles

Top comments (0)