In this post, we’ll explore two ways to connect your Python application to Ollama: 1. Via HTTP REST API; 2. Via the official Ollama Python library.
We’ll cover both chat and generate calls, and then discuss how to use “thinking models” effectively.
Ollama
has quickly become one of the most convenient ways to run large language models (LLMs) locally.
With its simple interface and support for popular open models like Llama 3, Mistral, Qwen2.5, and even “thinking” variants like qwen3, it’s easy to embed AI capabilities directly into your Python projects — without relying on external cloud APIs.
🧩 Prerequisites
Before diving in, make sure you have:
pip install requests ollama
Confirm Ollama is running by executing:
ollama list
You should see available models such as llama3
, mistral
, or qwen3
.
⚙️ Option 1: Using Ollama’s REST API
The REST API is ideal when you want maximum control or when integrating with frameworks that already handle HTTP requests.
Example 1: Chat API
import requests
import json
url = "http://localhost:11434/api/chat"
payload = {
"model": "llama3.1",
"messages": [
{"role": "system", "content": "You are a Python assistant."},
{"role": "user", "content": "Write a function that reverses a string."}
]
}
response = requests.post(url, json=payload, stream=True)
for line in response.iter_lines():
if line:
data = json.loads(line)
print(data.get("message", {}).get("content", ""), end="")
👉 The Ollama REST API streams responses line-by-line (similar to OpenAI’s streaming API). You can accumulate content or display it in real time for chatbots or CLI tools.
Example 2: Generate API
If you don’t need chat context or roles, use the simpler /api/generate
endpoint:
import requests
url = "http://localhost:11434/api/generate"
payload = {
"model": "llama3.1",
"prompt": "Explain recursion in one sentence."
}
response = requests.post(url, json=payload, stream=True)
for line in response.iter_lines():
if line:
print(line.decode("utf-8"))
This endpoint is great for one-shot text generation tasks — summaries, code snippets, etc.
🐍 Option 2: Using the Ollama Python Library
The Ollama Python client provides a cleaner interface for developers who prefer to stay fully in Python.
Example 1: Chat API
import ollama
response = ollama.chat(
model="llama3.1",
messages=[
{"role": "system", "content": "You are a code assistant."},
{"role": "user", "content": "Generate a Python script that lists all files in a directory."}
]
)
print(response['message']['content'])
This returns the final message as a dictionary. If you want streaming, you can iterate over the chat stream:
stream = ollama.chat(
model="llama3.1",
messages=[
{"role": "user", "content": "Write a haiku about recursion."}
],
stream=True
)
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
Example 2: Generate API
import ollama
output = ollama.generate(
model="llama3.1",
prompt="Summarize the concept of decorators in Python."
)
print(output['response'])
Or stream the result:
stream = ollama.generate(
model="llama3.1",
prompt="List three pros of using Python for AI projects.",
stream=True
)
for chunk in stream:
print(chunk['response'], end='', flush=True)
🧠 Working with “Thinking” Models
Ollama supports “thinking models” such as qwen3, designed to show their intermediate reasoning steps. These models produce structured output, often in a format like:
<think>
Reasoning steps here...
</think>
Final answer here.
This makes them useful for:
- Debugging model reasoning
- Research into interpretability
- Building tools that separate thought from output
Example: Using a Thinking Model
import ollama
response = ollama.chat(
model="qwen3",
messages=[
{"role": "user", "content": "What is the capital of Australia?"}
]
)
content = response['message']['content']
# Optionally extract "thinking" part
import re
thinking = re.findall(r"<think>(.*?)</think>", content, re.DOTALL)
answer = re.sub(r"<think>.*?</think>", "", content, flags=re.DOTALL)
print("🧠 Thought process:\n", thinking[0].strip() if thinking else "N/A")
print("\n✅ Final answer:\n", answer.strip())
When to Use Thinking Models
Use Case | Recommended Model | Why |
---|---|---|
Interpretability / Debugging | qwen3 |
View reasoning traces |
Performance-sensitive apps | qwen3 non-thinking mode |
Faster, less verbose |
Educational / Explanatory | qwen3 |
Shows step-by-step logic |
✅ Summary
Task | REST API | Python Client |
---|---|---|
Simple text generation | /api/generate |
ollama.generate() |
Conversational chat | /api/chat |
ollama.chat() |
Streaming support | Yes | Yes |
Works with thinking models | Yes | Yes |
Ollama’s local-first design makes it ideal for secure, offline, or privacy-sensitive AI applications. Whether you’re building an interactive chatbot or a background data enrichment service, you can integrate LLMs seamlessly into your Python workflow — with full control over models, latency, and data.
Top comments (0)