DEV Community

Cover image for Integrating Ollama with Python: REST API and Python Client Examples
Rost
Rost

Posted on • Originally published at glukhov.org

Integrating Ollama with Python: REST API and Python Client Examples

In this post, we’ll explore two ways to connect your Python application to Ollama: 1. Via HTTP REST API; 2. Via the official Ollama Python library.

We’ll cover both chat and generate calls, and then discuss how to use “thinking models” effectively.

Ollama
has quickly become one of the most convenient ways to run large language models (LLMs) locally.
With its simple interface and support for popular open models like Llama 3, Mistral, Qwen2.5, and even “thinking” variants like qwen3, it’s easy to embed AI capabilities directly into your Python projects — without relying on external cloud APIs.


🧩 Prerequisites

Before diving in, make sure you have:

  • Ollama installed and running locally (ollama serve)
  • Python 3.9+
  • Required dependencies:
pip install requests ollama
Enter fullscreen mode Exit fullscreen mode

Confirm Ollama is running by executing:

ollama list
Enter fullscreen mode Exit fullscreen mode

You should see available models such as llama3, mistral, or qwen3.


⚙️ Option 1: Using Ollama’s REST API

The REST API is ideal when you want maximum control or when integrating with frameworks that already handle HTTP requests.

Example 1: Chat API

import requests
import json

url = "http://localhost:11434/api/chat"

payload = {
    "model": "llama3.1",
    "messages": [
        {"role": "system", "content": "You are a Python assistant."},
        {"role": "user", "content": "Write a function that reverses a string."}
    ]
}

response = requests.post(url, json=payload, stream=True)

for line in response.iter_lines():
    if line:
        data = json.loads(line)
        print(data.get("message", {}).get("content", ""), end="")
Enter fullscreen mode Exit fullscreen mode

👉 The Ollama REST API streams responses line-by-line (similar to OpenAI’s streaming API). You can accumulate content or display it in real time for chatbots or CLI tools.


Example 2: Generate API

If you don’t need chat context or roles, use the simpler /api/generate endpoint:

import requests

url = "http://localhost:11434/api/generate"
payload = {
    "model": "llama3.1",
    "prompt": "Explain recursion in one sentence."
}

response = requests.post(url, json=payload, stream=True)
for line in response.iter_lines():
    if line:
        print(line.decode("utf-8"))
Enter fullscreen mode Exit fullscreen mode

This endpoint is great for one-shot text generation tasks — summaries, code snippets, etc.


🐍 Option 2: Using the Ollama Python Library

The Ollama Python client provides a cleaner interface for developers who prefer to stay fully in Python.

Example 1: Chat API

import ollama

response = ollama.chat(
    model="llama3.1",
    messages=[
        {"role": "system", "content": "You are a code assistant."},
        {"role": "user", "content": "Generate a Python script that lists all files in a directory."}
    ]
)

print(response['message']['content'])
Enter fullscreen mode Exit fullscreen mode

This returns the final message as a dictionary. If you want streaming, you can iterate over the chat stream:

stream = ollama.chat(
    model="llama3.1",
    messages=[
        {"role": "user", "content": "Write a haiku about recursion."}
    ],
    stream=True
)

for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)
Enter fullscreen mode Exit fullscreen mode

Example 2: Generate API

import ollama

output = ollama.generate(
    model="llama3.1",
    prompt="Summarize the concept of decorators in Python."
)

print(output['response'])
Enter fullscreen mode Exit fullscreen mode

Or stream the result:

stream = ollama.generate(
    model="llama3.1",
    prompt="List three pros of using Python for AI projects.",
    stream=True
)

for chunk in stream:
    print(chunk['response'], end='', flush=True)
Enter fullscreen mode Exit fullscreen mode

🧠 Working with “Thinking” Models

Ollama supports “thinking models” such as qwen3, designed to show their intermediate reasoning steps. These models produce structured output, often in a format like:

<think>
  Reasoning steps here...
</think>
Final answer here.
Enter fullscreen mode Exit fullscreen mode

This makes them useful for:

  • Debugging model reasoning
  • Research into interpretability
  • Building tools that separate thought from output

Example: Using a Thinking Model

import ollama

response = ollama.chat(
    model="qwen3",
    messages=[
        {"role": "user", "content": "What is the capital of Australia?"}
    ]
)

content = response['message']['content']

# Optionally extract "thinking" part
import re
thinking = re.findall(r"<think>(.*?)</think>", content, re.DOTALL)
answer = re.sub(r"<think>.*?</think>", "", content, flags=re.DOTALL)

print("🧠 Thought process:\n", thinking[0].strip() if thinking else "N/A")
print("\n✅ Final answer:\n", answer.strip())
Enter fullscreen mode Exit fullscreen mode

When to Use Thinking Models

Use Case Recommended Model Why
Interpretability / Debugging qwen3 View reasoning traces
Performance-sensitive apps qwen3 non-thinking mode Faster, less verbose
Educational / Explanatory qwen3 Shows step-by-step logic

✅ Summary

Task REST API Python Client
Simple text generation /api/generate ollama.generate()
Conversational chat /api/chat ollama.chat()
Streaming support Yes Yes
Works with thinking models Yes Yes

Ollama’s local-first design makes it ideal for secure, offline, or privacy-sensitive AI applications. Whether you’re building an interactive chatbot or a background data enrichment service, you can integrate LLMs seamlessly into your Python workflow — with full control over models, latency, and data.

Useful links

Top comments (0)