From Zero to Local AI in 10 Minutes With Ollama + Python
Why Ollama (And Why Now)?
If you want production-like experiments without cloud keys or per-call fees, Ollama gives you a local-first developer path:
- Zero friction: Install once; pull models on demand; everything runs on
localhostby default. - One API, two runtimes: The same API works for local and (optional) cloud models, so you can start on your laptop and scale later with minimal code changes.
- Batteries included: Simple CLI (
ollama run,ollama pull), a clean REST API, an official Python client, embeddings, and vision support. - Repeatability: A
Modelfile(think: Dockerfile for models) captures system prompts and parameters so teams get the same behaviour.
What’s New in Late 2025 (at a Glance)
- Cloud models (preview): Run larger models on managed GPUs with the same API surface; develop locally, scale in the cloud without code changes.
- OpenAI-compatible endpoints: Point OpenAI SDKs at Ollama (
/v1) for easy migration and local testing. - Windows desktop app: Official GUI for Windows users; drag-and-drop, multimodal inputs, and background service management.
- Safety/quality updates: Recent safety-classification models and runtime optimizations (e.g., flash-attention toggles in select backends) to improve performance.
How Ollama Works (Architecture in 90 Seconds)
- Runtime: A lightweight server listens on
localhost:11434and exposes REST endpoints for chat, generate, and embeddings. Responses stream token-by-token. - Model format (GGUF): Models are packaged in quantized
.ggufbinaries for efficient CPU/GPU inference and fast memory-mapped loading. - Inference engine: Built on the
llama.cppfamily of kernels with GPU offload via Metal (Apple Silicon), CUDA (NVIDIA), and others; choose quantization for your hardware. - Configuration:
Modelfilepins base model, system prompt, parameters, adapters (LoRA), and optional templates — so your team’s runs are reproducible.
Install in 60 Seconds
macOS / Windows / Linux
- Download and install Ollama from the official site (choose your OS).
Get Started with Python
To get started with Ollama in Python, you'll need to install it using pip:
pip install ollama
Next, create a new Python file (e.g., main.py) and import the Ollama client:
import ollama
# Initialize the Ollama client
client = ollama.Client()
Run Your First Model
To run your first model, use the ollama run command with the name of the model you want to load. For example, let's load the "text-davinci-003" model:
ollama run text-davinci-003
This will start a new Ollama instance and expose it at http://localhost:11434. You can now use this endpoint to make requests to the model.
Making Requests
To make requests to the model, you'll need to send a POST request to the /v1/complete endpoint with the text you want to complete. Here's an example using Python:
import requests
# Set the API endpoint and payload
endpoint = "http://localhost:11434/v1/complete"
payload = {"input": "This is a sample input."}
# Send the request
response = requests.post(endpoint, json=payload)
# Print the response
print(response.json())
Best Practices
- Use
ollama runto start a new Ollama instance for each model you want to load. - Use the
/v1/completeendpoint to make requests to the model. - Make sure to handle errors and exceptions properly when making requests to the model.
By following these steps and best practices, you'll be able to get started with Ollama in no time. Happy experimenting!
By Malik Abualzait

Top comments (0)