Malik Abualzait

Posted on Nov 19

Bootstrapping Local AI with Ollama & Python in 10

#ai #tech #programming #tutorial

From Zero to Local AI in 10 Minutes With Ollama + Python

Why Ollama (And Why Now)?

If you want production-like experiments without cloud keys or per-call fees, Ollama gives you a local-first developer path:

Zero friction: Install once; pull models on demand; everything runs on localhost by default.
One API, two runtimes: The same API works for local and (optional) cloud models, so you can start on your laptop and scale later with minimal code changes.
Batteries included: Simple CLI (ollama run, ollama pull), a clean REST API, an official Python client, embeddings, and vision support.
Repeatability: A Modelfile (think: Dockerfile for models) captures system prompts and parameters so teams get the same behaviour.

What’s New in Late 2025 (at a Glance)

Cloud models (preview): Run larger models on managed GPUs with the same API surface; develop locally, scale in the cloud without code changes.
OpenAI-compatible endpoints: Point OpenAI SDKs at Ollama (/v1) for easy migration and local testing.
Windows desktop app: Official GUI for Windows users; drag-and-drop, multimodal inputs, and background service management.
Safety/quality updates: Recent safety-classification models and runtime optimizations (e.g., flash-attention toggles in select backends) to improve performance.

How Ollama Works (Architecture in 90 Seconds)

Runtime: A lightweight server listens on localhost:11434 and exposes REST endpoints for chat, generate, and embeddings. Responses stream token-by-token.
Model format (GGUF): Models are packaged in quantized .gguf binaries for efficient CPU/GPU inference and fast memory-mapped loading.
Inference engine: Built on the llama.cpp family of kernels with GPU offload via Metal (Apple Silicon), CUDA (NVIDIA), and others; choose quantization for your hardware.
Configuration: Modelfile pins base model, system prompt, parameters, adapters (LoRA), and optional templates — so your team’s runs are reproducible.

Install in 60 Seconds

macOS / Windows / Linux

Download and install Ollama from the official site (choose your OS).

Get Started with Python

To get started with Ollama in Python, you'll need to install it using pip:

pip install ollama

Next, create a new Python file (e.g., main.py) and import the Ollama client:

import ollama

# Initialize the Ollama client
client = ollama.Client()

Run Your First Model

To run your first model, use the ollama run command with the name of the model you want to load. For example, let's load the "text-davinci-003" model:

ollama run text-davinci-003

This will start a new Ollama instance and expose it at http://localhost:11434. You can now use this endpoint to make requests to the model.

Making Requests

To make requests to the model, you'll need to send a POST request to the /v1/complete endpoint with the text you want to complete. Here's an example using Python:

import requests

# Set the API endpoint and payload
endpoint = "http://localhost:11434/v1/complete"
payload = {"input": "This is a sample input."}

# Send the request
response = requests.post(endpoint, json=payload)

# Print the response
print(response.json())

Best Practices

Use ollama run to start a new Ollama instance for each model you want to load.
Use the /v1/complete endpoint to make requests to the model.
Make sure to handle errors and exceptions properly when making requests to the model.

By following these steps and best practices, you'll be able to get started with Ollama in no time. Happy experimenting!

By Malik Abualzait

DEV Community