DEV Community

Cover image for A Beginner's Guide to Ollama Cloud Models
ELI
ELI

Posted on

A Beginner's Guide to Ollama Cloud Models

Ollama's cloud models are a new feature that allows users to run large language models without needing a powerful local GPU. These models are automatically offloaded to Ollama's cloud service, providing the same capabilities as local models while enabling the use of larger models that would typically not fit on a personal computer.

Ollama currently supports the following cloud models

  • deepseek-v3.1:671b-cloud
  • gpt-oss:20b-cloud
  • gpt-oss:120b-cloud
  • kimi-k2:1t-cloud
  • qwen3-coder:480b-cloud
  • glm-4.6:cloud
  • qwen3-vl:235b-cloud
  • Browse the latest additions ollama's cloud models

Cloud API Access

Cloud models can also be accessed directly on ollama.com API. In this mode, ollama acts as a remote Ollama host.

For direct access to ollama cloud api, first create an API key.
Then, set the OLLAMA_API_KEY environment variable to your API key.

export OLLAMA_API_KEY=your_api_key
Enter fullscreen mode Exit fullscreen mode

Install Dependencies

Run the following in your terminal or notebook cell:

!pip install ollama python-dotenv pydantic[email] IPython
Enter fullscreen mode Exit fullscreen mode

Note: This notebook runs in Google Colab.

Basic Usage

This code snippet connects to the Ollama cloud API using your API key, sends a question to a specific language model (gpt-oss:120b-cloud), and then prints the model's answer as it's generated.

import os
from google.colab import userdata

userdata_ollama = None
try:
  userdata_ollama = userdata.get('OLLAMA_API_KEY')
except:
  pass

if userdata_ollama:
  print("OLLAMA_API_KEY found in google colab\n")
  os.environ['OLLAMA_API_KEY'] = userdata_ollama
elif 'OLLAMA_API_KEY' not in os.environ:
  ollama_api_key = input('Enter your Ollama API key: ')
  os.environ['OLLAMA_API_KEY'] = ollama_api_key

ollama_api_key = os.getenv('OLLAMA_API_KEY')
Enter fullscreen mode Exit fullscreen mode

A custom client can be created by instantiating Client or AsyncClient from ollama.

from ollama import Client

try:
  client = Client(
        host='https://ollama.com',
        headers={'Authorization': f'Bearer {ollama_api_key}'}
  )

  messages = [
    {
      'role': 'user',
      'content': 'Why is the sky blue?',
    },
  ]

  for part in client.chat('gpt-oss:120b', messages=messages, stream=True):
    print(part.message.content, end='', flush=True)
except Exception as e:
  print(f'Error: {e}')
Enter fullscreen mode Exit fullscreen mode

Capabilities

Ollama's cloud models offer advanced capabilities beyond basic text generation, tailored for developers and AI practitioners. Key features include tool calling (for integrating external functions), thinking traces (to reveal the model's reasoning), streaming (for real-time responses), structured outputs (to enforce reliable JSON schemas), and vision (for multimodal image understanding). Together, these enable the development of robust, scalable, and production ready AI applications with enhanced control and insight.

Tool Calling

Ollama offers support for tool calling, also referred to as function calling. This feature empowers a language model to utilize external tools or functions and integrate the outcomes of these tools into its responses.

Supported models
Code Example
from ollama import web_search, web_fetch

available_tools = {'web_fetch': web_fetch, 'web_search': web_search}

try:
  messages = [
      {
        'role': 'user',
        'content': 'what is ollama?',
      },
  ]

  while(True):
    try:
      result = client.chat(model='deepseek-v3.1:671b', messages=messages, tools=[web_fetch, web_search])
    except Exception as e:
      print(f'Error: {e}')
      break
    if result.message.content:
      print(f'Content: {result.message.content}')

    messages.append(result.message)

    if result.message.tool_calls:
      print(f'Tool Calls: {result.message.tool_calls}')

      for tool_call in result.message.tool_calls:
        function_call = available_tools.get(tool_call.function.name)
        if function_call:
          tool_args = tool_call.function.arguments
          tool_result = function_call(**tool_args)
          print(f'Tool Result: {str(tool_result)[:200]}...')
          messages.append({
              'role': 'tool',
              'content': str(result)[:500 * 4],
              'tool_name': tool_call.function.name
            }
          )
        else:
          messages.append({
              'role': 'tool',
              'content': f'Tool {tool_call.function.name} not found',
              'tool_name': tool_call.function.name
            }
          )
    else:
      break
except Exception as e:
  print(f'Error: {e}')
Enter fullscreen mode Exit fullscreen mode

Thinking

Thinking capable models can generate a distinct thinking trace that details their reasoning process, separate from the final output. This feature allows for auditing the model's steps, visualizing its thought process in user interfaces, or concealing the trace when only the final answer is required.

Supported models
  • deepseek-v3.1:671b-cloud
  • gpt-oss:20b-cloud
  • gpt-oss:120b-cloud
  • Browse the latest additions under thinking models
Code Example
try:
  message = """Solve this expression and explain each algebraic step in plain English: $\frac{\frac{1}{x} + \frac{1}{x + 1}}{\frac{1}{x} - \frac{1}{x + 1}}$"""

  think_messages = [{'role': 'user', 'content': message}]

  thinking_result = client.chat(model='deepseek-v3.1:671b', messages=think_messages, think=True, stream=True)

  in_thinking = False

  for chunk in thinking_result:
    if chunk.message.thinking and not in_thinking:
      in_thinking = True
      print('Thinking:\n', end='')

    if chunk.message.thinking:
      print(chunk.message.thinking, end='')
    elif chunk.message.content:
      if in_thinking:
        print('\n\nAnswer:\n', end='')
        in_thinking = False
      print(chunk.message.content, end='')
except Exception as e:
  print(f'Error: {e}')
Enter fullscreen mode Exit fullscreen mode

Streaming

Streaming lets you display text as the model generates it, rather than waiting for the full response. It's on by default in the REST API but off by default in SDKs, you must set stream=True to enable it there.

Key Streaming Concepts
  • Chatting: Receive and render partial assistant messages in real time, as each chunk arrives.

  • Thinking: Some models include a thinking field in chunks, allowing you to optionally show the model's reasoning before the final answer.

  • Tool calling: Tool calls may appear incrementally in the stream; you can detect them, run the tools, and send the results back into the conversation.

Code Example
query = """
Explain me this code snippet:

from ollama import chat

messages = [
  {
    'role': 'user',
    'content': 'Why is the sky blue?',
  },
]

response = chat('gemma3', messages=messages)
print(response['message']['content'])
"""

message = [{'role': 'user', 'content': query}]
for stream in client.chat(model='qwen3-coder:480b', messages=message, stream=True):
  print(stream.message.content, end='', flush=True)
Enter fullscreen mode Exit fullscreen mode

Structured Outputs

Structured outputs let you enforce a specific JSON schema on model responses, ensuring reliable extraction of structured data, consistent replies, or formatted image descriptions.

You can enable this in two ways:

  1. Generic JSON: Set the format parameter to json to ensure the output is valid JSON.

  2. Specific Schema: Provide a detailed JSON schema (using tools like Pydantic in Python or Zod in JavaScript) to the format parameter. This forces the model to return data matching your exact structure.

Code Example
from pydantic import BaseModel, EmailStr, HttpUrl, Field
from typing import Optional

class Product(BaseModel):
    name: str = Field(..., description='Name of the product')
    description: str = Field(..., description='Description of the product')
    price: float = Field(0, description='Price of the product (e.g. 199.99)')
    currency: str = Field('USD', description='Currency of the price')
    category: Optional[str] = None # Category is not always present
    in_stock: bool = Field(..., description='Whether the product is in stock')

class ProviderDetails(BaseModel):
    name: str = Field(..., description='Name of the provider')
    description: str = Field(..., description='Description of the provider')
    email: EmailStr = Field(..., description='Email of the provider')
    phone: Optional[str] = None
    website: HttpUrl = Field(..., description='Website of the provider')
    address: str = Field(..., description='Address of the provider')

class Provider(BaseModel):
    provider: ProviderDetails = Field(None, description='Details of the provider')
    products: list[Product] = Field(default_factory=list, description='List of products provided by the provider')

def extract_provider_and_products(text: str) -> Provider:
    """
    Takes a text and extracts Provider and Product information
    structured according to the defined Pydantic models.
    """
    prompt = f"""
    Analyze the following text and extract the information about the Provider and their Products into a JSON format.
    The JSON must contain a top-level key 'provider' which holds an object with the fields:
    'name', 'description', 'email', 'phone', 'website', and 'address.
    It must also contain a top-level key 'products' which is an list of objects. Each product object must have the fields:
    'name', 'description', 'price' (as a number, e.g., 199.99, not $199.99), 'currency', 'category', and 'in_stock' (as a boolean, true/false).

    Text: {text}

    Ensure the output conforms strictly to the JSON format provided.
    If a field is not present, omit it (do not return null).
    Only return the JSON, do not include any markdown or other text.
    """

    response = client.chat(
        model='gpt-oss:120b', # Replace with your chosen model
        messages=[{'role': 'user', 'content': prompt}],
        format=Provider.model_json_schema(), # Use the schema to enforce structure
        options={'temperature': 0},  # Set temperature to 0 for more deterministic output
    )

    # Validate the model response against the Pydantic schema
    provider_data = Provider.model_validate_json(response.message.content)
    return provider_data


supplier_text = """
Meet TechSolutions Inc., a leading provider of innovative software development tools and services.
We are headquartered in Austin, Texas, and have been delivering cutting-edge solutions to businesses
worldwide for the past eight years. You can reach us at contact@techsolutions.com or call (555) 987-6543.
Visit our website: https://www.techsolutions.com

Our current product catalog includes:
- CodeMaster Pro IDE: Advanced integrated development environment, $199.99/user/year, category software, currently in stock.
- BugFinder Suite: Comprehensive debugging toolkit, $49.99/month, category software, currently in stock.
- CloudDeploy Platform: Automated deployment service, $29.99/month per project, category software, currently in stock.
"""

try:
  extracted_data = extract_provider_and_products(supplier_text)
  print(extracted_data.model_dump_json(indent=2))
except Exception as e:
  print(f'Error during extraction or validation: {e}')
Enter fullscreen mode Exit fullscreen mode

Vision

Vision models are AI models that can process both images and text to perform tasks like describing, classifying, and answering questions about visual content.

Supported models
  • qwen3-vl:235b-cloud
  • Browse the latest additions under vision models

Request (with images)

To submit images to vision models such as qwen3-vl, llava or bakllava, provide a list of base64-encoded images.

Code Example
import base64
from IPython.display import Image, display

image_path = '/content/image.jpg'
image_base64 = base64.b64encode(open(image_path, 'rb').read()).decode('utf-8')

messages = [
  {
    'role': 'user',
    'content': 'What is in this image?',
    'images': [image_base64]
  }
]

try:
  result = client.chat(model='qwen3-vl:235b', messages=messages, stream=True)
  for part in result:
    print(part.message.content, end='', flush=True)
  display(Image(data=image_path, width=300))
except Exception as e:
  print(f'Error: {e}')
Enter fullscreen mode Exit fullscreen mode

Conclution

Ollama's Cloud Models represent a significant leap forward in making powerful, large-scale AI accessible to developers without the need for high-end local hardware. By offloading computation to the cloud, users can seamlessly run massive models, such as deepseek-v3.1:671b-cloud, gpt-oss:120b-cloud, and qwen3-vl:235b-cloud, that would otherwise be impractical on personal machines.

These cloud models deliver a rich set of production ready capabilities:

  • Tool calling enables dynamic interaction with external APIs and services.
  • Thinking traces provide transparent, step-by-step reasoning for auditability and enhanced UX.
  • Streaming supports real-time, low-latency responses ideal for chat and interactive applications.
  • Structured outputs guarantee reliable, schema-compliant JSON—critical for data extraction, automation, and integration.
  • Vision support unlocks multimodal understanding, allowing models to interpret images alongside text.

With straightforward API access, environment-based authentication, and full compatibility with popular developer tools, Ollama Cloud lowers the barrier to building sophisticated AI applications. Whether you're prototyping, automating workflows, or deploying enterprise solutions, Ollama's cloud offering combines flexibility, control, and scalability bringing the future of open, on-demand AI within reach.

Resources

Top comments (0)