DEV Community

Cover image for Mastering Ollama AI endpoints: How to use each one correctly
Nube Colectiva
Nube Colectiva

Posted on • Originally published at nubecolectiva.com

Mastering Ollama AI endpoints: How to use each one correctly

Learn how to use all 14 Ollama API endpoints with real-world examples, best practices, and production-ready insights.

Artificial Intelligence is rapidly moving from cloud-only environments to local deployments. Developers increasingly want privacy, lower latency, reduced costs, and complete control over their AI infrastructure.

This is where Ollama shines.

Ollama allows you to run powerful Large Language Models (LLMs) such as Llama, Gemma, Mistral, Qwen, DeepSeek, and many others directly on your local machine or server. Beyond running models, Ollama provides a robust REST API that enables developers to integrate AI capabilities into applications, automation workflows, chatbots, coding assistants, search engines, and enterprise systems.

In this guide, you'll learn all 14 Ollama API endpoints, understand when to use each one, and see practical examples that go beyond the official documentation.


What Is Ollama?

Ollama is a platform designed to simplify the deployment and execution of large language models locally.

Some advantages include:

  • Privacy-focused AI processing
  • No dependency on external AI providers
  • Reduced API costs
  • Fast local inference
  • OpenAI-compatible API support
  • Easy model management

By default, Ollama runs on:

http://localhost:11434
Enter fullscreen mode Exit fullscreen mode

1. Generate Text

Endpoint

POST /api/generate
Enter fullscreen mode Exit fullscreen mode

Purpose

Generates text from a single prompt.

Example

curl http://localhost:11434/api/generate \
-d '{
  "model":"llama3",
  "prompt":"Explain quantum computing in simple terms."
}'
Enter fullscreen mode Exit fullscreen mode

Real Use Cases

  • Content generation
  • Code generation
  • Documentation writing
  • SEO article creation
  • Email drafting

Expert Tip

Use /api/generate for one-shot tasks where conversation history is unnecessary. It consumes fewer resources than chat endpoints.


2. Chat Conversations

Endpoint

POST /api/chat
Enter fullscreen mode Exit fullscreen mode

Purpose

Maintains conversational context.

Example

curl http://localhost:11434/api/chat \
-d '{
  "model":"llama3",
  "messages":[
    {
      "role":"user",
      "content":"Create a Node.js REST API."
    }
  ]
}'
Enter fullscreen mode Exit fullscreen mode

Real Use Cases

  • AI assistants
  • Customer support bots
  • Programming copilots
  • Internal company chatbots

Expert Tip

For production chat applications, always store conversation history externally rather than relying solely on the model context window.


3. Generate Embeddings

Endpoint

POST /api/embeddings
Enter fullscreen mode Exit fullscreen mode

Purpose

Converts text into numerical vectors.

Example

curl http://localhost:11434/api/embeddings \
-d '{
  "model":"nomic-embed-text",
  "prompt":"How does machine learning work?"
}'
Enter fullscreen mode Exit fullscreen mode

Real Use Cases

  • Semantic search
  • RAG systems
  • Recommendation engines
  • Knowledge bases

Expert Tip

Embeddings are the foundation of modern Retrieval-Augmented Generation (RAG) systems.


4. List Installed Models

Endpoint

GET /api/tags
Enter fullscreen mode Exit fullscreen mode

Purpose

Displays all downloaded models.

Example

curl http://localhost:11434/api/tags
Enter fullscreen mode Exit fullscreen mode

Why It Matters

Useful for:

  • Admin dashboards
  • Deployment scripts
  • Health checks
  • Monitoring systems

5. Display Model Details

Endpoint

POST /api/show
Enter fullscreen mode Exit fullscreen mode

Purpose

Returns detailed model information.

Example

curl http://localhost:11434/api/show \
-d '{
  "name":"llama3"
}'
Enter fullscreen mode Exit fullscreen mode

Useful Information Returned

  • Parameters
  • Quantization level
  • Model size
  • Context length
  • Architecture details

Expert Tip

Use this endpoint to automatically validate model compatibility before deployment.


6. Download a Model

Endpoint

POST /api/pull
Enter fullscreen mode Exit fullscreen mode

Purpose

Downloads a model from the Ollama registry.

Example

curl http://localhost:11434/api/pull \
-d '{
  "name":"deepseek-r1"
}'
Enter fullscreen mode Exit fullscreen mode

Automation Scenario

When deploying a new server:

startup.sh
Enter fullscreen mode Exit fullscreen mode

can automatically pull required models before application startup.


7. Upload a Model

Endpoint

POST /api/push
Enter fullscreen mode Exit fullscreen mode

Purpose

Publishes a model to a registry.

Example

curl http://localhost:11434/api/push \
-d '{
  "name":"mycompany-assistant"
}'
Enter fullscreen mode Exit fullscreen mode

Real Use Cases

  • Internal AI distribution
  • Team collaboration
  • Enterprise model sharing

8. Create a Custom Model

Endpoint

POST /api/create
Enter fullscreen mode Exit fullscreen mode

Purpose

Creates custom models from a Modelfile.

Example

curl http://localhost:11434/api/create \
-d '{
  "name":"seo-expert",
  "modelfile":"FROM llama3"
}'
Enter fullscreen mode Exit fullscreen mode

Why This Is Powerful

You can:

  • Add custom system prompts
  • Create branded assistants
  • Standardize AI behavior
  • Build department-specific AI agents

9. Copy a Model

Endpoint

POST /api/copy
Enter fullscreen mode Exit fullscreen mode

Purpose

Duplicates an existing model.

Example

curl http://localhost:11434/api/copy \
-d '{
  "source":"llama3",
  "destination":"llama3-backup"
}'
Enter fullscreen mode Exit fullscreen mode

Common Use Cases

  • Versioning
  • Testing
  • Experimentation
  • Safe upgrades

10. Delete a Model

Endpoint

DELETE /api/delete
Enter fullscreen mode Exit fullscreen mode

Purpose

Removes a model from local storage.

Example

curl -X DELETE http://localhost:11434/api/delete \
-d '{
  "name":"old-model"
}'
Enter fullscreen mode Exit fullscreen mode

Best Practice

Always verify model usage before deleting in shared environments.


11. View Running Models

Endpoint

GET /api/ps
Enter fullscreen mode Exit fullscreen mode

Purpose

Shows models currently loaded in memory.

Example

curl http://localhost:11434/api/ps
Enter fullscreen mode Exit fullscreen mode

Why It Matters

Helpful for:

  • Memory monitoring
  • Resource optimization
  • Capacity planning
  • Troubleshooting

Expert Tip

Large models may occupy several gigabytes of RAM even when idle.


12. Check Ollama Version

Endpoint

GET /api/version
Enter fullscreen mode Exit fullscreen mode

Purpose

Returns the installed Ollama version.

Example

curl http://localhost:11434/api/version
Enter fullscreen mode Exit fullscreen mode

Production Use

Useful for:

  • CI/CD validation
  • Compatibility checks
  • Deployment audits

13. OpenAI-Compatible Chat Completions

Endpoint

POST /v1/chat/completions
Enter fullscreen mode Exit fullscreen mode

Purpose

Provides OpenAI API compatibility.

Example

curl http://localhost:11434/v1/chat/completions \
-d '{
  "model":"llama3",
  "messages":[
    {
      "role":"user",
      "content":"Write a Python function for sorting."
    }
  ]
}'
Enter fullscreen mode Exit fullscreen mode

Why Developers Love This

Applications built for OpenAI can often switch to Ollama with minimal code changes.

Real Benefits

  • Lower costs
  • Local execution
  • Better privacy
  • Vendor independence

14. OpenAI-Compatible Model Listing

Endpoint

GET /v1/models
Enter fullscreen mode Exit fullscreen mode

Purpose

Lists available models using the OpenAI format.

Example

curl http://localhost:11434/v1/models
Enter fullscreen mode Exit fullscreen mode

Best Use Cases

  • AI gateways
  • SDK integrations
  • Multi-provider platforms
  • Existing OpenAI-based projects

Building Production Systems with Ollama

Many developers stop at generating text, but modern AI applications usually combine several endpoints:

AI Chatbot

/api/chat
/api/show
/api/ps
Enter fullscreen mode Exit fullscreen mode

RAG Search Engine

/api/embeddings
/api/chat
Enter fullscreen mode Exit fullscreen mode

Internal AI Platform

/api/pull
/api/show
/api/chat
/api/delete
Enter fullscreen mode Exit fullscreen mode

OpenAI Replacement

/v1/chat/completions
/v1/models
Enter fullscreen mode Exit fullscreen mode

Combining endpoints intelligently is what separates a proof of concept from a production-ready AI solution.


Security Best Practices

Before exposing Ollama publicly:

  • Place it behind a reverse proxy
  • Enable authentication
  • Limit access with firewalls
  • Monitor resource consumption
  • Restrict model management endpoints
  • Use HTTPS in production

Never expose an unrestricted Ollama instance directly to the internet.


Performance Optimization Tips

To achieve better performance:

  1. Use quantized models when possible.
  2. Keep frequently used models loaded.
  3. Monitor RAM utilization.
  4. Cache embeddings.
  5. Use SSD storage.
  6. Separate inference and application servers for high traffic.

These practices can significantly reduce latency and improve throughput.


Conclusion

Ollama is much more than a tool for running local language models, it is a complete AI platform with endpoints covering text generation, conversational AI, embeddings, model lifecycle management, monitoring, and OpenAI compatibility.

Understanding all 14 endpoints allows developers to build sophisticated AI solutions without relying entirely on external providers. Whether you're creating a chatbot, a RAG-powered knowledge base, a coding assistant, or an enterprise AI platform, Ollama provides the building blocks needed to deploy AI locally, securely, and efficiently.

As organizations increasingly prioritize privacy, cost control, and infrastructure ownership, mastering the Ollama API is becoming a valuable skill for modern software engineers, DevOps professionals, and AI developers.


Top comments (0)