Today, I'll discuss the practicality of running a local LLM (Large Language Model) and whether it's a real alternative to cloud-based solutions, based on my own experiences. Thanks to tools like Ollama, we can now use capabilities that were once only accessible to large corporations on our own servers or even powerful workstations. While doing AI-powered production planning in a manufacturing ERP or testing complex scenarios in my side product's financial calculators, I saw cloud API costs inflate rapidly. This is precisely where the appeal of local solutions emerges.
A few months ago, when the bandwidth of the three different ISPs we used at the company's exit point proved insufficient, I decided to run a small but critical service on a local machine within the internal network. This situation carries a similar logic for LLMs: keeping critical workloads under our own control, without external dependencies. Ollama is one of the key tools that provides us with this control. In this post, I will cover many topics with practical examples, from setting up local LLMs with Ollama to performance expectations, application integration, and advanced usage scenarios.
Why Ollama Matters: The Cost and Privacy Dilemma of Cloud LLMs
Cloud-based LLM services are undoubtedly valuable for their ease of use and scalability. However, for someone like me, who constantly monitors costs and is obsessed with data privacy, they come with some serious drawbacks. In a client's manufacturing ERP project, where we aimed to enhance production planning algorithms with AI, we initially used cloud LLM APIs. Everything went smoothly during initial tests, but when we moved to real operational load, we realized costs could exceed $5,000 USD per month. This was an unacceptable figure for just a few hundred production orders and hundreds of operator screens.
⚠️ The Cost Trap
While the unit costs of cloud LLM APIs may seem low, the pay-per-token model can quickly accumulate with heavy usage, leading to unexpected expenses. Especially during development and testing, repeated calls can strain the budget.
Beyond cost, another major concern was data privacy. Sending sensitive data related to production processes—critical company secrets like raw material origins, production recipes, or cost information—to a third-party cloud service's servers was risky. Even though cloud providers claim "your data won't be used for training," I wasn't comfortable unless I had full control. This was a dilemma I also faced with my own side products that handle sensitive data, such as financial calculators. Performing LLM operations with data kept within our own infrastructure, without external dependencies, was a strategic choice to both control costs and ensure data security. This is why local solutions like Ollama are lifesavers in such scenarios.
Ollama Installation and First Steps: Running an LLM on My Own Machine
Installing Ollama is quite straightforward. Since I typically use my Linux servers, I'll explain how I do it there. First, I get the installation script from Ollama's official website or GitHub repository. This script installs the necessary dependencies and starts the Ollama service.
curl -fsSL https://ollama.com/install.sh | sh
This command usually sets up the systemd service automatically. After the installation is complete, we can check if it's installed correctly with the ollama --version command. On my server, I currently get an output like ollama version is 0.1.32.
After installation, the next step is to download a model. Ollama supports many popular models like llama2, mistral, phi3. For small and quick experiments, I usually prefer mistral or phi3. For example, to download the mistral model:
ollama pull mistral
This command downloads the model and stores it locally. The download time may vary depending on the model size and your internet speed. The mistral model is usually around 4-5 GB. Once the download is complete, all you need to do to run the model is:
ollama run mistral
This command starts an interactive session, and you can begin chatting with the model. For example, when I ask "Explain the advantages of distributed systems in 5 points," the model starts responding within seconds. With these simple steps, you can get an LLM up and running on your own machine. It's surprising to see that integration processes that once took months for "an internal banking platform" can now be done with a few commands.
Local LLM Performance: Expectations vs. Reality
Running a local LLM presents a different performance equation compared to cloud solutions. The most critical factors here are RAM and VRAM (GPU memory). The model's size directly impacts these resources. For instance, a 7B parameter model is recommended to have at least 8GB of RAM or VRAM, while a 13B model requires 16GB. On my own VPS (8GB RAM, 2 CPU cores), when running the 7B quantized version of llama2, I get an average response speed of 10-15 tokens/second. This is sufficient for simple text generation and summarization tasks, but it can be slow for real-time and high-volume operations.
Another important factor affecting performance is quantization. Models are often quantized to lower precision, such as 4-bit or 8-bit, to consume less memory and run faster. This can sometimes slightly reduce the quality of the model's outputs, but in most practical scenarios, this reduction is negligible. In my production planning project, the 4-bit quantized mistral model proved quite successful in providing instant suggestions to operators.
ℹ️ Quantization and Performance
Quantization is a method to reduce memory usage and increase inference speed by converting a model's weights to lower bit precision. For example, models like
llama2:7b-chat-q4_0use 4-bit quantization.
In my experience, if you have a GPU (preferably NVIDIA, for CUDA support), local LLM performance increases tremendously. In a client project, we were able to run a 70B parameter model (quantized) on an RTX 4090 with 24GB VRAM at a speed of 30-40 tokens per second. This is a speed that can compete with cloud APIs. However, the cost of such hardware should not be overlooked. There's always a trade-off: either you invest in hardware and maintain control, or you pay the cloud provider per token. My preference for critical workloads is always towards hardware investment and local control. Especially for security layers like audit subsystems and file integrity monitoring, I can manage them more strictly on my own machines.
Integration with Ollama API: Adding Local Intelligence to My Applications
It's important to note that Ollama is not just an interactive CLI interface; it also offers a powerful REST API. Thanks to this API, we can easily add local LLM capabilities to our own applications. Since I generally use Python, let me show you how I integrate it with an example. In the backend of one of my side products, there's a section that processes user queries in natural language and analyzes financial data. While this used to work only with regex and rule-based systems, it has now become much more flexible thanks to the Ollama API.
First, you need to ensure the Ollama server is running. By default, it listens on http://localhost:11434. Then, you can make a simple call using the requests library in Python:
import requests
import json
def chat_with_ollama(prompt, model="mistral"):
url = "http://localhost:11434/api/generate"
headers = {"Content-Type": "application/json"}
data = {
"model": model,
"prompt": prompt,
"stream": False # Set to True to receive as a stream
}
try:
response = requests.post(url, headers=headers, data=json.dumps(data))
response.raise_for_status() # Check for HTTP errors
result = response.json()
return result['response']
except requests.exceptions.RequestException as e:
print(f"API call failed: {e}")
return None
# Example usage
if __name__ == "__main__":
my_prompt = "What is the capital of Turkey?"
response_text = chat_with_ollama(my_prompt)
if response_text:
print(f"Answer: {response_text}")
# A more complex scenario
complex_prompt = "Describe 3 main problems a manufacturing company might face in inventory management and their solutions."
erp_response = chat_with_ollama(complex_prompt)
if erp_response:
print(f"\nERP Answer:\n{erp_response}")
This Python code sends a POST request to the Ollama API to get a response from the model. With stream: False, we get the entire response at once, while with stream: True, we can receive the response in chunks and display it in real-time in the user interface. In my ERP project, I used this streaming feature for instant production planning suggestions integrated into operator screens. Operators would start seeing the LLM's suggestions on screen within seconds of entering a work order. This was much more efficient than the delayed reporting in the old system. Furthermore, while doing this integration, I placed Ollama behind an Nginx reverse proxy to add both security and load balancing layers. I protected the API with an authorization mechanism using JWT/OAuth2 patterns, ensuring that only authorized applications could access it.
Advanced Usage Scenarios: How RAG and Agent Patterns Work Locally
Local LLMs not only serve for simple text generation but also form a powerful foundation for more advanced AI application architectures. Retrieval-Augmented Generation (RAG) and Agent Patterns are at the forefront of these advanced scenarios. On my anonymous data platform for Turkey, I set up a RAG system that allows users to make natural language queries about specific datasets. This system prevents the LLM from generating outdated or non-specific information because it bases its responses on real data retrieved from a knowledge base.
The steps to set up a local RAG system are as follows:
- Data Preparation and Chunking: I divide my documents (PDFs, text files, database records) into small pieces (chunks).
- Generating Embeddings: I create a vector embedding for each chunk. Ollama also supports embedding models. For example, we can pull and use an embedding model with the
ollama pull nomic-embed-textcommand. - Vector Database: I store these embeddings in a vector database (e.g., ChromaDB, FAISS, or even PostgreSQL's
pgvectorextension). I usually prefer PostgreSQL because it's already installed and I have experience with replication. - Querying and LLM Integration: When a user query arrives, I convert this query into an embedding and search for the most relevant chunks in the vector database. I then send the retrieved chunks, along with the original query, to the LLM in Ollama to get the final response.
# Conceptual Python structure for RAG
from ollama import Client # Using Ollama client library is more practical
# from chromadb import Client as ChromaClient # Vector DB library
# from pgvector.psycopg2 import register_vector # For PostgreSQL
# Function to retrieve relevant documents from Vector DB (example)
def retrieve_documents(query_embedding):
# In this part, you would retrieve documents closest to the query embedding from your vector database.
# For example, you could use pgvector or ChromaDB.
# return ["Content of Relevant Document 1", "Content of Relevant Document 2"]
return [
"According to the 2023 financial report, Company X's revenue is 1.2 billion TL.",
"Company X's main product range is concentrated in categories Y and Z."
]
def rag_query_ollama(user_query, client):
# 1. Convert user query to embedding
# Getting embedding with Ollama (if an embedding model is loaded)
# response = client.embeddings(model='nomic-embed-text', prompt=user_query)
# query_embedding = response['embedding']
# For now, a simple placeholder embedding
query_embedding = [0.1, 0.2, 0.3] # This would actually be a vector
# 2. Retrieve relevant documents from the vector DB
context_documents = retrieve_documents(query_embedding)
# 3. Prepare the prompt to be sent to the LLM
full_prompt = (
f"Answer the question using the following context:\n\n"
f"Context:\n{'- '.join(context_documents)}\n\n"
f"Question: {user_query}\n"
f"Answer:"
)
# 4. Send to Ollama and get a response
response = client.generate(model='mistral', prompt=full_prompt)
return response['response']
if __name__ == "__main__":
ollama_client = Client(host='http://localhost:11434')
query = "What was Company X's 2023 revenue and what are its main products?"
answer = rag_query_ollama(query, ollama_client)
print(f"\nOllama RAG Answer: {answer}")
Agent patterns, on the other hand, are more dynamic systems where the LLM can use tools to perform specific tasks. For example, we can enable an LLM to query a database, send a request to an API, or perform operations on a file system. In my Android spam application, I experimented with a simple agent-like structure to analyze incoming SMS messages and automatically block them based on specific rules. The LLM would evaluate the SMS content and return a potential "spam" or "safe" label. In such systems, Linux service management issues like cgroup memory.high soft limits and journald rate limit are vital for keeping resource consumption under control. Otherwise, the entire system could be OOM-killed in an instant.
Beyond Being an Alternative to Cloud LLMs: What Else Can Be Done?
Running local LLMs with Ollama opens up entirely new possibilities, going beyond merely being an alternative to cloud solutions. One of the biggest advantages is, of course, data privacy and security. Your sensitive data is processed without leaving your company network. This is an indispensable feature, especially for regulated industries like healthcare, finance, or defense. It also offers offline capabilities. Your LLM continues to work even without an internet connection. This is a critical advantage for manufacturing plants in remote locations or field teams.
In my experience, local LLMs are particularly well-suited for custom finetuning and model experimentation. Finetuning a model in the cloud can be both costly and time-consuming. On my own server, I can test different model architectures and finetuned versions much faster and more economically. For example, I had to finetune the AI-powered production planning model in the ERP multiple times based on internal company data. Instead of paying thousands of dollars to cloud APIs for each attempt, I performed hundreds of experiments on my own GPU-equipped server and found the optimal model.
💡 Flexibility and Control
Local LLMs provide full control over the entire model lifecycle: model selection, quantization, finetuning, security policies, and deployment strategies. This is crucial for customized and sensitive applications.
Of course, local LLMs also have their limitations. The biggest disadvantage is scalability. To handle high-volume, concurrent requests, you might need multiple GPU-equipped servers or a cluster. In this case, container orchestration (I started with Docker Compose, but felt the need to switch to Kubernetes) comes into play. Additionally, model updates and maintenance are your responsibility. While cloud providers take this burden off you, locally you have to manage everything yourself. However, this also gives you flexibility. I can implement system security measures like kernel module blacklists or fail2ban patterns as I wish on my own servers.
In conclusion, local LLMs with Ollama offer a real alternative to cloud solutions in specific scenarios. Especially in situations requiring cost control, privacy, and customization, running an LLM on your own infrastructure is a logical and viable path.
| Feature | Local LLM (Ollama) | Cloud LLM (API) |
|---|---|---|
| Cost | High initial investment (hardware), low operational | Low initial investment, high operational (per token) |
| Data Privacy | Full control, data stays within the company | Third-party risks, dependent on privacy policies |
| Performance | Hardware-dependent, high performance potential | Generally high, scalable |
| Customization | Easy finetuning and model experimentation | Generally limited or costly |
| Dependency | Independent of external parties, works offline | Dependent on internet connection and provider |
| Management | Full management and maintenance responsibility | Managed by the provider |
Conclusion
My experience running local LLMs with Ollama has provided me with significant advantages in terms of both cost control and data privacy. Especially when developing complex algorithms in a manufacturing ERP and performing financial calculations in my own side products, it allowed me to overcome the limitations imposed by cloud APIs. Building an AI infrastructure on my own hardware, under my own control, was not just a cost optimization but also a strategic step towards independence.
While local LLMs may not be suitable for every scenario, they are definitely an option to consider for projects that handle sensitive data, have budget constraints, or require offline capabilities. By following the steps in this guide, you can set up your own local LLM, integrate it into your applications, and gain control and flexibility beyond cloud solutions. Your next step could be to pull a nomic-embed-text model on your own server and set up a simple RAG system using pgvector.
Top comments (0)