Your LLM API bill just hit $5,000 this month. OpenAI went down twice last week. And your legal team is nervous about sending proprietary data to external servers.
Sound familiar? Here's how to take back control with local LLM deployment using Ollama and NeuroLink.
TL;DR
- Install Ollama:
brew install ollama(orcurl -fsSL https://ollama.com/install.sh | sh) - Run a model:
ollama run llama3.1:8b - Connect with NeuroLink:
provider: "ollama"in config - Full privacy, zero API costs, <100ms latency
Read on for the complete setup guide...
Table of Contents
- Why Run LLMs Locally?
- Setting Up Ollama
- Configuring NeuroLink for Ollama
- Model Selection Guide
- Performance Optimization
- Hybrid Cloud and Local Patterns
- Troubleshooting Common Issues
Why Run LLMs Locally?
The rise of capable open-source language models has fundamentally changed how developers approach AI integration. No longer are you locked into cloud-only solutions with their associated costs, latency, and privacy concerns.
Privacy and Data Sovereignty
When you run models locally, your data never leaves your infrastructure. This is critical for:
- Healthcare applications handling protected health information (PHI)
- Financial services processing sensitive customer data
- Legal tech working with privileged communications
- Enterprise applications dealing with proprietary business information
Cost Predictability
Cloud LLM APIs charge per token, which can lead to unpredictable costs as usage scales. Local deployment converts this variable cost into a fixed infrastructure investment. Once you have the hardware, your marginal cost per inference approaches zero.
Latency Optimization
Network round-trips to cloud providers introduce latency that can be unacceptable for real-time applications. Local inference eliminates this network overhead entirely. On properly configured hardware, you can achieve response times measured in milliseconds rather than seconds.
Offline Capability
Local models work without internet connectivity, enabling deployment in:
- Air-gapped environments
- Edge devices with intermittent connectivity
- Mobile applications requiring offline functionality
- Disaster recovery scenarios
Setting Up Ollama
Ollama has emerged as the leading solution for running LLMs locally. It provides a simple, Docker-like experience for model management.
Installation
macOS (Homebrew):
brew install ollama
macOS/Linux (Direct Download):
curl -fsSL https://ollama.com/install.sh | sh
Docker:
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
Starting the Ollama Server
After installation, start the Ollama service:
ollama serve
On macOS and Windows, Ollama typically runs as a background service automatically. On Linux, you may want to configure it as a systemd service:
# /etc/systemd/system/ollama.service
[Unit]
Description=Ollama Service
After=network-online.target
[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
[Install]
WantedBy=default.target
Enable and start the service:
sudo systemctl enable ollama
sudo systemctl start ollama
Pulling Your First Model
With Ollama running, pull a model to get started:
# Pull Llama 3.1 8B - great balance of capability and speed
ollama pull llama3.1:8b
# Pull Mistral 7B - excellent for general tasks
ollama pull mistral:7b
# Pull CodeLlama for programming tasks
ollama pull codellama:13b
Verify the model is available:
ollama list
Test it with a quick prompt:
ollama run llama3.1:8b "Explain quantum computing in simple terms"
Configuring NeuroLink for Ollama
NeuroLink's provider-agnostic architecture makes Ollama integration straightforward.
Basic Configuration
In your NeuroLink configuration file, add Ollama as a provider:
# neurolink.config.yaml
providers:
ollama:
type: ollama
base_url: http://localhost:11434
default_model: llama3.1:8b
timeout: 120
retry:
max_attempts: 3
backoff_multiplier: 2
Environment Variables
Alternatively, configure via environment variables:
export NEUROLINK_OLLAMA_BASE_URL=http://localhost:11434
export NEUROLINK_OLLAMA_DEFAULT_MODEL=llama3.1:8b
export NEUROLINK_OLLAMA_TIMEOUT=120
Programmatic Configuration
For more control, configure Ollama programmatically:
import { NeuroLink } from "@juspay/neurolink";
// Initialize NeuroLink with Ollama
const nl = new NeuroLink({
providers: [{
name: "local",
config: {
baseUrl: "http://localhost:11434",
defaultModel: "llama3.1:8b",
timeout: 120,
keepAlive: "5m" // Keep model loaded for 5 minutes
}
}]
});
// Use the local provider
const response = await nl.generate({
input: { text: "Write a function to calculate fibonacci numbers" },
provider: "local"
});
Multiple Model Configuration
Configure multiple Ollama models for different use cases:
providers:
ollama-fast:
type: ollama
base_url: http://localhost:11434
default_model: llama3.1:8b
ollama-code:
type: ollama
base_url: http://localhost:11434
default_model: codellama:13b
ollama-large:
type: ollama
base_url: http://localhost:11434
default_model: llama3.1:70b
Model Selection Guide
Choosing the right model for your use case is crucial for balancing capability with resource requirements.
General Purpose Models
| Model | Size | VRAM | Best For |
|---|---|---|---|
| Llama 3.1 8B | 4.7GB | 8GB min | General chat, summarization, simple reasoning |
| Llama 3.1 70B | 40GB | 48GB+ | Complex reasoning, nuanced tasks |
| Mistral 7B | 4.1GB | 6GB min | Quick tasks, high throughput |
Coding Models
| Model | Size | VRAM | Best For |
|---|---|---|---|
| CodeLlama 13B | 7.4GB | 12GB min | Code generation, debugging |
| DeepSeek Coder 33B | 19GB | 24GB min | Complex programming tasks |
Quantization Options
Ollama supports various quantization levels that trade quality for reduced resource requirements:
# Full precision (largest, highest quality)
ollama pull llama3.1:8b-fp16
# 8-bit quantization (good balance)
ollama pull llama3.1:8b-q8_0
# 4-bit quantization (smallest, slight quality reduction)
ollama pull llama3.1:8b-q4_0
Performance Optimization
Getting the best performance from local LLMs requires attention to hardware configuration and Ollama settings.
GPU Acceleration
For optimal performance, use a GPU with sufficient VRAM:
# Check if Ollama is using GPU
ollama ps
# Force CPU-only mode (if needed)
OLLAMA_GPU_LAYERS=0 ollama serve
Memory Management
Configure system resources appropriately:
# Set maximum loaded models
export OLLAMA_MAX_LOADED_MODELS=2
# Set VRAM limit (in bytes)
export OLLAMA_GPU_MEMORY=8589934592 # 8GB
# Configure context window (affects memory usage)
export OLLAMA_NUM_CTX=4096
Custom Modelfile for Optimization
Create a custom Modelfile for optimized inference:
# Modelfile.optimized
FROM llama3.1:8b
# Increase context window
PARAMETER num_ctx 8192
# Optimize for speed
PARAMETER num_batch 512
PARAMETER num_thread 8
# Reduce temperature for more deterministic outputs
PARAMETER temperature 0.7
PARAMETER top_p 0.9
# System prompt for your use case
SYSTEM """You are a helpful assistant optimized for technical questions."""
Build and use the optimized model:
ollama create llama-optimized -f Modelfile.optimized
ollama run llama-optimized
Hybrid Cloud and Local Patterns
One of NeuroLink's most powerful features is the ability to seamlessly combine local and cloud providers.
Fallback Pattern
Use local inference by default, falling back to cloud when local resources are exhausted:
import { NeuroLink } from "@juspay/neurolink";
const nl = new NeuroLink({
providers: [
{ name: "local", config: { baseUrl: "http://localhost:11434" } },
{ name: "openai", config: { apiKey: process.env.OPENAI_API_KEY } },
{ name: "anthropic", config: { apiKey: process.env.ANTHROPIC_API_KEY } }
],
failover: {
enabled: true,
primary: "local",
fallbackProviders: ["openai", "anthropic"],
triggerOn: ["timeout", "overload", "error"]
}
});
// This will try local first, then cloud if needed
const response = await nl.generate({
input: { text: "Complex analysis task..." },
maxTokens: 2000
});
Task-Based Routing
Route requests to appropriate providers based on task characteristics:
import { NeuroLink } from "@juspay/neurolink";
const nl = new NeuroLink({
providers: [
{ name: "local", config: { baseUrl: "http://localhost:11434" } },
{ name: "anthropic", config: { apiKey: process.env.ANTHROPIC_API_KEY } }
],
routing: {
rules: [
{
taskType: "simple_qa",
provider: "local",
model: "llama3.1:8b"
},
{
taskType: "code_generation",
provider: "local",
model: "codellama:13b"
},
{
taskType: "complex_reasoning",
provider: "anthropic",
model: "claude-3-opus"
}
]
}
});
// Automatically routes to appropriate provider
const response = await nl.generate({
input: { text: "Write a sorting algorithm" },
taskType: "code_generation"
});
Privacy-Preserving Routing
Automatically route sensitive data to local inference:
import { NeuroLink } from "@juspay/neurolink";
const nl = new NeuroLink({
providers: [
{ name: "ollama", config: { baseUrl: "http://localhost:11434" } },
{ name: "openai", config: { apiKey: process.env.OPENAI_API_KEY } }
],
middleware: {
guardrails: {
piiDetection: {
enabled: true,
patterns: [
{ name: "ssn", regex: "\\b\\d{3}-\\d{2}-\\d{4}\\b" },
{ name: "email", regex: "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}" }
],
sensitiveKeywords: ["confidential", "proprietary"],
localProvider: "ollama",
cloudProvider: "openai"
}
}
}
});
// Automatically routes to local if sensitive data detected
const response = await nl.generate({
input: { text: "Analyze this customer data: SSN 123-45-6789..." }
// Routes to local Ollama automatically
});
Troubleshooting Common Issues
Model Loading Failures
Symptom: "Error: model not found" or slow initial response
Solutions:
# Verify model is downloaded
ollama list
# Re-pull if corrupted
ollama rm llama3.1:8b
ollama pull llama3.1:8b
# Check disk space
df -h ~/.ollama
Out of Memory Errors
Symptom: "CUDA out of memory" or system freeze
Solutions:
# Use smaller model
ollama pull llama3.1:8b-q4_0
# Reduce context window
export OLLAMA_NUM_CTX=2048
# Limit GPU memory
export OLLAMA_GPU_MEMORY=6442450944 # 6GB
Slow Inference
Symptom: Response times exceeding expectations
Solutions:
# Verify GPU is being used
ollama ps
# Check for thermal throttling
nvidia-smi -l 1
# Increase batch size for throughput
# In Modelfile:
PARAMETER num_batch 1024
Connection Refused
Symptom: NeuroLink cannot connect to Ollama
Solutions:
# Verify Ollama is running
curl http://localhost:11434/api/tags
# Check firewall settings
sudo ufw allow 11434/tcp
# Restart Ollama service
sudo systemctl restart ollama
Conclusion
Running local LLMs with Ollama and NeuroLink provides a powerful, flexible, and privacy-preserving approach to AI integration. By following this guide, you've learned how to:
- Set up and configure Ollama for local inference
- Integrate Ollama with NeuroLink for seamless model access
- Select appropriate models for your use cases
- Optimize performance for your hardware
- Implement hybrid cloud and local patterns
- Troubleshoot common deployment issues
The combination of local and cloud inference gives you unprecedented flexibility in how you deploy AI capabilities. Start with local models for development and privacy-sensitive tasks, scale to cloud providers when you need additional capacity or capabilities, and let NeuroLink handle the complexity of managing multiple providers.
Found this helpful? Drop a comment below with your questions or share your experience with local LLMs!
Want to try NeuroLink?
- Website: neurolink.ink
- GitHub: github.com/juspay/neurolink
- Documentation: docs.neurolink.ink
- Star the repo if you find it useful!
Follow us for more AI development content:
- Dev.to: @neurolink
- Twitter: @Neurolink__
Top comments (0)