DEV Community

Cover image for Running Local LLMs with NeuroLink and Ollama: Complete Guide
NeuroLink AI
NeuroLink AI

Posted on • Originally published at blog.neurolink.ink

Running Local LLMs with NeuroLink and Ollama: Complete Guide

Your LLM API bill just hit $5,000 this month. OpenAI went down twice last week. And your legal team is nervous about sending proprietary data to external servers.

Sound familiar? Here's how to take back control with local LLM deployment using Ollama and NeuroLink.

TL;DR

  • Install Ollama: brew install ollama (or curl -fsSL https://ollama.com/install.sh | sh)
  • Run a model: ollama run llama3.1:8b
  • Connect with NeuroLink: provider: "ollama" in config
  • Full privacy, zero API costs, <100ms latency

Read on for the complete setup guide...

Table of Contents

Why Run LLMs Locally?

The rise of capable open-source language models has fundamentally changed how developers approach AI integration. No longer are you locked into cloud-only solutions with their associated costs, latency, and privacy concerns.

Privacy and Data Sovereignty

When you run models locally, your data never leaves your infrastructure. This is critical for:

  • Healthcare applications handling protected health information (PHI)
  • Financial services processing sensitive customer data
  • Legal tech working with privileged communications
  • Enterprise applications dealing with proprietary business information

Cost Predictability

Cloud LLM APIs charge per token, which can lead to unpredictable costs as usage scales. Local deployment converts this variable cost into a fixed infrastructure investment. Once you have the hardware, your marginal cost per inference approaches zero.

Latency Optimization

Network round-trips to cloud providers introduce latency that can be unacceptable for real-time applications. Local inference eliminates this network overhead entirely. On properly configured hardware, you can achieve response times measured in milliseconds rather than seconds.

Offline Capability

Local models work without internet connectivity, enabling deployment in:

  • Air-gapped environments
  • Edge devices with intermittent connectivity
  • Mobile applications requiring offline functionality
  • Disaster recovery scenarios

Setting Up Ollama

Ollama has emerged as the leading solution for running LLMs locally. It provides a simple, Docker-like experience for model management.

Installation

macOS (Homebrew):

brew install ollama
Enter fullscreen mode Exit fullscreen mode

macOS/Linux (Direct Download):

curl -fsSL https://ollama.com/install.sh | sh
Enter fullscreen mode Exit fullscreen mode

Docker:

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
Enter fullscreen mode Exit fullscreen mode

Starting the Ollama Server

After installation, start the Ollama service:

ollama serve
Enter fullscreen mode Exit fullscreen mode

On macOS and Windows, Ollama typically runs as a background service automatically. On Linux, you may want to configure it as a systemd service:

# /etc/systemd/system/ollama.service
[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3

[Install]
WantedBy=default.target
Enter fullscreen mode Exit fullscreen mode

Enable and start the service:

sudo systemctl enable ollama
sudo systemctl start ollama
Enter fullscreen mode Exit fullscreen mode

Pulling Your First Model

With Ollama running, pull a model to get started:

# Pull Llama 3.1 8B - great balance of capability and speed
ollama pull llama3.1:8b

# Pull Mistral 7B - excellent for general tasks
ollama pull mistral:7b

# Pull CodeLlama for programming tasks
ollama pull codellama:13b
Enter fullscreen mode Exit fullscreen mode

Verify the model is available:

ollama list
Enter fullscreen mode Exit fullscreen mode

Test it with a quick prompt:

ollama run llama3.1:8b "Explain quantum computing in simple terms"
Enter fullscreen mode Exit fullscreen mode

Configuring NeuroLink for Ollama

NeuroLink's provider-agnostic architecture makes Ollama integration straightforward.

Basic Configuration

In your NeuroLink configuration file, add Ollama as a provider:

# neurolink.config.yaml
providers:
  ollama:
    type: ollama
    base_url: http://localhost:11434
    default_model: llama3.1:8b
    timeout: 120
    retry:
      max_attempts: 3
      backoff_multiplier: 2
Enter fullscreen mode Exit fullscreen mode

Environment Variables

Alternatively, configure via environment variables:

export NEUROLINK_OLLAMA_BASE_URL=http://localhost:11434
export NEUROLINK_OLLAMA_DEFAULT_MODEL=llama3.1:8b
export NEUROLINK_OLLAMA_TIMEOUT=120
Enter fullscreen mode Exit fullscreen mode

Programmatic Configuration

For more control, configure Ollama programmatically:

import { NeuroLink } from "@juspay/neurolink";

// Initialize NeuroLink with Ollama
const nl = new NeuroLink({
  providers: [{
    name: "local",
    config: {
      baseUrl: "http://localhost:11434",
      defaultModel: "llama3.1:8b",
      timeout: 120,
      keepAlive: "5m"  // Keep model loaded for 5 minutes
    }
  }]
});

// Use the local provider
const response = await nl.generate({
  input: { text: "Write a function to calculate fibonacci numbers" },
  provider: "local"
});
Enter fullscreen mode Exit fullscreen mode

Multiple Model Configuration

Configure multiple Ollama models for different use cases:

providers:
  ollama-fast:
    type: ollama
    base_url: http://localhost:11434
    default_model: llama3.1:8b

  ollama-code:
    type: ollama
    base_url: http://localhost:11434
    default_model: codellama:13b

  ollama-large:
    type: ollama
    base_url: http://localhost:11434
    default_model: llama3.1:70b
Enter fullscreen mode Exit fullscreen mode

Model Selection Guide

Choosing the right model for your use case is crucial for balancing capability with resource requirements.

General Purpose Models

Model Size VRAM Best For
Llama 3.1 8B 4.7GB 8GB min General chat, summarization, simple reasoning
Llama 3.1 70B 40GB 48GB+ Complex reasoning, nuanced tasks
Mistral 7B 4.1GB 6GB min Quick tasks, high throughput

Coding Models

Model Size VRAM Best For
CodeLlama 13B 7.4GB 12GB min Code generation, debugging
DeepSeek Coder 33B 19GB 24GB min Complex programming tasks

Quantization Options

Ollama supports various quantization levels that trade quality for reduced resource requirements:

# Full precision (largest, highest quality)
ollama pull llama3.1:8b-fp16

# 8-bit quantization (good balance)
ollama pull llama3.1:8b-q8_0

# 4-bit quantization (smallest, slight quality reduction)
ollama pull llama3.1:8b-q4_0
Enter fullscreen mode Exit fullscreen mode

Performance Optimization

Getting the best performance from local LLMs requires attention to hardware configuration and Ollama settings.

GPU Acceleration

For optimal performance, use a GPU with sufficient VRAM:

# Check if Ollama is using GPU
ollama ps

# Force CPU-only mode (if needed)
OLLAMA_GPU_LAYERS=0 ollama serve
Enter fullscreen mode Exit fullscreen mode

Memory Management

Configure system resources appropriately:

# Set maximum loaded models
export OLLAMA_MAX_LOADED_MODELS=2

# Set VRAM limit (in bytes)
export OLLAMA_GPU_MEMORY=8589934592  # 8GB

# Configure context window (affects memory usage)
export OLLAMA_NUM_CTX=4096
Enter fullscreen mode Exit fullscreen mode

Custom Modelfile for Optimization

Create a custom Modelfile for optimized inference:

# Modelfile.optimized
FROM llama3.1:8b

# Increase context window
PARAMETER num_ctx 8192

# Optimize for speed
PARAMETER num_batch 512
PARAMETER num_thread 8

# Reduce temperature for more deterministic outputs
PARAMETER temperature 0.7
PARAMETER top_p 0.9

# System prompt for your use case
SYSTEM """You are a helpful assistant optimized for technical questions."""
Enter fullscreen mode Exit fullscreen mode

Build and use the optimized model:

ollama create llama-optimized -f Modelfile.optimized
ollama run llama-optimized
Enter fullscreen mode Exit fullscreen mode

Hybrid Cloud and Local Patterns

One of NeuroLink's most powerful features is the ability to seamlessly combine local and cloud providers.

Fallback Pattern

Use local inference by default, falling back to cloud when local resources are exhausted:

import { NeuroLink } from "@juspay/neurolink";

const nl = new NeuroLink({
  providers: [
    { name: "local", config: { baseUrl: "http://localhost:11434" } },
    { name: "openai", config: { apiKey: process.env.OPENAI_API_KEY } },
    { name: "anthropic", config: { apiKey: process.env.ANTHROPIC_API_KEY } }
  ],
  failover: {
    enabled: true,
    primary: "local",
    fallbackProviders: ["openai", "anthropic"],
    triggerOn: ["timeout", "overload", "error"]
  }
});

// This will try local first, then cloud if needed
const response = await nl.generate({
  input: { text: "Complex analysis task..." },
  maxTokens: 2000
});
Enter fullscreen mode Exit fullscreen mode

Task-Based Routing

Route requests to appropriate providers based on task characteristics:

import { NeuroLink } from "@juspay/neurolink";

const nl = new NeuroLink({
  providers: [
    { name: "local", config: { baseUrl: "http://localhost:11434" } },
    { name: "anthropic", config: { apiKey: process.env.ANTHROPIC_API_KEY } }
  ],
  routing: {
    rules: [
      {
        taskType: "simple_qa",
        provider: "local",
        model: "llama3.1:8b"
      },
      {
        taskType: "code_generation",
        provider: "local",
        model: "codellama:13b"
      },
      {
        taskType: "complex_reasoning",
        provider: "anthropic",
        model: "claude-3-opus"
      }
    ]
  }
});

// Automatically routes to appropriate provider
const response = await nl.generate({
  input: { text: "Write a sorting algorithm" },
  taskType: "code_generation"
});
Enter fullscreen mode Exit fullscreen mode

Privacy-Preserving Routing

Automatically route sensitive data to local inference:

import { NeuroLink } from "@juspay/neurolink";

const nl = new NeuroLink({
  providers: [
    { name: "ollama", config: { baseUrl: "http://localhost:11434" } },
    { name: "openai", config: { apiKey: process.env.OPENAI_API_KEY } }
  ],
  middleware: {
    guardrails: {
      piiDetection: {
        enabled: true,
        patterns: [
          { name: "ssn", regex: "\\b\\d{3}-\\d{2}-\\d{4}\\b" },
          { name: "email", regex: "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}" }
        ],
        sensitiveKeywords: ["confidential", "proprietary"],
        localProvider: "ollama",
        cloudProvider: "openai"
      }
    }
  }
});

// Automatically routes to local if sensitive data detected
const response = await nl.generate({
  input: { text: "Analyze this customer data: SSN 123-45-6789..." }
  // Routes to local Ollama automatically
});
Enter fullscreen mode Exit fullscreen mode

Troubleshooting Common Issues

Model Loading Failures

Symptom: "Error: model not found" or slow initial response

Solutions:

# Verify model is downloaded
ollama list

# Re-pull if corrupted
ollama rm llama3.1:8b
ollama pull llama3.1:8b

# Check disk space
df -h ~/.ollama
Enter fullscreen mode Exit fullscreen mode

Out of Memory Errors

Symptom: "CUDA out of memory" or system freeze

Solutions:

# Use smaller model
ollama pull llama3.1:8b-q4_0

# Reduce context window
export OLLAMA_NUM_CTX=2048

# Limit GPU memory
export OLLAMA_GPU_MEMORY=6442450944  # 6GB
Enter fullscreen mode Exit fullscreen mode

Slow Inference

Symptom: Response times exceeding expectations

Solutions:

# Verify GPU is being used
ollama ps

# Check for thermal throttling
nvidia-smi -l 1

# Increase batch size for throughput
# In Modelfile:
PARAMETER num_batch 1024
Enter fullscreen mode Exit fullscreen mode

Connection Refused

Symptom: NeuroLink cannot connect to Ollama

Solutions:

# Verify Ollama is running
curl http://localhost:11434/api/tags

# Check firewall settings
sudo ufw allow 11434/tcp

# Restart Ollama service
sudo systemctl restart ollama
Enter fullscreen mode Exit fullscreen mode

Conclusion

Running local LLMs with Ollama and NeuroLink provides a powerful, flexible, and privacy-preserving approach to AI integration. By following this guide, you've learned how to:

  1. Set up and configure Ollama for local inference
  2. Integrate Ollama with NeuroLink for seamless model access
  3. Select appropriate models for your use cases
  4. Optimize performance for your hardware
  5. Implement hybrid cloud and local patterns
  6. Troubleshoot common deployment issues

The combination of local and cloud inference gives you unprecedented flexibility in how you deploy AI capabilities. Start with local models for development and privacy-sensitive tasks, scale to cloud providers when you need additional capacity or capabilities, and let NeuroLink handle the complexity of managing multiple providers.


Found this helpful? Drop a comment below with your questions or share your experience with local LLMs!

Want to try NeuroLink?

Follow us for more AI development content:

Top comments (0)