NeuroLink AI

Posted on Feb 17 • Edited on Apr 6 • Originally published at blog.neurolink.ink

Running Local LLMs with NeuroLink and Ollama: Complete Guide

#ollama #llm #ai #typescript

Your LLM API bill just hit $5,000 this month. OpenAI went down twice last week. And your legal team is nervous about sending proprietary data to external servers.

Sound familiar? Here's how to take back control with local LLM deployment using Ollama and NeuroLink.

TL;DR

Install Ollama: brew install ollama (or curl -fsSL https://ollama.com/install.sh | sh)
Run a model: ollama run llama3.1:8b
Connect with NeuroLink: provider: "ollama" in config
Full privacy, zero API costs, <100ms latency

Read on for the complete setup guide...

Why Run LLMs Locally?
Setting Up Ollama
Configuring NeuroLink for Ollama
Model Selection Guide
Performance Optimization
Hybrid Cloud and Local Patterns
Troubleshooting Common Issues

Why Run LLMs Locally?

The rise of capable open-source language models has fundamentally changed how developers approach AI integration. No longer are you locked into cloud-only solutions with their associated costs, latency, and privacy concerns.

Privacy and Data Sovereignty

When you run models locally, your data never leaves your infrastructure. This is critical for:

Healthcare applications handling protected health information (PHI)
Financial services processing sensitive customer data
Legal tech working with privileged communications
Enterprise applications dealing with proprietary business information

Cost Predictability

Cloud LLM APIs charge per token, which can lead to unpredictable costs as usage scales. Local deployment converts this variable cost into a fixed infrastructure investment. Once you have the hardware, your marginal cost per inference approaches zero.

Latency Optimization

Network round-trips to cloud providers introduce latency that can be unacceptable for real-time applications. Local inference eliminates this network overhead entirely. On properly configured hardware, you can achieve response times measured in milliseconds rather than seconds.

Offline Capability

Local models work without internet connectivity, enabling deployment in:

Air-gapped environments
Edge devices with intermittent connectivity
Mobile applications requiring offline functionality
Disaster recovery scenarios

Setting Up Ollama

Ollama has emerged as the leading solution for running LLMs locally. It provides a simple, Docker-like experience for model management.

Installation

macOS (Homebrew):

brew install ollama

macOS/Linux (Direct Download):

curl -fsSL https://ollama.com/install.sh | sh

Docker:

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Starting the Ollama Server

After installation, start the Ollama service:

ollama serve

On macOS and Windows, Ollama typically runs as a background service automatically. On Linux, you may want to configure it as a systemd service:

# /etc/systemd/system/ollama.service
[Unit]
Description=Ollama Service
After=network-online.target

[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3

[Install]
WantedBy=default.target

Enable and start the service:

sudo systemctl enable ollama
sudo systemctl start ollama

Pulling Your First Model

With Ollama running, pull a model to get started:

# Pull Llama 3.1 8B - great balance of capability and speed
ollama pull llama3.1:8b

# Pull Mistral 7B - excellent for general tasks
ollama pull mistral:7b

# Pull CodeLlama for programming tasks
ollama pull codellama:13b

Verify the model is available:

ollama list

Test it with a quick prompt:

ollama run llama3.1:8b "Explain quantum computing in simple terms"

Configuring NeuroLink for Ollama

NeuroLink's provider-agnostic architecture makes Ollama integration straightforward.

Basic Configuration

In your NeuroLink configuration file, add Ollama as a provider:

# neurolink.config.yaml
providers:
  ollama:
    type: ollama
    base_url: http://localhost:11434
    default_model: llama3.1:8b
    timeout: 120
    retry:
      max_attempts: 3
      backoff_multiplier: 2

Environment Variables

Alternatively, configure via environment variables:

export NEUROLINK_OLLAMA_BASE_URL=http://localhost:11434
export NEUROLINK_OLLAMA_DEFAULT_MODEL=llama3.1:8b
export NEUROLINK_OLLAMA_TIMEOUT=120

Programmatic Configuration

For more control, configure Ollama programmatically:

import { NeuroLink } from "@juspay/neurolink";

// Initialize NeuroLink with Ollama
const nl = new NeuroLink({
  providers: [{
    name: "local",
    config: {
      baseUrl: "http://localhost:11434",
      defaultModel: "llama3.1:8b",
      timeout: 120,
      keepAlive: "5m"  // Keep model loaded for 5 minutes
    }
  }]
});

// Use the local provider
const response = await nl.generate({
  input: { text: "Write a function to calculate fibonacci numbers" },
  provider: "local"
});

Multiple Model Configuration

Configure multiple Ollama models for different use cases:

providers:
  ollama-fast:
    type: ollama
    base_url: http://localhost:11434
    default_model: llama3.1:8b

  ollama-code:
    type: ollama
    base_url: http://localhost:11434
    default_model: codellama:13b

  ollama-large:
    type: ollama
    base_url: http://localhost:11434
    default_model: llama3.1:70b

Model Selection Guide

Choosing the right model for your use case is crucial for balancing capability with resource requirements.

General Purpose Models

Model	Size	VRAM	Best For
Llama 3.1 8B	4.7GB	8GB min	General chat, summarization, simple reasoning
Llama 3.1 70B	40GB	48GB+	Complex reasoning, nuanced tasks
Mistral 7B	4.1GB	6GB min	Quick tasks, high throughput

Coding Models

Model	Size	VRAM	Best For
CodeLlama 13B	7.4GB	12GB min	Code generation, debugging
DeepSeek Coder 33B	19GB	24GB min	Complex programming tasks

Quantization Options

Ollama supports various quantization levels that trade quality for reduced resource requirements:

# Full precision (largest, highest quality)
ollama pull llama3.1:8b-fp16

# 8-bit quantization (good balance)
ollama pull llama3.1:8b-q8_0

# 4-bit quantization (smallest, slight quality reduction)
ollama pull llama3.1:8b-q4_0

Performance Optimization

Getting the best performance from local LLMs requires attention to hardware configuration and Ollama settings.

GPU Acceleration

For optimal performance, use a GPU with sufficient VRAM:

# Check if Ollama is using GPU
ollama ps

# Force CPU-only mode (if needed)
OLLAMA_GPU_LAYERS=0 ollama serve

Memory Management

Configure system resources appropriately:

# Set maximum loaded models
export OLLAMA_MAX_LOADED_MODELS=2

# Set VRAM limit (in bytes)
export OLLAMA_GPU_MEMORY=8589934592  # 8GB

# Configure context window (affects memory usage)
export OLLAMA_NUM_CTX=4096

Custom Modelfile for Optimization

Create a custom Modelfile for optimized inference:

# Modelfile.optimized
FROM llama3.1:8b

# Increase context window
PARAMETER num_ctx 8192

# Optimize for speed
PARAMETER num_batch 512
PARAMETER num_thread 8

# Reduce temperature for more deterministic outputs
PARAMETER temperature 0.7
PARAMETER top_p 0.9

# System prompt for your use case
SYSTEM """You are a helpful assistant optimized for technical questions."""

Build and use the optimized model:

ollama create llama-optimized -f Modelfile.optimized
ollama run llama-optimized

Hybrid Cloud and Local Patterns

One of NeuroLink's most powerful features is the ability to seamlessly combine local and cloud providers.

Fallback Pattern

Use local inference by default, falling back to cloud when local resources are exhausted:

import { NeuroLink } from "@juspay/neurolink";

const nl = new NeuroLink({
  providers: [
    { name: "local", config: { baseUrl: "http://localhost:11434" } },
    { name: "openai", config: { apiKey: process.env.OPENAI_API_KEY } },
    { name: "anthropic", config: { apiKey: process.env.ANTHROPIC_API_KEY } }
  ],
  failover: {
    enabled: true,
    primary: "local",
    fallbackProviders: ["openai", "anthropic"],
    triggerOn: ["timeout", "overload", "error"]
  }
});

// This will try local first, then cloud if needed
const response = await nl.generate({
  input: { text: "Complex analysis task..." },
  maxTokens: 2000
});

Task-Based Routing

Route requests to appropriate providers based on task characteristics:

import { NeuroLink } from "@juspay/neurolink";

const nl = new NeuroLink({
  providers: [
    { name: "local", config: { baseUrl: "http://localhost:11434" } },
    { name: "anthropic", config: { apiKey: process.env.ANTHROPIC_API_KEY } }
  ],
  routing: {
    rules: [
      {
        taskType: "simple_qa",
        provider: "local",
        model: "llama3.1:8b"
      },
      {
        taskType: "code_generation",
        provider: "local",
        model: "codellama:13b"
      },
      {
        taskType: "complex_reasoning",
        provider: "anthropic",
        model: "claude-3-opus"
      }
    ]
  }
});

// Automatically routes to appropriate provider
const response = await nl.generate({
  input: { text: "Write a sorting algorithm" },
  taskType: "code_generation"
});

Privacy-Preserving Routing

Automatically route sensitive data to local inference:

import { NeuroLink } from "@juspay/neurolink";

const nl = new NeuroLink({
  providers: [
    { name: "ollama", config: { baseUrl: "http://localhost:11434" } },
    { name: "openai", config: { apiKey: process.env.OPENAI_API_KEY } }
  ],
  middleware: {
    guardrails: {
      piiDetection: {
        enabled: true,
        patterns: [
          { name: "ssn", regex: "\\b\\d{3}-\\d{2}-\\d{4}\\b" },
          { name: "email", regex: "[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}" }
        ],
        sensitiveKeywords: ["confidential", "proprietary"],
        localProvider: "ollama",
        cloudProvider: "openai"
      }
    }
  }
});

// Automatically routes to local if sensitive data detected
const response = await nl.generate({
  input: { text: "Analyze this customer data: SSN 123-45-6789..." }
  // Routes to local Ollama automatically
});

Troubleshooting Common Issues

Model Loading Failures

Symptom: "Error: model not found" or slow initial response

Solutions:

# Verify model is downloaded
ollama list

# Re-pull if corrupted
ollama rm llama3.1:8b
ollama pull llama3.1:8b

# Check disk space
df -h ~/.ollama

Out of Memory Errors

Symptom: "CUDA out of memory" or system freeze

Solutions:

# Use smaller model
ollama pull llama3.1:8b-q4_0

# Reduce context window
export OLLAMA_NUM_CTX=2048

# Limit GPU memory
export OLLAMA_GPU_MEMORY=6442450944  # 6GB

Slow Inference

Symptom: Response times exceeding expectations

Solutions:

# Verify GPU is being used
ollama ps

# Check for thermal throttling
nvidia-smi -l 1

# Increase batch size for throughput
# In Modelfile:
PARAMETER num_batch 1024

Connection Refused

Symptom: NeuroLink cannot connect to Ollama

Solutions:

# Verify Ollama is running
curl http://localhost:11434/api/tags

# Check firewall settings
sudo ufw allow 11434/tcp

# Restart Ollama service
sudo systemctl restart ollama

Conclusion

Running local LLMs with Ollama and NeuroLink provides a powerful, flexible, and privacy-preserving approach to AI integration. By following this guide, you've learned how to:

Set up and configure Ollama for local inference
Integrate Ollama with NeuroLink for seamless model access
Select appropriate models for your use cases
Optimize performance for your hardware
Implement hybrid cloud and local patterns
Troubleshoot common deployment issues

The combination of local and cloud inference gives you unprecedented flexibility in how you deploy AI capabilities. Start with local models for development and privacy-sensitive tasks, scale to cloud providers when you need additional capacity or capabilities, and let NeuroLink handle the complexity of managing multiple providers.

Found this helpful? Drop a comment below with your questions or share your experience with local LLMs!

DEV Community

Running Local LLMs with NeuroLink and Ollama: Complete Guide

TL;DR

Table of Contents

Why Run LLMs Locally?

Privacy and Data Sovereignty

Cost Predictability

Latency Optimization

Offline Capability

Setting Up Ollama

Installation

Starting the Ollama Server

Pulling Your First Model

Configuring NeuroLink for Ollama

Basic Configuration

Environment Variables

Programmatic Configuration

Multiple Model Configuration

Model Selection Guide

General Purpose Models

Coding Models

Quantization Options

Performance Optimization

GPU Acceleration

Memory Management

Custom Modelfile for Optimization

Hybrid Cloud and Local Patterns

Fallback Pattern

Task-Based Routing

Privacy-Preserving Routing

Troubleshooting Common Issues

Model Loading Failures

Out of Memory Errors

Slow Inference

Connection Refused

Conclusion

Related Articles from NeuroLink

Top comments (0)