DEV Community: Bare Tensor

Why 90,000+ Developers Are Frustrated With Raspberry Pi Inference (And How We Measured It)

Bare Tensor — Fri, 12 Jun 2026 04:37:07 +0000

Our Ollama vs Llama.cpp benchmark on Raspberry Pi just hit 90k views on Reddit.
Reading 100+ comments revealed a pattern: every developer is stuck on the same problems.
Not speed. Not hardware capability.
Configuration. Measurement. Reproducibility.
Here's what we found:

OS overhead costs 30-40% of performance

Default Raspbian runs background services that steal CPU from inference. Strip it down to minimal Linux? Suddenly 40% faster on same hardware.

Hidden configuration tricks nobody documents

Ollama defaults to 4096 context window on 4GB Pi. Should be 512. 26% speed difference. Nobody mentions it.

Setup complexity is the real blocker

Llama.cpp: Fastest but requires 2.5 hours + ARM NEON knowledge

Ollama: Easiest but misconfigured by default

No reproducible methodology

Everyone benchmarks differently. No standard way to measure. Can't compare your setup to others'.
https://www.reddit.com/r/raspberry_pi/comments/1tz673u/been_testing_llamacpp_vs_ollama_on_my_pi_5_the/

Llama.cpp vs Ollama on Raspberry Pi: The Performance Trade-off Nobody Talks About

Bare Tensor — Sun, 07 Jun 2026 07:25:45 +0000

I've been benchmarking the two main tools for running LLMs on Raspberry Pi, and I want to document what I'm finding because the trade-off between them isn't obvious.
The Setup
Raspberry Pi 5, 4GB RAM, Raspberry Pi OS (64-bit)
Model: TinyLlama 1.1B Q4_K_M
Test: 100 token generation, measured 10 times, averaged
Llama.cpp Results
Tokens per second: 8.2 ± 0.4
Peak RAM: 890MB
Model load time: 2.9 seconds
Total setup time: 2.5 hours
Installation steps: 12 (clone, compile, configure)
To get these numbers, I had to:

Clone the repository
Install build tools (gcc, g++, make)
Compile from source with ARM NEON flags
Test different thread counts to find optimal (4 threads on 4-core Pi)
Measure model load time multiple times to eliminate variance
Run benchmark 10 times and average

Each step had potential failure points. The compile step took 45 minutes on the Pi.
Ollama Results (Default Settings)
Tokens per second: 5.7 ± 0.3
Peak RAM: 1.1GB
Model load time: 5.4 seconds
Total setup time: 8 minutes
Installation steps: 1 (curl bash)
Installation literally took 8 minutes. Open terminal, paste one command, wait.
The problem: These numbers don't represent the actual capability of the Pi. They represent Ollama's default configuration on a Pi, which isn't optimal.
Ollama Results (Optimized Settings)
After manually setting OLLAMA_CONTEXT_LENGTH=512:
Tokens per second: 7.2 ± 0.3
Peak RAM: 890MB
Model load time: 4.2 seconds
Total setup time: 12 minutes (8 min install + 4 min config)
Installation steps: 2 (install + set env variable)
Same hardware. Same model. One environment variable changed. 26 percent performance improvement.
The Trade-off
Llama.cpp:

Pros: Fastest performance (8.2 tokens/sec), lowest RAM (890MB), actively optimized
Cons: 2.5 hour setup, requires technical knowledge, steep learning curve

Ollama:

Pros: 8 minute setup, zero technical knowledge required, user-friendly
Cons: 5.7 tokens/sec by default (you lose 30 percent), doesn't auto-optimize, requires knowledge of environment variables to fix

The Unspoken Problem
Most users encounter Ollama first because it's easier. They get 5.7 tokens/sec. They think the Pi is slow. They don't know that with one configuration change they'd get 7.2 tokens/sec.
Some users dig into llama.cpp. They get 8.2 tokens/sec. But they had to spend 2.5 hours learning compile flags and ARM architecture.
Neither experience is designed for someone who just wants to run AI locally on their Pi.
What Developers Need

Automatic hardware detection. Detect "this is a Pi" and optimize accordingly.
Sensible defaults for the hardware. Not x86 defaults on ARM hardware.
Clear setup path. "From zero to running inference" should be minutes, not hours.
Real-time visibility. Show me what's actually happening (RAM usage, CPU load, temperature).
Honest benchmarking. Let me know if my setup is actually optimal.

What's Coming
The Pi community is growing. The demand for local AI on Pi is real. The gap between what's technically possible (8.2 tokens/sec) and what's practically accessible (8 minute setup) is being noticed.
Some people are working on closing that gap.
In the next few weeks, new tools will ship that attempt to combine the speed of llama.cpp with the accessibility of Ollama.
The interesting part is watching which approach wins and why.

The True Cost of Cloud AI and Why Local Inference Changes the Economics

Bare Tensor — Fri, 05 Jun 2026 17:57:29 +0000

I've been tracking the cost structure of AI infrastructure for projects I've worked on, and I realized most developers haven't actually calculated what cloud AI costs at scale.
Let's do the math.
Cloud API Economics
Using OpenAI, Claude, or similar APIs for inference:

GPT-3.5 Turbo: $0.0005 per 1K input tokens, $0.0015 per 1K output tokens
Claude 3.5 Sonnet: $0.003 per 1K input tokens, $0.015 per 1K output tokens
GPT-4: $0.03 per 1K input tokens, $0.06 per 1K output tokens

A typical user interaction (question + response): 300-500 total tokens.
Single user interaction cost: $0.15 to $30 depending on model choice.
At Scale
100 daily users using an AI feature:

Low-cost API: 100 × 300 tokens × $0.0005 = $15/day = $450/month
Mid-range API: 100 × 400 tokens × $0.003 = $120/day = $3,600/month
High-performance API: 100 × 500 tokens × $0.03 = $1,500/day = $45,000/month

1,000 daily users:

Low-cost: $4,500/month
Mid-range: $36,000/month
High-performance: $450,000/month

10,000 daily users:

Low-cost: $45,000/month
Mid-range: $360,000/month
High-performance: $4,500,000/month

These aren't edge cases. These are realistic numbers for apps with moderate adoption.
The Local Inference Alternative
What if that AI ran on the user's device instead?
Infrastructure cost per inference: $0
The entire operational cost is hardware cost (one-time) and electricity (negligible).
The Device Capability Assumption
Most developers assume devices can't run AI locally. This assumption is outdated.
Devices that can now run real LLM models locally:

Raspberry Pi 4 (4GB): TinyLlama 1.1B at 4 tokens/sec
Raspberry Pi 5 (4GB): TinyLlama 1.1B at 8 tokens/sec
Intel/AMD laptop from 2019 (4GB RAM): Mistral 7B Q4 at 6 tokens/sec
ARM single board computers ($50): Qwen 1.5B at 4 tokens/sec

These aren't high-end systems. These are systems that most people consider weak for modern use.
Yet they can run inference at speeds that are useful for many applications.
Why This Gap Exists
Three separate communities that rarely talk to each other:

Device hardware community (manufacturers, embedded systems engineers) — knows their hardware can run inference
Cloud AI community (developers using APIs) — assumes local inference isn't viable
Local inference community (edge AI builders) — knows it works but small audience

When these communities don't overlap, information gap emerges. Developers don't know what's possible.
The Economics Flip
When you shift from cloud API to local inference:
Cloud API model:

First 100 users: $450-$45,000/month operational cost
Infrastructure scaling: linear cost increase with users
Economics worse the more successful you are

Local inference model:

First 100 users: cost of hardware + electricity (essentially free)
Infrastructure scaling: per-device deployment, not per-API-call scaling
Economics stay flat or improve as you scale

The Constraint
The only real constraint is developer knowledge. Not technical possibility. Not device capability. Developer knowledge of how to actually implement this.
Devices that can run AI models locally have been available for years. Model optimization tooling (GGUF, quantization, int8/int4) has been available. But the developer experience of putting these pieces together on constrained hardware hasn't been solved well.
What This Means
If you're building with cloud AI APIs, understand the actual cost structure. Calculate what scale costs you.
If that number seems large, investigate whether local inference is viable for your use case. For many applications, it is.
The economics of AI infrastructure change completely when you stop paying per inference.