DEV Community

plasmon profile picture

plasmon

Qiita_Blog

Joined Joined on 
Ollama, LM Studio, and GPT4All Are All Just llama.cpp — Here's Why Performance Still Differs

Ollama, LM Studio, and GPT4All Are All Just llama.cpp — Here's Why Performance Still Differs

Comments
6 min read

Want to connect with plasmon?

Create an account to connect with plasmon. You can also sign in below to proceed if you already have an account.

Already have an account? Sign in
99.8% of LLM Inference Power Isn't Spent on Computation

99.8% of LLM Inference Power Isn't Spent on Computation

Comments
7 min read
Q4 KV Cache Fit 32K Context into 8GB VRAM — Only Math Broke

Q4 KV Cache Fit 32K Context into 8GB VRAM — Only Math Broke

Comments
8 min read
HBM4 Didn't Break the Memory Wall — It Just Moved It

HBM4 Didn't Break the Memory Wall — It Just Moved It

Comments
6 min read
Running Just One LLM on 8GB VRAM Is a Waste

Running Just One LLM on 8GB VRAM Is a Waste

Comments
8 min read
Light Just Cut KV Cache Memory Traffic to 1/16th

Light Just Cut KV Cache Memory Traffic to 1/16th

Comments
7 min read
ツール呼び出しでも大きいモデルは勝てなかった

ツール呼び出しでも大きいモデルは勝てなかった

Comments
4 min read
RAGの検索精度を3軸で測ったら最適解が条件で全く変わった

RAGの検索精度を3軸で測ったら最適解が条件で全く変わった

Comments
3 min read
They Routed Power Through the Back of the Chip and 30% IR Drop Vanished

They Routed Power Through the Back of the Chip and 30% IR Drop Vanished

Comments
6 min read
Letting AI Control RAG Search Improved Accuracy by 79%

Letting AI Control RAG Search Improved Accuracy by 79%

Comments
6 min read
If Memory Could Compute, Would We Still Need GPUs?

If Memory Could Compute, Would We Still Need GPUs?

Comments
6 min read
I Couldn't Build a Local LLM PC for $1,300 — Budget Tiers and the VRAM Cliffs Between Them

I Couldn't Build a Local LLM PC for $1,300 — Budget Tiers and the VRAM Cliffs Between Them

Comments
6 min read
8-Bit Quantization Destroyed 92% of Code Generation — The Culprit Wasn't Bit Count

8-Bit Quantization Destroyed 92% of Code Generation — The Culprit Wasn't Bit Count

Comments
5 min read
The Recursive Loop Has Started: AI Is Now Designing AI Chips

The Recursive Loop Has Started: AI Is Now Designing AI Chips

Comments
7 min read
ML Hit 99% Accuracy on Yield Prediction — The Factory Floor Ignored It

ML Hit 99% Accuracy on Yield Prediction — The Factory Floor Ignored It

Comments
8 min read
3 Classifiers, 3 Answers: Why CoT Faithfulness Scores Are Meaningless

3 Classifiers, 3 Answers: Why CoT Faithfulness Scores Are Meaningless

Comments
6 min read
Parameter Count Is the Worst Way to Pick a Model on 8GB VRAM

Parameter Count Is the Worst Way to Pick a Model on 8GB VRAM

Comments
5 min read
The Memory Bandwidth Gap Is 49x and Growing — Why Local LLMs Hit a Ceiling

The Memory Bandwidth Gap Is 49x and Growing — Why Local LLMs Hit a Ceiling

Comments
7 min read
MoE Beat Dense 27B by 2.4x on 8GB VRAM — The 35B-A3B Benchmark Nobody Expected

MoE Beat Dense 27B by 2.4x on 8GB VRAM — The 35B-A3B Benchmark Nobody Expected

Comments
5 min read
I Designed a Memory System for Claude Code — 'Forgetting' Was the Hardest Part

I Designed a Memory System for Claude Code — 'Forgetting' Was the Hardest Part

Comments
6 min read
80% of LLM 'Thinking' Is a Lie — What CoT Faithfulness Research Actually Shows

80% of LLM 'Thinking' Is a Lie — What CoT Faithfulness Research Actually Shows

Comments
7 min read
80% of LLM 'Thinking' Is a Lie — What CoT Faithfulness Research Actually Shows

80% of LLM 'Thinking' Is a Lie — What CoT Faithfulness Research Actually Shows

Comments
7 min read
Can Spiking Neural Networks Kill the GPU? 3 Papers Show the Reality

Can Spiking Neural Networks Kill the GPU? 3 Papers Show the Reality

1
Comments
6 min read
I Let Claude Code Run My Tech Blog. A Fake Article Passed Every Quality Check.

I Let Claude Code Run My Tech Blog. A Fake Article Passed Every Quality Check.

2
Comments
10 min read
How Many Nanometers Until Physics Says No? The 3 Walls Beyond 2nm, Read Through Papers in 2026

How Many Nanometers Until Physics Says No? The 3 Walls Beyond 2nm, Read Through Papers in 2026

1
Comments
14 min read
Still Picking API vs Local LLM by Gut Feeling? A Framework With Real Benchmarks

Still Picking API vs Local LLM by Gut Feeling? A Framework With Real Benchmarks

Comments
6 min read
I Tried Speculative Decoding on RTX 4060 8GB — Every Config Was Slower Than Baseline

I Tried Speculative Decoding on RTX 4060 8GB — Every Config Was Slower Than Baseline

1
Comments
8 min read
Stop Letting AI Be Nice — LLM Sycophancy Mode Is Killing Your Engineering Thinking

Stop Letting AI Be Nice — LLM Sycophancy Mode Is Killing Your Engineering Thinking

1
Comments 2
6 min read
Stop Letting AI Be Nice — LLM Sycophancy Mode Is Killing Your Engineering Thinking

Stop Letting AI Be Nice — LLM Sycophancy Mode Is Killing Your Engineering Thinking

Comments 2
6 min read
95% of LLM Inference Energy Is Wasted on Data Movement — Why Optical Interconnects (CPO) Can't Fix It

95% of LLM Inference Energy Is Wasted on Data Movement — Why Optical Interconnects (CPO) Can't Fix It

Comments
7 min read
3D Chip Stacking Hits a 200μm Warpage Wall — Why Your Next GPU Memory Might Crack

3D Chip Stacking Hits a 200μm Warpage Wall — Why Your Next GPU Memory Might Crack

Comments
11 min read
I Pitted 3 Qwen3.5 Models Against Each Other on an RTX 4060 8GB — What Spec Sheets Don't Tell You

I Pitted 3 Qwen3.5 Models Against Each Other on an RTX 4060 8GB — What Spec Sheets Don't Tell You

Comments
8 min read
What Happens When You Bring LLMs Into a Semiconductor FAB — 5 ArXiv Papers, Brutally Honest Reviews

What Happens When You Bring LLMs Into a Semiconductor FAB — 5 ArXiv Papers, Brutally Honest Reviews

Comments
9 min read
I Built a Fully Local Paper RAG on an RTX 4060 8GB — BGE-M3 + Qwen2.5-32B + ChromaDB

I Built a Fully Local Paper RAG on an RTX 4060 8GB — BGE-M3 + Qwen2.5-32B + ChromaDB

Comments
10 min read
Running Qwen2.5-32B on RTX 4060 8GB — Beating M4 at 10.8 t/s with llama.cpp

Running Qwen2.5-32B on RTX 4060 8GB — Beating M4 at 10.8 t/s with llama.cpp

1
Comments
7 min read
loading...