DEV Community

Andrew
Andrew

Posted on • Originally published at andrew.ooo

Needle Review: 26M Function-Calling Model for Edge Devices

Originally published on andrew.ooo — visit the original for any updates, code snippets that aged out, or follow-up posts.

TL;DR

Needle is a 26-million-parameter function-calling model from the team behind Cactus, distilled from Gemini 3.1 Flash-Lite. It runs at 6000 tokens/sec prefill and 1200 tokens/sec decode on consumer devices — fast enough for a smartwatch — and it does one thing: turn natural-language queries into JSON tool calls.

  • 26M parameters — roughly 100x smaller than the smallest "useful" chat LLMs
  • Pure attention architecture — zero feed-forward / MLP layers, encoder-decoder with cross-attention
  • 6000 tok/s prefill, 1200 tok/s decode on consumer hardware via the Cactus runtime
  • Beats FunctionGemma-270M, Qwen-0.6B, Granite-350M, LFM2.5-350M on single-shot function calling
  • Pretrained on 200B tokens in 27 hours across 16 TPU v6e; post-trained in 45 minutes
  • MIT license — code, weights, and dataset generation pipeline all open
  • One-command finetune on your own tools from your Mac via needle playground

If you've been waiting for a "compiler pass for agents" — a tiny, deterministic model that handles the boring tool-routing step so your main LLM never has to — Needle is the first credible attempt I've seen at that exact shape.


Why This Matters

Most agent frameworks today shove the entire conversation, tool schema, and system prompt into a 70B+ model just to extract one JSON object like {"name": "get_weather", "arguments": {"city": "Paris"}}. That's a ~$0.01 LLM call to do what is fundamentally a structured prediction problem.

The waste shows up everywhere:

  • Latency. Even with Groq or Cerebras, a tool-call round-trip is 200–800ms — an eternity in a voice agent or wearable.
  • Cost. Function-calling is the single largest line item in most production agent bills, and most of it is repetitive routing.
  • Privacy. Every tool call leaks query content to a cloud LLM, even when the execution is local (e.g. "turn off the lights").
  • Offline. Phones, watches, glasses, in-car assistants — none can guarantee a cloud round-trip.

Needle's bet is that tool calling is not reasoning — it's retrieval-and-assembly. Match the query to a tool name, extract argument values, emit JSON. None of those steps requires the per-position non-linear computation that FFN layers provide. So Cactus removed them entirely. That's the most interesting architectural claim to come out of the small-model world in a while.


Quick Reference

Item Value
Parameters 26M
Architecture Encoder–decoder, pure attention (no FFN)
Encoder 12 layers, GQA (8 heads / 4 KV), RoPE, gated residuals
Decoder 8 layers, self-attn + cross-attn, gated residuals
d_model 512
Vocab 8192 (SentencePiece BPE)
Norm ZCRMSNorm (zero-centered, init = 0)
Precision bfloat16 (INT4 QAT during training)
Pretraining 200B tokens, 16× TPU v6e, 27 hours
Post-training 2B tokens of function-call data, 45 min
Distilled from Gemini 3.1 Flash-Lite
Throughput 6000 tok/s prefill, 1200 tok/s decode (Cactus runtime)
License MIT
GitHub cactus-compute/needle
Weights Cactus-Compute/needle on Hugging Face

What It Actually Is

Needle is a specialist model that converts (query, tools) into tool_call JSON and nothing else. It is not a chatbot. It does not do RAG. It does not write code. It does not decide whether a tool should be called — that's your job, or the job of the larger model in your stack.

Given Query: What's the weather in San Francisco? and a get_weather tool, it emits:

[{"name": "get_weather", "arguments": {"location": "San Francisco"}}]
Enter fullscreen mode Exit fullscreen mode

That's the whole product. The reason it matters: 99% of the cycles in a typical agent loop are spent doing exactly this — and Needle does it in milliseconds, locally, for free, while a 70B model is still loading its KV cache.

The "Simple Attention Network" architecture

The Cactus team calls their architecture a Simple Attention Network (SAN). The key choices:

  1. Encoder–decoder, not decoder-only. The encoder reads the full (query, tools) input bidirectionally in one shot; the decoder generates the JSON via cross-attention. No KV-cache of input tokens during generation — the encoder representation is fixed-size.
  2. No FFN / MLP layers anywhere. Standard transformers use ~2/3 of their parameters in FFN. Needle removes them entirely.
  3. Gated residuals. Without FFNs, plain x = x + Attn(Norm(x)) is limiting. Cactus uses x = x + sigmoid(gate) * Attn(Norm(x)), gate initialised to zero, so each layer starts at half-strength.
  4. ZCRMSNorm. Zero-centered RMSNorm (x * (1 + gamma) / RMS(x), gamma init = 0), identity at init. Pairs with gated residuals so the whole network starts as a damped identity.
  5. CLIP-style tool retrieval head. The encoder also produces a unit vector for contrastive search, letting you pre-filter to the top-k most relevant tools from a large catalogue before generation.
  6. Muon + AdamW dual optimiser. Muon (Newton–Schulz orthogonalisation) on Q/K/V/O projections prevents representation collapse in a deep stack of pure linear-then-softmax layers. AdamW for everything else.
  7. INT4 quantisation-aware training. Fake INT4 quantisation every 100 steps with straight-through estimators. The model trains at the same precision it deploys at — no post-training quantisation gap.

If you've read nGPT or DeepSeek-V3, several tricks will look familiar. The novelty is the combination plus the FFN-free claim. Full write-up: docs/simple_attention_networks.md.


Getting Started

The full path from clone to running prediction:

git clone https://github.com/cactus-compute/needle.git
cd needle && source ./setup
needle playground
Enter fullscreen mode Exit fullscreen mode

That opens a Gradio web UI at http://127.0.0.1:7860 where you can paste tool schemas and queries. Weights auto-download from Hugging Face on first run.

For programmatic use:

from needle import load_checkpoint, generate, SimpleAttentionNetwork, get_tokenizer

params, config = load_checkpoint("checkpoints/needle.pkl")
model = SimpleAttentionNetwork(config)
tokenizer = get_tokenizer()

result = generate(
    model, params, tokenizer,
    query="What's the weather in San Francisco?",
    tools='[{"name":"get_weather","parameters":{"location":"string"}}]',
    stream=False,
)
print(result)
# [{"name":"get_weather","arguments":{"location":"San Francisco"}}]
Enter fullscreen mode Exit fullscreen mode

Finetuning on your own tools

This is where Needle shines. The model was post-trained on synthetic data across 15 generic tool categories (timers, messaging, navigation, smart home, etc.). For your own product's tools, you almost certainly want to finetune — which the playground UI makes shockingly easy.

needle finetune data.jsonl
Enter fullscreen mode Exit fullscreen mode

JSONL format is three fields per line — query, tools (JSON-encoded string), answers (also JSON-encoded). Cactus recommends at least 120 examples per tool (100 train / 10 val / 10 test); fewer and the model overfits. You can click "generate-data" in the playground to have Gemini synthesise the dataset from a tool spec, then train immediately. End-to-end "tool spec → finetuned 26M model" in ~10 minutes on a Mac.


Why It's Trending Now

Needle hit #1 Show HN on May 14, 2026 with 280+ points and 78 comments, then climbed onto GitHub Trending the next morning. Three forces converged:

  1. The "small model" thesis is going mainstream. Apple Intelligence on-device, Gemini Nano in Chrome, Phi-4-mini — everyone is shipping sub-1B models. Needle pushes that an order of magnitude further.
  2. Function calling is the most repetitive thing LLMs do. OpenAI and Anthropic both surfaced this in 2026 product talks: the same JSON-emission pattern is happening billions of times a day at huge cost. A tiny specialist is the obvious answer.
  3. The architecture claim is genuinely novel. "No FFN, just attention, and it still works" is the kind of thing ML Twitter loves to argue about. The Cactus team posted careful ablations, which helped.

What the Community Is Saying

The HN thread surfaced the usual mix of skepticism and excitement:

  • "Is this just a glorified parser?" Several commenters argued tool calling is structured prediction. The author (Henry Ndubuaku) agreed and framed Needle as exactly that — a compiler pass for agents.
  • "How does it compare to constrained decoding?" You can get 99% JSON-validity from any LLM via grammar-constrained sampling (Outlines, Guidance, llama.cpp grammars). The Needle answer: yes, but you still pay the latency cost of a big model. Needle is 100× smaller — you get JSON-validity plus a 200ms→50ms latency reduction.
  • "What about multi-turn?" Single-shot only. Multi-turn agentic loops still need a bigger model. Cactus is explicit: Needle is a routing layer, not a planner.
  • "Will this work for a 500-tool MCP setup?" That's exactly what the CLIP-style retrieval head is for — encode all tool definitions once, cosine-rank per query, feed the top 10 to the decoder.

On Reddit's r/AI_Agents, a thread titled "A 26M tool-router suggests tool calling should be a compiler pass, not a reasoning step" drew the most thoughtful discussion. The Needle-as-compiler-pass framing is going to stick.


Real-World Use Cases

Where Needle fits in a stack:

  • On-device voice assistants. Phone, watch, smart speaker, car. The big LLM handles open-ended conversation in the cloud; Needle handles "set a timer for 8 minutes" without phoning home.
  • Latency-critical loops. Trading bots, robotics, autonomous driving — anywhere a 500ms LLM round-trip is unacceptable.
  • Cost-reduction layer. Route 80% of tool calls through Needle, fall back to GPT-5 or Claude for ambiguous cases.
  • Privacy-sensitive verticals. Healthcare, legal, defense. The big model never sees user data; Needle routes locally and you send only the resolved tool call to your audited backend.
  • MCP fan-out. Sit Needle in front of a large MCP server (50+ tools) as a fast pre-router.

Honest Limitations

Needle is impressive but it is not a general-purpose model:

  • No multi-turn reasoning. Single-shot only. The "decide what to do next" step in an agent loop still needs a bigger model.
  • No conversational fallback. Given any input, Needle emits some tool call — even if no tool fits. You need a guard model or confidence threshold in front.
  • Finicky out-of-the-box. The pretrained checkpoint covers 15 generic tool categories. For domain-specific tools, expect to finetune.
  • Tiny vocabulary. 8192-token BPE splits rare words aggressively. Fine for tool routing, problematic for anything else.
  • One language. Trained predominantly on English. Multilingual function calling needs more post-training.
  • Apples-to-oranges benchmarks. Needle beats FunctionGemma-270M etc. on single-shot tool calling. It is not better at conversation, code, or reasoning — it cannot do those things at all.

Who Should Use This

Use Needle if:

  • You're building anything that runs on a phone, watch, glasses, in-car, or other constrained device
  • You have a high-volume agent loop and your function-call bill is hurting
  • Your tool catalogue is well-defined and stable (so finetuning pays off)
  • You care about latency or offline operation
  • You want to learn how the "no-FFN" architectural bet performs in practice — the code is small, MIT, and beautifully readable

Skip Needle if:

  • You need a model that can also chat, reason, or generate prose
  • Your tool catalogue changes constantly and finetuning is operationally painful
  • You're already happy with constrained decoding on a 7B model
  • You need multi-turn agent loops without a separate planning model

Comparison with Alternatives

Model Params Function Calling Conversation Local? License
Needle 26M ✅ Specialist ✅ Phone-class MIT
FunctionGemma 270M ⚠️ Limited ✅ Laptop-class Gemma TOS
Qwen3-0.6B 600M ✅ Laptop-class Apache 2.0
LFM2.5-350M 350M ✅ Laptop-class LFM license
Granite-3.5-350M 350M ✅ Laptop-class Apache 2.0
GPT-5-mini (tools mode) ? ❌ Cloud Proprietary

The honest read: if you want the cheapest, fastest single-shot tool router that fits on a watch, Needle is currently in a class of one. If you want a generalist that also does tool calling, pick one of the 300M–600M models above.


FAQ

Is Needle production-ready?

Cactus tags it as an experimental run for the Simple Attention Network architecture, so be honest with yourself: weights are MIT and stable, but the docs say "small models can be finicky" — which is true. For production, finetune on your own tool set, set a confidence threshold, and route low-confidence queries to a bigger model.

How does Needle handle ambiguous queries?

It doesn't, really. There's no built-in abstain mechanism — given any input, Needle will emit some tool call, even if no tool fits. In production you want either (a) a separate intent classifier in front of it, or (b) a confidence-score guard derived from the contrastive retrieval head's top-1 cosine similarity. The Cactus team is explicit that this is a routing layer, not a planner.

Can I run Needle on a Raspberry Pi or microcontroller?

The Hugging Face weights run on any Python environment with JAX/PyTorch. For real edge deployment (phones, wearables, MCUs), you want the Cactus runtime — Cactus is a separate C++ inference engine built specifically for mobile, wearables, and custom hardware. That's where the 6000 tok/s prefill numbers come from.

How does Needle compare to constrained decoding?

Different layers of the stack. Constrained decoding (Outlines, Guidance, llama.cpp grammars) guarantees JSON-validity on top of any model. Needle is a smaller, faster model that emits valid JSON natively — the architecture is biased toward JSON output. You can stack constrained decoding on top of Needle, and probably should for paranoia-grade reliability.

What's the license for the weights?

MIT. Code, weights, and dataset generation pipeline are all permissively licensed. You can ship Needle in a commercial product without attribution. The training data was synthesised via Gemini, so check your Gemini terms of service if you plan to regenerate similar datasets.

How much does it cost to retrain from scratch?

200B tokens on 16× TPU v6e for 27 hours. At spot pricing, that's roughly $1,500–$3,000 to reproduce the pretraining run. Post-training (the part you'd actually customise) is 2B tokens in 45 minutes on the same setup — under $100. Finetuning on your own ~120 examples is free on a Mac.


Bottom Line

Needle is the first model I've seen that takes the "tool calling is a structured prediction problem, not a reasoning problem" thesis seriously and ships a working artifact. It won't replace your main LLM, and it isn't supposed to. What it does is collapse the cost and latency of the most repetitive 80% of agentic work down to something you can run on a watch.

The architectural bet — no FFN, pure attention, encoder-decoder — is genuinely novel at this scale and worth studying even if you never deploy the model. The MIT license, the one-command playground, and the 10-minute finetune loop make it trivially easy to try.

If you're building agents in 2026, Needle deserves an afternoon of your time. Even if it doesn't fit your current stack, the pattern — tiny specialist models as compiler passes for big generalist models — is almost certainly what production agent stacks look like in two years.

Top comments (0)