DEV Community

Devanshu Biswas
Devanshu Biswas

Posted on

Skeleton of Thought: Make an LLM Answer 2–3 Faster

LLMs write answers one token at a time, strictly left to right. Token 500 can't start until token 499 exists, so a thorough answer feels slow no matter how fast your hardware is. Skeleton of Thought (SoT) attacks exactly that — the length of the sequential critical path.

The idea

Most answers are really a list of semi-independent parts: tips, sections, aspects of a comparison. SoT exploits this in three moves:

  1. Ask for a skeleton — just short point titles, no prose.
  2. Expand every point in parallel — one request each, all at once.
  3. Stitch them back together in order.

Because the expansions don't depend on each other, they can run concurrently.

The skeleton call

const skeleton = await llm(
  `Answer "${q}" as ONLY 3-8 numbered point titles, <=6 words each.`);
const points = skeleton.split("\n").filter(Boolean);
Enter fullscreen mode Exit fullscreen mode

Tiny and fast. It also doubles as a plan, which tends to make the final answer better-organised than free-form writing.

Parallel expansion — where the speed comes from

const parts = await Promise.all(points.map((p, i) =>
  llm(`Q: ${q}\nOutline: ${skeleton}\nExpand point ${i+1} in 1-2 sentences.`)
));
Enter fullscreen mode Exit fullscreen mode

Instead of waiting for point 1, then 2, then 3… you wait once for whichever point takes longest.

The maths

  • Sequential time ≈ skeleton + sum of all points.
  • SoT time ≈ skeleton + the single longest point.

With five similar-length points, that's roughly five point-times versus one — the reported speedups land around 2× and up.

When NOT to use it

Parallelism is only safe when points are independent. Chained reasoning — a maths proof, a step-by-step derivation where point 3 needs point 2 — breaks it. Gate on that and fall back to normal generation.

The trade-off

SoT isn't free: you spend more total tokens (each expansion repeats the question and outline as context) and make more requests. What you buy is latency — the user sees a complete, structured answer much sooner. It's the classic distributed-systems bargain: more total work to shorten the critical path. It pairs beautifully with streaming UIs, too.

Watch a sequential vs SoT race with real timing here:
https://dev48v.infy.uk/prompt/day22-skeleton-of-thought.html

Top comments (0)