Series: The Learn Arc — 50 posts teaching Active Inference through a live BEAM-native workbench. ← Part 20: Session 3.2. This is Part 21.
The session
Chapter 3, §3. Session title: The softmax policy. Route: /learn/session/3/s3_softmax_policy.
You have G(π) for every candidate policy. How do you turn a vector of G-values into a distribution over plans? With a softmax. This session is about the softmax — specifically, the precision parameter on that softmax — and why it turns out to be one of the most important parameters in the whole theory.
The softmax
Q(π) = softmax( −γ · G(π) )
Where γ is the policy precision. The temperature τ = 1/γ; high precision = low temperature = confident policy posterior.
-
γ → ∞: the softmax becomes argmax. The agent picks the best policy with probability 1. No stochasticity. -
γ → 0: the softmax becomes uniform. The agent picks policies at random. No information used. - Middle values: the agent acts proportionally to expected-free-energy advantage.
In Chapter 4 Eq. 4.14 you'll see the full form (Q(π) = softmax(−γG − F)). Session 3.3 focuses on the γ.
The mid-temperature sweet spot
Why not always run γ = ∞? Because you haven't actually computed G perfectly. Your Q(s) is approximate. Your model is noisy. If you put all your posterior mass on the argmin-G policy and that argmin was wrong, you pay in full. The softmax's stochasticity is insurance against your own approximation error.
This is exactly the exploration-exploitation story from a different angle. Low γ = hedging, exploring. High γ = committing, exploiting. The Session 3.2 crossover is about how the two terms of G trade off over time; Session 3.3 is about how confidently to commit even when G is well-known.
Why biology cares
Chapter 5 turns γ into a testable claim: dopamine sets policy precision. High dopamine → high γ → sharp, confident action selection. Low dopamine → low γ → diffuse, tentative action selection.
Under this mapping, the behavioral signature of low dopamine is exactly the signature of a small γ in the softmax — indecision, hesitation, behavioral variability. Which is exactly what depletion studies find. Chapter 5 makes the mapping rigorous; Session 3.3 plants the parameter.
The sweep
/cookbook/planning-softmax-temperature is the recipe that sweeps γ across an order of magnitude. Run it in Studio and you watch the agent transition from indecisive (γ = 0.5) to sharply committed (γ = 8). Same world, same preferences, only the precision changes.
The behavior spectrum is instructive:
-
γ = 0.5— agent visits most cells roughly equally. High behavioral entropy. -
γ = 2— agent prefers the goal-ward direction but sometimes wanders. -
γ = 4— agent is mostly pragmatic; occasional epistemic detours. -
γ = 8— agent beelines. No detours even when ambiguity is high.
None of these is "right." Which γ is best depends on how good your model is (Chapter 9 tells you how to measure that).
The horizon knob alongside
/cookbook/planning-horizon-depth sweeps the other planning knob — how far ahead the agent expands policies. Deeper horizons see around corners. Shallower horizons are cheaper. The two knobs (γ and horizon depth) interact: high-γ agents benefit more from deep horizons because they commit hard to whatever the expansion returns.
The Workbench renders both interactions live. You can watch them on the policy-posterior panel.
The runtime spec
Every cookbook recipe has policy_depth and preference_strength fields. preference_strength is approximately γ (up to a constant factor absorbed into C). You can read these off any recipe's Runtime block:
Runtime:
agent_module: AgentPlane.ActiveInferenceAgent
world: tiny_open_goal
horizon: 5
policy_depth: 5
preference_strength: 4.0
Change preference_strength in the Builder canvas; save and re-instantiate. The agent's policy posterior tightens or loosens exactly as predicted.
The concepts this session surfaces
- Policy precision (γ) — softmax temperature inverse.
-
Stochastic policy — sampling from
Q(π). -
Deterministic policy — argmax of
Q(π). - Behavioral variability — emerges from low γ.
The quiz
Q: An agent with a well-calibrated model but low γ will:
- ☐ Fail catastrophically on most tasks.
- ☐ Behave more variably than the argmax-optimal agent would. ✓
- ☐ Ignore the ambiguity term in G.
- ☐ Learn faster because of increased exploration.
Why: Low γ flattens Q(π), so the agent samples from a broader distribution over policies. The agent's average policy is still near the argmin-G one, but individual choices have more variance. This is the behavioral signature of (e.g.) dopamine depletion.
Run it yourself
-
/learn/session/3/s3_softmax_policy— session page. -
/cookbook/planning-softmax-temperature— γ sweep. -
/cookbook/planning-horizon-depth— horizon sweep. -
/cookbook/preference-precision-vs-strength— C's precision vs agent's γ. -
/equations— Eq. 4.14 with γ explicit.
The mental move
The theory has two precision knobs — one on observations (Chapter 5: ACh), one on policies (Chapter 5: DA). Chapter 3's softmax is where the second knob gets introduced. Don't skip over it; you'll need it for every subsequent chapter, and the neuroscience hanging off it is some of the book's best content.
Next
Part 22: Session §3.4 — What makes an agent active. Chapter 3's final session. The synthesis of everything Chapter 3 built — and the cleanest one-sentence definition of what separates an Active Inference agent from a passive observer.
⭐ Repo: github.com/TMDLRG/TheORCHESTRATEActiveInferenceWorkbench · MIT license
📖 Active Inference, Parr, Pezzulo, Friston — MIT Press 2022, CC BY-NC-ND: mitpress.mit.edu/9780262045353/active-inference
← Part 20: Session 3.2 · Part 21: Session 3.3 (this post) · Part 22: Session 3.4 → coming soon

Top comments (0)