DEV Community

ORCHESTRATE
ORCHESTRATE

Posted on

Active Inference, The Learn Arc — Part 21: Session §3.3 — The Softmax Policy and Its Precision Knob

Session 3.3 — The softmax policy

Series: The Learn Arc — 50 posts teaching Active Inference through a live BEAM-native workbench. ← Part 20: Session 3.2. This is Part 21.

The session

Chapter 3, §3. Session title: The softmax policy. Route: /learn/session/3/s3_softmax_policy.

You have G(π) for every candidate policy. How do you turn a vector of G-values into a distribution over plans? With a softmax. This session is about the softmax — specifically, the precision parameter on that softmax — and why it turns out to be one of the most important parameters in the whole theory.

The softmax

Q(π) = softmax( −γ · G(π) )
Enter fullscreen mode Exit fullscreen mode

Where γ is the policy precision. The temperature τ = 1/γ; high precision = low temperature = confident policy posterior.

  • γ → ∞: the softmax becomes argmax. The agent picks the best policy with probability 1. No stochasticity.
  • γ → 0: the softmax becomes uniform. The agent picks policies at random. No information used.
  • Middle values: the agent acts proportionally to expected-free-energy advantage.

In Chapter 4 Eq. 4.14 you'll see the full form (Q(π) = softmax(−γG − F)). Session 3.3 focuses on the γ.

The mid-temperature sweet spot

Why not always run γ = ∞? Because you haven't actually computed G perfectly. Your Q(s) is approximate. Your model is noisy. If you put all your posterior mass on the argmin-G policy and that argmin was wrong, you pay in full. The softmax's stochasticity is insurance against your own approximation error.

This is exactly the exploration-exploitation story from a different angle. Low γ = hedging, exploring. High γ = committing, exploiting. The Session 3.2 crossover is about how the two terms of G trade off over time; Session 3.3 is about how confidently to commit even when G is well-known.

Why biology cares

Chapter 5 turns γ into a testable claim: dopamine sets policy precision. High dopamine → high γ → sharp, confident action selection. Low dopamine → low γ → diffuse, tentative action selection.

Under this mapping, the behavioral signature of low dopamine is exactly the signature of a small γ in the softmax — indecision, hesitation, behavioral variability. Which is exactly what depletion studies find. Chapter 5 makes the mapping rigorous; Session 3.3 plants the parameter.

The sweep

/cookbook/planning-softmax-temperature is the recipe that sweeps γ across an order of magnitude. Run it in Studio and you watch the agent transition from indecisive (γ = 0.5) to sharply committed (γ = 8). Same world, same preferences, only the precision changes.

The behavior spectrum is instructive:

  • γ = 0.5 — agent visits most cells roughly equally. High behavioral entropy.
  • γ = 2 — agent prefers the goal-ward direction but sometimes wanders.
  • γ = 4 — agent is mostly pragmatic; occasional epistemic detours.
  • γ = 8 — agent beelines. No detours even when ambiguity is high.

None of these is "right." Which γ is best depends on how good your model is (Chapter 9 tells you how to measure that).

The horizon knob alongside

/cookbook/planning-horizon-depth sweeps the other planning knob — how far ahead the agent expands policies. Deeper horizons see around corners. Shallower horizons are cheaper. The two knobs (γ and horizon depth) interact: high-γ agents benefit more from deep horizons because they commit hard to whatever the expansion returns.

The Workbench renders both interactions live. You can watch them on the policy-posterior panel.

The runtime spec

Every cookbook recipe has policy_depth and preference_strength fields. preference_strength is approximately γ (up to a constant factor absorbed into C). You can read these off any recipe's Runtime block:

Runtime:
  agent_module: AgentPlane.ActiveInferenceAgent
  world: tiny_open_goal
  horizon: 5
  policy_depth: 5
  preference_strength: 4.0
Enter fullscreen mode Exit fullscreen mode

Change preference_strength in the Builder canvas; save and re-instantiate. The agent's policy posterior tightens or loosens exactly as predicted.

The concepts this session surfaces

  • Policy precision (γ) — softmax temperature inverse.
  • Stochastic policy — sampling from Q(π).
  • Deterministic policy — argmax of Q(π).
  • Behavioral variability — emerges from low γ.

The quiz

Q: An agent with a well-calibrated model but low γ will:

  • ☐ Fail catastrophically on most tasks.
  • ☐ Behave more variably than the argmax-optimal agent would. ✓
  • ☐ Ignore the ambiguity term in G.
  • ☐ Learn faster because of increased exploration.

Why: Low γ flattens Q(π), so the agent samples from a broader distribution over policies. The agent's average policy is still near the argmin-G one, but individual choices have more variance. This is the behavioral signature of (e.g.) dopamine depletion.

Run it yourself

The mental move

The theory has two precision knobs — one on observations (Chapter 5: ACh), one on policies (Chapter 5: DA). Chapter 3's softmax is where the second knob gets introduced. Don't skip over it; you'll need it for every subsequent chapter, and the neuroscience hanging off it is some of the book's best content.

Next

Part 22: Session §3.4 — What makes an agent active. Chapter 3's final session. The synthesis of everything Chapter 3 built — and the cleanest one-sentence definition of what separates an Active Inference agent from a passive observer.


⭐ Repo: github.com/TMDLRG/TheORCHESTRATEActiveInferenceWorkbench · MIT license

📖 Active Inference, Parr, Pezzulo, Friston — MIT Press 2022, CC BY-NC-ND: mitpress.mit.edu/9780262045353/active-inference

Part 20: Session 3.2 · Part 21: Session 3.3 (this post) · Part 22: Session 3.4 → coming soon

Top comments (0)