DEV Community

ORCHESTRATE
ORCHESTRATE

Posted on

Active Inference, The Learn Arc — Part 20: Session §3.2 — Epistemic vs Pragmatic Value

Session 3.2 — Epistemic vs pragmatic

Series: The Learn Arc — 50 posts teaching Active Inference through a live BEAM-native workbench. ← Part 19: Session 3.1. This is Part 20.

The session

Chapter 3, §2. Session title: Epistemic vs pragmatic value. Route: /learn/session/3/s2_epistemic_pragmatic.

Session 3.1 wrote the equation. Session 3.2 walks both columns — one at a time — and shows you the crossover point where a good Active Inference agent transitions from exploring to exploiting.

The two columns, one at a time

Risk (pragmatic): KL[ Q(o|π) ‖ C ]

What this quantity is saying, word for word: "under this policy, the observations I'll likely see differ from the observations I'd prefer to see, by this many nats." Small risk = plan lands near preferences. Large risk = plan lands far from preferences.

Risk is zero when the expected observation distribution under the policy exactly matches the preference distribution C. That happens only when the world is fully characterized and you picked a policy that exactly takes you where you want to go.

Ambiguity (epistemic): E_Q[ H[ P(o|s) ] ]

Word for word: "averaged over what I currently believe about hidden states, how uncertain is my own sensor model about what I'd see?" Small ambiguity = "I know what my sensors would tell me under any of these states." Large ambiguity = "my sensors are noisy or under-specified."

Ambiguity is zero when the sensor model P(o|s) is deterministic for every plausible state — i.e., when seeing the observation tells you unambiguously which state produced it. In a well-lit room with precise sensors, ambiguity is small. In fog, ambiguity is large.

The crossover

Here's where it gets interesting: the two terms can trade off, but not symmetrically.

  • Early in a run (lots of uncertainty about s), ambiguity is high across most policies. Plans that reduce uncertainty — pointing sensors, moving to vantage points, asking questions — have the lowest G. The agent explores.
  • As observations accumulate, Q(s) sharpens. Ambiguity drops across all policies. Risk starts to dominate. Plans that bring expected observations close to C win. The agent exploits.

No tunable ε. No scheduled annealing. The transition is a consequence of the posterior sharpening, which is Chapter 2's machinery running forward.

The side-by-side recipe

Recipe — epistemic info gain vs reward

/cookbook/epistemic-info-gain-vs-reward runs two agents in the same world. One has a sharp preference distribution (low-temperature C concentrated on one cell). The other has a diffuse preference (high-temperature C spread across many cells). Same world, same agent architecture, different C.

The sharp-preference agent beelines. The diffuse-preference agent wanders and gathers information. The divergence is predicted by Chapter 3 without any parameter tweak — the shape of C changes which column of G dominates, which changes what the softmax selects.

An important nuance

The book is careful here: ambiguity is not the same as "exploration." Ambiguity is a specific, computable quantity — the expected entropy of the sensor model averaged over Q(s). Exploration is a behavior that can arise from high ambiguity, but it's not a mechanism; it's a consequence.

That distinction matters in practice. When an RL engineer says "we add an exploration bonus," they're admitting their theory doesn't predict exploration, so they bolt it on. Chapter 3's whole move is to say: no, you don't need to bolt it on. The epistemic term is in the math already.

The mini-recipe corpus

Session 3.2 foregrounds three runnable demos, each isolating one side of the crossover:

The concepts this session surfaces

  • Pragmatic value — the negative of the risk term.
  • Epistemic value — the negative of the ambiguity term.
  • Crossover — when ambiguity drops below risk in dominance.
  • Sensor entropyH[P(o|s)], the per-state uncertainty.

The quiz

Q: An Active Inference agent in a new environment initially explores and then exploits. Why?

  • ☐ It has a schedule that switches modes at a fixed tick.
  • ☐ Ambiguity dominates G early; risk dominates once beliefs sharpen. ✓
  • ☐ A meta-controller toggles between two policies.
  • ☐ The softmax temperature anneals.

Why: There's no schedule, no meta-controller, no annealing. The crossover falls out of Q(s) sharpening over time, which reduces ambiguity across all policies, which lets risk dominate policy selection. One functional, one softmax — the behavior emerges.

Run it yourself

The mental move

Active Inference's most impressive trick is that it derives the explore-exploit transition from math you already had. No new mechanism. No new scalar. Just the decomposition Session 3.2 spells out. That's why this chapter takes the rhetoric seriously and earns it.

Next

Part 21: Session §3.3 — The softmax policy. Chapter 3's third session. The precision parameter on Q(π) = softmax(−G/τ) is itself a meaningful biological quantity — and the hinge on which Chapter 5's neuromodulator story swings. We unpack it.


⭐ Repo: github.com/TMDLRG/TheORCHESTRATEActiveInferenceWorkbench · MIT license

📖 Active Inference, Parr, Pezzulo, Friston — MIT Press 2022, CC BY-NC-ND: mitpress.mit.edu/9780262045353/active-inference

Part 19: Session 3.1 · Part 20: Session 3.2 (this post) · Part 21: Session 3.3 → coming soon

Top comments (0)