Active Inference — The Learn Arc, Part 38: Session §7.4 — Hierarchical active inference

#activeinference #pomdp #ai #elixir

Series: The Learn Arc — 50 posts through the Active Inference workbench.
Previous: Part 37 — Session §7.3: Learning A and B

Hero line. Hierarchy is not a new algorithm. It is the same Eq 4.13 update, run on a taller factor graph — with the top level's posterior acting as the bottom level's predicted prior.

Stacking POMDPs

One POMDP models one timescale. Stack two, and the top models slow/abstract state (which room you are in), while the bottom models fast/concrete state (which tile you are stepping on). Stack three, and the top gets even slower.

The miracle of Session 7.4 is that the stacking requires zero new math. Every layer runs Eq 4.13 on its own two-node subgraph. The coupling between layers is just a longer graph with more edges.

Five beats

Top-down message = predicted prior. The upper layer's posterior over its state is the prior factor D on the lower layer's state. Abstract belief biases concrete perception by feeding the message into the same softmax.
Bottom-up message = observation to the layer above. The lower layer's posterior is the observation the upper layer sees through its own A. Concrete inference becomes abstract evidence.
Timescales fall out of prior stiffness. The upper layer's B transitions are near-identity — abstract state changes slowly. The lower layer's B moves every tick. No scheduler, no explicit timescale parameter; the structure of B does the work.
Policies at every layer. Each layer computes its own EFE over policies at its own timescale. The top picks "go to the kitchen;" the bottom picks "step north." The top never spells out footsteps. That is the whole point.
Learning too. Dirichlet counts still apply, layer by layer. The top layer's A learns the mapping "abstract state → observed lower-layer posterior." Hierarchy makes the slow stats tractable by aggregating evidence before it reaches the slow learner.

Why it matters

Flat POMDPs explode combinatorially as the task gets longer. Hierarchy is the escape hatch: each layer only has to model what happens at its own timescale. The framework did not need a new equation to get there. That is the power of "same message, taller graph."

Quiz

In a two-layer stack, what replaces the top layer's initial prior D at t > 0?
Why does making B near-identity at the top layer produce slower timescales?
Where does the lower layer's EFE get its preference vector C from when the upper layer is active?

Run it yourself

mix phx.server
# open http://localhost:4000/learn/session/7/s4_hierarchical

Cookbook recipe: hierarchy/two-layer-room — a two-layer agent: abstract layer picks rooms, concrete layer picks tiles. Watch the upper posterior drift on the scale of minutes while the lower posterior twitches every step.

Part 39: Session §7.5 — Worked example. We put it all together: Dirichlet A and B, two layers, live EFE. The capstone of Chapter 7 — the longest session in the book and the one that makes every earlier piece earn its keep.

Powered by The ORCHESTRATE Active Inference Learning Workbench — Phoenix/LiveView on pure Jido.