Training a Nematode with Quantum Reinforcement Learning

Chris Zaharia — Sat, 20 Sep 2025 17:06:14 +0000

Can a tiny quantum circuit learn the survival tricks of a tiny worm?

In this project I set out to model the foraging behavior of Caenorhabditis elegans — a 1 mm nematode with a famously compact nervous system — using reinforcement learning (RL). I built two agents that learn chemotaxis (moving up a chemical gradient toward food) in a simple grid-world: one with a classical neural policy, and one with a parameterized quantum circuit (PQC) acting as the "brain." I ran many independent sessions for both agents; the representative results shown below were typical across runs.

I also validated the quantum path end-to-end with short hardware tests (with and without error suppression) to ensure the pipeline works, but all conclusions here come from a quantum circuit simulator. As circuits grow (more qubits, deeper layers), simulators on CPU/GPU will eventually become too slow and memory-hungry — at that point I'll lean more on real devices and revisit the comparison in richer settings.

Why a worm, and why this worm?

C. elegans is a sweet spot for biologically inspired AI: its entire nervous system is mapped — 302 neurons wired by roughly 7,000+ synapses — yet it still displays rich behaviors like chemotaxis and learning. That scale makes it interpretable and simulable, while still biologically meaningful.

Chemotaxis in C. elegans is well-studied: the worm climbs gradients by reacting to changes in odor concentration and steering accordingly. That gives a clean behavioral target to encode in an RL environment.

Environment: a minimal, testable "world"

I trained a worm-agent in a small grid-world (e.g., 10×10). Each episode starts the agent at one corner and places food in the opposite corner. The agent has orientation (up/left/right/down), a short body (to discourage dithering), and four actions:

Move forward
Turn left
Turn right
Stay in place

State (2 features).

I purposely keep observations minimal and biologically motivated:

Gradient strength: a normalized inverse distance to the food (optionally shaped with a smooth nonlinearity).
Relative direction: bearing from the agent's current heading to the food, normalized to [-1, 1].

Rewards.

Large bonus on reaching food
Small step penalty (encourages efficiency)
Penalties for collisions or oscillation

RL loop. Per step: compute state → sample policy → step environment → get reward → (store) → at episode end, update policy.

A tiny slice of the training loop (simplified)

state = env.observe()  # (gradient_strength, relative_direction)
probs = policy(state)  # classical softmax or quantum measurement distribution
action = sample(probs)

next_state, reward, done, info = env.step(action)
episode.append((state, action, reward))
state = next_state

if done or steps >= max_steps:
    returns = compute_returns(episode, gamma=0.99, baseline=True)
    policy_update(policy, episode, returns, entropy_bonus=1e-3)
    episode.clear()

Classical brain: a small neural policy that just works

My classical policy is a 2-hidden-layer MLP (64 units each, ReLU), mapping the 2-D state to 4 action logits. It's trained with REINFORCE (Monte Carlo policy gradient) using a variance-reduction baseline and a small entropy bonus to sustain exploration.

# PyTorch policy (key bits)
import torch
import torch.nn as nn
import torch.nn.functional as F

class PolicyNet(nn.Module):
    def __init__(self, input_dim=2, hidden=64, actions=4):
        super().__init__()
        self.fc1 = nn.Linear(input_dim, hidden)
        self.fc2 = nn.Linear(hidden, hidden)
        self.fc3 = nn.Linear(hidden, actions)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        logits = self.fc3(x)
        return torch.softmax(logits, dim=-1)

I also tried a smaller classical network (two hidden layers of 32 units). It does learn, but consistently takes more episodes to converge to a strong policy. The ~4,600-parameter model (2×64) reached good performance faster; different classical designs (e.g., shallower nets, alternative activations, or tiny residual MLPs) might beat this tradeoff and are worth exploring in future work.

Quantum brain: a 2-qubit PQC as the policy

The quantum policy replaces neurons/weights with qubits and gates. The PQC ingests the same two features, encodes them as rotations, mixes information via entanglement, and measures to produce one of four outcomes (00, 01, 10, 11) mapped to the 4 actions. Learning adjusts a small set of rotation angles.

Encoding. Strength → Rx on qubit 0; relative direction → Ry on qubit 1.
Trainable layers. Two layers of per-qubit (Rx, Ry, Rz) plus CZ entanglement.
Output. Measure both qubits; the Born rule turns amplitudes into action probabilities (the policy is inherently stochastic).
Gradients. I use the parameter-shift rule to compute exact derivatives for gate angles; a classical optimizer updates parameters.

# Building the policy circuit (condensed)
from qiskit import QuantumCircuit
from qiskit.circuit import Parameter

theta = [Parameter(f"θ{i}") for i in range(12)]   # 2 qubits × 2 layers × 3 rotations
str_p = Parameter("strength")
dir_p = Parameter("reldir")

qc = QuantumCircuit(2, 2)

# Feature encoding
qc.rx(str_p, 0)
qc.ry(dir_p, 1)

# Layer 1
qc.rx(theta[0], 0); qc.ry(theta[1], 0); qc.rz(theta[2], 0)
qc.rx(theta[3], 1); qc.ry(theta[4], 1); qc.rz(theta[5], 1)
qc.cz(0, 1)

# Layer 2
qc.rx(theta[6], 0); qc.ry(theta[7], 0); qc.rz(theta[8], 0)
qc.rx(theta[9], 1); qc.ry(theta[10],1); qc.rz(theta[11],1)
qc.cz(0, 1)

# Measure -> 2-bit action
qc.measure([0,1], [0,1])

Running it. During training and evaluation I primarily use the Qiskit Aer simulator (noiseless, configurable shots). For short, end-to-end hardware checks I also ran via Qiskit Runtime Sampler on IBM backends, and repeated the same runs using Q-CTRL Fire Opal error-suppressed execution. The hardware tests behaved as expected, but all reported results below are from the simulator; I’ll revisit real device-level benchmarking when I scale circuits beyond practical simulation limits.

# Aer run (ideal simulator)
from qiskit_aer import AerSimulator
from qiskit import transpile

backend = AerSimulator()
bound = transpile(qc, backend)
job = backend.run(bound, shots=2048)
counts = job.result().get_counts()
# map counts to action probabilities

# IBM QPU via Runtime Sampler (short tests only)
from qiskit_ibm_runtime import QiskitRuntimeService, Sampler
service = QiskitRuntimeService()
backend = service.least_busy(operational=True, simulator=False, min_num_qubits=2)
sampler = Sampler(mode=backend)
job = sampler.run([bound], shots=1024)
dist = job.result()[0].data.c.get_counts()

# Fire Opal "Performance Management" (noise-suppressed runs)
# (Function API mirrors Sampler/Estimator usage)
# https://quantum.cloud.ibm.com/docs/guides/q-ctrl-performance-management

Results (many seeds; representative shown)

I trained and evaluated many runs per agent type (different random seeds, corners, and curricula). The table below summarizes a representative session — similar to the median across repeats:

Metric (10×10; corner→opposite)	Classical (MLP 2×64)	Quantum (2-qubit PQC; sim)
Success rate (≤ max steps)	100%	≈ 99%
Avg. steps to food	≈ 34	≈ 37
Trainable parameters	~4600	12

Two qualitative differences showed up across runs:

Learning dynamics. The classical policy converged faster and more stably (smoother curves). The quantum policy's returns had higher variance early on — that's the flip side of built-in exploration from sampling measurement outcomes.
Model capacity vs. compactness. The quantum policy used orders of magnitude fewer parameters yet reached near-parity on this task — consistent with findings that variational quantum policies can solve RL tasks with small parameter spaces.

A note on classical size: yes, a smaller MLP (e.g., 2×32) does learn this task, but in my runs it took longer to settle on a good strategy than 2×64. I didn't exhaust alternative architectures, and I expect there are more parameter-efficient classical designs worth testing in future work.

Reward shaping & "good taste" matter more than the brain

One repeated lesson: both agents hinge on sensible observations and well-shaped rewards. If the gradient feature is scaled poorly (or the reward doesn't penalize dithering), both agents plateau. When the "smell" and incentives align with chemotaxis, both brains discover the zig-zag-then-home-in behavior you'd expect from worms climbing gradients.

Here’s the distance-shaping component I use, which rewards net progress toward the food and lightly penalizes dithering:

def shaped_reward(prev_pos, pos, goal, step_penalty=0.01, dithering_penalty=0.02):
    reward = -step_penalty
    prev_dist = abs(prev_pos[0]-goal[0]) + abs(prev_pos[1]-goal[1])
    curr_dist = abs(pos[0]-goal[0]) + abs(pos[1]-goal[1])
    reward += (prev_dist - curr_dist) * 0.1  # move closer → positive
    # Anti-dithering: penalize back-and-forth
    if len(history) > 2 and pos == history[-3]:
        reward -= dithering_penalty
    return reward

Hardware today, hardware tomorrow

To keep this article evidence-driven, I've reported simulator results only. I did, however, run short sessions on real IBM devices to validate execution (and repeated them with Q-CTRL Fire Opal error suppression enabled). For now, the simulator is the right tool to explore design space quickly; as I scale to more qubits / deeper circuits, I'll shift more emphasis to hardware-first studies.

Why this is interesting (to me, and maybe to you)

Compact quantum policies are viable. Matching classical performance with 12 parameters is a nice empirical data point.
Exploration "for free." Quantum measurements produce a distribution without adding noise manually; in RL that can be a feature or a training headache.
Chemotaxis is a great benchmark. It's interpretable, grounded in biology, and lets me reason about whether the agent's behavior looks worm-like.
The path to "bigger brains." If the nematode is tractable, stepping up environment richness (and eventually organismal complexity) becomes a sensible research ladder.

Key code paths (for readers who want to tinker)

Classical REINFORCE update (core idea):

def reinforce_update(policy, trajectory, baseline, entropy_beta=1e-3, lr=3e-4):
    logps, rewards = [], []
    for s, a, r in trajectory:
        p = policy(torch.tensor(s, dtype=torch.float32))
        logps.append(torch.log(p[a]))
        rewards.append(r)
    returns = compute_returns_from_rewards(rewards, gamma=0.99)
    adv = returns - baseline.update_and_get(returns)

    loss = -(torch.stack(logps) * adv.detach()).sum()
    # entropy regularization (encourage exploration)
    entropy = -(torch.stack(logps).exp() * torch.stack(logps)).sum()
    (loss - entropy_beta * entropy).backward()
    optimizer.step(); optimizer.zero_grad()

Quantum policy: parameter-shift gradient (conceptual):

# For a parameter θ_k, run two shifted circuits and difference the returns
def grad_param_shift(theta_vec, k, shift=np.pi/2):
    plus = theta_vec.copy(); plus[k] += shift
    minus = theta_vec.copy(); minus[k] -= shift
    J_plus  = expected_return(plus)   # estimate via batched circuit eval
    J_minus = expected_return(minus)
    return 0.5 * (J_plus - J_minus)   # parameter-shift rule

What's next

Richer environments: larger maps, obstacles, multiple food sources, changing conditions.
Predator evasion: add a chaser/hazard so survival matters alongside hunger.
Architectures: smaller classical designs aimed at faster convergence per parameter; deeper/wider PQCs; and hybrid models mixing classical and quantum modules.
Hardware-first comparisons: once circuits exceed simulator practicality, re-run the study directly on devices with noise-aware tooling.

I'll cover environments with more variation and add predator evasion in the next article.

Selected references & further reading

Finally

You can check out this project's source code at:
👉 https://github.com/SyntheticBrains/nematode

If you want to poke at the code, try swapping the policy: MLP ↔ PQC, tweak reward shaping, or insert obstacles. If you do something cool (or find a better tiny classical design that converges faster), I'd love to hear about it.

DEV Community: Chris Zaharia