DEV Community

Cover image for I is not singular — Multi-Agent Simulation with Cognitive Architecture on a Single 8GB GPU
as1as
as1as

Posted on

I is not singular — Multi-Agent Simulation with Cognitive Architecture on a Single 8GB GPU

Reading Stanford's Generative Agents (2023), one thought stuck with me — that's just one LLM cosplaying 25 personas, isn't it?
"I" is not singular. What does it actually take to be a different person?

(If you're Korean and over 30, you know where this line comes from. The grammatically wrong "is" is intentional — "I" is being treated as a word, not the speaker. From the Korean fantasy novel Dragon Raja, 1998.)

Same stimulus, different reactions. Same model, different personalities. This had to be built with structure, not prompts.

I dropped 4 LLM agents into an empty plaza and let them live. They evolve only through their own unconscious states, their own experiences, and their own LoRA training — no external hand reaching into their minds. One 8GB GPU (RTX 3070), qwen3:8b, llama.cpp multi-LoRA hot-swap.

This is a write-up of the end-to-end verification of the multi-agent sim. Code, experimental results, failed attempts — all included.

Demo: sim.as1as.net


TL;DR

  • Multi-agent simulation built on RTX 3070 8GB + qwen3:8b + per-agent LoRA
  • Core differentiators: unconscious baseline + per-agent LoRA + 2-layer cognitive architecture (Persona LLM + inner modules)
  • Absolute principle: No external manipulation of agent unconscious. Evolution only through the agent's own experience.
  • 100-turn verification — all 4 agents had positive mood drift (+0.03 ~ +0.06), speech rates (33%~81%) matched their personas
  • Cost: $0 (free Cloudflare tier + home GPU). Live: sim.as1as.net

The limit of the Stanford paper — why this architecture

Stanford's Generative Agents has GPT acting out 25 personas through prompts. The result is impressive, but there are structural limits:

  • All agents reason with the same LLM with the same weights. Only the persona prompt differs.
  • No learning — experience accumulates but the model itself doesn't change.
  • No unconscious — baselines like mood, anxiety, trust don't influence behavior.

"I is not singular" is my answer to that limit. For one agent to actually be a different person, you need:

  1. Unconscious baseline — values like mood, anxiety, trust that change how inner modules respond
  2. Per-agent LoRA — each agent learns only from its own experience, so the model weights themselves diverge
  3. 2-layer cognitive architecture — a Persona LLM (the personality) is separated from the inner modules (emotion / instinct / reasoning), so even how an agent thinks differs

Cognitive architecture — 2-layer structure

[External stimulus (multi-channel perception)]
            ↓
[Persona LLM (think=True, LoRA hot-swap)]   ← cannot read unconscious directly
            ↓ (decides which inner modules to consult)
            ↓
   ┌────────┼────────┬────────┐
   ↓        ↓        ↓        ↓
[Emotion][Instinct][Reasoning][External Stimulus]
   ↑        ↑        ↑           (visual/auditory/tactile/thermal/gustatory)
   └────────┴────────┘
            ↑
   [Unconscious baseline] ← consulted by inner modules
            ↓
[Persona integrates with its own weights → action / dialogue]
            ↓
   [Sleep cycle (every 30 turns)]
            ↓
   - Unconscious update (own experience only)
   - Per-agent LoRA fine-tune (unsloth qwen3:8b 4bit)
   - Memory compression
Enter fullscreen mode Exit fullscreen mode

Top layer: Persona LLM (the personality)

The Persona LLM is the agent's consciousness. Its roles:

  • Receives external stimulus
  • Decides which inner modules to consult (picks 2~4)
  • Integrates module responses with its own preferences
  • Generates the final action / dialogue
  • Is the target of LoRA fine-tuning during sleep cycles

Critical point: The Persona doesn't see the unconscious values directly. The unconscious only affects the agent indirectly, through the inner modules. Just like a human can't consciously explain "why am I in this mood today?"

Call style: think=True (Kahneman's System 2 — deliberate integration). Receives rich context (external stimulus + module responses) and makes a synthesized decision.

async def decide_modules(self, stimulus, recent_memory):
    prompt = f"""You are {self.agent.name}'s consciousness.
External stimulus: {stimulus}
Recent memory: {recent_memory}

Which inner modules should you consult?
Available: emotion, instinct, reasoning
Choose 2-4 most relevant. Output JSON list."""

    response = await llm.chat(
        prompt=prompt,
        adapter=self.adapter_path,  # per-agent LoRA
        think=True,    # System 2
    )
    return json.loads(response)
Enter fullscreen mode Exit fullscreen mode

Bottom layer: Inner modules

3 core modules. All run with think=False (System 1 — instant reflex. Matches Kahneman's dual process.)

All agents share these modules. Loaded once, never change. The difference comes not from the modules themselves, but from the unconscious baseline passed to them.

1. Emotion

async def emotion_module(stimulus, unconscious):
    prompt = f"""You generate emotional responses.
Stimulus: {stimulus}
Unconscious baselines:
- Mood baseline: {unconscious['base_mood']}
- Anxiety baseline: {unconscious['base_anxiety']}

Rules:
- Low mood baseline → stronger negatives, weaker positives
- High anxiety → fear/worry responses come easier

Generate: {{"primary_emotion": "...", "intensity": 0.0-1.0, "valence": -1 to 1}}"""

    return await llm.chat(prompt=prompt, think=False, format="json")
Enter fullscreen mode Exit fullscreen mode

For the same stimulus "sound of laughter," an agent with base_mood=0.3 (depressed) returns valence: -0.2 (lukewarm), while base_mood=0.7 (elevated) returns valence: +0.6 (interested).

2. Instinct

References anxiety and trust baselines to generate avoidance / approach / wariness responses. With anxiety 0.61, the same "unfamiliar sound" is more likely to trigger an "avoid" response.

3. Reasoning

Receives the stimulus + other module results, makes logical judgments. Slower than the others, but most analytical.

Theoretical inspirations: Kahneman's System 1/2, Minsky's Society of Mind (module separation), Baars' Global Workspace Theory (Persona integrating module responses).


Unconscious Layer

5 values the agent cannot consciously access:

unconscious = {
    "base_mood": 0.5,      # 0=depressed, 1=elevated
    "base_anxiety": 0.3,   # baseline anxiety
    "base_trust": 0.6,     # baseline trust in others
    "social_need": 0.5,    # social drive
    "base_energy": 0.7,    # vitality
}
Enter fullscreen mode Exit fullscreen mode

How it works:

  • Each inner module receives these values as context when called
  • The values are reflected naturally in the module's response
  • The Persona LLM cannot access them directly — same as a human not being able to introspect their own unconscious
  • Only updated during sleep cycles, based on accumulated experience

Example: When the Emotion module runs with base_mood=0.3 (depressed):

  • Even good stimuli only get weak positive emotional responses
  • Negative stimuli get strong negative responses
  • The personality doesn't know "why I feel like this." It just feels.

Updating the unconscious

In the sleep cycle, the agent updates its unconscious from its own experience_log:

def update_from_experiences(self):
    exp = self.experience_log

    # base_mood: average emotional valence
    avg_emotion = mean([e.emotion_valence * e.emotion_intensity for e in exp])
    self.unconscious.base_mood += avg_emotion * 0.05

    # base_anxiety: frequency of negative experiences
    negative_ratio = count(e.reward < -0.3 for e in exp) / len(exp)
    self.unconscious.base_anxiety += negative_ratio * 0.05

    # base_trust: average outcome of social experiences
    social_outcomes = [e.reward for e in exp if e.was_social]
    if social_outcomes:
        avg_social = mean(social_outcomes)
        self.unconscious.base_trust += avg_social * 0.05

    # Clamp all values to 0~1
    for key in self.unconscious:
        self.unconscious[key] = clamp(0, 1, self.unconscious[key])
Enter fullscreen mode Exit fullscreen mode

The important thing here: there's no homeostasis term. Once mood drops, no built-in mechanism brings it back up. This becomes a problem later in the Daesoo trap.


Why 8b — the 1.7b/4b/8b thought experiment

To be honest, there was no choice. RTX 3070 8GB is what my desktop has. I'd rather buy my son another toy than upgrade — the 3070's ceiling was clear. 14B OOMs on load alone. 8b is the GPU's ceiling, and the game was finding the best within that.

Still, "many small models" vs "one shared 8b" had to be decided. I ran an 18-case experiment (3 models × 3 modules × unconscious LOW/HIGH).

qwen3:1.7b — Unfit. Fills the JSON format but only with placeholders. No difference between unconscious LOW and HIGH. Worst case: in the Instinct module, it answered "avoid" even when anxiety was low. It can't read the baseline.

qwen3:4b — Partially passing. Instinct/Reasoning reflect the unconscious OK. But in Emotion, it gets pulled by surface stimulus ("laughter sound") and ignores baseline. Expressive limit.

qwen3:8b — Passes. Same stimulus, Emotion LOW: valence -0.6, HIGH: +0.4 — exact opposite responses. All modules clearly reflect the unconscious. JSON and Korean both stable.

3-agent differentiation test

qwen3:8b unified + different persona prompts + different unconscious baselines → 5 rounds, 3/3 unique thoughts/actions every round. Same stimulus ("an unfamiliar dog"):

  • Alice: "Step back and observe" (high anxiety)
  • Bob: "Walk up and greet" (high trust)
  • Carol: "Where did this dog come from?" (reasoning first)

Unifying the base model didn't kill differentiation — if anything, the unconscious baseline gave each agent its voice.

VRAM scenario

llama-server (qwen3:8b base + 4 LoRAs loaded simultaneously):
  base Q4_K_M: 5.0 GB
  4 adapters: ~280 MB (~70MB each)
  Total: ~5.3 GB ✅ fits in 8GB with room to spare (2.7GB free)

Training peak (unsloth 4bit):
  6.00 GiB / 8 (2 GiB free)

Vision LLM (qwen3-vl:4b): 3.3 GB
  → Cannot coexist with base 8b (5.0+3.3=8.3)
  → Replaced with text perception (see External Stimulus section below)
Enter fullscreen mode Exit fullscreen mode

14B exceeds 8GB just loading 4-bit weights → forces CPU offload → training is essentially impossible. 8B is the ceiling for an RTX 3070 8GB.


Reward system

Two kinds of reward

1. Environmental reward (rule-based)

  • Found shelter / didn't find shelter
  • Avoided danger / got hurt
  • Basic needs met

2. Social reward

  • Other agents' emotional reactions are reward signals
  • Target's Emotion module evaluates the actor's action
  • Positive reaction → positive reward, negative → negative
  • Relationship (affinity) change is also reflected
def calculate_reward(agent, action, result, world):
    reward = 0

    # 1. Environmental reward
    if result.shelter_found and world.weather == "rainy":
        reward += 0.5
    if result.basic_need_met:
        reward += 0.3

    # 2. Social reward (target's emotion module)
    for other in result.affected_agents:
        their_emotion = other.last_emotion_response
        if their_emotion.valence > 0:
            reward += their_emotion.intensity * 0.5
        else:
            reward += their_emotion.valence * 0.5  # negative

        # Relationship change
        rel_change = other.relationships[agent.name].delta
        reward += rel_change * 0.3

    # 3. Persona weighting
    if action.is_social:
        reward *= (agent.traits.social_weight / 5)

    return reward
Enter fullscreen mode Exit fullscreen mode

Sliding window delayed reward

The consequences of an action don't appear immediately. An action at time T can produce effects at T+1 ~ T+5:

def calculate_delayed_reward(agent, action_at_T, window=5):
    immediate = get_reward_at(T)
    delayed = 0
    for t in range(T+1, T+1+window):
        if t < len(agent.experience_log):
            r = get_reward_at(t)
            decay = 0.7 ** (t - T)  # decay over time
            delayed += r * decay
    return immediate + delayed * 0.3
Enter fullscreen mode Exit fullscreen mode

Asymmetric learning

Negative experiences are learned 1.5× more strongly. Evolutionarily, avoiding danger matters more than seeking reward. As age (round count) accumulates, the rate of change decreases — personality solidifies.

def apply_asymmetry(reward, agent_age):
    if reward < 0:
        reward *= 1.5  # amplify negative
    age_factor = max(0.1, 1.0 - (agent_age / 365))
    return reward * age_factor
Enter fullscreen mode Exit fullscreen mode

Sleep cycle

Energy → sleep → learning → wake

Agents consume energy each turn. Every 30 rounds, a global sleep cycle runs.

Energy drain depends on persona:

drain = 3.0 * (1.5 - base_energy)
# Eva (base_energy 0.85) → 1.95/turn → 100→0 in ~51 turns (energetic, lasts longer)
# Sora (base_energy 0.62) → 2.64/turn → 38 turns (calmer, tires faster)
Enter fullscreen mode Exit fullscreen mode

base_energy is only changed by the agent's own sleep update — no violation of the absolute principle.

What happens during sleep

[1] Stop llama-server (free up VRAM for training)
[2] For each agent:
    - Extract high-reward (≥0.2) experiences from the log
    - If too few (< 5), relax the threshold (0.2→0.1→0.05→0.0)
    - If still not enough, skip training (keep previous adapter)
    - unsloth + PEFT training (continues from previous adapter)
    - 18s training + 25s GGUF conversion = ~50s
[3] Update unconscious values (update_from_experiences above)
[4] Compress memory (summarize long-term)
[5] Restart llama-server with all 4 LoRAs loaded
[6] Restore energy to 100, clear experience_log, version++
Enter fullscreen mode Exit fullscreen mode

Total sleep cycle: 5.5 minutes (4 agents × ~50s training + server restart). 6~12× faster than the spec estimate of 30min~1hr.

LoRA training data format

def format_for_training(experiences):
    formatted = []
    for exp in experiences:
        adjusted = apply_asymmetry(exp.reward, agent.age)
        if adjusted > threshold:
            # Good action: imitate
            formatted.append({
                "text": f"<situation>{exp.situation}</situation>"
                        f"<action>{exp.action}</action>"
            })
    return formatted
Enter fullscreen mode Exit fullscreen mode

Training data is extracted only from the agent's own experience_log. No external data injection — absolute principle.


Shared LoRA failed → per-agent LoRA

Option D: single shared LoRA

First, all agents shared one LoRA. Phase A 10 rounds → sleep (14 training samples) → Phase B 5 rounds:

Phase A Phase B (world-v1) Result
Alice reward +0.49 +0.05 ▼ suppressed
Alice speech 60% 0% ▼ silenced
Bob reward +0.11 +0.48 ▲ 4×
Carol reward +0.08 -0.07 ▼ isolation persona

Alice speech 60% → 0%. The training data was dominated by active behaviors, so Alice's shy persona got crushed. A single LoRA biases toward the majority persona.

Ollama didn't support LoRA inference (as of 0.21 — ADAPTER parses but generation gives "loras are not yet implemented"). So I had to merge → 16bit → GGUF → Q4_K_M → ollama create — 4 steps, also inefficient.

Option E: per-agent LoRA + llama.cpp multi-LoRA hot-swap ★ adopted

llama.cpp server natively supports multi-LoRA + runtime scale hot-swap:

llama-server \
  --model qwen3-8b.gguf \
  --lora mina_v1.gguf mina \
  --lora liam_v1.gguf liam \
  --lora sora_v1.gguf sora \
  --lora eva_v1.gguf eva \
  --port 8081
Enter fullscreen mode Exit fullscreen mode

All adapters (~70MB each) loaded into memory simultaneously. POST /lora-adapters to set only the active adapter to scale=1.0 (~ms). Base model loaded once. Swap cost essentially zero.

# Mina's turn begins
requests.post("http://localhost:8081/lora-adapters", json=[
    {"id": 0, "scale": 1.0},  # mina
    {"id": 1, "scale": 0.0},  # liam
    {"id": 2, "scale": 0.0},  # sora
    {"id": 3, "scale": 0.0},  # eva
])
# Then call chat completion as usual

# When inner modules (shared) are called: all scales=0 → base only (no training influence)
Enter fullscreen mode Exit fullscreen mode

Option D vs E

Option D (single LoRA) Option E (per-agent LoRA)
Alice reward change ▼ +0.49→+0.05 (suppressed) +0.34→+0.65 (reinforced)
Bob reward change ▲ 4× (bias beneficiary) persona preserved
Conversion pipeline merge→16bit→GGUF→Q4_K_M (~4min) convert_lora_to_gguf only (~1.5min)
Total sleep time ~4min (1 adapter) ~3min (4 adapters)

Per-agent — and faster than Option D. The key: the merge step disappeared.


The absolute principle — don't touch the agent's mind from outside

A multi-agent system grows only through external stimulus and the consequences of its own actions.
Direct manipulation of the unconscious or internal state from outside is prohibited.

Category Allowed ✓ Forbidden ❌
Environment (the plaza) "God" curates weather/objects/time — same for everyone
Agent unconscious values Only via sleep().update_from_experiences All other external mutation
Agent actions LLM (persona + inner modules) decides "God" forcing "Mina, do X"
Training data LoRA trained only on own experience_log Injecting external training data

The module that broke this: AgentDirector

A "God" LLM module that nudged the unconscious baseline of paralyzed characters by ±0.05. It worked, but it was cheating.

Why:

  • Unconscious baseline change is one axis of the learning signal. External mutation breaks causality and dilutes the illusion of emergent behavior.
  • A simulation where "god holds the puppet strings" loses its value.
  • Portfolio / research value lies in "agents evolving themselves."

Removed. Code preserved as legacy, not called.

Other approaches I rejected

  • Homeostasis term (base_X = 0.95*base_X + 0.05*INITIAL): not a "god" LLM, but the system itself forces regression. Independent of own experience. Violates agent autonomy.
  • Environment → unconscious deterministic mapping (cup in hand → automatic mood+0.005): bypasses the normal path (perception → emotion → experience → sleep update).

The only path to changing unconscious: external_stimulus → inner module → experience_log → sleep cycle update. All other mutations forbidden.

The legitimate cure for paralysis (allowed)

  1. Avoid the trap at the character spec stage — when designing a new character, check for paralysis traps (extreme baselines + archetype contradictions)
  2. Enrich the environment — when the "God" creates more diverse weather/objects, the inner module responses diversify, experience becomes richer, and the sleep update becomes more powerful. The unconscious is not touched directly.
  3. Time — whether the agent escapes paralysis or not is itself a simulation result. "Why didn't it escape?" also has academic value.

Bottom line — not cheating gave better results. See the end-to-end verification below.


The Daesoo trap — what I learned from a character spec

In the previous 100-round run, a character named Daesoo stayed at reward = -0.014 for the entire 100 rounds. All 4 phases at -0.01. On analysis, the spec itself was a paralysis trap:

"Daesoo": {
  "persona": "Worried, sensitive to reward. Has social need but doesn't easily trust people",
  "unconscious": {
    base_mood: 0.38,      # lowest of 4
    base_anxiety: 0.61,   # highest of 4
    base_trust: 0.45,     # lowest of 4
    social_need: 0.58,    # ← contradicts trust
    base_energy: 0.48,
  }
}
Enter fullscreen mode Exit fullscreen mode

Trap mechanism

  1. Unconscious baseline contradictionsocial_need 0.58 (wants social) + trust 0.45 (can't trust) → wants to approach but avoids. Action freezes.
  2. All module responses uniformly negative — low mood → emotion negative, high anxiety → instinct avoidance, low trust → rejecting social signals. Even when Persona integrates, behavior variance ↓.
  3. One-way drift — in update_from_experiences, anxiety only goes up (frequency-of-negative-experience based). No homeostasis, so once trapped → trapped forever.
  4. Archetype duplication — Hojin was already "the cautious solitary observer." Daesoo overlapped, reducing diversity among the 4.

The new 4

With these lessons, I designed Mina/Liam/Sora/Eva:

mood anxiety trust persona
Mina (Korean) 0.62 0.28 0.71 sociable, talkative
Liam (Western) 0.68 0.30 0.65 curious, joking
Sora (Korean) 0.58 0.42 0.68 sensitive, empathetic
Eva (Western) 0.60 0.32 0.55 active, blunt

All within mood 0.58~0.68, anxiety 0.28~0.42 — no extremes. 4 distinct archetypes (social/curious/sensitive/active).

Character creation checklist

  • No contradictory combinations like high social_need + low trust (paralysis trap)
  • Limit extreme values (>0.7 or <0.3) to 1~2 baselines
  • Avoid archetype duplication
  • Be aware homeostasis is absent — negative baselines have no recovery mechanism, so start safe
  • Persona text and traits must be consistent — writing "sensitive to reward" with reward_sensitivity=9 amplifies negative reward 9× as well

"God" LLM split into 4

The limit of a single call

At first, narrator did everything. think=True with one call: narration + environmental decisions + character analysis. 30s/call, intervention frequency 0% (the LLM was too cautious).

One 8b model handling a long, rich prompt with think=True ended up shallow at every individual subtask. Responsibility separation was needed.

After splitting

Pipeline (~50s/round):
  WeatherManager  (think=False, ~3-5s)  60% chance to change weather
  ObjectManager   (think=False, ~5-8s)  70% to add object, holder cap 2
  AgentObserver   (think=False, ~5s)    analysis only (read-only)
  Narrator        (think=True, ~15s)    storybook narrator (observation only, no decisions)
Enter fullscreen mode Exit fullscreen mode

Each role: short prompt + try/except isolation → if one fails, the others survive.

The role of AgentObserver

The Narrator doesn't see raw agent data directly. It receives Observer's pre-analyzed highlights:

{
  "agents": [
    {"name": "Mina", "state": "Walking with Sora following the music", "shift": "speech rate ↑"}
  ],
  "group": "Mina-Sora pair / Liam observing from bench",
  "highlights": ["Mina-Sora distance closing", "Eva reacting to new object"]
}
Enter fullscreen mode Exit fullscreen mode

Once this is in the Narrator's prompt, the "tone of description + what to highlight" becomes clear. The Narrator unfolds these highlights with think=True in storybook narrator voice.

ObjectManager's holder cap

Once Mina ended up with 11 lamps in her hand. The LLM kept adding the same type to the same holder. Solved with MAX_HELD_PER_AGENT=2 + duplicate-type rejection + total object cap of 16.

The removed module: AgentDirector

A module that nudged unconscious baselines by ±0.05 to cure paralysis. Removed for violating the absolute principle. Code preserved, not called.


External stimulus — 5 channels without a Vision LLM

The original spec had Vision LLM (qwen3-vl:4b, 3.3GB) seeing the screen capture every turn. On 8GB, base 8b (5.0GB) + Vision (3.3GB) = 8.3GB → swap mandatory, 6 extra seconds per turn.

Alternative: deterministic multi-channel synthesis (0 LLM calls):

def perceive(agent, world, received_dialogues):
    return {
      "visual":     "Position (3,4) | Sora nearby | bench visible",
      "auditory":   "Babbling water | Mina says: 'Let's walk together'",
      "tactile":    "Cold, hard wooden seat | another person's presence nearby",
      "thermal":    "Warm sunlight | cool air",
      "gustatory":  None,
      "summary":    "Integrated text from 5 channels above",
    }
Enter fullscreen mode Exit fullscreen mode

object_types.json defines which senses each object stimulates:

{
  "tree": {
    "senses": {
      "visual": "Tall tree, lush leaves",
      "auditory": "Leaves rustling softly in the breeze",
      "tactile": "Rough bark"
    }
  },
  "cup": {
    "senses": {
      "visual": "Small ceramic cup",
      "tactile": "Warm, smooth surface",
      "gustatory": "Sweet tea aroma"
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

WEATHER_SENSES constant + time-of-day thermal mapping auto-generate weather/time-based senses. Eva next to a tree gets "leaves rustling softly" added to auditory automatically. A cup in her hand gets "sweet tea" added to gustatory.

This module sits in the spec's Vision slot. It's auto-called every turn (INPUT_MODULES), separated from the MODULE_REGISTRY (emotion/instinct/reasoning) that the Persona selects from:

INPUT_MODULES = {"external_stimulus": external_stimulus}  # auto-called every turn
MODULE_REGISTRY = {"emotion": ..., "instinct": ..., "reasoning": ...}  # Persona selects
Enter fullscreen mode Exit fullscreen mode

End-to-end verification — 100-turn results

R0 → R30 (sleep1) → R60 (sleep2) → R90 (sleep3) → R100. 88-minute simulation.

v2 vs v3 comparison

Metric v2 100 turns (previous) v3 100 turns (this run)
Characters Mina/Hojin/Yuna/Daesoo Mina/Liam/Sora/Eva (balanced)
God LLM structure Single call 4 separated modules
AgentDirector Unconscious nudge ±0.05 ❌ Removed (absolute principle violation)
Mina mood drift -0.07 +0.03 ▲ +0.10
Overall mood drift -0.06 ~ -0.08 +0.03 ~ +0.06 all positive
anxiety drift +0.02 ~ +0.03 (rising) 0.00 ~ +0.01 (stable)
Phase B learning effect One character bias All 4 reach +0.33+

Removing the Director nudge + balanced characters + richer environment alone produced natural, positive evolution for everyone. Not cheating gave the better result.

Phase-by-phase reward progression

Phase A (base) Phase B (v1) Phase C (v2) Phase D (v3)
Mina +0.33 +0.40 +0.59 +0.26
Liam +0.09 +0.33 +0.13 +0.21
Sora +0.03 +0.33 +0.36 +0.27
Eva +0.03 +0.34 +0.26 +0.06

Phase B leap (after first sleep cycle): all 4 agents reach reward >= +0.33. In the previous version with shared LoRA, only one character benefited 4× (a clear bias). Here the per-agent LoRA + balanced characters distribute the effect across all 4.

Unconscious drift

mood Δ anxiety Δ
Mina +0.03 0.00
Liam +0.06 0.00
Sora +0.03 +0.01
Eva +0.04 +0.01

All positive. In the previous version, mood was -0.07 ~ -0.08 (paralysis). Without Director nudge, everyone naturally became happier.

Persona-action consistency

Speech rate vs persona:

  • Mina (sociable, talkative) 81% — most active, highest reward (+0.43 avg)
  • Sora (sensitive/empathetic) 58% — long, deep dialogue
  • Eva (active/blunt) 42% — decisive, energetic
  • Liam (curious/breadth-first) 33% — tries broadly, talks less

Persona text matches behavior. Liam's low speech rate is the natural consequence of "breadth over depth."

Same stimulus, four reactions

A typical round — when Liam suggested following the distant music:

  • Mina: "Yeah, let's go together! Where's it coming from?" (instant join, social)
  • Sora: "The tone sounds sad... what song is it?" (sensitive, focused on the music itself)
  • Eva: "We can check the location on the way" (active + blunt, practical)
  • Liam: already walking, "Let's see who follows" (curious, action-first)

Same stimulus, four reactions. Not just different persona prompts — the unconscious baseline made the emotion/instinct module responses different, and that got accumulated into the LoRA.

Time breakdown

  • Total simulation: 88 minutes (101 rounds)
  • Average normal turn: 46.1s (including 4 god module costs)
  • Sleep cycle: 5.5~5.9 min × 3 times (R30, R60, R90)

Infrastructure — home GPU + Cloudflare free tier

Stack

Area Stack
LLM inference home Ubuntu + llama.cpp CUDA build
GPU RTX 3070 8GB VRAM
Model qwen3:8b (Q4_K_M) + per-agent LoRA
Training unsloth + PEFT (4bit QLoRA)
Backend Python (cron-based simulation loop)
DB Cloudflare D1 (SQLite, single source of truth)
API Cloudflare Workers (POST/GET/SSE)
Frontend Vite + React + rough.js (hand-drawn SVG) + Tailwind
Hosting Cloudflare Pages
Cost $0 (Workers/D1/Pages all free tier)

Communication architecture

[home GPU backend (Python)]
            │
            └─ HTTPS POST /api/snapshot (HMAC-SHA256 auth)
                         │
                         ▼
                  [Cloudflare Workers]
                         │
                         ├─ INSERT D1 (persistent storage)
                         │
                         └─ SSE polling → push to viewers
                                  │
                  ┌───────────────┼───────────────┐
                  ▼               ▼               ▼
              [viewer1]       [viewer2]       [viewerN]
Enter fullscreen mode Exit fullscreen mode

Key decision: the backend doesn't run an SSE server. It's just an HTTP POST client. All data passes through D1 (single source of truth).

Why:

  • Backend can die, viewer connections survive (Cloudflare handles)
  • On reconnect, missed data auto-backfilled via Last-Event-ID
  • Live and replay use the same data source
  • Architecture stays simple

Cron and timing

Simulation runs on a 10-minute cron. A normal round is ~46s, but the sleep cycle takes more than 5.5min — a 5-min interval would cause the next tick to collide mid-training. Safety: 10 min.

crontab: */10 * * * * /path/to/cron_step.sh
flock prevents overlap
Enter fullscreen mode Exit fullscreen mode

Frontend

  • rough.js for hand-drawn SVG — static layer (background/trees/benches rendered once and cached) + dynamic layer (agent positions/speech bubbles updated each frame via transform)
  • Time-of-day sky gradient (morning/afternoon/evening/night) + sun/moon/stars
  • Weather effects (clear/cloudy/rainy/windy) via CSS @keyframes
  • Agent movement 600ms ease-out interpolation
  • Bilingual output (Korean/English, branched by navigator.language)
  • Timeline seeker (live/replay mode toggle + LRU 1000-tick cache)
  • Click an agent → detail panel (consciousness + debug-mode unconscious/relationships)

llama.cpp build — the prebuilt is CPU-only

The prebuilt binary I downloaded (llama-b8967-bin-ubuntu-x64.tar.gz) didn't include CUDA. It worked for short prompts on CPU, but hit limits on long prompts + reasoning. Building from source:

sudo apt install nvidia-cuda-toolkit cmake build-essential
cd /path/to/llama.cpp
cmake -B build -DGGML_CUDA=ON -DLLAMA_CURL=OFF -DCMAKE_CUDA_ARCHITECTURES=86
cmake --build build --target llama-server -j$(nproc)
Enter fullscreen mode Exit fullscreen mode

SM 8.6 (RTX 3070), 7.6GB available recognized.


Pitfalls and lessons

Most distinctive ones from the 100-turn run:

1. Energy paralysis disguised as a lock-in

From R17 onward, all 4 agents produced identical output + 8s/iter. Looked like a real lock-in, so I activated all the cures (memory wipe, random jump, perturbation hint) — no effect.

Real cause: if energy<20: skip take_turn → console was printing cached values. It wasn't a lock-in, energy depletion meant nobody acted. Removing the skip logic fixed it.

Lesson: prescribing based on symptoms hides the real cause. If the cure "runs but has no effect," the diagnosis is wrong.

2. CJK hallucination — qwen3:8b's multilingual leak

Korean narration kept getting Chinese characters (放下) and Japanese kana (ながら) mixed in. qwen3:8b is multilingual, so adjacent-language tokens leak in. Handled with regex strip + explicit prompt prohibition.

def _clean_korean(text):
    # Remove Chinese characters
    text = re.sub(r'[\u4e00-\u9fff]+', '', text)
    # Remove Japanese kana
    text = re.sub(r'[\u3040-\u30ff]+', '', text)
    return text.strip()
Enter fullscreen mode Exit fullscreen mode

English word leaks ("주머니를 만지 inspected") are trickier — Latin-letter names (Mina, Liam, Eva) need to be preserved, so a simple regex can't catch them. Unsolved.

This is unavoidable when using an 8b multilingual model exclusively for Korean. Whether fine-tuning can suppress it is an open experiment.

3. Prompt cache disguised as a lock-in

llama.cpp's prefix cache + similar prompt → same generation token sequence → identical output. Even temperature 1.2 couldn't break it — a strong attractor.

Quadruple cure:

  1. Truncate memory/intention/log
  2. Perturbation hint (random pick from 6 "sudden curiosity/impulse" prompts)
  3. Far-jump (move 2+ cells)
  4. Temperature split (0.9 default, 1.2 only for synthesize)

4. llama-server zombies

A previous session's PID held port 8081, but the Python instance's self.proc=None made is_running() return False. The next spawn attempt → port collision.

def _pids_on_port(self, port):
    result = subprocess.run(
        ["lsof", "-ti", f":{port}"],
        capture_output=True, text=True
    )
    return [int(p) for p in result.stdout.strip().split('\n') if p]
Enter fullscreen mode Exit fullscreen mode

start() / stop() check for external PIDs and kill them.

5. JSON parsing — Korean quotation marks

The persona would put '혼자 있으니 괜찮다' (with unescaped Korean quotes) in rationale → JSON parsing fails. Fixed with retry × 2 + Korean quote regex cleanup + explicit "no double quotes" in prompt.


What's next?

Honestly, I'm not sure yet. The simulation works, and I've verified that the 4 agents evolve from their own experience. Where to take it next for the most fun is what I'm thinking through now.

If you have ideas, drop a comment — I'd appreciate it. And sim.as1as.net is live; let me know how it feels to watch.


One-line summary

A multi-agent system with a 3-layer cognitive architecture — instinct (shared modules) + personality (per-agent LoRA) + unconscious (numerical baselines) — evolves from its own experience in a plaza built on a single 8GB GPU.
Nothing reaches into the agents' minds from outside — that's the absolute principle.


Inspirations

  • Generative Agents (Park et al., 2023) — the starting point of "I is not singular"
  • Marvin Minsky, Society of Mind — inner module separation
  • Daniel Kahneman, Thinking, Fast and Slow — System 1 (think=False) / System 2 (think=True)
  • Bernard Baars, Global Workspace Theory — Persona LLM as integrator of module responses
  • Lee Yeongdo, Dragon Raja (1998) — the original "I is not singular." Untranslated to English; the line as it appears in the Korean novel uses "is" (not "am") because "I" is treated as a word, not the speaker.

Top comments (0)