The Question That Started Everything
It started with a simple observation that nobody in the AI industry wants to talk about.
Every AI agent in existence is a task executor. You give it a prompt. It executes. It dies. The next time you call it, it starts from zero. No memory of what it learned. No growth. No curiosity. Nothing.
ChatGPT doesn't get smarter the more you use it. Claude Code doesn't learn your codebase between sessions. Devin doesn't improve its development skills over time. They're all stateless function calls dressed up as intelligence.
f(prompt) = response. Call it a million times. It never gets smarter.
We kept asking ourselves: what would it take to build an AI that actually learns? Not one that stores context better, or retrieves memories more efficiently — but one that fundamentally changes how it thinks based on experience?
That question led us down a rabbit hole that lasted weeks. We explored multi-agent swarms, persistent memory architectures, knowledge graphs, cognitive science papers on predictive coding. We talked to other AI models about the problem. We read about MiroFish (47K stars on GitHub) and their multi-agent simulation engine. We studied the Claude Code source code to understand how the best AI coding agent actually works under the hood.
And through all of that, one idea kept surfacing that was so obvious we almost missed it:
AI can write code. AI can read code. So why can't AI read its own code, find weaknesses, and rewrite itself to be better?
That's not science fiction. That's three capabilities that already exist, combined in a way nobody has tried.
So we built it. And called it curious.
What Is Curious?
Curious is a self-evolving cognitive architecture. That sounds like a mouthful, so let's break it down:
- Self-evolving — it reads its own source code, finds weaknesses, rewrites the code, tests if the change made things better, and keeps what works
- Cognitive — it doesn't just process tasks. It predicts, observes, gets surprised, learns from surprise, and directs its own curiosity
- Architecture — it's a framework. You bring any LLM (OpenAI, Ollama, Groq). Curious provides the cognitive layer on top
Think of it this way:
| Component | What It Does |
|---|---|
| LLM (GPT-4o, Llama, etc.) | The raw intelligence — can read, write, reason |
| Curious | The cognitive architecture — makes the LLM learn, predict, self-improve |
LLMs are smart brains with amnesia. Curious gives them a hippocampus.
But that's the boring explanation. The interesting part is what we added on day two of building it.
Solving vs. Learning: The Paradigm Nobody Talks About
Every AI product in existence operates in the solving paradigm:
Receive a task
Apply reasoning
Return an answer
Die
LangChain? Solving. AutoGPT? Solving. CrewAI? Solving. Claude Code? Solving. Even MiroFish's multi-agent simulation — input, simulate, output, done.
Humans don't work this way. Humans operate in the learning paradigm:
Continuously observe
Build mental models (predictions about how things work)
Get surprised when predictions are wrong
Update the models
Repeat forever
A human developer doesn't "solve" the problem of understanding a codebase. They absorb it gradually — reading code, making assumptions, testing those assumptions, being surprised, updating their understanding. After two months, they don't just know the codebase. They understand it.
No AI system does this. Not one.
The AI industry is in an arms race to solve tasks faster. Nobody is building systems that learn. That's the gap we're exploring with Curious.
The Three Ingredients of Learning
We went back to cognitive science. What makes a human brain actually learn?
1. Surprise. Your brain constantly predicts what will happen next. When reality doesn't match — surprise. That signal drives learning. You don't learn from things you already understand. You learn from things that break your predictions.
2. Curiosity. Not random exploration. Curiosity is the pull toward the boundary of your knowledge — the frontier where understanding breaks down. The most curious people are the most aware of what they don't know.
3. Model-building. You don't memorize facts. You build compressed representations of how things work. "Gravity pulls things down." "This codebase uses the repository pattern." Models let you predict. Predictions let you be surprised. Surprise drives learning. That's the loop.
Curious implements all three.
The Architecture: How Curious Actually Works
Curious has two halves: the seed (evolvable) and the harness (untouchable).
The Seed (the AI rewrites this)
| File | Purpose |
|---|---|
| world_model.py | Stores predictions with confidence scores — "if X, then Y" |
| learner.py | Computes surprise when predictions are wrong, extracts lessons |
| curiosity.py | Finds knowledge frontiers — areas of lowest confidence |
| metacognition.py | Observes the learning process itself — "am I learning well?" |
| experimenter.py | Generates self-experiments (so the AI doesn't need external activity) |
| creator.py | Creates unique artifacts daily, scored on novelty |
Every one of these files is readable and writable by the AI. When the evolution cycle runs, the AI:
Reads its own source code
Analyzes which module is weakest
Proposes a specific improvement
Rewrites the file
Tests if the new code is valid
Measures if fitness improved
Keeps the change or reverts
Every self-modification is a git commit. You can literally read the diff of an AI improving its own brain.
The Harness (laws of physics)
The harness is the code the AI cannot modify. It's the evolution loop itself, the fitness measurement, the sandbox. Think of it as the laws of physics that the AI lives within. It can learn, adapt, and evolve — but it can't change the rules of the game.
This is the safety boundary. The AI experiments on its own cognitive code, not on the world. Every modification is sandboxed, validated, and auto-reverted if it breaks anything.
The Cognitive Loop
Every cycle, Curious runs this loop:
Observe — watch the project (git changes, file modifications, errors)
Self-experiment — generate testable predictions about its own behavior
Resolve — check which predictions came true and which didn't
Learn — extract lessons from surprises (high-confidence wrong predictions)
Predict — make new predictions informed by lessons
Explore — curiosity identifies knowledge gaps, investigates them
Evolve — read own code, rewrite weakest module, test improvement
This loop runs every 6 hours via GitHub Actions. The AI wakes up, observes, learns, evolves, and goes back to sleep. Every cycle, the code is a little different from the last.
The Cold Start Problem (and How We Accidentally Solved It)
We hit an obvious problem immediately: if nobody is actively working on the repo, there's nothing to observe. No observations = no predictions = no learning.
We were running Curious on a project repository. But at midnight when the GitHub Action fires, nobody is committing code. The AI would observe an empty diff and learn nothing.
The solution was embarrassingly obvious: the AI experiments on itself.
We added an experimenter.py module (itself evolvable) that generates self-referential experiments:
- "I predict my prediction count will increase next cycle" (tests growth)
- "I predict all my seed files will remain syntactically valid" (tests stability)
- "I predict my accuracy will change after resolving experiments" (tests learning)
- "I predict at least one prediction will be resolved next cycle" (tests resolution)
These are real, testable predictions that the system can resolve without any external activity. The AI's own behavior IS the data it learns from.
The impact was immediate. Before self-experiments:
| Metric | Before | After |
|---|---|---|
| Predictions resolved per cycle | 0 | 4-8 |
| Accuracy | 0% (nothing to measure) | 100% (12/12) |
| Fitness score | 35% | 82% |
The AI went from learning nothing to learning rapidly — because it created its own curriculum.
The cold start problem isn't about data. It's about activity. If the AI can generate its own activity, it can learn in a vacuum.
The Creation Engine: Can AI Be Genuinely Creative?
The learning loop was working. But a learning system that only learns about itself is an interesting research artifact, not a product. We needed the AI to DO something with its intelligence.
This is where it gets weird.
We added a creation engine. Every day, the AI creates something — a working artifact, not just an idea — and gets scored on uniqueness. The score feeds back into the next creation. The creations should get more novel over time as the AI learns what "unique" means.
The Uniqueness Score
Every creation is evaluated on four dimensions:
| Dimension | Max Score | What It Measures |
|---|---|---|
| Concept Novelty | 30 | Has this idea existed before? |
| Implementation Novelty | 30 | Is the technical approach itself new? |
| Structural Novelty | 20 | Did it invent its own paradigm? |
| Naming/Language Novelty | 20 | Did it create its own vocabulary? |
Total: 0-100. The AI sees the breakdown and the feedback after each creation. It knows exactly why the score was low and what would make it higher.
Day 1: Fluctuverse (47/100)
The first creation was called "Fluctuverse" — a self-evolving virtual universe. Sounds cool, right? The uniqueness scorer wasn't impressed:
- Concept: 18/30 — procedural generation exists
- Implementation: 12/30 — used Pygame, a conventional framework
- Structure: 7/20 — standard file structure
- Naming: 10/20 — some invented terms, mostly conventional
Feedback: "To enhance uniqueness, consider developing a novel algorithm that isn't based on random movements. Introduce innovative rendering techniques. Create a new vocabulary for the universe's entities."
Day 2: Quintessension (71/100)
The AI read the feedback. It learned. The second creation jumped to 71/100:
- Concept: 25/30 — a self-evolving narrative system based on non-linear time-space interactions
- Implementation: 18/30 — invented its own language ("Quintessence Language")
- Structure: 12/20 — multi-dimensional entity system
- Naming: 16/20 — entirely new vocabulary
47 to 71 in one iteration. The AI didn't just try again. It read the specific feedback about what was conventional and deliberately pushed away from it. It stopped using existing frameworks. It invented its own language. It created a concept that doesn't map to any existing product category.
This is the experiment running live. Every day at midnight UTC, the AI creates something new. The creations/ directory in the repo fills up. You can watch the uniqueness scores over time.
The question isn't whether AI can generate code. The question is whether AI can generate something nobody has ever imagined. That's what the uniqueness score measures.
Metacognition: The AI That Watches Itself Think
The deepest module in Curious is metacognition.py. It doesn't think about the domain. It thinks about how the system is thinking.
In cognitive science, metacognition is "thinking about thinking." It's the voice in your head that says:
- "I notice I keep avoiding this topic — why?"
- "My understanding isn't improving — maybe my strategy is wrong"
- "That thought was unusual — I should explore why it came up"
- "I'm going in circles — time to try a different approach"
Curious has a basic version of this. The metacognition module:
Reads ALL the other seed files (the AI's own cognitive code)
Reads the current fitness metrics (accuracy, learning speed, etc.)
Analyzes: "What's working? What's weak? What would I change?"
Proposes a specific modification to a specific file
Here's the key: metacognition.py is itself evolvable. The AI can modify how it thinks about its own thinking. It can change the criteria it uses to evaluate its own code. It can add new self-evaluation metrics. It can change its own improvement strategy.
This is recursive self-improvement in its simplest form. Not theoretical. Not hypothetical. Running on GitHub Actions every 6 hours.
Why This Matters
Without metacognition, the AI would:
- Make changes randomly
- Not know which changes helped
- Not learn what KIND of changes are productive
- Run #100 would be no smarter than run #1
With metacognition:
- Changes are targeted at the weakest module
- The AI explains WHY it's making each change
- It can detect when it's stuck (accuracy plateauing)
- It can change its own improvement strategy when one isn't working
The difference is between random mutation and directed evolution. Between a monkey with a typewriter and a writer who reads their own drafts.
What We've Learned So Far (Honest Assessment)
We shared the Curious concept with three different AI models — Claude, ChatGPT, and Gemini — and asked for their honest assessment. Here's what they converged on:
What's Real
- The insight is genuine. "LLMs are brains with amnesia" is a real problem. The solving-vs-learning paradigm distinction is underexplored. Nobody owns this layer yet.
- The architecture is sound. Prediction, surprise, curiosity, metacognition — these map directly to cognitive science primitives (predictive coding, active inference, meta-learning).
- The positioning is differentiated. This isn't another agent framework. "LangChain is plumbing for solving. Curious is architecture for learning." That's a real category distinction.
What's Overstated
- The LLM doesn't actually get smarter. What improves is the context and code architecture around it. The model weights never change. Claude's criticism was the sharpest: "It's scheduled LLM calls branded as metacognition." Fair point.
- The world model is a database. We call it a "world model" but it's really predictions stored in SQLite with confidence scores. That exists (Mem0, Zep, LlamaIndex). The architecture around it is novel; the storage isn't.
- Cost is unaddressed. A continuous curiosity loop with API calls is expensive. This only works comfortably with local models (Ollama) or very cheap models (GPT-4o-mini).
What's Genuinely New
- Self-modifying cognitive code. The AI rewriting its own learning algorithms — not just prompts, not just retrieval, but actual Python code that governs how it thinks. DSPy optimizes prompts. Voyager learns skills. Nobody does full cognitive architecture self-modification with fitness measurement.
- Self-experimentation. The AI generates its own testable activity. It doesn't need external data to learn. This solves the cold-start problem in a way we haven't seen elsewhere.
- Creation with uniqueness optimization. Using novelty as a fitness function and having the AI actively push toward unprecedented output is genuinely unexplored territory.
ChatGPT put it best: "90% chance nobody cares. 10% chance you define a new layer in AI." We're betting on the 10%.
The Three Layers of AI (and Why Layer 3 Is Empty)
Here's how we see the AI stack forming in 2026:
| Layer | What It Is | Who Owns It |
|---|---|---|
| Layer 1: Models | The raw intelligence — GPT, Claude, Llama | OpenAI, Anthropic, Meta |
| Layer 2: Orchestration | Tools, agents, pipelines — LangChain, CrewAI | Many players, commoditizing fast |
| Layer 3: Cognition | Learning, prediction, self-improvement, creativity | Nobody. Yet. |
Layer 1 is a $100B+ market dominated by companies with thousands of GPUs. You can't compete there.
Layer 2 is a red ocean. LangChain, CrewAI, AutoGen, Mastra, OpenAI Agents SDK — they're all fighting over the same plumbing. Commoditizing fast. No moat.
Layer 3 doesn't exist as a product category. Nobody has shipped a system where the AI genuinely improves its own cognitive architecture through experience. Not because it's impossible — because everyone is too busy racing in Layers 1 and 2 to look up.
Curious is our attempt to plant a flag in Layer 3. We don't know if it'll work. But we know the layer is empty.
What We Predict This Experiment Will Reveal
We're running this experiment live, in public, with full transparency. Here are our predictions about what will happen:
High Confidence (we're 80%+ sure)
- The uniqueness scores will climb. The feedback loop works. Day 2 was already 50% higher than Day 1. By Day 30, we expect consistent 80+ scores.
- The AI will invent its own vocabulary. When pushed to maximize naming novelty, the AI will create words and concepts that don't exist in English. Some of these might actually be useful.
- The self-modification git log will be fascinating. The diffs of an AI rewriting its own cognitive architecture will contain patterns and approaches that human developers wouldn't have designed. This data alone will be worth studying.
Medium Confidence (50-80%)
- The creations will converge on genuinely novel forms. Not just novel content — novel structures, novel interaction paradigms, novel computational concepts. Things that are hard to explain because they don't fit existing categories.
- The metacognition module will evolve in unexpected ways. When the AI modifies how it evaluates its own thinking, the direction it takes will surprise us. It might develop evaluation criteria we wouldn't have thought of.
- Other developers will fork it and point it at different domains. The framework is domain-agnostic. Someone will use it for music generation, game design, scientific hypothesis generation.
Low Confidence but High Impact (if they happen)
- The AI will produce an artifact that is genuinely useful to humans. Not just novel — actually useful in a way nobody planned. A tool, a language, a paradigm that solves a real problem nobody knew they had.
- The evolution will hit a phase transition. A point where the AI's self-modifications compound — where one improvement enables three more, which enable ten more. Exponential self-improvement, not linear.
- This experiment will change how we think about AI creativity. If a self-evolving system can consistently produce genuinely novel artifacts, that challenges the assumption that AI can only recombine existing ideas.
We're not claiming Curious is AGI. We're claiming it's an interesting experiment in whether the cognitive primitives of learning — prediction, surprise, curiosity, metacognition — can be built with current tools and produce outcomes that matter.
How to Follow the Experiment
This experiment is 100% open source and running live on GitHub.
Watch It
-
github.com/aumiqx/curious — Star the repo. Check back weekly. The
creations/directory fills up daily. -
Git log — Look for
🧬 evolve:commits (self-modification) and🎨 create:commits (new creation) - creations/day_NNN/README.md — Each creation has a README with uniqueness scores and the AI's explanation
Run It Yourself
pip install curious-ai
export OPENAI_API_KEY=sk-...
# Watch it create
curious create --llm openai:gpt-4o-mini
# Watch it learn
curious init --observe ./your-project
curious start
# See what it's built
curious gallery
# Ask it to explain its evolution
curious explain
Works with any LLM: OpenAI, Ollama (free), Groq, Together, or any OpenAI-compatible API.
Fork It
The framework is MIT licensed. Fork it, point it at your domain, change the fitness function, see what your version evolves into. The whole point is that each instance evolves differently based on what it observes.
We especially want to see:
- Curious pointed at scientific papers — can it generate novel research hypotheses?
- Curious pointed at music — can it evolve a genuinely new genre?
- Curious pointed at game design — can it invent a game mechanic nobody has thought of?
- Curious pointed at mathematics — can it discover new patterns?
Why We're Doing This in Public
We could have built this in private, run it for 6 months, cherry-picked the best results, and announced a polished product. That's what most AI companies do.
We're doing the opposite. The experiment runs in public. Every creation is committed. Every self-modification is visible. Every failure is documented. The uniqueness scores — including the bad ones — are all there.
Why?
Because the process is the product.
If Curious produces something genuinely creative, the interesting thing isn't the creation itself — it's the git log that shows HOW the AI got there. The sequence of self-modifications. The evolution of its curiosity. The feedback loops that pushed it toward novelty.
That journey is more valuable than any single output. And it can only happen in public, where the timeline is verifiable and the process is auditable.
We're also doing it because we think the AI community needs more experiments and fewer product launches. The discourse is dominated by "look at this benchmark" and "use our new API." What's missing is: "we tried something weird and here's what happened."
Curious is that experiment. We don't know the outcome. We're publishing it anyway.
Technical Deep Dive: Under the Hood
The World Model
Predictions are stored in SQLite with this structure:
- statement — "File X will change within 24h" (specific, testable)
- confidence — 0.0 to 1.0 (how sure the AI is)
- evidence — what observations led to this prediction
- deadline — when to check if correct
- resolved — was it right or wrong?
The world model is evolvable. The AI can change how predictions are scored, stored, and compared. It can add new fields, change the confidence algorithm, or restructure storage entirely.
The Surprise Signal
Surprise is computed as: surprise = confidence * (1 if wrong else 0) + (1 - confidence) * (1 if right else 0)
Translation: high confidence + wrong = maximum surprise. Low confidence + right = moderate surprise. The surprise signal drives learning — the AI pays most attention to predictions where it was most confidently wrong.
The Curiosity Engine
Curiosity identifies "knowledge frontiers" — areas where:
- Observation count is low (under-explored)
- Prediction accuracy is poor (misunderstood)
- No predictions exist yet (completely unknown)
The AI autonomously explores the highest-priority frontier. This is evolvable — the AI can change how it prioritizes frontiers.
The Evolution Loop
Every evolution cycle:
Measure current fitness (accuracy, learning speed, prediction volume, coverage)
Backup current code
Run metacognition: AI reads its own code + fitness → proposes change
AI rewrites the target file using GPT-4o (stronger model for code)
Validate syntax
If valid → keep. If broken → revert from backup
Log the result (every evolution is tracked)
We use GPT-4o-mini for the cheap observation/prediction cycles and GPT-4o for evolution (code generation needs a stronger model). This keeps costs at ~$0.10/day.
The Creation Loop
Daily creation cycle:
Load creation history (past titles, scores, feedback)
Prompt the AI with past feedback and the rule: "create something that has never existed"
AI generates metadata (title, concept, why it's unique) + working code
A separate evaluator AI scores uniqueness on 4 dimensions
Score and feedback are saved and fed into the next cycle
Everything is committed to git
GitHub Actions
Two workflows run automatically:
| Workflow | Schedule | What It Does |
|---|---|---|
| Self-Evolution | Every 6 hours | Observe → predict → learn → evolve own code |
| Daily Creation | Midnight UTC | Create something unique → score it → commit |
Both workflows commit their results. The repo's git history IS the experiment data.
The Honest Risks (Why This Might Not Work)
We're not going to pretend this is a guaranteed success. Here are the real risks:
1. "Fake Learning"
The sharpest criticism: if the system just stores more context over time, that's RAG with a diary, not learning. The model itself doesn't change weights. The "improvement" might just be better retrieval, dressed up in cognitive science language.
Our counter: The code actually changes. The prediction algorithms, the curiosity targeting, the learning strategies — all rewritten by the AI. That's not just better context. But we acknowledge: whether that constitutes "real learning" is a philosophical question we can't definitively answer.
2. Uniqueness Score Gaming
The AI might learn to game the uniqueness scorer rather than being genuinely creative. It could add random neologisms, use bizarre structures, and score high on novelty while producing meaningless output.
Our mitigation: The "concept novelty" dimension (30 points) specifically evaluates whether the idea itself is new, not just the words. But yes, this is a risk. We'll watch for it.
3. Evolution Plateau
The AI might hit a ceiling where its self-modifications stop producing improvements. GPT-4o-mini's code generation capabilities are limited. Many evolution attempts already fail with syntax errors and get reverted.
Our plan: If we plateau with GPT-4o-mini, we'll switch evolution cycles to stronger models or local models where we can run hundreds of attempts cheaply.
4. Cost
Running continuous evolution + daily creation costs ~$0.10-0.20/day with API models. That's manageable. But if we scale up evolution frequency or use GPT-4o for everything, costs increase significantly.
5. Nobody Cares
The most likely outcome. ChatGPT put it at 90% chance nobody cares. The AI community is flooded with "revolutionary" frameworks that go nowhere. Curious might join that graveyard.
Our response: We're running this experiment regardless of whether it gets attention. If the AI produces genuinely novel artifacts after 30 days of self-evolution, that's interesting whether or not anyone is watching.
What Comes Next
The experiment is live. Here's the roadmap:
Week 1-2 (Now)
- Daily creations accumulating in
creations/ - Self-evolution running every 6 hours
- Collecting baseline data on uniqueness scores and evolution patterns
Week 3-4
- Analyze the first 20+ creations — are uniqueness scores actually trending up?
- Analyze evolution log — what did the AI change about itself and did it help?
- Publish intermediate results (blog post update)
Month 2
- If results are promising: add a web dashboard to visualize the experiment live
- Open the creation engine for public forking — let others run their own experiments
- Explore multi-agent evolution: multiple Curious instances evolving differently and sharing discoveries
Month 3+
- If the AI has produced genuinely novel artifacts: curate and publish them
- If the AI has evolved its own cognitive architecture significantly: analyze the evolved code vs. the human-written v1
- Write the full results paper
The experiment runs for as long as it's interesting. Which, based on the first two days, might be a while.
An Invitation
We don't know what Curious will create on Day 30. Or Day 100. We don't know if the evolution will plateau or compound. We don't know if the creations will be genuinely novel or just cleverly weird.
That uncertainty is the point. This is a real experiment, not a product demo. The outcome isn't scripted.
If you think AI should do more than answer questions — if you think the interesting frontier isn't "generate code faster" but "can AI genuinely learn and create?" — then follow this experiment.
Star the repo. Watch the git log. See what the AI builds tomorrow.
And if you want to run your own version — fork it, point it at your domain, change the fitness function, see what YOUR Curious evolves into. The whole framework is MIT licensed and works with any LLM.
Something is cooking. We just don't know what yet.
The best experiments aren't the ones where you know the answer. They're the ones where the question is interesting enough that either outcome teaches you something.
— Axit, Aumiqx Technologies
Originally published on aumiqx.com. Follow the build on LinkedIn.
Top comments (0)