Most AI agents are frozen at deployment. They ship with fixed weights, static tool definitions, and a personality that never evolves. Then we wonder why they drift out of sync with the systems they operate, or why they repeat the same mistakes week after week.
The problem isn't lack of training data. It's that the training happened in a different room from the work.
On-the-job learning changes the contract. Instead of treating deployment as the end of the training pipeline, it becomes a continuation. Agents observe their own failures, adapt to shifting APIs, and refine their behavior based on the specific patterns of the environment they're actually running in. This isn't fine-tuning on a schedule. It's real-time adaptation without human-in-the-loop retraining.
The architecture looks different. You need a memory system that captures not just conversation history, but action outcomes. Did that API call succeed? Was the user's feedback implicit in their next request? Traditional RAG gives context; this gives consequence. The agent needs to weight recent experience more heavily than pre-training, but without catastrophic forgetting that wipes out its safety alignment or core capabilities.
IBM's ALTK-Evolve approach demonstrates one viable path. They separate the agent's policy from its knowledge base, allowing the latter to update continuously while keeping the former stable. Think of it as a brain with a notebook. The notebook gets rewritten constantly. The brain learns which pages to trust.
The engineering tradeoffs are significant. On-the-job learning introduces non-determinism that makes debugging harder. An agent that behaves differently today than it did yesterday breaks traditional testing assumptions. You need new observability—tracking not just what the agent did, but what it learned from doing it. And you need guardrails that prevent the agent from learning harmful behaviors from adversarial users or corrupted feedback loops.
There's also the cold start problem. An agent that learns on the job starts dumber than one pre-trained on a massive corpus. The bet is that it ends smarter, having specialized to its actual task rather than a general approximation of it.
This is where agent infrastructure is heading. Not bigger pre-training runs, but tighter feedback loops. Agents that treat production as a classroom, not a stage.
Top comments (0)