AWS re:Invent 2025 -The new AI architecture that adapts and thinks just like humans (STP108)

🦄 Making great presentations more accessible.
This project aims to enhances multilingual accessibility and discoverability while maintaining the integrity of original content. Detailed transcriptions and keyframes preserve the nuances and technical insights that make each session compelling.

Overview

📖 AWS re:Invent 2025 -The new AI architecture that adapts and thinks just like humans (STP108)

In this video, Jan Chorowski and Victor Szczerba from Pathway introduce Baby Dragon Hatchling, a post-Transformer AI architecture inspired by the brain's sparse neural networks. They explain how Transformers lack continuous learning, are inefficient, and unsuitable for long-running enterprise tasks. Baby Dragon features sparse activation and connectivity, enabling continuous learning from thin datasets, extended attention spans beyond two hours, improved energy efficiency, and model observability for regulated environments. The architecture addresses enterprise needs through sticky inference with corporate data, launching mid-year with AWS partnership.

; This article is entirely auto-generated while preserving the original presentation content as much as possible. Please note that there may be typos or inaccuracies.

Main Part

The Transformer's Limitations and the Brain-Inspired Baby Dragon Hatchling Architecture

Hello, everyone, and thank you for joining our presentation. I'm Jan Chorowski, CTO and co-founder of Pathway, and together with Victor Szczerba, our Chief Commercial Officer, we want to show you what's happening in AI beyond Transformers. We'll explore what the post-Transformer architecture is and what it offers to the enterprise. Let me tell you a little secret: the Transformer is on its way out. Its days are numbered.

It's been a wonderful eight years with it. The Transformer is a tremendous invention that has scaled beyond all expectations, going from single sentence translation to powering trillion parameter language models. In a sense, it has expanded from coherence of a few words to a few pages and then to basically a small book. It allowed us to dream about the possibilities of AI. However, as we try to implement those possibilities with the Transformer, we start to see limitations.

The main limitation is when you try to employ it to perform long-running tasks. The truth is that no one is really working on this. The frontier models and foundational models are getting smarter and smarter, so you get something really smart, but it doesn't get any better than that. It's not able to follow through with a task that is sufficiently long. What we often need is coherence over a long-running task. There is this new axis, this horizontal axis of long thinking, which appears to be totally neglected.

Yes, you can download a better model every week and switch to it. They all start to feel the same because they basically all participate in the same competition. But then you put it to work, and it doesn't perform better on those long-running tasks. If there is a bug on day one, there will be a bug on day two and a bug on day three. Nothing really changes, despite the constant feeling of progress. We have narrowed this down to basically three main limitations of the Transformer.

First, it lacks long-term memory that it keeps adapting. In the current blueprint of how we do AI models, they are trained once in the lab on multiple tasks, then released as a snapshot, either as a downloadable model or as a version behind an API, and they stay like that throughout their lifetime. You can try to attach external memory to it, but ultimately the base model stays the same. It's a very smart savant. It's extremely brilliant on day one, just as brilliant on day two, and feels just as limited on day three. There is no progress, there is no change.

You cannot solve really long-line tasks without learning in between, without any sort of lessons learned and getting better. Second, it's actually very inefficient. You can keep improving the benchmark by scaling, but every incremental improvement on the benchmarks basically means a ten-times increase in cost, because the model has to get bigger and the dataset has to get bigger. If you want to get better on benchmarks that are hard for humans, you actually have to curate, namely label, assemble, and generate more hard data. This means to get good data, you have to hire really smart humans. There is no automation; it's basically rewriting.

Yes, the models code better. Yes, you need to hire very good coders to train those models. The value per token, in a sense, is getting smaller and smaller because you have to feed the model with even larger datasets just to keep progressing. As a matter of fact, we ran out of text data on the internet to train the largest models. Taken together, this basically makes the model not suitable for the enterprise. These are one-size-fits-all consumer-grade models. They are not really customizable to enterprise needs. They are not getting better at what they do.

It's really hard to train them on the limited amount of data that enterprises may have, and it's counterproductive to hire more people to generate more data so that you can automate the process one day. You don't really know why the models are failing or why they are responding the way they do. If you spot a bug with this limited interpretability, there is no easy way to fix it beyond doing a ten-times increase on the data or trying a smarter model, and maybe it will switch from one bug failure to another.

However, we know that it's possible to have continuous learning, getting better every day, and we know that it's true to do so in a predictable way. In fact, we are all here today just to learn, right? Our brains excel at integrating new information and learning on the go. We expect smart people to start on a journey, start doing a task, and finish it changed, having learned something during it. It's all about the journey. It's not so for the models; it's all for the brains. We know how the brain differs from the current models. Yes, the Transformer is a deep neural network. Yes, it's brain-inspired, but also yes, it's a very different kind of neural net than the one we have in the brain. The Transformer is basically a very dense network, and in terms of operation, you have to use all of it for every decision it makes.

Every piece of information is mixed together and squeezed into those fully connected dense layers, which is basically the bulk of what you download off the internet when you download the weights of a model. The brain, however, is different. It's sparse in many senses of the word. It's sparsely activated. The joke that we use only 20% of our brain is actually a good thing energy-wise, because we only use the parts of the brain which need to be used for the task at hand. This also means less crosstalk between tasks and less overwriting of skills for task one when I'm doing task two. It also means energy efficiency.

The brain is sparsely connected, not all the neurons talk to each other, and this allows it to have those little modules and clusters where knowledge is stored. As far as network science goes, the world of sparse networks is a very different world from the world of dense networks. So yes, the transformer is an artificial neural network, but we want to say no, it's not the proper network. You need to have this special brain-like structure with high dimensionality, sparse activation, and sparse connectivity to have the benefits of the brain and the inductive biases of the brain to basically learn and behave like a brain.

The good news is we did it. We are introducing the Baby Dragon Hatchling, our AI architecture, which picks the good parts of the brain and basically builds the whole AI around it. We went back to square one and thought about how an artificial neural network could behave like the brain network. As the result of this quest, we are bringing together this brain inspiration about large sparse networks with GPU friendliness. As you know, GPUs love dense matrix multiplication. They don't love sparse matrix multiplication as much. We made the two work together.

We inherited the brain's capacity for learning. We actually have a synapse network which is sparse and which is evolving as the model is solving tasks. So we do have this memory which keeps on adapting to the task at hand. We inherit the computational advantages of the brain. The model has good information processing locality, which means that certain parts are activated for central concepts, and you only activate what's needed. This also means that this can scale because localized information processing means it's easier to decentralize, easier to go from a GPU to a server to a cluster, basically to a data center.

No central coordination basically means supreme scaling and going from small tasks to large tasks. We have published a very rigorous scientific paper which draws the full path from brain all the way to how the model operates, finding this brain-like network inside of the AI. And finally, we have models which really work. On the benchmarks we have tried, the models match or exceed transformers' performance. So we do have the correct inductive biases, we do have the correct energy efficiency, and we do have the correct model to build the next AI, to build the post-transformer AI models.

How Baby Dragon Hatchling Works: Sparse Activation, Continuous Learning, and Enterprise Advantages

Let me now transfer to Victor, who is going to explain to you how this translates to what basically you care about, how those scientific breakthroughs translate to enterprise-ready features. So let's talk a little bit about how this works conceptually. Think about our model a little bit like a darkened high school gymnasium with millions of little LEDs, and they're all turned off. As data is flying into the room, only certain ones of these LEDs light up. These LEDs are literally the components and building blocks of our model. They're a little bit like weights, a little bit like memory, kind of a combination of the two.

Only a very small portion of these things at any given time with any piece of data gets used, and so it says, "This is important to me, this is important to a few of my buddies here, go pass this along and say, hey, listen, you need to know about this thing," but most of the model does nothing. That is what allows us to keep the model in memory and keep the model going very quickly. That's the image that I'm hoping we could implant inside your head. So what do we get from this? Right now we have a model that as the data is coming in is continuously learning from the data itself. It's no longer a model that is fine-tuned because the fine-tuning is actually processed as data comes in.

Your model becomes much faster with improved throughput, and you don't need to use as large a model. This efficiency translates directly to cost savings and performance improvements.

In the world of power efficiency, consider a high school gymnasium during model training. You literally have to light up every LED in that room at all times, which is why training models consumes so much power and GPU resources. In our model, only a certain small piece is active at any given time, so it remains lightweight and available, especially for really long reasoning tasks.

The third major advantage is the increase in functionality. If you think about the diagram we showed you regarding why all frontier models feel the same, it's because they are trained on the same data with the same number of parameters. Think of us as almost like another axis, where we're opening up all this new functionality that you can use for your enterprise applications.

Let's talk about a couple of examples of things you could do in a non-transformer model specifically inside of BDH. Beyond continuous learning, let's discuss attention span. Right now the best models and highest-end reasoning models have an attention span of about two hours, but what nobody really tells you is that's with a fifty percent success rate. If you want to get to an eighty percent success rate, they can only focus on a task for about thirty minutes.

That does not help us in putting together the functionality we have inside our corporation into an LLM. Think about a model that could take as complex a process as finishing a quarter. Now all of a sudden you have eight departments working for weeks on end coordinating all the different things that have to happen so a company can release its public results and audit them. This is the kind of attention span we're talking about with models like this—having a really smart person that remembers the quarter-end process and having them there with you the whole time.

The second thing we do is learn from very thin data sets. A lot of corporations don't go through these processes many times to get a lot of learning data. When we're people and we're learning things as a kid, how many times do you have to taste soap to remember what it tastes like? Right now in a world of Mediterranean LLM, you literally have to taste soap thousands of times before the LLM says this is what soap tastes like. Think about something you want to learn and internalize into the model. That's what this long attention span does with a thin data set.

The third thing that happens is in a world where there is highly regulated data and places with a lot of regulations, we can actually look inside the model to see what's happening. If you think about the analogy we painted of all the little small LEDs, we can actually count and see what's going on inside each of these LEDs and keep track of that. So all of a sudden you have observability and auditability, and you can now start using this model in places where you would never really use a black box type of LLM model.

Customer Engagement and Design Partner Opportunities

What's next? Right now we literally just started making our announcements and have been peeking out there. Those of you who looked at the Wall Street Journal the other day saw a nice little story on us. We are just opening up our customer engagement and are looking for design partners along with our partners at AWS. This model will be available sometime in the middle of the year, the first half of the year, and we are looking for folks that have use cases.

We are looking for folks that have use cases right now that have not worked using traditional LLMs, whether it's a thin data use case or one of these long attention use cases or a highly regulated use case. Please talk to us about that. Our model of monetization is actually based on a consumption function, so we earn money on all the tokens that we put through the machine. In essence, what we're doing is building this concept of sticky inference.

With a lot of inferences that go through standard models, you're probably going to want to go through whoever's cheapest, because it doesn't matter whether the outcome comes from competitor A or competitor B—the answers are pretty similar. So if they're pretty similar, you choose whoever's the cheapest. But what makes something truly sticky is your internal corporate data. Put that in the model and see the answers that you're going to get out of Baby Dragon. That is the use case we're looking for, so reach out to us.

We have an email, and you could talk to us afterwards if you have these use cases. We are literally starting to sign up design partners today. With that, Jan and I wanted to say thank you very much. We will be able to take questions, but the rule was we had to take questions off to the side. We can't do questions in this format. Thank you and happy building.

; This article is entirely auto-generated using Amazon Bedrock.