The New Role of AI in Science: A Supercomputer Emulator

Artificial intelligence has become an indispensable tool in many scientists’ lives, such that its use by researchers now has its own moniker—AI4Science—used by conferences and laboratories. Last month, Microsoft announced its own AI4Science initiative, employing dozens of people spread across several countries. Chris Bishop, its director, started on the science side before gravitating to AI. He earned a Ph.D. in quantum field theory at the University of Edinburgh, then worked in nuclear fusion before machine learning caught his eye in the 1980s. He began applying neural networks to his own work. “I was kind of 25 years early,” he says, “but it really has taken off.” He joined Microsoft Research’s Cambridge lab in 1997, eventually becoming its director, and now has a new role. We spoke about the evolution of the scientific method, lasers versus beer, and nerdy T-shirts.

: All it really is is a new team that we’re building. We see a very exciting opportunity over the next decade at the intersection of machine learning and the natural sciences—chemistry, physics, biology, astronomy, and so on. It goes beyond simply the application of machine learning in the natural sciences.

How does it go beyond that?

The theoretical paradigm is the second. Consider Maxwell's equations or Newton's laws of motion. Usually, these are differential equations. The assumption that they characterize the world more broadly is an inductive step. An equation can be written on your T-shirt and is highly accurate on a variety of time and length scales.

The invention of digital computers and simulations, which successfully solved these differential equations for weather forecasting and other uses, signaled the beginning of the third revolution in scientific discovery at the turn of the 20th century.

The fourth paradigm, which became popular in the twenty-first century, did not involve employing computers to create equations from scratch. Instead, it involves computer-aided large-scale empirical data analysis. There, machine learning flourishes. Consider the Hugethe James Webb Space Telescope, the Hadron Collider, or research involving protein binding.

These four paradigms all function in concert.

A new paradigm is beginning to emerge. It has a long history, but it employs machine learning in the natural sciences in a novel way. In the third paradigm, a challenging simulation is conducted on a supercomputer, and the following day, a new query is posed. You inhale deeply and add another cent to the electric meter. We can now train machine-learning deep neural nets to mimic or emulate the simulator using the simulation's inputs and outputs as training data. The cost of creating the training data and the cost of training are amortized if the emulator is used frequently. Now that you have this hopefully fairly general-purpose emulator, which you can run orders of magnitude faster than the simulation.

Roughly how much simulation data is needed to train an emulator?

A lot of machine learning is an empirical science. It involves trying out different architectures and amounts of data and seeing how things scale. You can’t say ahead of time, I need 56 million data points to do this particular task.

What is interesting, though, are techniques in machine learning that are a little bit more intelligent than just regular training. Techniques like active learning and reinforcement learning, where a system has some understanding of its limitations. It could request more data where it has more uncertainty.

What are emulation’s weaknesses?

They can still be computationally very expensive. Additionally, emulators learn from data, so they’re typically not more accurate than the data used to train them. Moreover, they may give insufficiently accurate results when presented with scenarios that are markedly different from those on which they’re trained.

Are all of Microsoft AI4Science’s projects based on emulation?

No. We do quite a bit of work in drug discovery. That’s at the moment entirely fourth-paradigm-based. It’s based on empirical observations of the properties of certain molecules, and using machine learning to infer the properties of molecules that weren’t part of the training set, and then to reverse that process and say, given a set of properties, can we find new molecules which have those properties? We have a five-year research partnership with Novartis.

We’re looking actively at partnerships. Microsoft brings a couple of things. We have a lot of expertise in machine learning. We also have a lot of expertise in very-large-scale compute and cloud computing. What we’re not endeavoring to do, though, is to be domain experts. We don’t want to be a drug company, we don’t want to be an expert in catalysis. We are bringing in people who have expertise in quantum chemistry, quantum physics, catalysis, and so on, but really to allow us to build an interface with collaborators and partners.

The bigger picture is we’re working anywhere we’ve got these differential equations. It could be fluid flows, designing turbines, predicting the weather, large-scale astronomical phenomena, plasma in nuclear reactors. A lot of our emphasis is on molecular-scale simulation. Scientifically, it holds some of the most challenging and some of the most interesting problems, but also the applicability is enormous—drug discovery, sustainability. We’ve been thinking about direct air capture of carbon dioxide.

We have, I guess, three goals. First and foremost, it’s about building up our research. Peer-reviewed publication will be a key outlet.

Second, Microsoft is a company whose business model is empowering others to be successful. So one of the things we’ll be looking for is how we can turn some of the research advances into cloud-based services which can then be used commercially or by the academic world. The breadth of applicability of this is potentially enormous. If you just think about molecular simulation, it’s drugs, it’s lubricants, it’s protecting corrosion, it’s carbon capture, it’s catalysis for the chemical industry, and so on.

And then the third goal, ultimately, is to see real-world impact: Health care, sustainability, climate change.

Do we foresee advances not just in the domains where you’re helping partners but also in pure computer science and machine learning?

That’s a great question. I believe in “use-inspired basic research.” People think in terms of a very linear model, in which you have basic research at one end and applied research at the other. A great example would be Einstein. He discovers stimulated emission with a pencil and paper and a brain, and then later it gets used to build the laser.

But there’s a different kind of research, which is often characterized by the work of Pasteur. He was a consultant for the brewing industry. Why did this beer keep going sour? He basically founded the whole field of microbiology. I think about that as use-inspired basic research.

I hope to see that as we go after really hard problems. We’re trying to build a neural net that can understand the dynamics of molecules, and we’re going to need new neural-network architectures. And that might spill over into completely different domains.

What will the sixth scientific paradigm be? Will AI generate new hypotheses?

I have no idea what the sixth paradigm is. But I think the fifth paradigm will keep us pretty busy for the next decade or more.

This transcript has been edited for brevity and clarity.