Author: Jason Corso (Professor of Robotics and EECS at University of Michigan | Co-Founder and Chief Science Officer @ Voxel51)
As long as open source AI only contains software and trained model weights, it should not be considered or called open source AI. But, is that a problem?
It’s more than a badge of honor these days; it seems to be a religious proclamation. “I’ve open-sourced my AI.” Like the gospel from atop the mountain. From the not-yet-known research paper I’m handling as Area Chair for CVPR 2024 to the epitome of AI influencer Yann Lecun, Open Source Artificial Intelligence seems to be top of mind. There’s a (yet rather brief) Wikipedia page on it. There’s even a grant program from a major venture capital firm, Andreessen Horowitz, for open source AI: https://a16z.com/supporting-the-open-source-ai-community/.
The push to open source AI is both interesting and relevant. Open source software, for example, has had significant impacts on computing over the last fifty years: for example, Linux, MySQL, and Apache are all open source software and created the backbone of the dot-com web.
But, I find this conversation a bit hard to digest. It is not because I don’t appreciate the value or need of openness in innovation. In fact, I’m a big advocate of openness in innovation. I try to release open source code to reproduce all results in every paper I write, and I require that my students do the same. I also started my own open source AI-software company, Voxel51.
It’s also not because there is a lack of open source software in AI out there. In fact, GitHub claims that there were 65K new open source AI projects created in 2023 alone.
So, what’s the problem? Well, what exactly is open source AI? What does everyone mean when they say things like “Open source models have become a critical part of the AI landscape”, “Open-source AI is critical to empower a whole new generation of tech startups,” and “[AI] is way too dangerous to be proprietary…it will have to be open source.”
For that matter, the status quo seems to be releasing some functional software and model weights and calling it open source AI. For example, Figure 1 shows the explanation from Meta for why they choose this status quo.
Figure 1 Screenshot from Llama 2 docs about why they only release the model weights and some code for Llama2. Source is the Llama 2 FAQ.
Is that enough? Or is it bull? “Plenty to work with” sounds like condescending legal lip service to me. Certainly, one can directly run the system with those two pieces. But, can one reproduce the work from scratch? No. Sometimes even the code released is only the core piece and not the ever-important glue around it. But, does it matter?
To get to the bottom of these questions, we need to understand what makes up a modern AI system, and come to some agreement on what it means to be open source AI. Only then, can we consider what we need out of open source AI. I am not the first to ask these questions, thankfully. But this article provides a concrete proposal for open source AI.
What makes a modern AI system?
At the risk of being verbose, let me spell out what I see as the seven key pieces of an AI system.
System Source Code. The system-specific software that implements the key functionality, including model architecture specification, wrapper code around models for both input and output, and any other classical source code used to implement and deploy the entire AI system. In an ideal world, one also has access to documentation for this code, which is maintained alongside the source code itself (this is often not the case). Importantly, that documentation also needs to describe the hardware and driver requirements.
Model Parameters. The part of the system that complements the system source code to be fully deployable. This includes the binary files for the trained model weights along with any parameters necessary to properly configure the system.
Dataset. The raw media and any annotations on that raw media used by the training source code to produce the model parameters. This may also include hold out validation or similar data against which best candidate models are selected.
Hyperparameters. The configuration values necessary to train the model parameters, such as learning rate. These can be individual specific values or ranges of values along with selection criteria.
Training Source Code. The part of the software system that transforms hyperparameters, the dataset and part of the system source code into the model parameters. Similar to the system source code, this should be documented, including the training hardware requirements.
Random Number Generation. For certain AI systems, the random number generator used along with the random number seeds may be necessary to fully reproduce a session.
Software Frameworks. These are the libraries and frameworks on which the system source code is built. One needs access to not only the frameworks (many of these are open source software already, such as PyTorch and Tensorflow) but also the specific versioning used in the system source code and the training source code. Details matter.
All of these are required to build and deploy an AI system. All of these are “married” together in the system. One cannot divorce the system source code from the model parameters; one cannot divorce the system source code and the model parameters from the dataset and the hyperparameters. They are one.
A definition of Open Source AI
Now, how do those pieces of an AI system relate to open source AI? When someone says open source software, they generally mean people can access the source code to read, understand, modify, extend, and share. There are numerous open source licenses, but I’m talking about a mindset here. And, there is a related notion of free software, but that’s a different direction to take this discussion.
The essence of open source relevant to this article is that sufficient material is provided to fully understand, analyze, assess the properties, such as safety and privacy, of the software, and, ultimately, to be able to reproduce, modify, and extend its capabilities. By being able to independently reproduce a capability, one builds confidence in the ideas, engineering efforts, etc., behind that capability, and can potentially build on it. Without independent reproducibility, one must rely on trust or faith alone.
In the context of modern AI systems, what is required to achieve the same level of sufficiency as open source?
Above, I enumerated the full body of materials needed to create a modern AI system. These seven parts together comprise the full AI system. This represents a significant evolution of “software” from what we were accustomed to over the last few decades. Karpathy, for example, calls this Software 2.0, which I think captures the evolution nicely.
Open Source AI is all of the key elements of an AI system, including the system source code, model parameters, dataset, hyperparameters, training source code, random number generation, and software frameworks that are necessary to understand, reproduce, modify, and extend the functionality of the AI system.
To fully understand the functionality and capabilities of an AI system, each of these seven pieces needs to be included in the definition of open source AI and in the release. The system source code. The model parameters. The dataset. The hyperparameters. The training source code. The random number generation. The software frameworks. Without any single one of these, we are left with almost nothing. Certainly, we are left with the inability to have any potential to fully understand and reproduce the functionality of an AI system. There is no room for proprietary-based exclusion in any one of the seven parts. Furthermore, one needs all of it; the actual content. The actual content cannot only be described via some reference, such as “Model Cards,” which have the right mindset, but fall short of actually delivering on reproducibility.
Otherwise, claiming open source AI in a release is nothing more than lip service. Why? Open source AI requires an ability to understand, reproduce, modify, and extend the functionality of the AI system. Without all the pieces, it is not possible to achieve those ends. Period.
Importantly, it may not even be possible to actually “fully understand a modern AI system” (see my final remark in the Closing), but it is nevertheless critical that all seven of these pieces be included in the definition of open source AI.
The Status Quo in Open Source AI
Status quo open source AI is bullshit.
Contemporary open source AI falls short of any true or complete notion of openness. It does not come close to my definition. There are pieces of the puzzle being released independently and mostly complementarily by different groups. For example, Google led the development and release of the widely used Open Images dataset; others have released model code to use various dataset, but with limited training hyperparameters or code, such as the teamwho won the ECCV2022 Ego4D challenge. I don’t mean to single out any one group here, this is the status quo. Release a dataset. Release a model. I myself have done this too. Perhaps it is easy to find references from Google and Meta because they have done such a good job of releasing elements of open source AI.
So, what are we to do? Releasing elements of open source AI is a great step. But, it’s not truly open source AI. Is it harming the community? No, it even helps in some sense; sometimes impressively. Yet, we are increasingly seeing larger AI systems being widely leveraged for myriad applications being partially released under the moniker of open source AI, such as Llama2. That’s simply not acceptable to me.
Does everyone have their own right to open source or not? Absolutely. It’s similar to the copyright licensing issue from creators; some choose to give away their content with no licensing requirement while others enforce a need to license their content for further use. No difference.
But without full open source AI, it is limiting the community’s ability to understand how these systems work, why they work, and how to best advance. Furthermore, you’re embarrassing yourself if you call a release open source AI without actually meeting a rigorous notion of openness like I have provided here. In this era, AI is much more than software.
Closing
The essence of this article is that to fully understand, to reproduce, to modify, and to extend an AI system — parts of which may not (yet) be possible — one needs access to every aspect of the system, all software, all documentation, all data, and all parameters. This may be a hard pill to swallow. But without full open source AI, it is limiting the community’s ability to understand how these systems work, why they work, and how to best advance.
Step up: open source all of your AI or stop giving us the lip service.
Does it matter? Now, for the final remark, I’ll lay the groundwork for some future article. I’m a full advocate for open source AI. But, progress in science sometimes has a different face: in some scenarios, for example, visual psychophysics experiments, the subject (analog to the AI system) is fully closed source (a human). In these cases, there is no notion of open source. There is, however, a notion of reproducibility, which, again, is the key. When an AI system is open sourced, one is tempted to take a shortcut to this type of experimental reproducibility. But, the ability to repeat an AI system output on a different computer does not bring along with it any notion of this type of scientific reproducibility. What does AI reproducibility mean?
Acknowledgements
Thank you to my friends and colleagues who read early versions of this article and inspired powerful changes, especially Jeffrey Siskind, Filipos Bellos, Yayuan Li, Dave Mekelburg, Michelle Brinich, and Jacob Marks. Thank you to everyone who has created or contributed to open source technology. It has changed the world.
Biography
Jason Corso is Professor of Robotics, Electrical Engineering and Computer Science at the University of Michigan and Co-Founder / Chief Science Officer of the AI startup Voxel51. He received his PhD and MSE degrees at Johns Hopkins University in 2005 and 2002, respectively, and a BS Degree with honors from Loyola University Maryland in 2000, all in Computer Science. He is the recipient of the University of Michigan EECS Outstanding Achievement Award 2018, Google Faculty Research Award 2015, Army Research Office Young Investigator Award 2010, National Science Foundation CAREER award 2009, SUNY Buffalo Young Investigator Award 2011, a member of the 2009 DARPA Computer Science Study Group, and a recipient of the Link Foundation Fellowship in Advanced Simulation and Training 2003. Corso has authored more than 150 peer-reviewed papers and hundreds of thousands of lines of open-source code on topics of his interest including computer vision, robotics, data science, and general computing. He is a member of the AAAI, ACM, MAA and a senior member of the IEEE.
Copyright 2024 by Jason J. Corso. All Rights Reserved.
No part of this publication may be reproduced, distributed, or transmitted in any form or by any means, including photocopying, recording, or other electronic or mechanical methods, without the prior written permission of the publisher, except in the case of brief quotations embodied in critical reviews and certain other noncommercial uses permitted by copyright law. For permission requests, write to the publisher via direct message on X/Twitter at JasonCorso.
Top comments (8)
Imagine writing a program. You finish it, you compile it, and you decide that releasing this program to the public would be good marketing for your actual product.
So you upload the compiled binary to the internet, and proclaim that you released a new open source project, but have decided to keep the code to yourself.
You would be, rightfully, ridiculed. Youtube videos would be made making fun of how shamelessly you appropriate the term while doing the exact opposite of what it means.
And yet, somehow, when facebook does exactly this, they seem to get a pass. It makes sense from their perspective: open source sounds good, it makes people trust your software more, and they might have good reasons for not releasing their training data.
But companies have always had good reasons not to make their projects open source. We have a word for that: Freeware. And we shouldn't allow big corporations to change the words we use for the worse.
llama2 is a freeware AI model.
Thank you for sharing this article, and I appreciate your concerns. I recently wrote this article on the topic from a different angle. Most companies releasing the "very usable" models are large corporations that could release new versions with new licenses anytime. Also, they could decide to do the work inside their organization and no longer update the models for the community. We live in a time when people are hungry for these models, and how Meta and others plan to make them financially viable is opaque (at least to me).
Here is the article if you are interested:
The Tyranny of Choice: Open vs. Closed AI Models
A fascinating viewpoint: it's clear that the training data is a required part of the "source" of an AI model. I'll certainly be thinking about this differently.
Thanks for sharing, totally agree that what we call open source LLM, is actually open-to-deploy LLM. We have no idea what is going on inside LLama.
"open-to-deploy" is such a great phrase-- I'll definitely be thinking of it this way going forward.
Freeware. It's called freeware. The word has been around for a while.
"You're right, the term 'freeware' has been around for a while, but there's often confusion between freeware and open-source software, especially in discussions like these. Freeware is free to use but (streamonsports.org/)doesn’t necessarily provide access to the source code like open-source does. In contrast, open source allows for collaboration and modification, which is crucial for AI development.
By the way, if you're into live sports streaming, you might want to check out streamonsports.org. It's a great platform for live football streaming."
My question is -what about Bloom??