Embarking on the second and last part of our 'Generative AI Openness' series, we earlier established a straightforward framework to gauge the openness of Large Language Models (LLMs) and utilized it to explore LLM development and the positioning of key players. We noticed a trend towards increasingly restricted LLM artifacts for OpenAI and Google, contrasted with Meta's more open approach.
Now, let's venture into the realm of collaboration and reuse, keeping our openness matrix in mind, to uncover the multifaceted nature of openness in LLMs.
Before we delve into the specifics of different models, their development, and openness, let's start by considering the openness of some well-known LLMs from a broad perspective.
LLM Model Openness in a Nutshell
LLM | Model (weights) | Pre-training Dataset | Fine-tuning Dataset | Reward Model | Data Processing Code |
---|---|---|---|---|---|
Alpaca | 3 - Open with limitations | 1 - Published research only | 2 - Research use only | Not applicable | 4 - Under Apache 2 license |
Vicuna | 3 - Open with limitations | 1 - Published research only | 2 - Research use only | Not applicable | 4 - Under Apache 2 license |
GPT-J, GPT-Neo | 4 - Completely open | 3 - Open with limitations | Not applicable | Not applicable | 4 - Completely open |
Falcon | 3 - Open with limitations | 4 - Access and reuse without restriction | Not applicable | Not applicable | 1 - No code available |
BLOOM | 3 - Open with limitations | 3 - Open with limitations | Not applicable | Not applicable | 4 - Completely open |
OpenLLaMa | 4 - Access and reuse without restriction | 4 - Access and reuse without restriction | Not applicable | Not applicable | 1 - No complete data processing code available |
MistralAI | 4 - Access and reuse without restriction | 0 - No public information or access | Not applicable | Not applicable | 4 - Complete data processing code available |
Dolly | 4 - Access and reuse without restriction | 3 - Open with limitations | 4 - Access and reuse without restriction | 0 - No public information available | 4 - Access and reuse possible |
BLOOMChat | 3 - Open with limitations | 3 - Open with limitations | 4 - Access and reuse without restriction | 0 - No public information available | 3 - Open with limitations |
Zephyr | 4 - Access and reuse without restriction | 3 - Open with limitations | 3 - Open with limitations | 3 - Open with paper and code examples | 3 - Open with limitations |
AmberChat | 4 - Access and reuse without restriction | 4 - Access and reuse without restriction | 2 - Research use only | 0 - No public information available | 4 - Under Apache 2 license |
We often come across news about a new open-source LLM being released. However, upon closer examination, we find that accessing the model weights or using the model without restrictions is generally feasible. Nevertheless, it is often challenging to reproduce the work due to the unavailability of training datasets or missing data processing code. Additionally, the table shows that many models lack fine-tuning with a reward model, which is crucial for the success of current LLMs and plays a significant role in reducing hallucination and toxicity. Even in the case of the most open models, either no reward model is utilized or the reward model is not publicly accessible.
In the following sections, we will provide details of the LLM models mentioned in the table above, introduce their evolution, and explain their openness score.
1. Fine-tuned Models from Llama 2
Concluding the first part of this series, we highlighted two fine-tuned models based on Llama 2, subject to Meta's licensing constraints. Let's evaluate their openness level.
Alpaca is an instruction-oriented LLM derived from LLaMA, enhanced by Stanford researchers with a dataset of 52,000 examples of following instructions, sourced from OpenAI’s InstructGPT through the self-instruct method. The extensive self-instruct dataset, details of data generation, and the model refinement code were publicly disclosed. This model complies with the licensing requirements of its base model. Due to the utilization of InstructGPT for data generation, it also adheres to OpenAI’s usage terms, which prohibit the creation of models competing with OpenAI. This illustrates how dataset restrictions can indirectly affect the resulting fine-tuned model.
Vicuna is another instruction-focused LLM rooted in LLaMA, developed by researchers from UC Berkeley, Carnegie Mellon University, Stanford, and UC San Diego. They adapted Alpaca’s training code and incorporated 70,000 examples from ShareGPT, a platform for sharing ChatGPT interactions.
Alpaca and Vicuna Openness
Component | Score | Level description | Motivation and links |
---|---|---|---|
Model (weights) | 3 | Open with limitations | Both Vicuna and Alpaca are based on the Llama 2 foundational model. |
Pre-training Dataset | 1 | Published research only | Both Vicuna and Alpaca are derived from the Llama 2 foundational model and dataset. |
Fine-tuning Dataset | 2 | Research use only | Both models are constrained by OpenAI’s non-competition clause due to their training with data originating from ChatGPT. |
Reward model | NA | Not Applicable | Neither model underwent Reinforcement Learning from Human Feedback (RLHF) initially, hence there are no reward models for evaluation. It's worth noting that AlpacaFarm, a framework simulating an RLHF process, was released under a non-commercial license, and StableVicuna underwent RLHF fine-tuning on Vicuna. |
Data Processing Code | 4 | Under Apache 2 license | Both projects have shared their code on GitHub (Vicuna, Alpaca). |
Significantly, both projects face dual constraints: initially from LLaMA’s licensing on the model and subsequently from OpenAI due to their fine-tuning data.
2. Collaboration and Open Source in LLM Evolution
In addition to the foundation model Llama and its associated families of fine-tuned LLMs, there are many initiatives that contribute to promoting the openness of foundational models and their fine-tuned ones.
2.1. Foundational Models and Pre-training Datasets
The research highlights the cost-effectiveness of developing instruction-tuned LLMs atop foundational models through collaborative efforts and reutilization. This approach necessitates the availability of genuinely open-source foundational models and pre-training datasets.
2.1.1 EleutherAI
This vision is in line with EleutherAI, a non-profit organization founded in July 2020 by a group of researchers. Driven by the perceived opacity and the challenge of reproducibility in AI, their goal was to create leading open-source language models.
By December 2020, EleutherAI had introduced The Pile, a comprehensive text dataset designed for training models. Subsequently, tech giants such as Microsoft, Meta, and Google used this dataset for training their models. In March 2021, they revealed GPT-Neo, an open-source model under Apache 2.0 license, which was unmatched in size at its launch. EleutherAI’s later projects include the release of GPT-J, a 6 billion parameter model, and GPT-NeoX, a 20 billion parameter model, unveiled in February 2022. Their work demonstrates the viability of high-quality open-source AI models.
EleutherAI GPT-J Openness
Component | Score | Level description | Motivation and links |
---|---|---|---|
Model (weights) | 4 | Completely open | GPT-J’s model weights are freely accessible, in line with EleutherAI's commitment to open-source AI. EleutherAI GitHub |
Pre-training Dataset | 3 | Open with limitations | GPT-J was trained on The Pile, a large-scale dataset curated by EleutherAI. While mostly open, parts of The Pile may have limitations. The Hugging Face page notes: "Licensing Information: Please refer to the specific license depending on the subset you use" |
Fine-tuning Dataset | NA | Not Applicable | GPT-J is a foundational model and wasn't specifically fine-tuned on additional datasets for its initial release. |
Reward model | NA | Not Applicable | GPT-J did not undergo RLHF, making this category non-applicable. |
Data Processing Code | 4 | Completely open | The code for data processing and model training for GPT-J is openly available, fostering transparency and community involvement. |
2.1.2 Falcon
In March 2023, a research team from the Technology Innovation Institute (TII) in the United Arab Emirates introduced a new open model lineage named Falcon, along with its dataset.
Falcon features two versions: the initial with 40 billion parameters trained on one trillion tokens, and the subsequent with 180 billion parameters trained on 3.5 trillion tokens. The latter is said to rival the performance of models like LLaMA 2, PaLM 2, and GPT-4. TII emphasizes Falcon’s distinctiveness in its training data quality, predominantly sourced from public web crawls (~80%), academic papers, legal documents, news outlets, literature, and social media dialogues. Its licensing, based on the open-source Apache License, allows entities to innovate and commercialize using Falcon 180B, including hosting on proprietary or leased infrastructure. However, it explicitly prohibits hosting providers from exploiting direct access to shared Falcon 180B instances and its refinements, especially through API access. Due to this clause, its license does not fully align with OSI’s open-source criteria.
Falcon Openness
Component | Score | Level description | Motivation and links |
---|---|---|---|
Model (weights) | 3 | Open with limitations | Falcon's license is inspired by Apache 2 but restricts hosting uses. |
Pre-training Dataset | 4 | Access and reuse without restriction | The RefinedWeb dataset is distributed under the Open Data Commons Attribution License (ODC-By) and also under the CommonCrawl terms, which are quite open. |
Fine-tuning Dataset | NA | Not Applicable | Falcon is a foundational model and can be fine-tuned on various specific datasets as per use case, not provided by the original creators. |
Reward model | NA | Not Applicable | Falcon did not undergo RLHF in its initial training, hence no reward model for evaluation. |
Data Processing Code | 1 | No code available | General instructions are available here. |
2.1.3 LAION
Highlighting the international scope of this field, consider LAION and BLOOM.
LAION (Large-scale Artificial Intelligence Open Network), a German non-profit established in 2020, is dedicated to advancing open-source models and datasets (primarily under Apache 2 and MIT licenses) to foster open research and the evolution of benevolent AI. Their datasets, encompassing both images and text, have been pivotal in the training of renowned text-to-image models like Stable Diffusion.
2.1.4 BLOOM
BLOOM, boasting 176 billion parameters, is capable of generating text in 46 natural and 13 programming languages. It represents the culmination of a year-long collaboration involving over 1000 researchers from more than 70 countries and 250 institutions, which concluded with a 117-day run on the Jean Zay French supercomputer. Distributed under an OpenRAIL license, BLOOM is not considered fully open-source due to usage constraints, such as prohibitions against harmful intent, discrimination, or interpreting medical advice.
BLOOM Openness
Component | Score | Level description | Motivation and links |
---|---|---|---|
Model (weights) | 3 | Completely open | BLOOM’s model weights are publicly accessible, reflecting a commitment to open science but with an OpenRAIL license. BigScience GitHub |
Pre-training Dataset | 3 | Open with limitations | BLOOM’s primary pre-training dataset, while officially under an Apache 2 license, is based on various subsets with potential limitations. |
Fine-tuning Dataset | NA | Not Applicable | BLOOM is a foundational model and wasn't fine-tuned on specific datasets for its initial release. |
Reward model | NA | Not Applicable | BLOOM did not undergo RLHF, hence no reward model for evaluation. |
Data Processing Code | 4 | Completely open | The data processing and training code for BLOOM are openly available, encouraging transparency and community participation. BigScience GitHub |
2.1.5 OpenLLaMA and RedPajama
RedPajama, initiated in 2022 by Together AI, a non-profit advocating AI democratization, playfully references LLaMA and the children’s book “Llama Llama Red Pajama”.
The initiative has expanded to include partners like Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute. In April 2023, they released a 1.2 trillion token dataset, mirroring LLaMA’s dataset, for training their models. These models, with parameters ranging from 3 to 7 billion, were released in September, licensed under open-source Apache 2.
The RedPajama dataset was adapted by the OpenLLaMA project at UC Berkeley, creating an open-source LLaMA equivalent without Meta’s restrictions. The model's later version also included data from Falcon and StarCoder. This highlights the importance of open-source models and datasets, enabling free repurposing and innovation.
OpenLLaMa Openness
Component | Score | Level description | Motivation and links |
---|---|---|---|
Model (weights) | 4 | Access and reuse without restriction | Models and weights (in PyTorch and JAX formats) are available under the Apache 2 open-source license. |
Pre-training Dataset | 4 | Access and reuse without restriction | Based on RedPajama, Falcon, and StarCoder datasets. |
Fine-tuning Dataset | NA | Not Applicable | OpenLLaMA is a foundational model. |
Reward model | NA | Not Applicable | OpenLLaMA did not undergo RLHF, hence no reward model for evaluation. |
Data Processing Code | 1 | No complete data processing code available | The Hugging Face page provides some usage examples with transformers. |
2.1.6 MistralAI
MistralAI, a French startup, developed a 7.3 billion parameter LLM named Mistral for various applications. Committed to open-sourcing its technology under Apache 2.0, the training dataset details for Mistral remain undisclosed. The Mistral Instruct model was fine-tuned using publicly available instruction datasets from the Hugging Face repository, though specifics about the licenses and potential constraints are not detailed. Recently, MistralAI released Mixtral 8x7B, a model based on the sparse mixture of experts (SMoE) architecture, consisting of several specialized models (likely eight, as suggested by its name) activated as needed.
Mistral and Mixtral Openness
Component | Score | Level description | Motivation and links |
---|---|---|---|
Model (weights) | 4 | Access and reuse without restriction | Models and weights are available under the Apache 2 open-source license. |
Pre-training Dataset | 0 | No public information or access | The training dataset for the model is not publicly detailed. |
Fine-tuning Dataset | NA | Not Applicable | Mistral is a foundational model. |
Reward model | NA | Not Applicable | Mistral did not undergo RLHF, hence no reward model for evaluation. |
Data Processing Code | 4 | Complete data processing code available | Instructions and deployment code are available on GitHub. |
These examples illustrate the diverse approaches to openness in foundational models, emphasizing the value of sharing and enabling reuse by others.
2.2. Fine-tuned Models and Instruction Datasets
When pre-trained models and datasets are accessible with minimal restrictions, various entities can repurpose them to develop new foundational models or fine-tuned variants.
2.2.1 Dolly
Drawing inspiration from LLaMA and the fine-tuned projects like Alpaca and Vicuna, the big data company Databricks introduced Dolly 1.0 in March 2022.
This cost-effective LLM was built upon EleutherAI’s GPT-J, employing the data and training methods of Alpaca. Databricks highlighted that this model was developed for under $30, suggesting that significant advancements in leading-edge models like ChatGPT might be more attributable to specialized instruction-following training data than to larger or more advanced base models. Two weeks later, Databricks launched Dolly 2.0, still based on the EleutherAI model, but now exclusively fine-tuned with a pristine, human-curated instruction dataset created by Databricks’ staff. They chose to open source the entirety of Dolly 2.0, including the training code, dataset, and model weights, making them suitable for commercial applications. By May 2023, Databricks had acquired MosaicML, a company that had just released its MPT (MosaicML Pre-trained Transformer) foundational model under the Apache 2.0 license, allowing anyone to train, refine, and deploy their own MPT models.
Dolly Openness
Component | Score | Level description | Motivation and links |
---|---|---|---|
Model (weights) | 4 | Access and reuse without restriction | Dolly 2 is based on the EleutherAI foundational model and is released under the MIT license. |
Pre-training Dataset | 3 | Open with limitations | Dolly 2 is built upon the EleutherAI foundational model and inherits the same limitations as its dataset (The Pile). |
Fine-tuning Dataset | 4 | Access and reuse without restriction | The fine-tuning dataset was created by Databricks employees and released under the CC-BY-SA permissive license. |
Reward model | 0 | No public information available | The Reward model of Dolly is not publicly disclosed. |
Data Processing Code | 4 | Access and reuse possible | Code to train and run the model is available on GitHub under the Apache 2 license. |
2.2.2 OpenAssistant
OpenAssistant stands as another example of the potential of open source fine-tuning datasets.
This initiative aims to create an open-source, chat-based assistant proficient in understanding tasks, interacting with third-party systems, and retrieving dynamic information. Led by LAION and international contributors, a unique aspect of this project is its reliance on crowdsourcing for data collection from human volunteers. The OpenAssistant team has initiated numerous crowdsourcing tasks to gather data for various tasks, such as generating diverse text formats—from poems and code to emails and letters—and providing informative responses to queries. Recently, the LAION initiative also released an open-source dataset named OIG (Open Instruction Generalist), comprising 43M instructions, focusing on data augmentation rather than human feedback.
2.2.3 BLOOMChat
BLOOMChat, a fine-tuned chat model with 176 billion parameters, demonstrates the reuse of previous models and datasets. It underwent instruction tuning based on the BLOOM foundational model, incorporating datasets like OIG, OpenAssistant, and Dolly.
BLOOMChat Openness
Component | Score | Level description | Motivation and links |
---|---|---|---|
Model (weights) | 3 | Open with limitations | Based on the BLOOM foundational model, it inherits its restrictions (Open RAIL license). |
Pre-training Dataset | 3 | Open with limitations | Based on the BLOOM foundational model and dataset, it inherits potential restrictions. |
Fine-tuning Dataset | 4 | Access and reuse without restriction | The Dolly and LAION fine-tuning datasets are open source. |
Reward model | 0 | No public information available | The Reward model of BLOOMChat is not publicly disclosed. |
Data Processing Code | 3 | Open with limitations | Code to train and run the model is available on GitHub under an Open RAIL type license. |
2.2.4 Zephyr
Zephyr serves as another exemplary case of reutilizing an open foundational model. It is a fine-tuned version of the Mistral 7B model developed by the Hugging Face H4 project. It was fine-tuned using filtered versions of the UltraChat and UltraFeedback datasets. These datasets were generated using ShareGPT and GPT-4, potentially imposing some transitive use restrictions on the resulting model.
Zephyr Openness
Component | Score | Level description | Motivation and links |
---|---|---|---|
Model (weights) | 4 | Access and reuse without restriction | Zephyr is released under the MIT license, and the Mistral foundational model is under Apache 2. |
Pre-training Dataset | 3 | Open with limitations | Inherits possible restrictions from the Mistral foundational model and dataset. |
Fine-tuning Dataset | 3 | Open with limitations | ShareGPT and GPT-4 were used to produce the fine-tuning datasets, imposing limitations. |
Reward model | 3 | Open with paper and code examples | Zephyr was fine-tuned using Direct Preference Optimization (DPO), not RLHF. A paper and some code and examples on this new technique are available. |
Data Processing Code | 3 | Open with limitations | Example code to train and run the model is available on Hugging Face. |
2.2.5 LLM360
LLM360 is an emerging yet intriguing initiative focused on ensuring complete openness of all essential elements required for training models. This includes not only model weights but also checkpoints, training datasets, and source code for data preprocessing and model training. The project is a collaborative effort between the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) in the UAE and two American AI companies. It has introduced two foundational models: Amber, a 7 billion parameter English LLM, and CrystalCoder, a 7 billion parameter LLM specializing in code and text. Additionally, there are fine-tuned versions of these models, including AmberChat.
AmberChat openness
Component | Score | Level description | Motivation and links |
---|---|---|---|
Model (weights) | 4 | Access and reuse of model is possible without restriction | AmberChat is released under the Apache 2 open source license. |
Pre-training Dataset | 4 | Access and reuse without restriction | Based on RedPajama v1, Faclcon RefineWeb and StarCoderData. |
Fine-tuning Dataset | 2 | Research use only | AmberChat dataset is based on WizardLM evol instruct V2 and share gpt 90k both based in ShareGPT. |
Reward model | 0 | Reward model ot available | The reward model is not available. DPO is mentioned as a alignement method for AmberSafe another fine-tuned Amber model. |
Data Processing Code | 4 | Open with limitations | Data processing and model training code are available under the Apache 2 open source license. |
An assessment of the openness of AmberChat reveals that while the Amber foundational model is quite open, providing unrestricted access to weights, dataset, and code, the fine-tuning process in AmberChat introduces certain limitations to this initial level of openness.
These diverse examples highlight how openness enables global collaboration, allowing entities to repurpose earlier models and datasets in ways typical of the research community. This approach clarifies licensing nuances and fosters external contributions. And it's this spirit of collaborative and community-driven innovation that is championed by a notable player: Hugging Face.
2.3. The Hugging Face Ecosystem
The Hugging Face initiative represents a community-focused endeavor, aimed at advancing and democratizing artificial intelligence through open-source and open science methodologies. Originating from the Franco-American company, Hugging Face, Inc., it's renowned for its Transformers library, which offers open-source implementations of transformer models for text, image, and audio tasks. In addition, it provides libraries dedicated to dataset processing, model evaluation, simulations, and machine learning demonstrations.
Hugging Face has spearheaded two significant scientific projects, BigScience and BigCode, leading to the development of two large language models: BLOOM and Pygmalion. Moreover, the initiative curates a LLM leaderboard, allowing users to upload and assess models based on text generation tasks.
However, in my opinion, the truly revolutionary aspect is the Hugging Face Hub. This online collaborative space acts as a hub where enthusiasts can explore, experiment, collaborate, and create machine learning-focused technology. Hosting thousands of models, datasets, and demo applications, all transparently accessible, it has quickly become the go-to platform for open AI, similar to GitHub in its field. The Hub's user-friendly interface encourages effortless reuse and integration, cementing its position as a pivotal element in the open AI landscape.
3. On the Compute Side
3.1 Computing Power
Generative AI extends beyond software, a reality clearly understood by entities like Microsoft/OpenAI and Google as Cloud service providers. They benefit from the ongoing effort to enhance LLM performance through increased computing resources.
In September, AWS announced an investment of $4 billion in Anthropic, whose models will be integrated into Amazon Bedrock, developed, trained, and deployed using AWS AI chips.
NVIDIA is at the forefront of expanding computing capacity with its GPUs. This commitment is reflected in their CUDA (Compute Unified Device Architecture) parallel computing platform and the associated programming model. Additionally, NVIDIA manages the NeMo conversational AI toolkit, which serves both research purposes and as an enterprise-grade framework for handling LLMs and performing GPU-based computations, whether in the Cloud or on personal computers.
The AI chip recently unveiled by Intel signifies a major advancement in user empowerment over LLM applications. Diverging from Cloud providers' strategies, Intel's aim is to penetrate the market for AI chips suitable for operations outside data centers, thus enabling the deployment of LLMs on personal desktops, for example, through its OpenVINO open-source framework.
This field is characterized by dynamic competition, with companies like AMD and Alibaba making notable inroads into the AI chip market. The development of proprietary AI chips by Cloud Service Providers (CSPs), such as Google's TPU and AWS's Inferentia, as well as investments by startups like SambaNova, Cerebras, and Rain AI (backed by Sam Altman), further intensifies this competition.
3.2 Democratizing AI Computing
The llama.cpp project, initiated by researchers at the University of Freiburg, exemplifies efforts to democratize AI and drive innovation by promoting CPU-compatible models. It's a C++ implementation of the LLaMA algorithm, originally designed for use on MacBooks and now also supporting x86 architectures. Employing quantization, which reduces LLMs' size and computational demands by converting model weights from floating-point to integer values, this project created the GGML binary format for efficient model storage and loading. Recently, GGML was succeeded by the more versatile and extensible GGUF format.
Since its launch in October 2022, llama.cpp has quickly gained traction among researchers and developers for its scalability and compatibility across MacOS, Linux, Windows, and Docker.
The GPT4All project, which employs llama.cpp, aims to train and deploy LLMs on conventional hardware. This ecosystem democratizes AI by allowing researchers and developers to train and deploy LLMs on their hardware, thus circumventing costly cloud computing services. This initiative has enhanced the accessibility and affordability of LLMs. GPT4All includes a variety of models derived from GPT-J, MPT, and LLaMA, fine-tuned with datasets including ShareGPT conversations. However, the limitations mentioned earlier are important to consider. The ecosystem also features a chat desktop application, as previously discussed in a related article.
LocalAI is another open-source project designed to facilitate running LLMs locally. Previously introduced in a dedicated article, LocalAI leverages llama.cpp and supports GGML/GGUF formats and Hugging Face models, among others. This tool simplifies deploying an open model on a standard computer or within a container, integrating it into a larger, distributed architecture. The BionicGPT open-source project exemplifies its usage.
Conclusion
The evolution of Generative AI mirrors broader technological advancements, fluctuating between open collaboration and proprietary control. The AI landscape, as illustrated by the rise and diversification of LLMs, stands at a crossroads. On one side, tech giants drive towards commercialization and centralization, leveraging immense AI potential and vast computational resources. On the other, a growing movement champions open AI, emphasizing collaboration, transparency, and democratization.
The LLaMA project, despite its limited openness, has inspired numerous others, as evident in names like Alpaca, Vicuna, Dolly, Goat, Dalai (a play on Dalai Lama), Redpajama, and more. We may soon see other animal-themed projects joining this AI menagerie, perhaps flying alongside Falcon?
The range of projects from LLaMA to Falcon, and platforms like Hugging Face, highlight the vitality of the open AI movement. They remind us that while commercial interests are crucial for innovation, the spirit of shared knowledge and collective growth is equally essential. The "Linux moment" of AI beckons a future where openness entails not just access, but also the freedom to innovate, repurpose, and collaborate.
In our view, the centralization of AI computing power tends to lean towards closed-source solutions, whereas its democratization inherently supports open-source alternatives.
As AI increasingly permeates our lives, the choices we make now—whether to centralize or democratize, to close or open up will shape not just the future of technology, but the essence of our digital society. This moment of reckoning urges us to contemplate the kind of digital world we wish to create and pass on to future generations.
Top comments (0)