raphiki for Technology at Worldline

Posted on Dec 19, 2023 • Originally published at blog.worldline.tech

How Open is Generative AI? Part 2

Embarking on the second and last part of our 'Generative AI Openness' series, we earlier established a straightforward framework to gauge the openness of Large Language Models (LLMs) and utilized it to explore LLM development and the positioning of key players. We noticed a trend towards increasingly restricted LLM artifacts for OpenAI and Google, contrasted with Meta's more open approach.

Now, let's venture into the realm of collaboration and reuse, keeping our openness matrix in mind, to uncover the multifaceted nature of openness in LLMs.

Before we delve into the specifics of different models, their development, and openness, let's start by considering the openness of some well-known LLMs from a broad perspective.

LLM Model Openness in a Nutshell

LLM	Model (weights)	Pre-training Dataset	Fine-tuning Dataset	Reward Model	Data Processing Code
Alpaca	3 - Open with limitations	1 - Published research only	2 - Research use only	Not applicable	4 - Under Apache 2 license
Vicuna	3 - Open with limitations	1 - Published research only	2 - Research use only	Not applicable	4 - Under Apache 2 license
GPT-J, GPT-Neo	4 - Completely open	3 - Open with limitations	Not applicable	Not applicable	4 - Completely open
Falcon	3 - Open with limitations	4 - Access and reuse without restriction	Not applicable	Not applicable	1 - No code available
BLOOM	3 - Open with limitations	3 - Open with limitations	Not applicable	Not applicable	4 - Completely open
OpenLLaMa	4 - Access and reuse without restriction	4 - Access and reuse without restriction	Not applicable	Not applicable	1 - No complete data processing code available
MistralAI	4 - Access and reuse without restriction	0 - No public information or access	Not applicable	Not applicable	4 - Complete data processing code available
Dolly	4 - Access and reuse without restriction	3 - Open with limitations	4 - Access and reuse without restriction	0 - No public information available	4 - Access and reuse possible
BLOOMChat	3 - Open with limitations	3 - Open with limitations	4 - Access and reuse without restriction	0 - No public information available	3 - Open with limitations
Zephyr	4 - Access and reuse without restriction	3 - Open with limitations	3 - Open with limitations	3 - Open with paper and code examples	3 - Open with limitations
AmberChat	4 - Access and reuse without restriction	4 - Access and reuse without restriction	2 - Research use only	0 - No public information available	4 - Under Apache 2 license

We often come across news about a new open-source LLM being released. However, upon closer examination, we find that accessing the model weights or using the model without restrictions is generally feasible. Nevertheless, it is often challenging to reproduce the work due to the unavailability of training datasets or missing data processing code. Additionally, the table shows that many models lack fine-tuning with a reward model, which is crucial for the success of current LLMs and plays a significant role in reducing hallucination and toxicity. Even in the case of the most open models, either no reward model is utilized or the reward model is not publicly accessible.

In the following sections, we will provide details of the LLM models mentioned in the table above, introduce their evolution, and explain their openness score.

1. Fine-tuned Models from Llama 2

Concluding the first part of this series, we highlighted two fine-tuned models based on Llama 2, subject to Meta's licensing constraints. Let's evaluate their openness level.

Alpaca is an instruction-oriented LLM derived from LLaMA, enhanced by Stanford researchers with a dataset of 52,000 examples of following instructions, sourced from OpenAI’s InstructGPT through the self-instruct method. The extensive self-instruct dataset, details of data generation, and the model refinement code were publicly disclosed. This model complies with the licensing requirements of its base model. Due to the utilization of InstructGPT for data generation, it also adheres to OpenAI’s usage terms, which prohibit the creation of models competing with OpenAI. This illustrates how dataset restrictions can indirectly affect the resulting fine-tuned model.

Vicuna is another instruction-focused LLM rooted in LLaMA, developed by researchers from UC Berkeley, Carnegie Mellon University, Stanford, and UC San Diego. They adapted Alpaca’s training code and incorporated 70,000 examples from ShareGPT, a platform for sharing ChatGPT interactions.

Alpaca and Vicuna Openness

Component	Score	Level description	Motivation and links
Model (weights)	3	Open with limitations	Both Vicuna and Alpaca are based on the Llama 2 foundational model.
Pre-training Dataset	1	Published research only	Both Vicuna and Alpaca are derived from the Llama 2 foundational model and dataset.
Fine-tuning Dataset	2	Research use only	Both models are constrained by OpenAI’s non-competition clause due to their training with data originating from ChatGPT.
Reward model	NA	Not Applicable	Neither model underwent Reinforcement Learning from Human Feedback (RLHF) initially, hence there are no reward models for evaluation. It's worth noting that AlpacaFarm, a framework simulating an RLHF process, was released under a non-commercial license, and StableVicuna underwent RLHF fine-tuning on Vicuna.
Data Processing Code	4	Under Apache 2 license	Both projects have shared their code on GitHub (Vicuna, Alpaca).

Significantly, both projects face dual constraints: initially from LLaMA’s licensing on the model and subsequently from OpenAI due to their fine-tuning data.

2. Collaboration and Open Source in LLM Evolution

In addition to the foundation model Llama and its associated families of fine-tuned LLMs, there are many initiatives that contribute to promoting the openness of foundational models and their fine-tuned ones.

2.1. Foundational Models and Pre-training Datasets

The research highlights the cost-effectiveness of developing instruction-tuned LLMs atop foundational models through collaborative efforts and reutilization. This approach necessitates the availability of genuinely open-source foundational models and pre-training datasets.

2.1.1 EleutherAI

This vision is in line with EleutherAI, a non-profit organization founded in July 2020 by a group of researchers. Driven by the perceived opacity and the challenge of reproducibility in AI, their goal was to create leading open-source language models.

By December 2020, EleutherAI had introduced The Pile, a comprehensive text dataset designed for training models. Subsequently, tech giants such as Microsoft, Meta, and Google used this dataset for training their models. In March 2021, they revealed GPT-Neo, an open-source model under Apache 2.0 license, which was unmatched in size at its launch. EleutherAI’s later projects include the release of GPT-J, a 6 billion parameter model, and GPT-NeoX, a 20 billion parameter model, unveiled in February 2022. Their work demonstrates the viability of high-quality open-source AI models.

EleutherAI GPT-J Openness

Component	Score	Level description	Motivation and links
Model (weights)	4	Completely open	GPT-J’s model weights are freely accessible, in line with EleutherAI's commitment to open-source AI. EleutherAI GitHub
Pre-training Dataset	3	Open with limitations	GPT-J was trained on The Pile, a large-scale dataset curated by EleutherAI. While mostly open, parts of The Pile may have limitations. The Hugging Face page notes: "Licensing Information: Please refer to the specific license depending on the subset you use"
Fine-tuning Dataset	NA	Not Applicable	GPT-J is a foundational model and wasn't specifically fine-tuned on additional datasets for its initial release.
Reward model	NA	Not Applicable	GPT-J did not undergo RLHF, making this category non-applicable.
Data Processing Code	4	Completely open	The code for data processing and model training for GPT-J is openly available, fostering transparency and community involvement.

2.1.2 Falcon

In March 2023, a research team from the Technology Innovation Institute (TII) in the United Arab Emirates introduced a new open model lineage named Falcon, along with its dataset.

Falcon features two versions: the initial with 40 billion parameters trained on one trillion tokens, and the subsequent with 180 billion parameters trained on 3.5 trillion tokens. The latter is said to rival the performance of models like LLaMA 2, PaLM 2, and GPT-4. TII emphasizes Falcon’s distinctiveness in its training data quality, predominantly sourced from public web crawls (~80%), academic papers, legal documents, news outlets, literature, and social media dialogues. Its licensing, based on the open-source Apache License, allows entities to innovate and commercialize using Falcon 180B, including hosting on proprietary or leased infrastructure. However, it explicitly prohibits hosting providers from exploiting direct access to shared Falcon 180B instances and its refinements, especially through API access. Due to this clause, its license does not fully align with OSI’s open-source criteria.

Falcon Openness

Component	Score	Level description	Motivation and links
Model (weights)	3	Open with limitations	Falcon's license is inspired by Apache 2 but restricts hosting uses.
Pre-training Dataset	4	Access and reuse without restriction	The RefinedWeb dataset is distributed under the Open Data Commons Attribution License (ODC-By) and also under the CommonCrawl terms, which are quite open.
Fine-tuning Dataset	NA	Not Applicable	Falcon is a foundational model and can be fine-tuned on various specific datasets as per use case, not provided by the original creators.
Reward model	NA	Not Applicable	Falcon did not undergo RLHF in its initial training, hence no reward model for evaluation.
Data Processing Code	1	No code available	General instructions are available here.

2.1.3 LAION

Highlighting the international scope of this field, consider LAION and BLOOM.

LAION (Large-scale Artificial Intelligence Open Network), a German non-profit established in 2020, is dedicated to advancing open-source models and datasets (primarily under Apache 2 and MIT licenses) to foster open research and the evolution of benevolent AI. Their datasets, encompassing both images and text, have been pivotal in the training of renowned text-to-image models like Stable Diffusion.

2.1.4 BLOOM

BLOOM, boasting 176 billion parameters, is capable of generating text in 46 natural and 13 programming languages. It represents the culmination of a year-long collaboration involving over 1000 researchers from more than 70 countries and 250 institutions, which concluded with a 117-day run on the Jean Zay French supercomputer. Distributed under an OpenRAIL license, BLOOM is not considered fully open-source due to usage constraints, such as prohibitions against harmful intent, discrimination, or interpreting medical advice.

BLOOM Openness

Component	Score	Level description	Motivation and links
Model (weights)	3	Completely open	BLOOM’s model weights are publicly accessible, reflecting a commitment to open science but with an OpenRAIL license. BigScience GitHub
Pre-training Dataset	3	Open with limitations	BLOOM’s primary pre-training dataset, while officially under an Apache 2 license, is based on various subsets with potential limitations.
Fine-tuning Dataset	NA	Not Applicable	BLOOM is a foundational model and wasn't fine-tuned on specific datasets for its initial release.
Reward model	NA	Not Applicable	BLOOM did not undergo RLHF, hence no reward model for evaluation.
Data Processing Code	4	Completely open	The data processing and training code for BLOOM are openly available, encouraging transparency and community participation. BigScience GitHub

2.1.5 OpenLLaMA and RedPajama

RedPajama, initiated in 2022 by Together AI, a non-profit advocating AI democratization, playfully references LLaMA and the children’s book “Llama Llama Red Pajama”.

The initiative has expanded to include partners like Ontocord.ai, ETH DS3Lab, Stanford CRFM, Hazy Research, and MILA Québec AI Institute. In April 2023, they released a 1.2 trillion token dataset, mirroring LLaMA’s dataset, for training their models. These models, with parameters ranging from 3 to 7 billion, were released in September, licensed under open-source Apache 2.

The RedPajama dataset was adapted by the OpenLLaMA project at UC Berkeley, creating an open-source LLaMA equivalent without Meta’s restrictions. The model's later version also included data from Falcon and StarCoder. This highlights the importance of open-source models and datasets, enabling free repurposing and innovation.

OpenLLaMa Openness

Component	Score	Level description	Motivation and links
Model (weights)	4	Access and reuse without restriction	Models and weights (in PyTorch and JAX formats) are available under the Apache 2 open-source license.
Pre-training Dataset	4	Access and reuse without restriction	Based on RedPajama, Falcon, and StarCoder datasets.
Fine-tuning Dataset	NA	Not Applicable	OpenLLaMA is a foundational model.
Reward model	NA	Not Applicable	OpenLLaMA did not undergo RLHF, hence no reward model for evaluation.
Data Processing Code	1	No complete data processing code available	The Hugging Face page provides some usage examples with transformers.

2.1.6 MistralAI

MistralAI, a French startup, developed a 7.3 billion parameter LLM named Mistral for various applications. Committed to open-sourcing its technology under Apache 2.0, the training dataset details for Mistral remain undisclosed. The Mistral Instruct model was fine-tuned using publicly available instruction datasets from the Hugging Face repository, though specifics about the licenses and potential constraints are not detailed. Recently, MistralAI released Mixtral 8x7B, a model based on the sparse mixture of experts (SMoE) architecture, consisting of several specialized models (likely eight, as suggested by its name) activated as needed.

Mistral and Mixtral Openness

Component	Score	Level description	Motivation and links
Model (weights)	4	Access and reuse without restriction	Models and weights are available under the Apache 2 open-source license.
Pre-training Dataset	0	No public information or access	The training dataset for the model is not publicly detailed.
Fine-tuning Dataset	NA	Not Applicable	Mistral is a foundational model.
Reward model	NA	Not Applicable	Mistral did not undergo RLHF, hence no reward model for evaluation.
Data Processing Code	4	Complete data processing code available	Instructions and deployment code are available on GitHub.

These examples illustrate the diverse approaches to openness in foundational models, emphasizing the value of sharing and enabling reuse by others.

2.2. Fine-tuned Models and Instruction Datasets

When pre-trained models and datasets are accessible with minimal restrictions, various entities can repurpose them to develop new foundational models or fine-tuned variants.

2.2.1 Dolly

Drawing inspiration from LLaMA and the fine-tuned projects like Alpaca and Vicuna, the big data company Databricks introduced Dolly 1.0 in March 2022.

This cost-effective LLM was built upon EleutherAI’s GPT-J, employing the data and training methods of Alpaca. Databricks highlighted that this model was developed for under $30, suggesting that significant advancements in leading-edge models like ChatGPT might be more attributable to specialized instruction-following training data than to larger or more advanced base models. Two weeks later, Databricks launched Dolly 2.0, still based on the EleutherAI model, but now exclusively fine-tuned with a pristine, human-curated instruction dataset created by Databricks’ staff. They chose to open source the entirety of Dolly 2.0, including the training code, dataset, and model weights, making them suitable for commercial applications. By May 2023, Databricks had acquired MosaicML, a company that had just released its MPT (MosaicML Pre-trained Transformer) foundational model under the Apache 2.0 license, allowing anyone to train, refine, and deploy their own MPT models.

Dolly Openness

Component	Score	Level description	Motivation and links
Model (weights)	4	Access and reuse without restriction	Dolly 2 is based on the EleutherAI foundational model and is released under the MIT license.
Pre-training Dataset	3	Open with limitations	Dolly 2 is built upon the EleutherAI foundational model and inherits the same limitations as its dataset (The Pile).
Fine-tuning Dataset	4	Access and reuse without restriction	The fine-tuning dataset was created by Databricks employees and released under the CC-BY-SA permissive license.
Reward model	0	No public information available	The Reward model of Dolly is not publicly disclosed.
Data Processing Code	4	Access and reuse possible	Code to train and run the model is available on GitHub under the Apache 2 license.

2.2.2 OpenAssistant

OpenAssistant stands as another example of the potential of open source fine-tuning datasets.

This initiative aims to create an open-source, chat-based assistant proficient in understanding tasks, interacting with third-party systems, and retrieving dynamic information. Led by LAION and international contributors, a unique aspect of this project is its reliance on crowdsourcing for data collection from human volunteers. The OpenAssistant team has initiated numerous crowdsourcing tasks to gather data for various tasks, such as generating diverse text formats—from poems and code to emails and letters—and providing informative responses to queries. Recently, the LAION initiative also released an open-source dataset named OIG (Open Instruction Generalist), comprising 43M instructions, focusing on data augmentation rather than human feedback.

2.2.3 BLOOMChat

BLOOMChat, a fine-tuned chat model with 176 billion parameters, demonstrates the reuse of previous models and datasets. It underwent instruction tuning based on the BLOOM foundational model, incorporating datasets like OIG, OpenAssistant, and Dolly.

BLOOMChat Openness

Component	Score	Level description	Motivation and links
Model (weights)	3	Open with limitations	Based on the BLOOM foundational model, it inherits its restrictions (Open RAIL license).
Pre-training Dataset	3	Open with limitations	Based on the BLOOM foundational model and dataset, it inherits potential restrictions.
Fine-tuning Dataset	4	Access and reuse without restriction	The Dolly and LAION fine-tuning datasets are open source.
Reward model	0	No public information available	The Reward model of BLOOMChat is not publicly disclosed.
Data Processing Code	3	Open with limitations	Code to train and run the model is available on GitHub under an Open RAIL type license.

2.2.4 Zephyr

Zephyr serves as another exemplary case of reutilizing an open foundational model. It is a fine-tuned version of the Mistral 7B model developed by the Hugging Face H4 project. It was fine-tuned using filtered versions of the UltraChat and UltraFeedback datasets. These datasets were generated using ShareGPT and GPT-4, potentially imposing some transitive use restrictions on the resulting model.

Zephyr Openness

Component	Score	Level description	Motivation and links
Model (weights)	4	Access and reuse without restriction	Zephyr is released under the MIT license, and the Mistral foundational model is under Apache 2.
Pre-training Dataset	3	Open with limitations	Inherits possible restrictions from the Mistral foundational model and dataset.
Fine-tuning Dataset	3	Open with limitations	ShareGPT and GPT-4 were used to produce the fine-tuning datasets, imposing limitations.
Reward model	3	Open with paper and code examples	Zephyr was fine-tuned using Direct Preference Optimization (DPO), not RLHF. A paper and some code and examples on this new technique are available.
Data Processing Code	3	Open with limitations	Example code to train and run the model is available on Hugging Face.

2.2.5 LLM360

LLM360 is an emerging yet intriguing initiative focused on ensuring complete openness of all essential elements required for training models. This includes not only model weights but also checkpoints, training datasets, and source code for data preprocessing and model training. The project is a collaborative effort between the Mohamed bin Zayed University of Artificial Intelligence (MBZUAI) in the UAE and two American AI companies. It has introduced two foundational models: Amber, a 7 billion parameter English LLM, and CrystalCoder, a 7 billion parameter LLM specializing in code and text. Additionally, there are fine-tuned versions of these models, including AmberChat.

AmberChat openness

Component	Score	Level description	Motivation and links
Model (weights)	4	Access and reuse of model is possible without restriction	AmberChat is released under the Apache 2 open source license.
Pre-training Dataset	4	Access and reuse without restriction	Based on RedPajama v1, Faclcon RefineWeb and StarCoderData.
Fine-tuning Dataset	2	Research use only	AmberChat dataset is based on WizardLM evol instruct V2 and share gpt 90k both based in ShareGPT.
Reward model	0	Reward model ot available	The reward model is not available. DPO is mentioned as a alignement method for AmberSafe another fine-tuned Amber model.
Data Processing Code	4	Open with limitations	Data processing and model training code are available under the Apache 2 open source license.

An assessment of the openness of AmberChat reveals that while the Amber foundational model is quite open, providing unrestricted access to weights, dataset, and code, the fine-tuning process in AmberChat introduces certain limitations to this initial level of openness.

These diverse examples highlight how openness enables global collaboration, allowing entities to repurpose earlier models and datasets in ways typical of the research community. This approach clarifies licensing nuances and fosters external contributions. And it's this spirit of collaborative and community-driven innovation that is championed by a notable player: Hugging Face.

2.3. The Hugging Face Ecosystem

The Hugging Face initiative represents a community-focused endeavor, aimed at advancing and democratizing artificial intelligence through open-source and open science methodologies. Originating from the Franco-American company, Hugging Face, Inc., it's renowned for its Transformers library, which offers open-source implementations of transformer models for text, image, and audio tasks. In addition, it provides libraries dedicated to dataset processing, model evaluation, simulations, and machine learning demonstrations.

Hugging Face has spearheaded two significant scientific projects, BigScience and BigCode, leading to the development of two large language models: BLOOM and Pygmalion. Moreover, the initiative curates a LLM leaderboard, allowing users to upload and assess models based on text generation tasks.

However, in my opinion, the truly revolutionary aspect is the Hugging Face Hub. This online collaborative space acts as a hub where enthusiasts can explore, experiment, collaborate, and create machine learning-focused technology. Hosting thousands of models, datasets, and demo applications, all transparently accessible, it has quickly become the go-to platform for open AI, similar to GitHub in its field. The Hub's user-friendly interface encourages effortless reuse and integration, cementing its position as a pivotal element in the open AI landscape.

3. On the Compute Side

3.1 Computing Power

Generative AI extends beyond software, a reality clearly understood by entities like Microsoft/OpenAI and Google as Cloud service providers. They benefit from the ongoing effort to enhance LLM performance through increased computing resources.

In September, AWS announced an investment of $4 billion in Anthropic, whose models will be integrated into Amazon Bedrock, developed, trained, and deployed using AWS AI chips.

NVIDIA is at the forefront of expanding computing capacity with its GPUs. This commitment is reflected in their CUDA (Compute Unified Device Architecture) parallel computing platform and the associated programming model. Additionally, NVIDIA manages the NeMo conversational AI toolkit, which serves both research purposes and as an enterprise-grade framework for handling LLMs and performing GPU-based computations, whether in the Cloud or on personal computers.

The AI chip recently unveiled by Intel signifies a major advancement in user empowerment over LLM applications. Diverging from Cloud providers' strategies, Intel's aim is to penetrate the market for AI chips suitable for operations outside data centers, thus enabling the deployment of LLMs on personal desktops, for example, through its OpenVINO open-source framework.

This field is characterized by dynamic competition, with companies like AMD and Alibaba making notable inroads into the AI chip market. The development of proprietary AI chips by Cloud Service Providers (CSPs), such as Google's TPU and AWS's Inferentia, as well as investments by startups like SambaNova, Cerebras, and Rain AI (backed by Sam Altman), further intensifies this competition.

3.2 Democratizing AI Computing

The llama.cpp project, initiated by researchers at the University of Freiburg, exemplifies efforts to democratize AI and drive innovation by promoting CPU-compatible models. It's a C++ implementation of the LLaMA algorithm, originally designed for use on MacBooks and now also supporting x86 architectures. Employing quantization, which reduces LLMs' size and computational demands by converting model weights from floating-point to integer values, this project created the GGML binary format for efficient model storage and loading. Recently, GGML was succeeded by the more versatile and extensible GGUF format.

Since its launch in October 2022, llama.cpp has quickly gained traction among researchers and developers for its scalability and compatibility across MacOS, Linux, Windows, and Docker.

The GPT4All project, which employs llama.cpp, aims to train and deploy LLMs on conventional hardware. This ecosystem democratizes AI by allowing researchers and developers to train and deploy LLMs on their hardware, thus circumventing costly cloud computing services. This initiative has enhanced the accessibility and affordability of LLMs. GPT4All includes a variety of models derived from GPT-J, MPT, and LLaMA, fine-tuned with datasets including ShareGPT conversations. However, the limitations mentioned earlier are important to consider. The ecosystem also features a chat desktop application, as previously discussed in a related article.

LocalAI is another open-source project designed to facilitate running LLMs locally. Previously introduced in a dedicated article, LocalAI leverages llama.cpp and supports GGML/GGUF formats and Hugging Face models, among others. This tool simplifies deploying an open model on a standard computer or within a container, integrating it into a larger, distributed architecture. The BionicGPT open-source project exemplifies its usage.

Conclusion

The evolution of Generative AI mirrors broader technological advancements, fluctuating between open collaboration and proprietary control. The AI landscape, as illustrated by the rise and diversification of LLMs, stands at a crossroads. On one side, tech giants drive towards commercialization and centralization, leveraging immense AI potential and vast computational resources. On the other, a growing movement champions open AI, emphasizing collaboration, transparency, and democratization.

The LLaMA project, despite its limited openness, has inspired numerous others, as evident in names like Alpaca, Vicuna, Dolly, Goat, Dalai (a play on Dalai Lama), Redpajama, and more. We may soon see other animal-themed projects joining this AI menagerie, perhaps flying alongside Falcon?

The range of projects from LLaMA to Falcon, and platforms like Hugging Face, highlight the vitality of the open AI movement. They remind us that while commercial interests are crucial for innovation, the spirit of shared knowledge and collective growth is equally essential. The "Linux moment" of AI beckons a future where openness entails not just access, but also the freedom to innovate, repurpose, and collaborate.

In our view, the centralization of AI computing power tends to lean towards closed-source solutions, whereas its democratization inherently supports open-source alternatives.

As AI increasingly permeates our lives, the choices we make now—whether to centralize or democratize, to close or open up will shape not just the future of technology, but the essence of our digital society. This moment of reckoning urges us to contemplate the kind of digital world we wish to create and pass on to future generations.

See why 4M developers consider Sentry, “not bad.”

Fixing code doesn’t have to be the worst part of your day. Learn how Sentry can help.

Learn more

DEV Community