raphiki for Technology at Worldline

Posted on Oct 25, 2023 • Edited on Nov 10, 2023 • Originally published at blog.worldline.tech

How Open is Generative AI? Part 1

#ai #llm #openness

Welcome to this two-part series on Generative AI Openness, where with my colleague Luxin Zhang, we explore the history, current landscape, and potential future of open collaboration and proprietary control in the development of Language Language Models (LLMs). In this first part, we will delve into the importance of inspecting the openness of each component in the LLM training process, and how this can impact the potential limitations on the model use or reuse imposed by one or more of its components. In the second part of this series, we will explore the potential benefits and drawbacks of sharing Generative AIs openly for the collective advancement of society.

Introduction

Observing the current landscape of generative AI, we are reminded of the shared history of open source software and the Internet. Both transitioned from academic research endeavors to business-centric, mass-market phenomena. Historically, the free software movement was birthed by researchers keen on sharing their work, while hardware vendors like IBM and AT&T realized the monetary potential of software. And indeed this business potential has swiftly been capitalized upon, leading to the creation of proprietary software by various companies, some of which grew immensely powerful. Later on, the open source movement expanded on the free software idea, emphasizing openness and facilitating the commercial reuse of software.

A similar centralization trend did occur with Internet content, increasingly hosted on private platforms. Many of these platforms, which grew into Internet giants, relied heavily on open source software. Free and open source licenses, conceived before the Cloud era, weren’t entirely suited for the new age of software concealed behind SaaS and APIs. In response, the AGPL was designed to extend GPL’s reach to the Cloud. Recent years have seen numerous projects adjusting their licenses in light of this, sometimes moving away from being considered open source. MongoDB, ElasticSearch, and more recently Terraform serve as prime examples.

Generative AI seems to be tackling with analogous challenges, albeit with some distinct differences. The transition from research to business in the AI realm has been astoundingly rapid, fueled by significant capital. This rapid evolution has granted generative AI immense public visibility, perhaps without adequate time for licensing nuances to mature. Today, in response to the rise of closed, proprietary, and centralized platforms aspiring to dominate AI, researchers and certain companies are championing openness. I term this collective effort “open AI”, distinguishing it from “Open Source AI” due to the varying degrees of openness in the sector. While licenses might not always align with strict open source definitions, the extent of openness also hinges on what components are shared and reused, be it pre-training dataset, foundational model and its weights, fine-tuning dataset and resulting model, reward model, and code.

We concur with the sentiment that AI is experiencing its Linux moment, because what is at stake is not just about having open source models or datasets; it’s about collaboratively constructing an entire ecosystem. Collaboration and open source don’t negate business opportunities; they simply promote a decentralized approach. For this to work, clarity on licenses is paramount, enabling collective construction and achieving what might seem impossible, just like Linux.

1. Defining Openness for a LLM

We propose a framework to evaluate the level of openness of a LLM and the potential restrictions to its use and reuse. Before introducing our evaluation matrix, let's revisit the different components involved in a LLM training.

The success of LLMs can be attributed to the combination of training strategies implemented at various stages, the advanced architecture of models, and abundant datasets. To help you understand why thoroughly inspecting the openness of each component is crucial, we offer some explanations of these elements. In summary, the LLM training process consists of three primary steps.

Self-supervised Pre-training

The purpose of the pre-training step is to provide the model with basic language understanding. This is achieved through a self-supervised learning approach where the model is trained to predict the following words given the pre-text. The dataset for this step is gathered from various sources such as websites, book corpora, articles, and open source code. Typically, the dataset used in this step is quite large, often exceeding several terabytes. Due to its size, manually checking the quality of the data is unfeasible. Therefore, it's essential to have a data processing script to ensure the quality of the dataset and the performance of the model.
After completing the pre-training step, the resulting foundational model is used for future fine-tuning tasks.
Not all models that are released are passed on to the fine-tuning step. Typically, Llama1 and GPT3 are models that are not fine-tuned.

Supervised Instruction Fine-tuning (SFT)

The Fine-tuning step aims to adapt the foundational model to specific tasks. This step is supervised, meaning that the model is trained on pairs of instructions and responses generated by humans. The number of pairs can range from several to tens or even hundreds of thousands. Although the dataset used in this step is far smaller than the pre-training dataset, it is much more expensive to collect. It is possible to collect synthetic data from other advanced LLMs like GPT-4, but the license of these LLM providers might restrict the usage of the synthetic dataset and the fine-tuned model. Many community models like Alpaca, Vicuna and Dolly are released after this step.

Reinforcement Learning from Human Feedback (RLFH)

RLFH aims to enhance a SFT model by incorporating human feedback. In this approach, the model is trained to generate responses that receive high scores from human raters. To gather data for this process, humans are asked to evaluate the responses produced by the SFT LLM. However, instead of employing this dataset to train the LLM directly, a reward model (RM) is developed. The RM is basically a fine-tuned SFT LLM that outputs a score that predicts the human rating of the LLM's response rather than words.
The RM is used to adjust the LLM in order to align it with human preferences. As shown in the Figure from the Llama2 paper, this step is essential for mitigating the harmfulness of LLMs while improving their usefulness.
Practically all commercial LLMs, such as chatGPT, GPT4, Claude, and Bison, benefit from this process.

For us, the main components and aspects of a LLM to check when assessing its openness are the following:

The model and its weights
The dataset used for pre-training
The dataset used for fine-tuning (if not a foundational model)
The reward model used for the RLHF phase (if not a foundational model)
The data processing code

On each of these aspects we can then evaluate the level of openness with one of the following scores:

Score 0 - Closed: No access to any public information, data or asset
Score 1 - Published research only: Research papers(s) published but with no more information, data or asset
Score 2 - Restricted access: Access to asset is possible only with special agreement (commercial, research…)
Score 3 - Open with limitations: Access and reuse of asset is possible but with certain limitations on usage (ex. Open RAIL license)
Score 4 - Totally open: Access and reuse of asset is possible without restriction (ex. open source license)

Such an evaluation matrix can be used to evaluate LLMs and also trace the potential limitations on the model use or reuse imposed by one or more of its components. Let's use our model to revisit LLM history and position some of they key players.

2. Early days of LLMs

Before LLMs gained prominence, Natural Language Processing (NLP) depended on rule-based systems and simpler statistical models. These systems necessitated manual feature engineering and often faltered with intricate language tasks.

During the 2010s, word embeddings, such as Word2Vec and GloVe, gained traction. These models depicted words as dense vectors, encapsulating semantic relationships among them. Recurrent Neural Networks (RNNs), and their offshoot, Long Short-Term Memory (LSTM) networks, emerged to manage data sequences, proving useful for text generation and sentiment analysis.

The Attention is All You Need paper in 2017 introduced the attention mechanism, enabling models to prioritize different segments of input data. This innovation birthed the Transformer architecture, the bedrock of contemporary LLMs. In 2018, Google unveiled BERT (Bidirectional Encoder Representations from Transformers). Pre-trained on extensive text data and subsequently fine-tuned for specific tasks, BERT established new standards across multiple NLP tasks.

OpenAI launched the Generative Pre-trained Transformer (GPT) series, with GPT-3 in 2020 highlighting its prowess in generating human-like text and executing tasks without task-specific training. As models like GPT-3 showcased remarkable abilities, issues related to biases, misuse, and environmental consequences surfaced. Companies such as Meta have endeavored to open source or grant access to their models, igniting discussions about the genuine essence of “openness” in AI.

3. Case Studies of some major AI Players

Before delving into the detailed analysis of major AI players, it's important to clarify that the perspectives and opinions expressed in this article are solely those of the authors and do not reflect the official views of Worldline.

3.1. OpenAI and the GPT Series

Reflecting on OpenAI’s journey and the evolution of the GPT series provides insights into the potential consequences and the significance of openness in LLM development.

OpenAI, established in 2015 by tech visionaries like Elon Musk, Sam Altman, Greg Brockman, Ilya Sutskever, John Schulman, and Wojciech Zaremba, aimed to cultivate AI that would be safe, beneficial for humanity, and openly accessible.

In June 2018, OpenAI released Improving Language Understanding by Generative Pre-Training, introducing the GPT model a transformer-based neural network architecture tailored for language generation. This model, built upon the transformer architecture presented in “Attention Is All You Need”, had 117 million parameters and was trained on a vast text corpus. Though it marked a leap in natural language processing, its capabilities had limitations.

OpenAI’s subsequent iterations of the GPT model saw GPT-2 in February 2019, reaching 1.5 billion parameters —significantly larger and more potent than its forerunner. GPT-2 could generate human-like text and handle diverse language tasks. However, due to potential misuse concerns, OpenAI opted for a phased release of GPT-2.

GPT-1 and GPT-2 openness

Component	Score	Level description	Motivation and links
Model (weights)	4	Totally open	GPT-1 and GPT-2 model weights and architecture are under an open source license.
Pre-training Dataset	1	Published research only	The pre-training dataset is not publicly available. However, details about the data and training process are described in the original GPT and GPT-2 papers.
Fine-tuning Dataset	NA	Not Applicable	GPT-1 and 2 are foundational models and were not fine-tuned on a specific dataset for their initial releases.
Reward model	NA	Not Applicable	GPT-1 and 2 did not undergo Reinforcement Learning from Human Feedback (RLHF) in their initial trainings, so there is no reward model to evaluate.
Data Processing Code	1	Published research only	While OpenAI has provided some information about the data processing and training methodology, the exact code used for data processing during GPT-1 and GPT-2’s training has not been released.

In 2019, Microsoft’s $1 billion investment in OpenAI bolstered the creation of advanced AI models and technologies. This partnership enabled OpenAI to utilize Microsoft’s Azure cloud computing platform for its AI models. Microsoft’s involvement facilitated the joint development of GPT-3. These shifts in OpenAI’s foundational goals prompted Dario Amodei (VP of Research at OpenAI) and 14 other researchers to depart and establish the competing AI startup, Anthropic.

OpenAI unveiled GPT-3 in June 2020, widely acclaimed as a natural language processing milestone. With 175 billion parameters, it stood among the most formidable language models, producing coherent, human-like text and executing a plethora of language tasks with astounding precision.

Yet, in March 2023, OpenAI launched GPT-4, encompassing 1.76 trillion parameters, making it the most expansive language model to date. Trained on a multilingual dataset, it can craft more inventive content, translate with heightened accuracy, and generate code. This release, primarily a closed model, marked a deviation from OpenAI’s original vision of research transparency and openness.

GPT-3 and GPT-4 openness

Component	Score	Level description	Motivation and links
Model (weights)	0	Closed	As of the available information, the model weights for GPT-3 and 4 are not available to the public. OpenAI provides access to GPT-4 through an API. More information can be found on the OpenAI API platform.
Pre-training Dataset	1	Published research only	The pre-training dataset is not publicly available. However, details about the data and training process are described in the GPT-3 paper and the GPT-4 Technical Report.
Fine-tuning Dataset	NA	Not Applicable	GPT-3 and 4 are foundational models and ware not fine-tuned on specific datasets for their initial releases.
Reward model	NA	Not Applicable	GPT-3 and 4 did not undergo RLHF in their initial training, so there are no reward models to evaluate.
Data Processing Code	1	Published research only	While OpenAI has provided some information about the data processing and training methodology, the exact code used for data processing during GPT-3 and GPT-4’s training has not been released.

ChatGPT is fine-tuned model based on GPT, also reinforced with human feedback targeting a Chat usage. It was launched on in November 2022 and enables users to refine a conversation towards a desired result. It is offered as a freemium service with a free tier to access the GPT-3.5-based version. A more advanced GPT-4 based version and priority access to newer features is provided to paid subscribers.

ChatGPT openness

Component	Score	Level description	Motivation and links
Model (weights)	0	Closed	Access to the ChatGPT model is provided through OpenAI's API, but the weights are not publicly released.
Pre-training Dataset	1	Published research only	Pre-trained on a large corpus of publicly available text data, but specific details are not provided. source
Fine-tuning Dataset	1	Published research only	Fine-tuned using publicly available instruction datasets, but specific details are not provided. source
Reward model	0	Closed	Uses reinforcement learning with human feedback, but specific details about the reward models are not provided.
Data Processing Code	0	Closed	Designed to process publicly available data, but specific details about the data processing code are not provided.

3.2. Google's Transition

Google’s stance on AI research has evolved from an initially open and collaborative orientation to a more proprietary and commercial direction, mirroring OpenAI’s trajectory.

In the nascent stages of AI exploration at Google, the DeepMind division was renowned for its emphasis on foundational research and its unwavering dedication to transparency and openness.

For example, the Transformer paper of 2017 was published by a consortium of Google AI researchers. This methodology swiftly set the benchmark for a myriad of NLP tasks, including machine translation, text summarization, and question-answering. Furthermore, in 2018, Google introduced the BERT LLM, which rapidly ascended to become a cornerstone in the AI domain.

BERT openness

Component	Score	Level description	Motivation and links
Model (weights)	4	Totally open	BERT model weights and architecture are available for download under an open source license.
Pre-training Dataset	2	Restricted access	BERT was pre-trained on the BookCorpus and English Wikipedia. While the methodology and details are provided in the BERT paper, the exact dataset used for pre-training is not publicly available. However, similar datasets can be accessed with certain limitations.
Fine-tuning Dataset	NA	Not Applicable	BERT is a foundational model and can be fine-tuned on various specific datasets per use-case, which are not provided by the original creators.
Reward model	NA	Not Applicable	BERT did not undergo RLHF in its initial training, so there is no reward model to evaluate.
Data Processing Code	4	Totally open	The code for pre-processing data and training BERT is available on GitHub under an open source license.

Yet, as Google’s AI endeavors gravitated towards commercialization, there was a discernible pivot towards proprietary, closed-source models, predominantly crafted by the Google Brain team. This transition might elucidate Mustafa Suleyman’s departure, co-founder of DeepMind, from Google in January 2022, shortly before the merger of DeepMind and Google Brain.

Part of this shift can be attributed to the imperative to safeguard Google’s intellectual assets and sustain a competitive edge in the dynamic AI landscape. For instance, in February 2023, Google opted to capitalize on its AI breakthroughs, disseminating research papers only post product development. This move, while strategically sound, has elicited concerns from a segment of the research community, apprehensive that a trend towards closed-source models might curb innovation and collaborative spirit in AI.

In May 2023, Google unveiled Bard, a versatile AI assistant capable of Internet searches, powered by the proprietary LLM PaLM 2 (Pathways Language Model 2). With 540 billion parameters, PaLM 2, trained on an extensive dataset of text and code, offers capabilities like text generation, language translation, crafting diverse creative content, and delivering informative responses. Additionally, Google heralded the advent of Gemini, an upcoming LLM envisioned to surpass Bard in prowess and adaptability. Its multimodal nature would enable it to interpret and generate text, visuals, and other data modalities.

PaLM 2 openness

Component	Score	Level description	Motivation and links
Model (weights)	1	Published research only	Weights are not publicly available. The only information available is in the PaLM 2 Technical Report.
Pre-training Dataset	1	Published research only	The pre-training dataset is not publicly available. The only information available is in the PaLM 2 Technical Report.
Fine-tuning Dataset	NA	Not Applicable	PaLM2 is a foundational model and was not fine-tuned on a specific dataset for its initial release.
Reward model	NA	Not Applicable	PaLM2 did not undergo Reinforcement Learning from Human Feedback (RLHF) in its initial training, so there is no reward model to evaluate.
Data Processing Code	0	Closed	The data processing code is not publicly available.

It’s evident that both OpenAI and Google have transitioned from open research to a more proprietary, commercial approach. Their shared belief is that LLM progression hinges on computational power and scalability. Could this be because one is a cloud provider and the other is heavily backed by one?

3.3. Meta's Journey Towards Openness

Meta’s trajectory, given its non-status as a Cloud provider, offers a compelling narrative. Its strategy concerning language models has oscillated between open source and proprietary stances.

In 2019, Meta introduced RoBERTa, an open source language model built upon Google’s BERT framework. RoBERTa, trained on an expansive text dataset, was adept at various linguistic tasks, such as question-answering and text categorization. Its launch was heralded as a pivotal advancement in natural language processing, catalyzing the proliferation of open source AI paradigms.

RoBERTa openness

Component	Score	Level description	Motivation and links
Model (weights)	4	Totally open	RoBERTa model weights and architecture are available for download under an open source license.
Pre-training Dataset	3	Open with limitations	RoBERTa was pre-trained on several datasets, including BookCorpus, English Wikipedia, CC-News, OpenWebText, and Stories. While some of these datasets are publicly available, others like BookCorpus are not available due to copyright issues. More details can be found in the RoBERTa paper.
Fine-tuning Dataset	NA	Not Applicable	RoBERTa is a foundational model and can be fine-tuned on various specific datasets per use-case, which are not provided by the original creators.
Reward model	NA	Not Applicable	RoBERTa did not undergo RLHF in its initial training, so there is no reward model to evaluate.
Data Processing Code	4	Totally open	The code for pre-processing data and training RoBERTa is available on GitHub under an open source license.

Come 2021, Meta unveiled LLM, a language model surpassing RoBERTa in size and capability. LLM, trained on an extensive text dataset, could produce remarkably coherent and human-esque text. Contrary to RoBERTa, LLM wasn’t open sourced. Instead, Meta granted access to LLM to a select group of partners and clientele via a research license.

This proprietary stance on LLM sparked debates, with critics contending it contravened the ethos of transparency and cooperation in AI. However, Meta justified its choice, echoing OpenAI’s sentiments regarding potential model misuse and intellectual property preservation.

In a strategic pivot, Meta, in February 2023, launched LLaMA (Language Model for Multi-Agent Communication). Unlike LLM, LLaMA is an accessible model, allowing anyone to utilize and expand upon it. This model, rooted in transformer architecture, was trained on a vast text dataset.

Yet, LLaMA’s license hasn’t garnered the Open Source Initiative (OSI) stamp, rendering it non-open source. The license imposes several constraints, such as:

Prohibiting users from leveraging LLaMA to enhance other sizable language models, including their own.
Mandating users to secure a specialized license from Meta for deploying LLaMA in platforms with over 700 million monthly users.

LLaMA 2 openness

Component	Score	Level description	Motivation and links
Model (weights)	3	Open with limitations	LLaMA 2 model weights were intended for selective academic and research access, but were inadvertently leaked online. The original source, however, imposes certain restrictions.
Pre-training Dataset	1	Published research only	The LLaMA 2 model was pre-trained on a new mix of publicly available data, which does not include data from Meta’s products or services, but specific details are not provided in the paper.
Fine-tuning Dataset	NA	Not Applicable	LLaMA 2 is a foundational model and can be fine-tuned on various specific datasets per use-case, which are not provided by the original creators.
Reward model	NA	Not Applicable	LLaMA 2 did not undergo RLHF in its initial training, so there is no reward model to evaluate.
Data Processing Code	1	Published research only	Designed to process publicly available data, but specific details about the data processing code are not provided.

Essentially, these restrictions aim to shield Meta’s intellectual assets and deter rivals from harnessing LLaMA for proprietary language model development. Despite the ensuing debates, LLaMA remains freely accessible for both research and commercial endeavors, with its license ensuring derivative works based on its code remain open. Nevertheless, users must acquaint themselves with the license’s restrictions prior to integrating LLaMA into their offerings. Although LLaMA’s model weights were intended for selective academic and research access, they were inadvertently leaked online.

This transparency birthed new fine-tuned LLMs rooted in LLaMA, such as Alpaca and Vicuna and more generally inspired other more open and collaborative projects. The impact of collaboration and open source in LLM Evolution will be covered by the last part of our series on Generative AI Openness.

Conclusion... for now

The evolution of Generative AI mirrors the broader trajectory of technological advancements, oscillating between open collaboration and proprietary control. As we’ve seen with the rise and diversification of LLMs, the AI landscape is at a crossroads. As we conclude part one of our series on Generative AI Openness, we leave you with a question: what does the future hold for LLMs and the broader AI landscape?

Stay tuned for part two, where we will explore the impact of collaboration and open source on the evolution of Generative AI.

Struggling with slow API calls? 👀

Dan Mindru walks through how he used Sentry's new Trace View feature to shave off 22.3 seconds from an API call.

Get a practical walkthrough of how to identify bottlenecks, split tasks into multiple parallel tasks, identify slow AI model calls, and more.

Top comments (1)

Jett Liya • Feb 9 '24

Generative AI, at its core, is a technology designed to create new content, whether it's text, images, music, or other forms of media, based on patterns learned from existing data. The openness of generative AI can be examined from various perspectives:

Open Source Frameworks: Many generative AI frameworks, such as TensorFlow and PyTorch, are open source, allowing developers to access and modify the code. This openness fosters collaboration, innovation, and transparency in the development of generative models.
If you want to learn in detail to visit this website and learn in detail about ai tools:
AiChief

Research Papers and Publications: The research community regularly publishes papers on new generative models and techniques, often accompanied by code implementations. This dissemination of knowledge encourages further exploration and refinement of generative AI algorithms.

Data Availability: The availability of datasets is crucial for training generative AI models. While some datasets are publicly available, others may be proprietary or require permissions to access. Open datasets enable researchers and developers to experiment with generative models and contribute to the advancement of the field.

Ethical Considerations and Transparency: There's ongoing discussion and debate around the ethical implications of generative AI, particularly regarding issues like bias, fairness, and misuse. Openness in this context involves transparency about how AI systems are trained, their limitations, and potential biases. Efforts toward explaining AI decisions (e.g., through interpretability techniques) contribute to making generative AI more open and accountable.

Community Engagement and Education: Openness in generative AI extends to community engagement and education initiatives. Online forums, workshops, and tutorials facilitate knowledge sharing and skill development among practitioners and enthusiasts, democratizing access to generative AI technology.

DEV Community

How Open is Generative AI? Part 1

Introduction

1. Defining Openness for a LLM

2. Early days of LLMs

3. Case Studies of some major AI Players

3.1. OpenAI and the GPT Series

3.2. Google's Transition

3.3. Meta's Journey Towards Openness

Conclusion... for now

Struggling with slow API calls? 👀

Top comments (1)

The Next Generation Developer Platform

Read next

AI Creates Endless Video Loops from Text, Makes Perfect Seamless Animations for Social Media

Simple Method Lets You Control AI Image Generation with Both Text and Pictures

AI Model Achieves Breakthrough in Multi-Task Computer Vision Using Diffusion Technology

AI Creates Ultra-Realistic Rain in Photos Using Graphics Rendering and Neural Networks

Okay