The Role of HPC in Meta Llama 3 Development

#ia #hpc #programming #beginners

Introduction

Like every developer and artificial intelligence enthusiast, I was quite excited about Meta's open-source model release. I became quite curious about how they managed to achieve this result through hardware, and certainly, they used high-performance computers to train these models and integrate them with the company's existing social media technologies.

The purpose of this article is to demonstrate the importance of high-performance computers in building these models, which are becoming increasingly present in our daily lives. Part of it is being written based on notes released on Meta's blog. The company invested in a robust hardware infrastructure to manage large workloads and simultaneously deliver reliable results.

Building Blocks

In the quest to develop the next wave of advanced AI, a crucial foundation lies in powerful new computers capable of performing quintillions of operations per second. RSC help Meta’s AI researchers build new and better AI models that can learn from trillions of examples; work across hundreds of different languages; seamlessly analyze text, images, and video together; develop new augmented reality tools; and much more.

While Meta has a long history of constructing AI infrastructure, details on the AI Research SuperCluster (RSC), equipped with 16,000 NVIDIA A100 GPUs, were first shared in 2022. The RSC has propelled Meta's open and responsible AI research forward, facilitating the development of its initial wave of advanced AI models. It has been, and remains, instrumental in the evolution of projects like Llama and Llama 2, alongside the creation of sophisticated AI models for diverse applications, including computer vision, NLP, speech recognition, image generation, and even coding.

High-performance computing infrastructure is a critical component in training such large models, and Meta’s AI research team has been building these high-powered systems for many years. The first generation of this infrastructure, designed in 2017, has 22,000 NVIDIA V100 Tensor Core GPUs in a single cluster that performs 35,000 training jobs a day. Up until now, this infrastructure has set the bar for Meta’s researchers in terms of its performance, reliability, and productivity.

How It Works?

Meta's newer AI clusters continue to build upon the successes and lessons learned from RSC. The company's focus remains on constructing end-to-end AI systems with a primary emphasis on enhancing the experience and productivity of researchers and developers. The efficiency of the high-performance network fabrics within these clusters, alongside pivotal storage decisions, combined with the inclusion of 24,576 NVIDIA Tensor Core H100 GPUs in each, enables both versions of the cluster to accommodate larger and more complex models than those supported by the RSC. This advancement paves the way for further progress in GenAI product development and AI research.

Network

With these objectives in mind, Meta constructed one cluster equipped with a remote direct memory access (RDMA) over converged Ethernet (RoCE) network fabric solution, based on the Arista 7800 with Wedge400 and Minipack2 OCP rack switches. The second cluster is outfitted with an NVIDIA Quantum2 InfiniBand fabric. Both solutions boast interconnections of 400 Gbps endpoints. Through the utilization of these two distinct configurations, Meta aims to evaluate the suitability and scalability of various interconnect types for large-scale training, thereby gaining valuable insights that will inform the design and construction of even larger, scaled-up clusters in the future. By carefully co-designing the network, software, and model architectures, Meta has effectively utilized both RoCE and InfiniBand clusters for large GenAI workloads, including the ongoing training of Llama 3 on the RoCE cluster, without encountering any network bottlenecks.

Computing

Both clusters are constructed using Grand Teton, Meta's in-house-designed, open GPU hardware platform, which has been contributed to the Open Compute Project (OCP). Grand Teton builds upon numerous generations of AI systems, integrating power, control, compute, and fabric interfaces into a single chassis to enhance overall performance, signal integrity, and thermal efficiency. It offers rapid scalability and flexibility within a simplified design, facilitating swift deployment across data center fleets and easy maintenance and scaling. Together with other in-house innovations like our Open Rack power and rack architecture, Grand Teton enables the construction of new clusters tailored for both current and future applications at Meta.

Performance

Aligned with Meta's commitment to constructing large-scale AI clusters, the company prioritizes maximizing performance and ease of use simultaneously, without compromising one for the other. This principle forms the bedrock of Meta's endeavor to develop best-in-class AI models.

As Meta ventures into the forefront of AI system advancements, the most effective approach to gauging scalability lies in the direct construction, optimization, and testing of designs. While simulators offer valuable insights, their utility has its limits. Throughout this design journey, Meta has conducted performance comparisons between small and large clusters to pinpoint potential bottlenecks. The graph below illustrates the AllGather collective performance, represented as normalized bandwidth on a 0-100 scale, during communication among a large number of GPUs at message sizes where roofline performance is anticipated.

DEV Community