DEV Community: MRUGANK MANOJ RAUT

LLM performance optimization solutions

MRUGANK MANOJ RAUT — Tue, 28 May 2024 09:29:44 +0000

Performance optimization techniques

After distributed tranining, LLM practitioners use performance & memory optimization techniques.There are 3 techniques for this.

1.Mixed-Precision training

This method uses lower-precision arithmetic and reduces resource utilization. It reduces the workload on CPU and lowers the use of storage. Because of this, we can deploy larger networks with same amount of memory.

2.Gradient Checkpoint

This technique stores only subset of intermediate activations and recomputing them during backward pass to reduce memory usage.

3.Operator Fusion

Using this technique, we can combine multiple operations into a single one to reduce memory allocation.

Using Purpose-Built Infrastructure

1.AWS Trainium

It is second-generation machine-learning accelerator built for deep-learning training.It powers EC2-Trn1 instances.

2.AWS Inferentia

It delivers high performance at lowest cost for deep-learning applications. Inf2 instances are used for large-scale gen-AI applications. They use models containing billions of parameters.

LLM practioners can use AWS neuron SDK for HPC.

Thank You

LLM Multi-Machine Training Solutions

MRUGANK MANOJ RAUT — Tue, 28 May 2024 08:21:00 +0000

Scaling LLMs with Distributed Training

To maximize the resource utilization and reduce the training cost, practitioners use distributed computing techniques for multi-GPU or multi-machine training. This techniques are named as distributed data parallelism and distributed model parallelism.This methods help in efficient use of resources. They also support in horizontal scaling, fault tolerance and parallel processing.

Applying Data Parallelism Techniques

Data parallelism is used when data does not fit in a single device or lets say a GPU. With data parallelism, dataset is shared across multiple devices which contain the copy of model.In beggining, a mini-batch of dataset is distributed equally in exclusive manner across all model copies. Then this copies are trained in parallel and model parameters are coordinated across all devices. Collective algorithms and high performance computing networking frameworks are used to perform parameter synchronization.

Approaches of Data Parallelism are as follows:-

1.AllReduce

The AllReduce approach counts on direct communication between devices to interactively exchange model gradients and parameters. This approach aggregates the data from all devices and redistributes the aggregted results back to them.

2.Parameter-Server

Local model copies are synchronized by using publisher between set of parameter servers. This servers hold the most up-to-date copy of model. If not, then they participate in weight averaging step.It can be performed at the end of each training step(synchronous). Also unsychronously, where model copies pull parameters and push gradients independently. To improve the performance of parameter-server approach, HPC infrastructure components are used.

Applying Model Parallelism Techniques

When the neural network is too big to fit in a single device or say a GPU, Model parallelism is an ideal solution. It also makes training process less memory intensive. In model parallelism, the model is partitioned across multiple devices to effectively utilize the combined memory of training cluster.It stores the entire model in memory-efficient fashion.

Common approaches in model Parallelism are as follows:-

1.pipeline parallelism

It partitions set of model layers across several devices and divided the mini-batch training into micro-batches. This micro-batches are scheduled in an artificial pipeline for forward and backward calculations in overlap manner. It reduces device inactive time.

2.Tensor parallelism

In pipeline parallelism, it partitions the set of weights. But in this case, it splits the indivisual weights across multiple devices.Tensor Parallelism in required in the case where a single parameter consumes most of the GPU memory. Big models like GPT need to be divided and run on many devices at same time to handle all calculations.

In AWS, Amazon sagemaker offers data and model parallelism libraries. Some other are DeepSpeed by Microsoft, Megatron-LM by NVIDIA.

Thank You

Common LLM Practitioner Challenges

MRUGANK MANOJ RAUT — Tue, 28 May 2024 05:30:06 +0000

Model quality depends on the large size of LLM and data used to train it, but training an LLM is quite challenging. Lets learn some common challanges faced while building such LLMs.

1.Training Data Curation

Models which are based on transformers are trained on large datasets of text from multiple data sources. An LLM's quality majorly depends on selection and curation of training data. Preparing the LLM training data is an area of research in LLM industry. Collecting, processing and cleaning the data requires a lot of resources but they are necessary to ensure the quality of model outputs.

2.Large-scale, High-end infrastructure need

While training LLMs, we must maintain the balance between the factors such as model size, model performance, computational complexity, etc. Training requires large-scale accelerated computing resources, high-speed networking and high-end compute instances. This training can take several days to weeks for
completion. The high-end compute instances exist in close quarters to each other and are sometimes grouped in single network spine.
To detect and handle failure, GPU quality management software is essential. It also configures distributed storage and multi-node data I/O for datasets.

3.High Training Costs

To train LLMs, organizations require to invest from millions to billions dollars. Only few organizations are in the position to invest this much money to train their LLMs. Due to this, other teams/organizations look for cost-effective training or to fine-tune the pre-trained models.

4.Machine Learning Expertise

To optimize the performance of LLMs, practitioners use some advanced techniques for distributed training and parallel data processing. Practitioners also manage the framework. It requires expertise in Machine Learning.

5.Responsible AI

LLMs are complex. Understanding their reasoning is a challenging task. Exploratory reaserch is required to make certain that language models are fair, transparent and unbiased. Another area of research is to create certain benchmarks to evaluate and compare the model's performance over various tasks.

Interested about how LLMs are trained,then read the following post!

Multi-Machine Training Solutions

MRUGANK MANOJ RAUT ・ May 28

#llm #largelanguagemodel

Thank You.

AWS Core Services - Networking

MRUGANK MANOJ RAUT — Fri, 24 May 2024 06:17:34 +0000

When you run an application using cloud, firstly you have to connect your resources to the cloud and then end-users will connect to your application.All of this comes under the concept of Networking. To understand how networking works on AWS, we have to understand how Amazon VPC works.

Amazon VPC

It is a private network space to launch your resources on cloud in order to run your application. Amazon VPC provides a logical isolation to your application. You can control the in-out traffic of Your Amazon VPC. You can also control the way you want to connect to other networks.
More than one VPCs can be launched from your AWS account. You can use those multiple VPCs to launch different workloads. You can also configure the way packets travel through the layers of your network.

Amazon Route 53

Route 53 is a scalable domain name service (DNS).It has 3 functionalities namely domain registration, DNS routing and health checking. DNS service translates the domain names into IP Addresses. With the help of Route 53, you can purchase and manage the domain names and can also configure DNS settings. Route 53 provides you multiple routing options.

Amazon ELB

Elastic Load Balancer(ELB) is a DNS Service. It automatically distributes the incoming network traffic between multiple EC2 instances. It is single point of contact to your application. Because of ELB, the users do not need to be aware that how many machines your application is running on.