DEV Community

Cover image for How to deploy Llama 3.1 405B in the Cloud?
Ayush kumar for NodeShift

Posted on

22 2 2 2 2

How to deploy Llama 3.1 405B in the Cloud?

Meta's latest open-source AI model is its biggest yet.

Meta introduced the Llama 3.1 405B, a model with an extraordinary 405 billion parameters. This model has outperformed other LLMs in nearly every benchmark, a testament to its exceptional performance. Its remarkable abilities in general knowledge, steerability, math, tool use, and multilingual translation are impressive and inspiring. They will ignite excitement and curiosity in developers and tech professionals, encouraging them to explore its potential.

Image description

The Llama 3.1 models, available in three sizes—8B, 70B, and 405B—demonstrate exceptional performance that surpasses other open-weight models of similar sizes. In this blog, we will focus on the 405B model. If you want to learn about the 70B model, please read the previous post in our Llama series. The Llama 3.1 research report confirms that the 405B model matches the benchmark performance of GPT-4, further highlighting their superior performance.

Llama 3.1 405B: The Powerhouse

✅405 billion parameters
✅Trained on over 15 trillion tokens
✅Rivals top closed-source AI models in capabilities
✅State-of-the-art performance in general knowledge, steerability, math, and tool use
✅Multilingual translation support

The Llama 3.1 AI model is a versatile powerhouse designed to empower AI agents with its large context window of 128K tokens. It supports native tool use and function calling capabilities and excels in math, logic, and reasoning problems. Its range of advanced use cases, from long-form text summarization to multilingual conversational agents and coding assistants, will inspire developers and tech professionals to explore its potential.

Trained using 16,000 Nvidia H100 GPUs and benefiting from cutting-edge training and development techniques, the Llama 3.1 AI model is a force to be reckoned with. Meta claims it's on par with leading proprietary models like OpenAI's GPT-4o and Anthropic's Claude 3.5 Sonnet, with a few caveats.

Step-by-Step Process to Deploying Llama 3.1 in the Cloud

For the purpose of this tutorial, we will use a GPU-powered Virtual Machine offered by NodeShift; however, you can replicate the same steps with any other cloud provider of your choice.

Step 1: Sign Up and Set Up a NodeShift Cloud Account

Visit the NodeShift Cloud website (https://app.nodeshift.com/) and create an account. Once you've signed up, log into your account.

Follow the account setup process and provide the necessary details and information.

Image description

Step 2: Create a GPU Virtual Machine

NodeShift GPUs offer flexible and scalable on-demand resources like NodeShift Virtual Machines (VMs) equipped with diverse GPUs ranging from H100s to A100s. These GPU-powered VMs provide enhanced environmental control, allowing configuration adjustments for GPUs, CPUs, RAM, and Storage based on specific requirements.

Image description

Navigate to the menu on the left side. Select the GPU VMs option, create a GPU VM in the Dashboard, click the Create GPU VM button, and create your first deployment.

Step 3: Select a Model, Region, and Storage

In the "GPU VMs" tab, select a GPU Model and Storage according to your needs and the geographical region where you want to launch your model.

Image description

For the purpose of this tutorial, we are using the 8x A100 SXM4 GPUs to deploy Llama 3.1 405b. After this, select the amount of storage to run meta-llama/meta-lama-3.1-405b. You will need at least 810 GB of storage.

Image description

Note:

✅Meta-Llama-3.1-8B-Instruct is recommended on 1x NVIDIA A10G or L4 GPUs.
✅Meta-Llama-3.1-70B-Instruct is recommended on 4x NVIDIA A100 or as AWQ/GPTQ quantized on 2x A100s.
✅Meta-Llama-3.1-405B-Instruct-FP8 is recommended on 8x NVIDIA H100 in FP or as AWQ/GPTQ quantized on 8x A100s.

Step 4: Select Authentication Method

There are two authentication methods available: Password and SSH Key. SSH keys are a more secure option, in order create them, head over to our official documentation: (https://docs.nodeshift.com/gpus/create-gpu-deployment)

Image description

Step 5: Choose an Image

Next, you will need to choose an image for your VM. We will be deploying Llama 3.1 405b on an NVIDIA CUDA Virtual Machine, it’s a proprietary and closed source parallel computing platform that will allow you to install Llama 3.1 on your GPU VM.

Image description

After choosing the image, click the ‘Create’ button, and your VM will be deployed.

Image description

Step 6: Virtual Machine Successfully Deployed

You will get visual confirmation that your machine is up and running.

Image description

Step 7: Connect to GPUs using SSH

NodeShift GPUs can be connected to and controlled through a terminal using the SSH key provided during GPU creation.

Once your GPU VM deployment is successfully created and has reached the 'RUNNING' status, you can navigate to the page of your GPU Deployment Instance. Then, click the 'Connect' button in the top right corner.

Image description

Now open your terminal and paste the proxy SSH IP.

Image description

Next, if you want to check the GPU details, run the following command:

nvidia-smi

Image description

Step 8: Install Llama 3.1 405b

After completing all of the above steps, it's time to download Llama 3.1 from the Ollama website.

Website Link: https://ollama.com/library/llama3.1

We will be running Meta-Llama-3.1-405b, so choose the 405b model from the website.

Image description

After this, we will run the following command in cmd, and the installation will start:

curl -fsSL https://ollama.com/install.sh | sh

Image description

Now, we see that our installation process is complete, run the command below to see a list of available commands:

ollama

Image description

Step 9: Install Llama 3.1 405b Model

To install the Llama 3.1 405b Model, run the following command:

ollama pull llama3.1:405b

Image description

Step 10: Run Llama 3.1 405b Model

Now, you can run the model in the terminal using the command below and interact with your model:

ollama run llama3.1:405b

Image description
Example: Integrate 1/(1+x2) for limit [0,1]

Image description
Example: Python Program to Check Armstrong Number

Conclusion

Deploying the Llama 3.1 model in the cloud is a straightforward process that unlocks the potential of Meta AI's latest open-weight LLMs. By following the steps outlined, from setting up a NodeShift Cloud account and creating a GPU VM to installing and running the Llama 3.1 model, developers can leverage the model's advanced capabilities for a wide range of applications. The guide ensures that users can efficiently deploy the powerful 405B model to build innovative AI solutions with its extended context length and enhanced reasoning abilities.

For more information about NodeShift:

Website
Docs
LinkedIn
X
Discord
Blogs

API Trace View

How I Cut 22.3 Seconds Off an API Call with Sentry

Struggling with slow API calls? Dan Mindru walks through how he used Sentry's new Trace View feature to shave off 22.3 seconds from an API call.

Get a practical walkthrough of how to identify bottlenecks, split tasks into multiple parallel tasks, identify slow AI model calls, and more.

Read more →

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Discover a treasure trove of wisdom within this insightful piece, highly respected in the nurturing DEV Community enviroment. Developers, whether novice or expert, are encouraged to participate and add to our shared knowledge basin.

A simple "thank you" can illuminate someone's day. Express your appreciation in the comments section!

On DEV, sharing ideas smoothens our journey and strengthens our community ties. Learn something useful? Offering a quick thanks to the author is deeply appreciated.

Okay