DEV Community

Julien Simon
Julien Simon

Posted on • Originally published at julsimon.Medium on

Deploying Llama3 with Inference Endpoints and AWS Inferentia2

In this video, I walk you through the simple process of deploying a Meta Llama 3 8B model with Hugging Face Inference Endpoints and the AWS Inferentia2 accelerator, first with the Hugging Face UI, then with the Hugging Face API.

I use the latest version of the Hugging Face Text Generation Inference container and show you how to run streaming inference with the OpenAI client library. I also discuss Inferentia2 benchmarks.

Top comments (0)

A Workflow Copilot. Tailored to You.

Pieces.app image

Our desktop app, with its intelligent copilot, streamlines coding by generating snippets, extracting code from screenshots, and accelerating problem-solving.

Read the docs

👋 Kindness is contagious

Please leave a ❤️ or a friendly comment on this post if you found it helpful!

Okay