karthik Balasubramanian

Posted on Jan 28, 2025 • Edited on Feb 3, 2025

Deploying the DeepSeek R1 Model on AWS EC2 for Scalable AI Solutions

#aws #cloud #ai #tutorial

Introduction

The DeepSeek Model R1 is a state-of-the-art machine learning model designed for advanced data analysis, predictive modeling, and real-time inference. This guide provides a comprehensive walkthrough for deploying the DeepSeek Model R1 locally on an AWS EC2 instance, its use cases, advantages, cost optimization strategies, and key considerations for a successful deployment. AWS EC2 is an ideal platform for deploying machine learning models due to its scalability, flexibility, and seamless integration with other AWS services.

Use Cases

The DeepSeek Model R1 supports various applications across industries, including:

Fraud Detection: Detect anomalies in real-time financial transactions.
Predictive Maintenance: Analyze IoT data to prevent equipment failures.
Healthcare Diagnostics: Process large medical datasets for faster, more accurate diagnoses.
NLP Applications: Power conversational AI models or chatbots for improved customer interactions.

Deploying on EC2 provides a customizable and controlled environment to host these applications effectively.

Advantages of Deployment on EC2

Scalability: Dynamically scale resources based on real-time workloads, reducing costs during low usage periods.
Customization: Full control over the operating system, libraries, and model-specific configurations.
Performance: Utilize GPU-enabled EC2 instances (e.g., P3 or G5) for faster inference and training.
Data Privacy: Enhanced security through encryption, IAM roles, and VPC isolation.
Integration: Seamless integration with AWS services like S3, CloudWatch, and Lambda.

Instance Selection Guide for Developers

Recommended Instances:

Full Model Loading & Partial Inference:
- Instance: p4d.24xlarge or higher.
Budget-Constrained Deployments:
- Instance: p3.16xlarge or g5.48xlarge.
Lightweight Prototyping:
- Instance: g4dn.12xlarge (4 NVIDIA T4 GPUs, 192 GB RAM).

My Configuration:

Instance Type: g4dn.12xlarge
GPU: 4 NVIDIA T4 GPUs (16 GB GDDR6 per GPU).
vCPUs: 48
Memory: 150 GB RAM
Storage: 1.9 TB NVMe SSD
Cost: ~$2.10 per hour (on-demand).
Operating System: Ubuntu

Installing Ollama

To install and set up Ollama, follow these steps:

Install Ollama:

   curl -fsSL https://ollama.com/install.sh | sudo sh

Download the DeepSeek Model R1:

   ollama pull deepseek-r1:7b

Test Ollama:

   ollama run deepseek-r1:7b
   ollama show
   ollama list

Start Ollama:

   ollama serve

Setting Up the UI for Ollama Model

Using Open WebUI with Docker provides an intuitive interface for interacting with the model.

Step 1: Install Docker

SSH into the instance and run:

sudo yum install -y docker
sudo systemctl start docker
sudo usermod -aG docker $USER

Step 2: Pull the Open WebUI Docker Image

docker pull ghcr.io/open-webui/open-webui:main

Step 3: Run the Docker Container

docker run -d -p 3000:8080 \
  --add-host=host.docker.internal:host-gateway \
  -v open-webui:/app/backend/data \
  --name open-webui \
  --restart always \
  ghcr.io/open-webui/open-webui:main

Step 4: Access the WebUI

After starting the container, access Open WebUI at:

http://localhost:3000

Accessing Ollama via REST API

Ollama provides a REST API for managing and interacting with models.

Generate a Response

curl http://localhost:11434/api/generate -d '{
  "model": "deepseek-r1:7b",
  "prompt": "Why is the sky blue?"
}'

Chat with the Model

curl http://localhost:11434/api/chat -d '{
  "model": "deepseek-r1:7b",
  "messages": [
    { "role": "user", "content": "Why is the sky blue?" }
  ]
}'

Monitor GPU usage during inference using the command:

nvidia-smi

Next Steps: Optimizations

Proxy Implementation: Use reverse proxies for improved performance and scalability.
Stable Environment: Automate deployment workflows with tools like Terraform or AWS CloudFormation.
Cost Optimization: Leverage Spot Instances or Savings Plans to reduce operating costs.

DEV Community