DEV Community

AIGuruX
AIGuruX

Posted on

Run AI Workloads on Serverless GPUs with RunPod

Run AI Inference Workloads on Serverless GPUs with RunPod

In the world of GPU rentals, platforms like Vast.ai and Amazon dominate by offering scalable computing resources. However, they share a common limitation: you're charged for as long as the virtual machine is running, regardless of how much it's used - typical of traditional cloud environments. But what if your workload is unpredictable and you can't predict the load? This is where serverless computing comes in, allowing you to pay only for what you use. And now, with RunPod, you can extend this serverless convenience to GPU-based workloads, providing a cost-effective and scalable solution for AI and compute-intensive tasks.

Image description


What is RunPod?

RunPod is a cloud service designed to provide GPU-powered Docker containers for rent. It allows developers to tackle compute-intensive workloads, like AI and ML, without needing expensive on-premises hardware.

Services Available

  • Serverless GPU Hosting: Deploy flexible, GPU-backed apps quickly.
  • AI API Endpoints: Offer your AI models as accessible APIs.
  • Dedicated and On-Demand GPU Instances: Full control for larger-scale tasks.

This post focuses on Serverless GPU hosting, which is perfect for scaling applications to handle dynamic user demands.


Prebuilt Containers for Rapid Deployment

RunPod offers a variety of ready-to-use containers, making it simple to get started without building everything from scratch. For example, you can choose containers designed for running inference on custom models, including those from Hugging Face. These preconfigured environments save time and allow you to focus on your application’s logic rather than infrastructure setup.

Image description

One highlight is the Serverless vLLM container, which lets you deploy OpenAI-compatible Large Language Model (LLM) endpoints using the efficient vLLM Engine. With just a few clicks, you can launch scalable LLM-powered APIs, perfect for applications requiring natural language processing.


For different use cases, you can deploy and utilize your own serverless solutions with a Custom Docker Container.

Here's how.

Core Serverless Concepts

Before getting started, let’s break down the three essential parts of RunPod Serverless:

  1. Templates
  2. Endpoints
  3. Serverless Handler

1. Serverless Handler

The Serverless Handler is the heart of any application deployed on RunPod. It processes incoming data and generates a response, acting as the bridge between your code and the RunPod Serverless environment.

Requirements

  • Python Version: Ensure you are using Python 3.10.x since earlier versions are incompatible.
  • Handler Code: The handler is typically stored in a file called serverless_handler.py. Here’s an example of a simple handler that takes username as input and returns a custom message:
import runpod  

def prepare_response(data):
    """
    Handles the provided input data.
    """
    username = data.get('username', 'User')
    welcome_message = f'Hello there, {username}!'
    return {"welcome_message": welcome_message}

# Main entry point for the serverless function
def main_handler(event):
    """
    Entry point triggered by RunPod.
    """
    return prepare_response(event['input'])

if __name__ == '__main__':
    runpod.serverless.execute({'handler': main_handler})
Enter fullscreen mode Exit fullscreen mode

Local Testing

Testing your handler locally before deployment is always recommended. Start by creating a file named test_payload.json:

{
    "input": {
        "username": "Jordan"
    }
}
Enter fullscreen mode Exit fullscreen mode

Set up a virtual environment and install the required runpod module:

python3 -m venv venv  
source venv/bin/activate  
pip install runpod  
python3 -u serverless_handler.py  
Enter fullscreen mode Exit fullscreen mode

You should see output similar to the following:

WARN | RUNPOD_WEBHOOK_GET_JOB not set, switching to get_local  
INFO | local_test | Initialized  
WARN | Test results: {"output": {"welcome_message": "Hello there, Jordan!"}}  
INFO | local_test | Execution completed successfully.  
Enter fullscreen mode Exit fullscreen mode

2. Building and Deploying with Docker

After confirming your handler works locally, containerize the application using Docker. Use the following as an example Dockerfile:

FROM python:3.10-slim  

WORKDIR /app  
RUN pip install --no-cache-dir runpod  
COPY serverless_handler.py .  

CMD ["python3", "-u", "serverless_handler.py"]  
Enter fullscreen mode Exit fullscreen mode

Steps to Build and Push

  1. Log in to Docker Hub:
docker login  
Enter fullscreen mode Exit fullscreen mode
  1. Build the image and push it to your Docker repository (replace yourname/image:version appropriately):
docker build -t yourname/image:version .  
docker push yourname/image:version  
Enter fullscreen mode Exit fullscreen mode
  1. For M1/M2 Mac users, specify the platform during the build step:
docker buildx build --push -t yourname/image:version . --platform linux/amd64  
Enter fullscreen mode Exit fullscreen mode

3. Templates

Image description

Templates define the specifications of your Serverless application. Create one via the RunPod Console:

  • Provide a name for your template.
  • Enter the Docker image name (e.g., yourname/image:version).
  • Specify environment variables, if necessary, in key-value format.
  • Leave Container Registry Credentials blank unless using a private registry.
  • Set container disk size (5GB is generally sufficient).

4. Endpoints

Endpoints make your application accessible via REST API. Go to RunPod Endpoints and create one:

  1. Name the endpoint for easy identification.
  2. Choose the template created earlier.
  3. Set GPU tiers to match your application needs (e.g., choose a tier for high VRAM if required).
  4. Define scaling options:
    • Max Workers: Set between 2–5 for most use cases.
    • Idle Timeout: Default is fine in most cases.
    • Avoid setting Active Workers to reduce costs.

Advanced Features

  • FlashBoot: Reduces cold starts to as little as 2 seconds, useful for LLMs.
  • Scale Type: Choose Queue Delay for time-based scaling or Request Count for load-based scaling.
  • Network Volumes: Share files/models across multiple workers.

Key GPU and Region Tips

  • GPU Tiers: Select based on application demands (e.g., RTX 3090 for typical workloads, RTX 4090 for high-performance tasks).
  • Deployment Regions: Use regions with high GPU availability to reduce queue delays.

By following these steps, you can deploy powerful, GPU-enabled applications on RunPod’s Serverless platform while enjoying cost-efficient scalability. Explore the RunPod Console to get started today!

Top comments (0)