DEV Community

Alex Spinov
Alex Spinov

Posted on

Modal Has a Free API: Run GPU Workloads in the Cloud Without DevOps

What is Modal?

Modal is a serverless cloud platform for running Python code. Define a function, decorate it with @modal.function, and it runs in the cloud on any hardware — CPUs, GPUs (A100, H100), or large memory machines. No Docker, no Kubernetes, no infrastructure.

Why Modal?

  • $30 free credits/month — enough to run hundreds of GPU tasks
  • Zero infrastructure — no Docker, no K8s, no servers
  • GPU access — A10G, A100, H100 on demand
  • Sub-second cold starts — custom container snapshots
  • Cron jobs — scheduled functions built in
  • Web endpoints — deploy APIs with one decorator

Quick Start

pip install modal
modal setup  # Free account at modal.com
Enter fullscreen mode Exit fullscreen mode
import modal

app = modal.App("my-app")

@app.function()
def hello(name: str) -> str:
    return f"Hello {name} from the cloud!"

@app.local_entrypoint()
def main():
    print(hello.remote("World"))  # Runs in Modal's cloud
Enter fullscreen mode Exit fullscreen mode
modal run app.py  # Deploys and runs instantly
Enter fullscreen mode Exit fullscreen mode

GPU Workloads

import modal

app = modal.App("gpu-inference")

image = modal.Image.debian_slim().pip_install("torch", "transformers")

@app.function(gpu="A100", image=image, timeout=600)
def generate_text(prompt: str) -> str:
    from transformers import AutoModelForCausalLM, AutoTokenizer

    model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-3-8b-chat-hf",
        device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8b-chat-hf")

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=500)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)
Enter fullscreen mode Exit fullscreen mode

Web Endpoints

@app.function()
@modal.web_endpoint()
def predict(text: str):
    result = model.predict(text)
    return {"prediction": result, "confidence": 0.95}

# Deploys to: https://your-user--my-app-predict.modal.run
Enter fullscreen mode Exit fullscreen mode

Batch Processing

@app.function(gpu="A10G")
def process_image(image_url: str) -> dict:
    # Process one image on GPU
    return {"url": image_url, "labels": classify(image_url)}

@app.local_entrypoint()
def main():
    urls = ["img1.jpg", "img2.jpg", "img3.jpg"] * 100  # 300 images

    # Process all 300 images in parallel across GPUs
    results = list(process_image.map(urls))
    print(f"Processed {len(results)} images")
Enter fullscreen mode Exit fullscreen mode

Scheduled Functions

@app.function(schedule=modal.Period(hours=6))
def check_metrics():
    metrics = fetch_app_metrics()
    if metrics.error_rate > 0.05:
        send_alert(f"Error rate spike: {metrics.error_rate:.1%}")
Enter fullscreen mode Exit fullscreen mode

Modal vs Alternatives

Feature Modal AWS Lambda RunPod Replicate
Free credits $30/mo 1M requests None Limited
GPU A10G, A100, H100 None All A40, A100
Cold start <1s (snapshots) 1-10s 10-30s 5-30s
Python native Yes Handler format Docker Cog format
Parallelism .map() built-in Manual Manual Manual
Web endpoints Decorator API Gateway Manual Auto

Real-World Impact

A video processing startup needed to transcribe 10,000 videos. On a single GPU server: 72 hours. On Modal with 50 parallel A10G GPUs: 1.5 hours. Cost: $45 on Modal vs $200 for a dedicated GPU server running for 3 days. Plus zero time managing infrastructure.


Running GPU workloads? I help teams optimize ML infrastructure costs. Contact spinov001@gmail.com or explore my data tools on Apify.

Top comments (0)