Modal Has a Free API: Run GPU Workloads in the Cloud Without DevOps

#python #ai #cloud #gpu

What is Modal?

Modal is a serverless cloud platform for running Python code. Define a function, decorate it with @modal.function, and it runs in the cloud on any hardware — CPUs, GPUs (A100, H100), or large memory machines. No Docker, no Kubernetes, no infrastructure.

Why Modal?

$30 free credits/month — enough to run hundreds of GPU tasks
Zero infrastructure — no Docker, no K8s, no servers
GPU access — A10G, A100, H100 on demand
Sub-second cold starts — custom container snapshots
Cron jobs — scheduled functions built in
Web endpoints — deploy APIs with one decorator

Quick Start

pip install modal
modal setup  # Free account at modal.com

import modal

app = modal.App("my-app")

@app.function()
def hello(name: str) -> str:
    return f"Hello {name} from the cloud!"

@app.local_entrypoint()
def main():
    print(hello.remote("World"))  # Runs in Modal's cloud

modal run app.py  # Deploys and runs instantly

GPU Workloads

import modal

app = modal.App("gpu-inference")

image = modal.Image.debian_slim().pip_install("torch", "transformers")

@app.function(gpu="A100", image=image, timeout=600)
def generate_text(prompt: str) -> str:
    from transformers import AutoModelForCausalLM, AutoTokenizer

    model = AutoModelForCausalLM.from_pretrained(
        "meta-llama/Llama-3-8b-chat-hf",
        device_map="auto"
    )
    tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8b-chat-hf")

    inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
    outputs = model.generate(**inputs, max_new_tokens=500)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

Web Endpoints

@app.function()
@modal.web_endpoint()
def predict(text: str):
    result = model.predict(text)
    return {"prediction": result, "confidence": 0.95}

# Deploys to: https://your-user--my-app-predict.modal.run

Batch Processing

@app.function(gpu="A10G")
def process_image(image_url: str) -> dict:
    # Process one image on GPU
    return {"url": image_url, "labels": classify(image_url)}

@app.local_entrypoint()
def main():
    urls = ["img1.jpg", "img2.jpg", "img3.jpg"] * 100  # 300 images

    # Process all 300 images in parallel across GPUs
    results = list(process_image.map(urls))
    print(f"Processed {len(results)} images")

Scheduled Functions

@app.function(schedule=modal.Period(hours=6))
def check_metrics():
    metrics = fetch_app_metrics()
    if metrics.error_rate > 0.05:
        send_alert(f"Error rate spike: {metrics.error_rate:.1%}")

Modal vs Alternatives

Feature	Modal	AWS Lambda	RunPod	Replicate
Free credits	$30/mo	1M requests	None	Limited
GPU	A10G, A100, H100	None	All	A40, A100
Cold start	<1s (snapshots)	1-10s	10-30s	5-30s
Python native	Yes	Handler format	Docker	Cog format
Parallelism	.map() built-in	Manual	Manual	Manual
Web endpoints	Decorator	API Gateway	Manual	Auto

Real-World Impact

A video processing startup needed to transcribe 10,000 videos. On a single GPU server: 72 hours. On Modal with 50 parallel A10G GPUs: 1.5 hours. Cost: $45 on Modal vs $200 for a dedicated GPU server running for 3 days. Plus zero time managing infrastructure.

Running GPU workloads? I help teams optimize ML infrastructure costs. Contact spinov001@gmail.com or explore my data tools on Apify.

DEV Community