What is Modal?
Modal is a serverless cloud platform for running Python code. Define a function, decorate it with @modal.function, and it runs in the cloud on any hardware — CPUs, GPUs (A100, H100), or large memory machines. No Docker, no Kubernetes, no infrastructure.
Why Modal?
- $30 free credits/month — enough to run hundreds of GPU tasks
- Zero infrastructure — no Docker, no K8s, no servers
- GPU access — A10G, A100, H100 on demand
- Sub-second cold starts — custom container snapshots
- Cron jobs — scheduled functions built in
- Web endpoints — deploy APIs with one decorator
Quick Start
pip install modal
modal setup # Free account at modal.com
import modal
app = modal.App("my-app")
@app.function()
def hello(name: str) -> str:
return f"Hello {name} from the cloud!"
@app.local_entrypoint()
def main():
print(hello.remote("World")) # Runs in Modal's cloud
modal run app.py # Deploys and runs instantly
GPU Workloads
import modal
app = modal.App("gpu-inference")
image = modal.Image.debian_slim().pip_install("torch", "transformers")
@app.function(gpu="A100", image=image, timeout=600)
def generate_text(prompt: str) -> str:
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3-8b-chat-hf",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3-8b-chat-hf")
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**inputs, max_new_tokens=500)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
Web Endpoints
@app.function()
@modal.web_endpoint()
def predict(text: str):
result = model.predict(text)
return {"prediction": result, "confidence": 0.95}
# Deploys to: https://your-user--my-app-predict.modal.run
Batch Processing
@app.function(gpu="A10G")
def process_image(image_url: str) -> dict:
# Process one image on GPU
return {"url": image_url, "labels": classify(image_url)}
@app.local_entrypoint()
def main():
urls = ["img1.jpg", "img2.jpg", "img3.jpg"] * 100 # 300 images
# Process all 300 images in parallel across GPUs
results = list(process_image.map(urls))
print(f"Processed {len(results)} images")
Scheduled Functions
@app.function(schedule=modal.Period(hours=6))
def check_metrics():
metrics = fetch_app_metrics()
if metrics.error_rate > 0.05:
send_alert(f"Error rate spike: {metrics.error_rate:.1%}")
Modal vs Alternatives
| Feature | Modal | AWS Lambda | RunPod | Replicate |
|---|---|---|---|---|
| Free credits | $30/mo | 1M requests | None | Limited |
| GPU | A10G, A100, H100 | None | All | A40, A100 |
| Cold start | <1s (snapshots) | 1-10s | 10-30s | 5-30s |
| Python native | Yes | Handler format | Docker | Cog format |
| Parallelism | .map() built-in | Manual | Manual | Manual |
| Web endpoints | Decorator | API Gateway | Manual | Auto |
Real-World Impact
A video processing startup needed to transcribe 10,000 videos. On a single GPU server: 72 hours. On Modal with 50 parallel A10G GPUs: 1.5 hours. Cost: $45 on Modal vs $200 for a dedicated GPU server running for 3 days. Plus zero time managing infrastructure.
Running GPU workloads? I help teams optimize ML infrastructure costs. Contact spinov001@gmail.com or explore my data tools on Apify.
Top comments (0)