ONNX Runtime is Microsoft's open-source inference engine that runs machine learning models across platforms with hardware acceleration — and it has a comprehensive API you can use for free.
Why ONNX Runtime Matters
Most ML frameworks lock you into one ecosystem. TensorFlow models don't run in PyTorch. PyTorch models don't run in browsers. ONNX Runtime solves this by providing a universal execution engine for the ONNX format.
What you get for free:
- Run models trained in ANY framework (PyTorch, TensorFlow, scikit-learn, XGBoost)
- Hardware acceleration: CPU, GPU (CUDA/ROCm), DirectML, TensorRT, OpenVINO
- Language support: Python, C++, C#, Java, JavaScript, React Native, Objective-C
- Optimized inference that's often 2-10x faster than native framework inference
Quick Start: Python
import onnxruntime as ort
import numpy as np
# Load any ONNX model
session = ort.InferenceSession("model.onnx")
# Check input requirements
input_name = session.get_inputs()[0].name
input_shape = session.get_inputs()[0].shape
print(f"Model expects: {input_name} with shape {input_shape}")
# Run inference
test_input = np.random.randn(1, 3, 224, 224).astype(np.float32)
results = session.run(None, {input_name: test_input})
print(f"Output shape: {results[0].shape}")
Convert Any Model to ONNX
# PyTorch to ONNX
import torch
model = torch.load("pytorch_model.pt")
dummy_input = torch.randn(1, 3, 224, 224)
torch.onnx.export(model, dummy_input, "model.onnx",
input_names=["input"],
output_names=["output"],
dynamic_axes={"input": {0: "batch"}, "output": {0: "batch"}})
# TensorFlow to ONNX
# pip install tf2onnx
import tf2onnx
import tensorflow as tf
model = tf.saved_model.load("saved_model_dir")
spec = (tf.TensorSpec((None, 224, 224, 3), tf.float32),)
onnx_model, _ = tf2onnx.convert.from_function(model.signatures["serving_default"], input_signature=spec)
GPU Acceleration (Zero Config)
# Automatically use GPU if available
providers = [
("CUDAExecutionProvider", {"device_id": 0}),
"CPUExecutionProvider" # Fallback
]
session = ort.InferenceSession("model.onnx", providers=providers)
# Check which provider is active
print(session.get_providers()) # Shows available execution providers
Quantization: Make Models 4x Smaller
from onnxruntime.quantization import quantize_dynamic, QuantType
# Dynamic quantization — no calibration data needed
quantize_dynamic(
"model.onnx",
"model_quantized.onnx",
weight_type=QuantType.QInt8
)
# Result: ~4x smaller, ~2x faster, minimal accuracy loss
Run in the Browser (JavaScript)
import * as ort from "onnxruntime-web";
const session = await ort.InferenceSession.create("model.onnx");
const input = new ort.Tensor("float32", new Float32Array(224 * 224 * 3), [1, 3, 224, 224]);
const results = await session.run({ input: input });
console.log("Prediction:", results.output.data);
Performance Comparison
| Framework | ResNet-50 Inference (ms) | Memory (MB) |
|---|---|---|
| PyTorch (CPU) | 45ms | 180MB |
| TensorFlow (CPU) | 42ms | 210MB |
| ONNX Runtime (CPU) | 18ms | 95MB |
| ONNX Runtime (GPU) | 3ms | 120MB |
| ONNX Runtime (Quantized) | 12ms | 45MB |
Useful Links
- GitHub Repository
- ONNX Model Zoo — pre-trained models ready to use
- Documentation
- Hugging Face Optimum — easy ONNX export for transformers
Building AI-powered data pipelines? Check out my developer tools on Apify for ready-made web scrapers, or email spinov001@gmail.com for custom solutions.
Top comments (0)