In MLOps, models rarely fail because of algorithms alone.
Many failures happen because services cannot communicate correctly.
A production ML platform usually includes:
- training pipelines
- feature stores
- model registry
- inference services
- monitoring systems
- The challenge is simple:
Some components must communicate freely.
Others must remain isolated for security and reliability.
This is where networking architecture becomes a core MLOps skill.
Why Networking Matters in MLOps
A training job may need:
- access to GPUs
- model registry
- experiment tracking
feature retrieval
An inference service may need:feature lookup
model loading
metrics export
If connectivity is poorly designed:pipelines fail
latency increases
security risks appear
Four Networking Models in MLOpsInternal Service Networking
This is the safest default model.
Services communicate privately using internal DNS names instead of IP addresses.
Example:
mlflow server
kubectl run model-api
Instead of hardcoding addresses:
Use service names like:
mlflow-server
feature-store
model-registry
Why this matters:
Container IPs change.
Names remain stable.
- Host-Level Connectivity Some training workloads need direct host access.
Typical case:
GPU-intensive training.
Example:
docker run --network host training-container
Benefits:
- ultra-low latency
- direct GPU access
- full host networking Tradeoff:
Lower isolation.
Best used carefully.
- Fully Isolated Execution Sometimes zero connectivity is required.
Become a Medium member
Typical use cases:
- secure model validation
- offline data checks
- security testing Example:
docker run --network none secure-validation
This prevents access to:
- internet
- APIs
- internal databases
- Maximum isolation.
- Custom Subnet Segmentation Production MLOps often separates environments.
Typical split:
- training environment
- inference environment
- monitoring environment Example:
docker network create --driver bridge --subnet 182.18.0.0/16 mlops-isolated-network
This reduces accidental cross-communication.
Why DNS Beats Static IPs
Bad approach:
mlflow.set_tracking_uri("http://172.17.0.4:5000")
Problem:
Container IPs change.
Better:
mlflow.set_tracking_uri("http://mlflow-server:5000")
This survives restarts.
This is how resilient service discovery works.
Built-In DNS in Modern Platforms
Docker and Kubernetes automatically resolve service names.
That means:
- feature-store
- model-registry
- monitoring-service
- resolve internally without manual IP management.
Production Rule: Separate by Function
A strong production pattern:
training → controlled internal access
inference → custom subnet
monitoring → isolated observability path
This improves:
- reproducibility
- reliability
- security
- scalability
Practical Mental Model
Use:internal service network for normal communication
host network only for performance-critical workloads
no network for secure isolated tasks
custom subnets for production boundaries
Final Takeaway
A model can be accurate and still fail in production if connectivity is weak.
In MLOps, networking is not infrastructure decoration.
It is system design.
Top comments (1)
Good concise breakdown. I especially agree with the point that MLOps failures are often systems failures, not model failures. In practice, a lot of "ML quality" issues are really feature-store access, registry reachability, or environment boundary problems wearing a model label. DNS-based service discovery plus clear network boundaries is one of those unglamorous decisions that quietly determines whether an ML platform scales.