Getting Started with Red Hat OpenShift with NVIDIA
Prerequisites
Before deploying NVIDIA networking solutions on OpenShift, ensure the following prerequisites are met:
- A functioning Red Hat OpenShift Container Platform cluster (version 4.10 or later recommended)
- NVIDIA networking hardware (Mellanox ConnectX or BlueField series) installed in worker nodes
- Node Feature Discovery (NFD) operator installed and configured
- SR-IOV Network Operator installed (if using SR-IOV capabilities)
- GPU Operator installed (if using GPUDirect RDMA)
Understanding Networking Technologies
Ethernet vs. InfiniBand
When planning your OpenShift deployment with NVIDIA networking hardware, it's crucial to understand the fundamental differences between Ethernet and InfiniBand technologies:
Feature | Ethernet | InfiniBand |
---|---|---|
Design Purpose | General data movement between systems | High reliability, high bandwidth, low latency for supercomputing clusters |
Latency Handling | Store-and-forward with MAC address transport | Cut-through approach with 16-bit LID for faster forwarding |
Network Reliability | No scheduling-based flow control, potential for congestion | End-to-end flow control providing lossless networking |
Network Mode | MAC addresses with ARP protocol | Built-in software-defined networking with subnet manager |
OpenShift Compatibility | Native support | Requires special configuration and cannot be used for cluster API traffic |
Important Note: OpenShift installation requires Ethernet connectivity for the cluster API traffic. InfiniBand can only be used as a secondary network for application traffic after the cluster is installed.
Recommended Network Architecture
For deployments requiring both OpenShift functionality and high-speed InfiniBand connectivity, a dual-network architecture is recommended:
- Primary Network (Ethernet): Used for OpenShift cluster API traffic, management, and standard application networking
- Secondary Network (InfiniBand or Ethernet with RDMA): Used for high-performance, low-latency application traffic
This architecture can be implemented using:
- Dual single-port NICs (one Ethernet, one InfiniBand)
- Dual-port NICs with one port configured for Ethernet and one for InfiniBand
- Multiple Ethernet NICs with RDMA capabilities
NVIDIA Network Operator Installation
The NVIDIA Network Operator is the primary tool for deploying and managing NVIDIA networking components in OpenShift. It can be installed using either the OpenShift web console or the command-line interface.
Prerequisites
Before installing the Network Operator, ensure:
- Node Feature Discovery (NFD) is properly configured
- Worker nodes with NVIDIA networking hardware are labeled with
feature.node.kubernetes.io/pci-15b3.present=true
Installation Using OpenShift Web Console
- In the OpenShift Container Platform web console, navigate to Operators > OperatorHub
- Search for "NVIDIA Network Operator"
- Select the operator and click Install
- Follow the on-screen instructions to complete the installation
Installation Using OpenShift CLI
- Create a namespace for the Network Operator:
oc create namespace nvidia-network-operator
- Determine the current channel version:
oc get packagemanifest nvidia-network-operator -n openshift-marketplace -o jsonpath='{.status.defaultChannel}'
- Create a subscription file (network-operator-sub.yaml):
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: nvidia-network-operator
namespace: nvidia-network-operator
spec:
channel: "v25.4" # Replace with the current channel from step 2
name: nvidia-network-operator
source: certified-operators
sourceNamespace: openshift-marketplace
- Apply the subscription:
oc create -f network-operator-sub.yaml
- Switch to the network-operator project:
oc project nvidia-network-operator
Verification
Verify the operator deployment with:
oc get pods -n nvidia-network-operator
A successful deployment will show the controller manager pod with a Running status.
RDMA Configuration Options
Remote Direct Memory Access (RDMA) allows computers to directly access each other's memory without involving the CPU or operating system, providing high bandwidth and low latency. NVIDIA offers three configuration methods for RDMA in OpenShift:
1. RDMA Shared Device
The RDMA Shared Device configuration allows multiple pods on a worker node to share the same RDMA device. This method is suitable for development environments or applications where maximum performance is not critical.
Configuration Example:
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
ofedDriver:
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: 25.04-0.6.1.0-2
startupProbe:
initialDelaySeconds: 10
periodSeconds: 20
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 30
readinessProbe:
initialDelaySeconds: 10
periodSeconds: 30
rdmaSharedDevicePlugin:
config: |
{
"configList": [
{
"resourceName": "rdma_shared_device_ib",
"rdmaHcaMax": 63,
"selectors": {
"ifNames": ["ibs2f0"]
}
},
{
"resourceName": "rdma_shared_device_eth",
"rdmaHcaMax": 63,
"selectors": {
"ifNames": ["ens8f0np0"]
}
}
]
}
image: k8s-rdma-shared-dev-plugin
repository: nvcr.io/nvidia/cloud-native
version: v1.5.3
Use Cases:
- Development and testing environments
- Applications where multiple pods need RDMA functionality but not maximum performance
- Environments with limited hardware resources
2. RDMA SR-IOV Legacy Device
The SR-IOV (Single Root I/O Virtualization) configuration segments a network device at the hardware layer, creating multiple virtual functions (VFs) that can be assigned to different pods. This provides better isolation and performance compared to the shared device method.
Configuration Example:
# NicClusterPolicy for OFED driver
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
ofedDriver:
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: 25.04-0.6.1.0-2
startupProbe:
initialDelaySeconds: 10
periodSeconds: 20
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 30
readinessProbe:
initialDelaySeconds: 10
periodSeconds: 30
# SR-IOV Network Node Policy
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: sriov-legacy-policy
namespace: openshift-sriov-network-operator
spec:
deviceType: netdevice
mtu: 1500
nicSelector:
vendor: "15b3"
pfNames: ["ens8f0np0#0-7"]
nodeSelector:
feature.node.kubernetes.io/pci-15b3.present: "true"
numVfs: 8
priority: 90
isRdma: true
resourceName: sriovlegacy
Use Cases:
- Production environments requiring high performance
- Workloads sensitive to latency and bandwidth
- Applications requiring isolation between network resources
3. RDMA Host Device
The Host Device configuration passes the entire physical network device from the host to a pod. This provides maximum performance but limits the device to a single pod at a time.
Configuration Example:
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
ofedDriver:
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: 25.04-0.6.1.0-2
startupProbe:
initialDelaySeconds: 10
periodSeconds: 20
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 30
readinessProbe:
initialDelaySeconds: 10
periodSeconds: 30
sriovDevicePlugin:
image: sriov-network-device-plugin
repository: ghcr.io/k8snetworkplumbingwg
version: v3.7.0
config: |
{
"resourceList": [
{
"resourcePrefix": "nvidia.com",
"resourceName": "hostdev",
"selectors": {
"vendors": ["15b3"],
"isRdma": true
}
}
]
}
Use Cases:
- Workloads requiring maximum performance
- Systems where SR-IOV is not supported
- Applications needing features only available in the physical function driver
Deployment Examples
Example 1: Network Operator with Host Device Network
This example demonstrates deploying the Network Operator with SR-IOV device plugin and a single SR-IOV resource pool:
# NicClusterPolicy
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
ofedDriver:
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: 25.04-0.6.1.0-2
startupProbe:
initialDelaySeconds: 10
periodSeconds: 20
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 30
readinessProbe:
initialDelaySeconds: 10
periodSeconds: 30
sriovDevicePlugin:
image: sriov-network-device-plugin
repository: ghcr.io/k8snetworkplumbingwg
version: v3.9.0
config: |
{
"resourceList": [
{
"resourcePrefix": "nvidia.com",
"resourceName": "hostdev",
"selectors": {
"vendors": ["15b3"],
"isRdma": true
}
}
]
}
# HostDeviceNetwork
apiVersion: mellanox.com/v1alpha1
kind: HostDeviceNetwork
metadata:
name: hostdev-net
spec:
networkNamespace: "default"
resourceName: "hostdev"
ipam: |
{
"type": "whereabouts",
"datastore": "kubernetes",
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
},
"range": "192.168.3.225/28",
"exclude": [
"192.168.3.229/30",
"192.168.3.236/32"
],
"log_file" : "/var/log/whereabouts.log",
"log_level" : "info"
}
# Example Pod
apiVersion: v1
kind: Pod
metadata:
name: hostdev-test-pod
annotations:
k8s.v1.cni.cncf.io/networks: hostdev-net
spec:
restartPolicy: OnFailure
containers:
- image: <rdma image>
name: doca-test-ctr
securityContext:
capabilities:
add: ["IPC_LOCK"]
resources:
requests:
nvidia.com/hostdev: 1
limits:
nvidia.com/hostdev: 1
command:
- sh
- -c
- sleep inf
Example 2: Network Operator with SR-IOV Legacy Mode
This example demonstrates deploying the Network Operator with SR-IOV in legacy mode:
# NicClusterPolicy
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
ofedDriver:
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: 25.04-0.6.1.0-2
startupProbe:
initialDelaySeconds: 10
periodSeconds: 20
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 30
readinessProbe:
initialDelaySeconds: 10
periodSeconds: 30
# SriovNetworkNodePolicy
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
name: policy-1
namespace: openshift-sriov-network-operator
spec:
deviceType: netdevice
mtu: 1500
nicSelector:
vendor: "15b3"
pfNames: ["ens2f0"]
nodeSelector:
feature.node.kubernetes.io/pci-15b3.present: "true"
numVfs: 8
priority: 90
isRdma: true
resourceName: sriovlegacy
# SriovNetwork
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
name: sriov-network
namespace: openshift-sriov-network-operator
spec:
vlan: 0
networkNamespace: "default"
resourceName: "sriovlegacy"
ipam: |
{
"datastore": "kubernetes",
"kubernetes": {
"kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
},
"log_file": "/tmp/whereabouts.log",
"log_level": "debug",
"type": "whereabouts",
"range": "192.168.101.0/24"
}
# Example Pod
apiVersion: v1
kind: Pod
metadata:
name: testpod1
annotations:
k8s.v1.cni.cncf.io/networks: sriov-network
spec:
containers:
- name: appcntr1
image: <image>
imagePullPolicy: IfNotPresent
securityContext:
capabilities:
add: ["IPC_LOCK"]
command:
- sh
- -c
- sleep inf
resources:
requests:
openshift.io/sriovlegacy: '1'
limits:
openshift.io/sriovlegacy: '1'
Example 3: Network Operator with RDMA Shared Device
This example demonstrates deploying the Network Operator with RDMA Shared Device and MacVlan Network:
# NicClusterPolicy
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
name: nic-cluster-policy
spec:
ofedDriver:
image: doca-driver
repository: nvcr.io/nvidia/mellanox
version: 25.04-0.6.1.0-2
startupProbe:
initialDelaySeconds: 10
periodSeconds: 20
livenessProbe:
initialDelaySeconds: 30
periodSeconds: 30
readinessProbe:
initialDelaySeconds: 10
periodSeconds: 30
rdmaSharedDevicePlugin:
config: |
{
"configList": [
{
"resourceName": "rdmashared",
"rdmaHcaMax": 1000,
"selectors": {
"ifNames": ["enp4s0f0np0"]
}
}
]
}
image: k8s-rdma-shared-dev-plugin
repository: nvcr.io/nvidia/cloud-native
version: v1.5.3
# MacvlanNetwork
apiVersion: mellanox.com/v1alpha1
kind: MacvlanNetwork
metadata:
name: rdmashared-net
spec:
networkNamespace: default
master: enp4s0f0np0
mode: bridge
mtu: 1500
ipam: '{"type":"whereabouts","range":"16.0.2.0/24","gateway":"16.0.2.1"}'
# Example Pod
apiVersion: v1
kind: Pod
metadata:
name: test-rdma-shared-1
annotations:
k8s.v1.cni.cncf.io/networks: rdmashared-net
spec:
containers:
- image: myimage
name: rdma-shared-1
securityContext:
capabilities:
add:
- IPC_LOCK
resources:
limits:
rdma/rdmashared: 1
requests:
rdma/rdmashared: 1
restartPolicy: OnFailure
Best Practices
Network Design Considerations
- Separate Control and Data Planes: Dedicate control plane nodes for OpenShift deployments with NVIDIA Network Operator
- Dual Network Architecture: Use Ethernet for cluster API traffic and InfiniBand for high-performance application traffic
- Resource Planning: Carefully plan the number of VFs and resource allocation based on workload requirements
- NUMA Alignment: Ensure NUMA alignment between GPUs, NICs, and CPU cores for optimal performance
Performance Optimization
-
RDMA Configuration Selection:
- Use RDMA Shared Device for development or when maximum performance is not critical
- Use SR-IOV Legacy for production workloads requiring high performance with multiple pods
- Use Host Device when maximum performance is required for a single pod
MTU Configuration: Configure jumbo frames (MTU 9000) for improved throughput when supported by the network infrastructure
HUGEPAGES Configuration: For DPDK workloads, configure HUGEPAGES following the OpenShift documentation
Troubleshooting
- Verify Node Feature Discovery:
oc describe node | grep -E 'Roles|pci' | grep -v "control-plane"
Ensure nodes with NVIDIA hardware show the label feature.node.kubernetes.io/pci-15b3.present=true
- Check Network Operator Status:
oc get pods -n nvidia-network-operator
- Verify OFED Driver Installation:
oc get pods -n nvidia-network-operator | grep ofed
oc logs <ofed-driver-pod-name> -n nvidia-network-operator
- Check SR-IOV Resources:
oc get sriovnetworknodestates -n openshift-sriov-network-operator -o yaml
- Verify Available Resources:
oc get node <node-name> -o json | jq '.status.allocatable'
Top comments (0)