DEV Community

MohamedELGAMASY
MohamedELGAMASY

Posted on

[Part02] Getting Started with Red Hat OpenShift with NVIDIA

Getting Started with Red Hat OpenShift with NVIDIA

Prerequisites

Before deploying NVIDIA networking solutions on OpenShift, ensure the following prerequisites are met:

  1. A functioning Red Hat OpenShift Container Platform cluster (version 4.10 or later recommended)
  2. NVIDIA networking hardware (Mellanox ConnectX or BlueField series) installed in worker nodes
  3. Node Feature Discovery (NFD) operator installed and configured
  4. SR-IOV Network Operator installed (if using SR-IOV capabilities)
  5. GPU Operator installed (if using GPUDirect RDMA)

Understanding Networking Technologies

Ethernet vs. InfiniBand

When planning your OpenShift deployment with NVIDIA networking hardware, it's crucial to understand the fundamental differences between Ethernet and InfiniBand technologies:

Feature Ethernet InfiniBand
Design Purpose General data movement between systems High reliability, high bandwidth, low latency for supercomputing clusters
Latency Handling Store-and-forward with MAC address transport Cut-through approach with 16-bit LID for faster forwarding
Network Reliability No scheduling-based flow control, potential for congestion End-to-end flow control providing lossless networking
Network Mode MAC addresses with ARP protocol Built-in software-defined networking with subnet manager
OpenShift Compatibility Native support Requires special configuration and cannot be used for cluster API traffic

Important Note: OpenShift installation requires Ethernet connectivity for the cluster API traffic. InfiniBand can only be used as a secondary network for application traffic after the cluster is installed.

Recommended Network Architecture

For deployments requiring both OpenShift functionality and high-speed InfiniBand connectivity, a dual-network architecture is recommended:

  1. Primary Network (Ethernet): Used for OpenShift cluster API traffic, management, and standard application networking
  2. Secondary Network (InfiniBand or Ethernet with RDMA): Used for high-performance, low-latency application traffic

This architecture can be implemented using:

  • Dual single-port NICs (one Ethernet, one InfiniBand)
  • Dual-port NICs with one port configured for Ethernet and one for InfiniBand
  • Multiple Ethernet NICs with RDMA capabilities

NVIDIA Network Operator Installation

The NVIDIA Network Operator is the primary tool for deploying and managing NVIDIA networking components in OpenShift. It can be installed using either the OpenShift web console or the command-line interface.

Prerequisites

Before installing the Network Operator, ensure:

  1. Node Feature Discovery (NFD) is properly configured
  2. Worker nodes with NVIDIA networking hardware are labeled with feature.node.kubernetes.io/pci-15b3.present=true

Installation Using OpenShift Web Console

  1. In the OpenShift Container Platform web console, navigate to Operators > OperatorHub
  2. Search for "NVIDIA Network Operator"
  3. Select the operator and click Install
  4. Follow the on-screen instructions to complete the installation

Installation Using OpenShift CLI

  1. Create a namespace for the Network Operator:
   oc create namespace nvidia-network-operator
Enter fullscreen mode Exit fullscreen mode
  1. Determine the current channel version:
   oc get packagemanifest nvidia-network-operator -n openshift-marketplace -o jsonpath='{.status.defaultChannel}'
Enter fullscreen mode Exit fullscreen mode
  1. Create a subscription file (network-operator-sub.yaml):
   apiVersion: operators.coreos.com/v1alpha1
   kind: Subscription
   metadata:
     name: nvidia-network-operator
     namespace: nvidia-network-operator
   spec:
     channel: "v25.4"  # Replace with the current channel from step 2
     name: nvidia-network-operator
     source: certified-operators
     sourceNamespace: openshift-marketplace
Enter fullscreen mode Exit fullscreen mode
  1. Apply the subscription:
   oc create -f network-operator-sub.yaml
Enter fullscreen mode Exit fullscreen mode
  1. Switch to the network-operator project:
   oc project nvidia-network-operator
Enter fullscreen mode Exit fullscreen mode

Verification

Verify the operator deployment with:

oc get pods -n nvidia-network-operator
Enter fullscreen mode Exit fullscreen mode

A successful deployment will show the controller manager pod with a Running status.

RDMA Configuration Options

Remote Direct Memory Access (RDMA) allows computers to directly access each other's memory without involving the CPU or operating system, providing high bandwidth and low latency. NVIDIA offers three configuration methods for RDMA in OpenShift:

1. RDMA Shared Device

The RDMA Shared Device configuration allows multiple pods on a worker node to share the same RDMA device. This method is suitable for development environments or applications where maximum performance is not critical.

Configuration Example:

apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 25.04-0.6.1.0-2
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30
  rdmaSharedDevicePlugin:
    config: |
      {
        "configList": [
          {
            "resourceName": "rdma_shared_device_ib",
            "rdmaHcaMax": 63,
            "selectors": {
              "ifNames": ["ibs2f0"]
            }
          },
          {
            "resourceName": "rdma_shared_device_eth",
            "rdmaHcaMax": 63,
            "selectors": {
              "ifNames": ["ens8f0np0"]
            }
          }
        ]
      }
    image: k8s-rdma-shared-dev-plugin
    repository: nvcr.io/nvidia/cloud-native
    version: v1.5.3
Enter fullscreen mode Exit fullscreen mode

Use Cases:

  • Development and testing environments
  • Applications where multiple pods need RDMA functionality but not maximum performance
  • Environments with limited hardware resources

2. RDMA SR-IOV Legacy Device

The SR-IOV (Single Root I/O Virtualization) configuration segments a network device at the hardware layer, creating multiple virtual functions (VFs) that can be assigned to different pods. This provides better isolation and performance compared to the shared device method.

Configuration Example:

# NicClusterPolicy for OFED driver
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 25.04-0.6.1.0-2
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30

# SR-IOV Network Node Policy
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: sriov-legacy-policy
  namespace: openshift-sriov-network-operator
spec:
  deviceType: netdevice
  mtu: 1500
  nicSelector:
    vendor: "15b3"
    pfNames: ["ens8f0np0#0-7"]
  nodeSelector:
    feature.node.kubernetes.io/pci-15b3.present: "true"
  numVfs: 8
  priority: 90
  isRdma: true
  resourceName: sriovlegacy
Enter fullscreen mode Exit fullscreen mode

Use Cases:

  • Production environments requiring high performance
  • Workloads sensitive to latency and bandwidth
  • Applications requiring isolation between network resources

3. RDMA Host Device

The Host Device configuration passes the entire physical network device from the host to a pod. This provides maximum performance but limits the device to a single pod at a time.

Configuration Example:

apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 25.04-0.6.1.0-2
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30
  sriovDevicePlugin:
    image: sriov-network-device-plugin
    repository: ghcr.io/k8snetworkplumbingwg
    version: v3.7.0
    config: |
      {
        "resourceList": [
          {
            "resourcePrefix": "nvidia.com",
            "resourceName": "hostdev",
            "selectors": {
              "vendors": ["15b3"],
              "isRdma": true
            }
          }
        ]
      }
Enter fullscreen mode Exit fullscreen mode

Use Cases:

  • Workloads requiring maximum performance
  • Systems where SR-IOV is not supported
  • Applications needing features only available in the physical function driver

Deployment Examples

Example 1: Network Operator with Host Device Network

This example demonstrates deploying the Network Operator with SR-IOV device plugin and a single SR-IOV resource pool:

# NicClusterPolicy
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 25.04-0.6.1.0-2
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30
  sriovDevicePlugin:
    image: sriov-network-device-plugin
    repository: ghcr.io/k8snetworkplumbingwg
    version: v3.9.0
    config: |
      {
        "resourceList": [
          {
            "resourcePrefix": "nvidia.com",
            "resourceName": "hostdev",
            "selectors": {
              "vendors": ["15b3"],
              "isRdma": true
            }
          }
        ]
      }

# HostDeviceNetwork
apiVersion: mellanox.com/v1alpha1
kind: HostDeviceNetwork
metadata:
  name: hostdev-net
spec:
  networkNamespace: "default"
  resourceName: "hostdev"
  ipam: |
    {
      "type": "whereabouts",
      "datastore": "kubernetes",
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "range": "192.168.3.225/28",
      "exclude": [
        "192.168.3.229/30",
        "192.168.3.236/32"
      ],
      "log_file" : "/var/log/whereabouts.log",
      "log_level" : "info"
    }

# Example Pod
apiVersion: v1
kind: Pod
metadata:
  name: hostdev-test-pod
  annotations:
    k8s.v1.cni.cncf.io/networks: hostdev-net
spec:
  restartPolicy: OnFailure
  containers:
  - image: <rdma image>
    name: doca-test-ctr
    securityContext:
      capabilities:
        add: ["IPC_LOCK"]
    resources:
      requests:
        nvidia.com/hostdev: 1
      limits:
        nvidia.com/hostdev: 1
    command:
    - sh
    - -c
    - sleep inf
Enter fullscreen mode Exit fullscreen mode

Example 2: Network Operator with SR-IOV Legacy Mode

This example demonstrates deploying the Network Operator with SR-IOV in legacy mode:

# NicClusterPolicy
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 25.04-0.6.1.0-2
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30

# SriovNetworkNodePolicy
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-1
  namespace: openshift-sriov-network-operator
spec:
  deviceType: netdevice
  mtu: 1500
  nicSelector:
    vendor: "15b3"
    pfNames: ["ens2f0"]
  nodeSelector:
    feature.node.kubernetes.io/pci-15b3.present: "true"
  numVfs: 8
  priority: 90
  isRdma: true
  resourceName: sriovlegacy

# SriovNetwork
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: sriov-network
  namespace: openshift-sriov-network-operator
spec:
  vlan: 0
  networkNamespace: "default"
  resourceName: "sriovlegacy"
  ipam: |
    {
      "datastore": "kubernetes",
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "log_file": "/tmp/whereabouts.log",
      "log_level": "debug",
      "type": "whereabouts",
      "range": "192.168.101.0/24"
    }

# Example Pod
apiVersion: v1
kind: Pod
metadata:
  name: testpod1
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-network
spec:
  containers:
  - name: appcntr1
    image: <image>
    imagePullPolicy: IfNotPresent
    securityContext:
      capabilities:
        add: ["IPC_LOCK"]
    command:
    - sh
    - -c
    - sleep inf
    resources:
      requests:
        openshift.io/sriovlegacy: '1'
      limits:
        openshift.io/sriovlegacy: '1'
Enter fullscreen mode Exit fullscreen mode

Example 3: Network Operator with RDMA Shared Device

This example demonstrates deploying the Network Operator with RDMA Shared Device and MacVlan Network:

# NicClusterPolicy
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 25.04-0.6.1.0-2
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30
  rdmaSharedDevicePlugin:
    config: |
      {
        "configList": [
          {
            "resourceName": "rdmashared",
            "rdmaHcaMax": 1000,
            "selectors": {
              "ifNames": ["enp4s0f0np0"]
            }
          }
        ]
      }
    image: k8s-rdma-shared-dev-plugin
    repository: nvcr.io/nvidia/cloud-native
    version: v1.5.3

# MacvlanNetwork
apiVersion: mellanox.com/v1alpha1
kind: MacvlanNetwork
metadata:
  name: rdmashared-net
spec:
  networkNamespace: default
  master: enp4s0f0np0
  mode: bridge
  mtu: 1500
  ipam: '{"type":"whereabouts","range":"16.0.2.0/24","gateway":"16.0.2.1"}'

# Example Pod
apiVersion: v1
kind: Pod
metadata:
  name: test-rdma-shared-1
  annotations:
    k8s.v1.cni.cncf.io/networks: rdmashared-net
spec:
  containers:
  - image: myimage
    name: rdma-shared-1
    securityContext:
      capabilities:
        add:
        - IPC_LOCK
    resources:
      limits:
        rdma/rdmashared: 1
      requests:
        rdma/rdmashared: 1
  restartPolicy: OnFailure
Enter fullscreen mode Exit fullscreen mode

Best Practices

Network Design Considerations

  1. Separate Control and Data Planes: Dedicate control plane nodes for OpenShift deployments with NVIDIA Network Operator
  2. Dual Network Architecture: Use Ethernet for cluster API traffic and InfiniBand for high-performance application traffic
  3. Resource Planning: Carefully plan the number of VFs and resource allocation based on workload requirements
  4. NUMA Alignment: Ensure NUMA alignment between GPUs, NICs, and CPU cores for optimal performance

Performance Optimization

  1. RDMA Configuration Selection:

    • Use RDMA Shared Device for development or when maximum performance is not critical
    • Use SR-IOV Legacy for production workloads requiring high performance with multiple pods
    • Use Host Device when maximum performance is required for a single pod
  2. MTU Configuration: Configure jumbo frames (MTU 9000) for improved throughput when supported by the network infrastructure

  3. HUGEPAGES Configuration: For DPDK workloads, configure HUGEPAGES following the OpenShift documentation

Troubleshooting

  1. Verify Node Feature Discovery:
   oc describe node | grep -E 'Roles|pci' | grep -v "control-plane"
Enter fullscreen mode Exit fullscreen mode

Ensure nodes with NVIDIA hardware show the label feature.node.kubernetes.io/pci-15b3.present=true

  1. Check Network Operator Status:
   oc get pods -n nvidia-network-operator
Enter fullscreen mode Exit fullscreen mode
  1. Verify OFED Driver Installation:
   oc get pods -n nvidia-network-operator | grep ofed
   oc logs <ofed-driver-pod-name> -n nvidia-network-operator
Enter fullscreen mode Exit fullscreen mode
  1. Check SR-IOV Resources:
   oc get sriovnetworknodestates -n openshift-sriov-network-operator -o yaml
Enter fullscreen mode Exit fullscreen mode
  1. Verify Available Resources:
   oc get node <node-name> -o json | jq '.status.allocatable'
Enter fullscreen mode Exit fullscreen mode

References

  1. NVIDIA Network Operator Documentation
  2. Red Hat OpenShift Documentation
  3. NVIDIA DOCA Documentation

Top comments (0)