MohamedELGAMASY

Posted on Jul 4

[Part02] Getting Started with Red Hat OpenShift with NVIDIA

#nvidia #openshift #ai #programming

Getting Started with Red Hat OpenShift with NVIDIA

Prerequisites

Before deploying NVIDIA networking solutions on OpenShift, ensure the following prerequisites are met:

A functioning Red Hat OpenShift Container Platform cluster (version 4.10 or later recommended)
NVIDIA networking hardware (Mellanox ConnectX or BlueField series) installed in worker nodes
Node Feature Discovery (NFD) operator installed and configured
SR-IOV Network Operator installed (if using SR-IOV capabilities)
GPU Operator installed (if using GPUDirect RDMA)

Understanding Networking Technologies

Ethernet vs. InfiniBand

When planning your OpenShift deployment with NVIDIA networking hardware, it's crucial to understand the fundamental differences between Ethernet and InfiniBand technologies:

Feature	Ethernet	InfiniBand
Design Purpose	General data movement between systems	High reliability, high bandwidth, low latency for supercomputing clusters
Latency Handling	Store-and-forward with MAC address transport	Cut-through approach with 16-bit LID for faster forwarding
Network Reliability	No scheduling-based flow control, potential for congestion	End-to-end flow control providing lossless networking
Network Mode	MAC addresses with ARP protocol	Built-in software-defined networking with subnet manager
OpenShift Compatibility	Native support	Requires special configuration and cannot be used for cluster API traffic

Important Note: OpenShift installation requires Ethernet connectivity for the cluster API traffic. InfiniBand can only be used as a secondary network for application traffic after the cluster is installed.

Recommended Network Architecture

For deployments requiring both OpenShift functionality and high-speed InfiniBand connectivity, a dual-network architecture is recommended:

Primary Network (Ethernet): Used for OpenShift cluster API traffic, management, and standard application networking
Secondary Network (InfiniBand or Ethernet with RDMA): Used for high-performance, low-latency application traffic

This architecture can be implemented using:

Dual single-port NICs (one Ethernet, one InfiniBand)
Dual-port NICs with one port configured for Ethernet and one for InfiniBand
Multiple Ethernet NICs with RDMA capabilities

NVIDIA Network Operator Installation

The NVIDIA Network Operator is the primary tool for deploying and managing NVIDIA networking components in OpenShift. It can be installed using either the OpenShift web console or the command-line interface.

Prerequisites

Before installing the Network Operator, ensure:

Node Feature Discovery (NFD) is properly configured
Worker nodes with NVIDIA networking hardware are labeled with feature.node.kubernetes.io/pci-15b3.present=true

Installation Using OpenShift Web Console

In the OpenShift Container Platform web console, navigate to Operators > OperatorHub
Search for "NVIDIA Network Operator"
Select the operator and click Install
Follow the on-screen instructions to complete the installation

Installation Using OpenShift CLI

Create a namespace for the Network Operator:

   oc create namespace nvidia-network-operator

Determine the current channel version:

   oc get packagemanifest nvidia-network-operator -n openshift-marketplace -o jsonpath='{.status.defaultChannel}'

Create a subscription file (network-operator-sub.yaml):

   apiVersion: operators.coreos.com/v1alpha1
   kind: Subscription
   metadata:
     name: nvidia-network-operator
     namespace: nvidia-network-operator
   spec:
     channel: "v25.4"  # Replace with the current channel from step 2
     name: nvidia-network-operator
     source: certified-operators
     sourceNamespace: openshift-marketplace

Apply the subscription:

   oc create -f network-operator-sub.yaml

Switch to the network-operator project:

   oc project nvidia-network-operator

Verification

Verify the operator deployment with:

oc get pods -n nvidia-network-operator

A successful deployment will show the controller manager pod with a Running status.

RDMA Configuration Options

Remote Direct Memory Access (RDMA) allows computers to directly access each other's memory without involving the CPU or operating system, providing high bandwidth and low latency. NVIDIA offers three configuration methods for RDMA in OpenShift:

1. RDMA Shared Device

The RDMA Shared Device configuration allows multiple pods on a worker node to share the same RDMA device. This method is suitable for development environments or applications where maximum performance is not critical.

Configuration Example:

apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 25.04-0.6.1.0-2
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30
  rdmaSharedDevicePlugin:
    config: |
      {
        "configList": [
          {
            "resourceName": "rdma_shared_device_ib",
            "rdmaHcaMax": 63,
            "selectors": {
              "ifNames": ["ibs2f0"]
            }
          },
          {
            "resourceName": "rdma_shared_device_eth",
            "rdmaHcaMax": 63,
            "selectors": {
              "ifNames": ["ens8f0np0"]
            }
          }
        ]
      }
    image: k8s-rdma-shared-dev-plugin
    repository: nvcr.io/nvidia/cloud-native
    version: v1.5.3

Use Cases:

Development and testing environments
Applications where multiple pods need RDMA functionality but not maximum performance
Environments with limited hardware resources

2. RDMA SR-IOV Legacy Device

The SR-IOV (Single Root I/O Virtualization) configuration segments a network device at the hardware layer, creating multiple virtual functions (VFs) that can be assigned to different pods. This provides better isolation and performance compared to the shared device method.

Configuration Example:

# NicClusterPolicy for OFED driver
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 25.04-0.6.1.0-2
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30

# SR-IOV Network Node Policy
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: sriov-legacy-policy
  namespace: openshift-sriov-network-operator
spec:
  deviceType: netdevice
  mtu: 1500
  nicSelector:
    vendor: "15b3"
    pfNames: ["ens8f0np0#0-7"]
  nodeSelector:
    feature.node.kubernetes.io/pci-15b3.present: "true"
  numVfs: 8
  priority: 90
  isRdma: true
  resourceName: sriovlegacy

Use Cases:

Production environments requiring high performance
Workloads sensitive to latency and bandwidth
Applications requiring isolation between network resources

3. RDMA Host Device

The Host Device configuration passes the entire physical network device from the host to a pod. This provides maximum performance but limits the device to a single pod at a time.

Configuration Example:

apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 25.04-0.6.1.0-2
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30
  sriovDevicePlugin:
    image: sriov-network-device-plugin
    repository: ghcr.io/k8snetworkplumbingwg
    version: v3.7.0
    config: |
      {
        "resourceList": [
          {
            "resourcePrefix": "nvidia.com",
            "resourceName": "hostdev",
            "selectors": {
              "vendors": ["15b3"],
              "isRdma": true
            }
          }
        ]
      }

Use Cases:

Workloads requiring maximum performance
Systems where SR-IOV is not supported
Applications needing features only available in the physical function driver

Deployment Examples

Example 1: Network Operator with Host Device Network

This example demonstrates deploying the Network Operator with SR-IOV device plugin and a single SR-IOV resource pool:

# NicClusterPolicy
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 25.04-0.6.1.0-2
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30
  sriovDevicePlugin:
    image: sriov-network-device-plugin
    repository: ghcr.io/k8snetworkplumbingwg
    version: v3.9.0
    config: |
      {
        "resourceList": [
          {
            "resourcePrefix": "nvidia.com",
            "resourceName": "hostdev",
            "selectors": {
              "vendors": ["15b3"],
              "isRdma": true
            }
          }
        ]
      }

# HostDeviceNetwork
apiVersion: mellanox.com/v1alpha1
kind: HostDeviceNetwork
metadata:
  name: hostdev-net
spec:
  networkNamespace: "default"
  resourceName: "hostdev"
  ipam: |
    {
      "type": "whereabouts",
      "datastore": "kubernetes",
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "range": "192.168.3.225/28",
      "exclude": [
        "192.168.3.229/30",
        "192.168.3.236/32"
      ],
      "log_file" : "/var/log/whereabouts.log",
      "log_level" : "info"
    }

# Example Pod
apiVersion: v1
kind: Pod
metadata:
  name: hostdev-test-pod
  annotations:
    k8s.v1.cni.cncf.io/networks: hostdev-net
spec:
  restartPolicy: OnFailure
  containers:
  - image: <rdma image>
    name: doca-test-ctr
    securityContext:
      capabilities:
        add: ["IPC_LOCK"]
    resources:
      requests:
        nvidia.com/hostdev: 1
      limits:
        nvidia.com/hostdev: 1
    command:
    - sh
    - -c
    - sleep inf

Example 2: Network Operator with SR-IOV Legacy Mode

This example demonstrates deploying the Network Operator with SR-IOV in legacy mode:

# NicClusterPolicy
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 25.04-0.6.1.0-2
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30

# SriovNetworkNodePolicy
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: policy-1
  namespace: openshift-sriov-network-operator
spec:
  deviceType: netdevice
  mtu: 1500
  nicSelector:
    vendor: "15b3"
    pfNames: ["ens2f0"]
  nodeSelector:
    feature.node.kubernetes.io/pci-15b3.present: "true"
  numVfs: 8
  priority: 90
  isRdma: true
  resourceName: sriovlegacy

# SriovNetwork
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: sriov-network
  namespace: openshift-sriov-network-operator
spec:
  vlan: 0
  networkNamespace: "default"
  resourceName: "sriovlegacy"
  ipam: |
    {
      "datastore": "kubernetes",
      "kubernetes": {
        "kubeconfig": "/etc/cni/net.d/whereabouts.d/whereabouts.kubeconfig"
      },
      "log_file": "/tmp/whereabouts.log",
      "log_level": "debug",
      "type": "whereabouts",
      "range": "192.168.101.0/24"
    }

# Example Pod
apiVersion: v1
kind: Pod
metadata:
  name: testpod1
  annotations:
    k8s.v1.cni.cncf.io/networks: sriov-network
spec:
  containers:
  - name: appcntr1
    image: <image>
    imagePullPolicy: IfNotPresent
    securityContext:
      capabilities:
        add: ["IPC_LOCK"]
    command:
    - sh
    - -c
    - sleep inf
    resources:
      requests:
        openshift.io/sriovlegacy: '1'
      limits:
        openshift.io/sriovlegacy: '1'

Example 3: Network Operator with RDMA Shared Device

This example demonstrates deploying the Network Operator with RDMA Shared Device and MacVlan Network:

# NicClusterPolicy
apiVersion: mellanox.com/v1alpha1
kind: NicClusterPolicy
metadata:
  name: nic-cluster-policy
spec:
  ofedDriver:
    image: doca-driver
    repository: nvcr.io/nvidia/mellanox
    version: 25.04-0.6.1.0-2
    startupProbe:
      initialDelaySeconds: 10
      periodSeconds: 20
    livenessProbe:
      initialDelaySeconds: 30
      periodSeconds: 30
    readinessProbe:
      initialDelaySeconds: 10
      periodSeconds: 30
  rdmaSharedDevicePlugin:
    config: |
      {
        "configList": [
          {
            "resourceName": "rdmashared",
            "rdmaHcaMax": 1000,
            "selectors": {
              "ifNames": ["enp4s0f0np0"]
            }
          }
        ]
      }
    image: k8s-rdma-shared-dev-plugin
    repository: nvcr.io/nvidia/cloud-native
    version: v1.5.3

# MacvlanNetwork
apiVersion: mellanox.com/v1alpha1
kind: MacvlanNetwork
metadata:
  name: rdmashared-net
spec:
  networkNamespace: default
  master: enp4s0f0np0
  mode: bridge
  mtu: 1500
  ipam: '{"type":"whereabouts","range":"16.0.2.0/24","gateway":"16.0.2.1"}'

# Example Pod
apiVersion: v1
kind: Pod
metadata:
  name: test-rdma-shared-1
  annotations:
    k8s.v1.cni.cncf.io/networks: rdmashared-net
spec:
  containers:
  - image: myimage
    name: rdma-shared-1
    securityContext:
      capabilities:
        add:
        - IPC_LOCK
    resources:
      limits:
        rdma/rdmashared: 1
      requests:
        rdma/rdmashared: 1
  restartPolicy: OnFailure

Best Practices

Network Design Considerations

Separate Control and Data Planes: Dedicate control plane nodes for OpenShift deployments with NVIDIA Network Operator
Dual Network Architecture: Use Ethernet for cluster API traffic and InfiniBand for high-performance application traffic
Resource Planning: Carefully plan the number of VFs and resource allocation based on workload requirements
NUMA Alignment: Ensure NUMA alignment between GPUs, NICs, and CPU cores for optimal performance

Performance Optimization

RDMA Configuration Selection:
- Use RDMA Shared Device for development or when maximum performance is not critical
- Use SR-IOV Legacy for production workloads requiring high performance with multiple pods
- Use Host Device when maximum performance is required for a single pod
MTU Configuration: Configure jumbo frames (MTU 9000) for improved throughput when supported by the network infrastructure
HUGEPAGES Configuration: For DPDK workloads, configure HUGEPAGES following the OpenShift documentation

Troubleshooting

Verify Node Feature Discovery:

   oc describe node | grep -E 'Roles|pci' | grep -v "control-plane"

Ensure nodes with NVIDIA hardware show the label feature.node.kubernetes.io/pci-15b3.present=true

Check Network Operator Status:

   oc get pods -n nvidia-network-operator

Verify OFED Driver Installation:

   oc get pods -n nvidia-network-operator | grep ofed
   oc logs <ofed-driver-pod-name> -n nvidia-network-operator

Check SR-IOV Resources:

   oc get sriovnetworknodestates -n openshift-sriov-network-operator -o yaml

Verify Available Resources:

   oc get node <node-name> -o json | jq '.status.allocatable'

DEV Community

[Part02] Getting Started with Red Hat OpenShift with NVIDIA

Getting Started with Red Hat OpenShift with NVIDIA

Prerequisites

Understanding Networking Technologies

Ethernet vs. InfiniBand

Recommended Network Architecture

NVIDIA Network Operator Installation

Prerequisites

Installation Using OpenShift Web Console

Installation Using OpenShift CLI

Verification

RDMA Configuration Options

1. RDMA Shared Device

2. RDMA SR-IOV Legacy Device

3. RDMA Host Device

Deployment Examples

Example 1: Network Operator with Host Device Network

Example 2: Network Operator with SR-IOV Legacy Mode

Example 3: Network Operator with RDMA Shared Device

Best Practices

Network Design Considerations

Performance Optimization

Troubleshooting

References

Top comments (0)