GKE TCPXO Networking Prerequisites
For *-gke-cos-training* recipes, GPUDirect TCPXO enables high-speed inter-node GPU communication on GKE. Without it, NCCL falls back to TCP (~4 GB/s vs ~340 GB/s with TCPXO).
Infrastructure Prerequisites
GKE clusters must have multi-NIC networking configured before deploying AICR bundles:
- Multi-NIC networking enabled (8 GPU NICs per a3-megagpu-8g node)
Network+GKENetworkParamSetCRs configured for GPU NICs (cluster-specific, not managed by AICR)nccl-tcpxo-installerDaemonSet on GPU nodes (included in AICR bundle)nri-device-injectorDaemonSet on GPU nodes (included in AICR bundle)
Important: The GPU node pool must be provisioned with only the 8 GPU NIC networks (gpu-nic-0 through gpu-nic-7). Do not include a gVNIC additional network — it takes a GPU NIC PCI slot (0000:06:00.0), leaving only 7/8 GPUs available for TCPXO.
Workload Pod Configuration (NRI Profile)
The NRI profile mounts the host's /sys and /proc/sys into the TCPXO daemon container, giving it PCI sysfs visibility without hostNetwork. This preserves pod networking (DNS, network policies, service mesh compatibility).
apiVersion: v1
kind: Pod
metadata:
name: my-workload
annotations:
# NRI device injection for tcpxo-daemon GPU access
devices.gke.io/container.tcpxo-daemon: |
- path: /dev/nvidia0
- path: /dev/nvidia1
- path: /dev/nvidia2
- path: /dev/nvidia3
- path: /dev/nvidia4
- path: /dev/nvidia5
- path: /dev/nvidia6
- path: /dev/nvidia7
- path: /dev/nvidiactl
- path: /dev/nvidia-uvm
- path: /dev/dmabuf_import_helper
# Multi-NIC mapping (network names are cluster-specific)
networking.gke.io/default-interface: eth0
networking.gke.io/interfaces: |
[{"interfaceName":"eth0","network":"default"},
{"interfaceName":"eth1","network":"gpu-nic0"},
{"interfaceName":"eth2","network":"gpu-nic1"},
{"interfaceName":"eth3","network":"gpu-nic2"},
{"interfaceName":"eth4","network":"gpu-nic3"},
{"interfaceName":"eth5","network":"gpu-nic4"},
{"interfaceName":"eth6","network":"gpu-nic5"},
{"interfaceName":"eth7","network":"gpu-nic6"},
{"interfaceName":"eth8","network":"gpu-nic7"}]
spec:
hostNetwork: false
containers:
- name: tcpxo-daemon
image: us-docker.pkg.dev/gce-ai-infra/gpudirect-tcpxo/tcpgpudmarxd-dev:v1.0.20
securityContext:
capabilities:
add: [NET_ADMIN, NET_BIND_SERVICE]
volumeMounts:
- name: nvtcpxo-libraries
mountPath: /usr/local/nvidia
readOnly: true
- name: nvtcpxo-sys
mountPath: /hostsysfs
- name: nvtcpxo-proc-sys
mountPath: /hostprocsysfs
env:
- name: LD_LIBRARY_PATH
value: /usr/local/nvidia/lib64
- name: workload
# ... your training container
volumeMounts:
- name: nvtcpxo-aperture-devices
mountPath: /dev/aperture_devices
volumes:
- name: nvtcpxo-libraries
hostPath:
path: /home/kubernetes/bin/nvidia
- name: nvtcpxo-sys
hostPath:
path: /sys
- name: nvtcpxo-proc-sys
hostPath:
path: /proc/sys
- name: nvtcpxo-aperture-devices
hostPath:
path: /dev/aperture_devicesKey properties:
hostNetwork: false— workloads get proper pod networkingprivileged: false— tcpxo-daemon uses onlyNET_ADMINandNET_BIND_SERVICE/sysmounted as/hostsysfs— provides PCI sysfs visibility for GPU enumeration/proc/sysmounted as/hostprocsysfs— allows kernel network tuning- NRI annotations inject GPU devices and multi-NIC interfaces
- Requires NRI device injector DaemonSet deployed on GPU nodes
See demos/workloads/training/gke-nccl-test-tcpxo.yaml for a complete 2-node NCCL benchmark example.
NCCL Plugin Version Matching
The NCCL test container image must match the cluster's installed TCPXO plugin version. Check with:
kubectl get ds nccl-tcpxo-installer -n kube-system \
-o jsonpath='{.spec.template.spec.containers[?(@.name=="nccl-tcpxo-installer")].image}'Update the nccl-plugin-gpudirecttcpx-dev image tag in your workload to match.
Troubleshooting
RxDM detects 7/8 GPUs
If RxDM reports Number of GPUs detected 7 is not equal to the actual number of GPUs 8, check the GPU node pool's additional network configuration:
gcloud container node-pools describe <pool-name> \
--cluster <cluster> --region <region> --project <project> \
--format="yaml(networkConfig.additionalNodeNetworkConfigs)"If a gVNIC network appears in the list, it is taking a GPU NIC PCI slot. Remove the gVNIC from the node pool and reprovision the GPU nodes.
You can also verify the node NIC mapping:
kubectl get node <gpu-node> \
-o jsonpath='{.metadata.annotations.networking\.gke\.io/nic-info}'All 8 GPU NIC PCI addresses should be mapped to eth1–eth8. If a gVNIC is present, it typically occupies PCI 0000:06:00.0, displacing the first GPU NIC.
RxDM detects 0/8 GPUs
If RxDM reports Number of GPUs detected in the PCI tree 0, the pod is missing the /sys hostPath mount. Ensure /sys is mounted as /hostsysfs in the tcpxo-daemon container. Without it, the container network namespace hides the host PCI sysfs tree entirely.
Performance Reference
Validated on GKE 1.35 / a3-megagpu-8g (2 nodes, 16 GPUs):
| Profile | hostNetwork | busBW @ 16 GB | Avg busBW |
|---|---|---|---|
| NRI (recommended) | false | ~340 GB/s | ~100 GB/s |
| Without TCPXO | N/A | ~4 GB/s | ~4 GB/s |