AKS GPU Setup
Kubernetes Version Requirement
AICR requires Kubernetes 1.34 or later on AKS. This is driven by DRA (Dynamic Resource Allocation), which is included in every AICR recipe.
The core DRA APIs (resource.k8s.io) graduated to GA (stable v1) in Kubernetes 1.34. No AKS-specific feature flag is needed — DRA is enabled out of the box once you're on 1.34+.
# Create a cluster on 1.34
az aks create \
--resource-group <rg> \
--name <cluster> \
--kubernetes-version 1.34 \
--enable-oidc-issuer \
--enable-workload-identity \
--enable-managed-identity \
--generate-ssh-keys
# Upgrade an existing cluster to 1.34
az aks upgrade \
--resource-group <rg> \
--name <cluster> \
--kubernetes-version 1.34You can verify DRA is available after the upgrade:
kubectl api-resources --api-group=resource.k8s.ioExpected output includes deviceclasses, resourceclaims, resourceclaimtemplates, and resourceslices.
Note: Kubernetes version skipping is not allowed. If your cluster is on 1.32, you must upgrade to 1.33 first, then to 1.34.
Dynamic Resource Allocation (DRA)
All AICR recipes include the nvidia-dra-driver-gpu component, which advertises GPUs via the Kubernetes DRA API instead of the legacy device plugin. DRA provides structured GPU device advertisement, claim-based allocation, and integration with gang scheduling.
Feature Gate Details
| Kubernetes Version | DRA Status | Feature Gate |
|---|---|---|
| 1.26–1.29 | Alpha | DynamicResourceAllocation — off by default |
| 1.30–1.33 | Beta | DynamicResourceAllocation — on by default |
| 1.34+ | GA / Stable | resource.k8s.io/v1 — always enabled, no feature gate needed |
On AKS 1.34, DRA is GA. You do not need to pass any custom API server flags or register an AKS preview feature.
CLI Override
You can control DRA settings when bundling:
# Enable GPU resource advertisement (default)
aicr bundle -r recipe.yaml --set dradriver:gpuResourcesEnabledOverride=true
# Disable DRA GPU allocation (fall back to device plugin)
aicr bundle -r recipe.yaml \
--set dradriver:gpuResourcesEnabledOverride=false \
--set dradriver:resources.gpus.enabled=falseDevice Plugin vs DRA (Important)
Both device-plugin and DRA are enabled by default, but only one should be used per node. Using both concurrently causes GPU over-admission — both systems advertise all GPUs independently, so the scheduler may admit more GPU pods than physical GPUs available.
For DRA-only (recommended):
aicr bundle -r recipe.yaml --set gpuoperator:devicePlugin.enabled=falseFor device-plugin-only (legacy):
aicr bundle -r recipe.yaml \
--set dradriver:gpuResourcesEnabledOverride=false \
--set dradriver:resources.gpus.enabled=falseGPU Driver Setup
AKS GPU nodepools install NVIDIA drivers by default. This conflicts with the GPU Operator, which also installs drivers by default. Use one of the approaches below to avoid the conflict.
Recommended: Let GPU Operator Manage the Driver
Create nodepools with --gpu-driver none so AKS skips its driver installation and the GPU Operator handles it:
az aks nodepool add \
--cluster-name <cluster> \
--resource-group <rg> \
--name gpupool \
--node-vm-size Standard_NC80adis_H100_v5 \
--gpu-driver none \
--node-count 1No changes to AICR recipes are needed — this is the default configuration.
Alternative: Use the AKS-Managed Driver
If you prefer the AKS-managed driver (e.g., for driver version pinning by AKS), disable the GPU Operator driver:
aicr bundle -r recipe.yaml --set gpuoperator:driver.enabled=falseOr add to your values override file:
driver:
enabled: false