Skip to content

Component Catalog

AICR recipes are composed of components — the individual software packages that make up a GPU-accelerated Kubernetes runtime. This page lists every component that can appear in a recipe.

Note: Components are included as appropriate in recipes. Not every component listed here will appear in a recipe.

The source of truth is recipes/registry.yaml. Each entry in the registry defines the component's Helm chart (or Kustomize source), default version, namespace, and node scheduling configuration. If a component is not listed there, it cannot appear in a recipe.

Components

ComponentDescriptionSource
gpu-operatorManages the GPU driver and runtime lifecycle on Kubernetes nodes. Handles driver installation, container runtime configuration, device plugin, and GPU feature discovery.NVIDIA GPU Operator
network-operatorManages high-performance networking for GPU workloads. Configures RDMA, SR-IOV, and host networking for multi-node communication.NVIDIA Network Operator
gke-nccl-tcpxoNCCL TCPxO network plugin for GKE. Provides optimized collective communication for multi-node GPU workloads on Google Kubernetes Engine. GKE-specific.
aws-efaDevice plugin for AWS Elastic Fabric Adapter. Enables low-latency networking on EKS clusters with EFA-capable instances. EKS-specific.AWS EFA K8s Device Plugin
cert-managerAutomates TLS certificate management. Required by several operators for webhook and API server certificates.cert-manager
skyhook-operatorOS-level node tuning and configuration management. Applies kernel parameters, sysctl settings, and system-level optimizations to nodes.Skyhook
skyhook-customizationsEnvironment-specific node tuning profiles applied via Skyhook. Extends the operator with kernel params, hugepages, and other host-level configurations.
nvsentinelGPU health monitoring and automated remediation. Detects GPU errors and can cordon or drain affected nodes.NVSentinel
nvidia-dra-driver-gpuDynamic Resource Allocation (DRA) driver for GPUs. Advertises GPUs via the Kubernetes resource.k8s.io/v1 API instead of the legacy device plugin. Requires Kubernetes 1.34+ (DRA is GA in 1.34). See AKS GPU Setup for details. CLI alias: dradriver.NVIDIA DRA Driver
kube-prometheus-stackCluster monitoring: Prometheus, Grafana, Alertmanager, and node exporters. Provides GPU and cluster metrics collection and dashboards.kube-prometheus-stack
prometheus-adapterExposes custom metrics from Prometheus to the Kubernetes metrics API. Enables HPA scaling based on GPU utilization and other custom metrics.prometheus-adapter
aws-ebs-csi-driverCSI driver for Amazon EBS volumes. Provides persistent storage for workloads on EKS. EKS-specific.AWS EBS CSI Driver
k8s-ephemeral-storage-metricsExports ephemeral storage usage metrics per pod. Useful for monitoring scratch space consumption on GPU nodes.k8s-ephemeral-storage-metrics
kai-schedulerDRA-aware gang scheduler with hierarchical queues and topology-aware placement. Ensures distributed training jobs land on nodes with optimal interconnect topology.KAI Scheduler
dynamo-crdsCustom Resource Definitions for NVIDIA Dynamo inference serving. Installed separately to support CRD lifecycle management.Dynamo
dynamo-platformNVIDIA Dynamo inference serving platform. Distributed inference with prefix-cache-aware routing and disaggregated prefill/decode.Dynamo
kgateway-crdsCustom Resource Definitions for kgateway (Kubernetes Gateway API implementation).kgateway
kgatewayKubernetes Gateway API implementation. Provides model-aware ingress routing for inference workloads.kgateway
kubeflow-trainerKubeflow Training Operator for distributed training jobs (PyTorch, etc.). Manages multi-node training job lifecycle with JobSet integration.Kubeflow Trainer

How Components Are Selected

Not every component appears in every recipe. The recipe engine selects components based on the overlay chain for your environment:

  • Base components (cert-manager, kube-prometheus-stack) appear in most recipes.
  • Cloud-specific components (aws-efa, aws-ebs-csi-driver) are added when the service matches.
  • Intent-specific components (kubeflow-trainer, dynamo-platform, kai-scheduler) are added based on workload intent.
  • Accelerator/OS-specific tuning (skyhook-customizations, nvidia-dra-driver-gpu) varies by hardware and OS combination.

To see exactly which components appear in a given recipe, generate one:

bash
aicr recipe --service eks --accelerator h100 --os ubuntu --intent training -o recipe.yaml

The output lists every component with its pinned version and configuration values.

Adding Components

New components are added declaratively in recipes/registry.yaml — no Go code required. See the Contributing Guide and Bundler Development docs for details.

Released under the Apache 2.0 License.