Component Catalog

AICR recipes are composed of components — the individual software packages that make up a GPU-accelerated Kubernetes runtime. This page lists every component that can appear in a recipe.

Note: Components are included as appropriate in recipes. Not every component listed here will appear in a recipe.

The source of truth is recipes/registry.yaml. Each entry in the registry defines the component's Helm chart (or Kustomize source), default version, namespace, and node scheduling configuration. If a component is not listed there, it cannot appear in a recipe.

Components

Component	Description	Source
gpu-operator	Manages the GPU driver and runtime lifecycle on Kubernetes nodes. Handles driver installation, container runtime configuration, device plugin, and GPU feature discovery.	NVIDIA GPU Operator
network-operator	Manages high-performance networking for GPU workloads. Configures RDMA, SR-IOV, and host networking for multi-node communication.	NVIDIA Network Operator
gke-nccl-tcpxo	NCCL TCPxO network plugin for GKE. Provides optimized collective communication for multi-node GPU workloads on Google Kubernetes Engine. GKE-specific.	—
aws-efa	Device plugin for AWS Elastic Fabric Adapter. Enables low-latency networking on EKS clusters with EFA-capable instances. EKS-specific.	AWS EFA K8s Device Plugin
cert-manager	Automates TLS certificate management. Required by several operators for webhook and API server certificates.	cert-manager
skyhook-operator	OS-level node tuning and configuration management. Applies kernel parameters, sysctl settings, and system-level optimizations to nodes.	Skyhook
skyhook-customizations	Environment-specific node tuning profiles applied via Skyhook. Extends the operator with kernel params, hugepages, and other host-level configurations.	—
nvsentinel	GPU health monitoring and automated remediation. Detects GPU errors and can cordon or drain affected nodes.	NVSentinel
nvidia-dra-driver-gpu	Dynamic Resource Allocation (DRA) driver for GPUs. Advertises GPUs via the Kubernetes `resource.k8s.io/v1` API instead of the legacy device plugin. Requires Kubernetes 1.34+ (DRA is GA in 1.34). See AKS GPU Setup for details. CLI alias: `dradriver`.	NVIDIA DRA Driver
kube-prometheus-stack	Cluster monitoring: Prometheus, Grafana, Alertmanager, and node exporters. Provides GPU and cluster metrics collection and dashboards.	kube-prometheus-stack
prometheus-adapter	Exposes custom metrics from Prometheus to the Kubernetes metrics API. Enables HPA scaling based on GPU utilization and other custom metrics.	prometheus-adapter
aws-ebs-csi-driver	CSI driver for Amazon EBS volumes. Provides persistent storage for workloads on EKS. EKS-specific.	AWS EBS CSI Driver
k8s-ephemeral-storage-metrics	Exports ephemeral storage usage metrics per pod. Useful for monitoring scratch space consumption on GPU nodes.	k8s-ephemeral-storage-metrics
kai-scheduler	DRA-aware gang scheduler with hierarchical queues and topology-aware placement. Ensures distributed training jobs land on nodes with optimal interconnect topology.	KAI Scheduler
dynamo-crds	Custom Resource Definitions for NVIDIA Dynamo inference serving. Installed separately to support CRD lifecycle management.	Dynamo
dynamo-platform	NVIDIA Dynamo inference serving platform. Distributed inference with prefix-cache-aware routing and disaggregated prefill/decode.	Dynamo
kgateway-crds	Custom Resource Definitions for kgateway (Kubernetes Gateway API implementation).	kgateway
kgateway	Kubernetes Gateway API implementation. Provides model-aware ingress routing for inference workloads.	kgateway
kubeflow-trainer	Kubeflow Training Operator for distributed training jobs (PyTorch, etc.). Manages multi-node training job lifecycle with JobSet integration.	Kubeflow Trainer

How Components Are Selected

Not every component appears in every recipe. The recipe engine selects components based on the overlay chain for your environment:

Base components (cert-manager, kube-prometheus-stack) appear in most recipes.
Cloud-specific components (aws-efa, aws-ebs-csi-driver) are added when the service matches.
Intent-specific components (kubeflow-trainer, dynamo-platform, kai-scheduler) are added based on workload intent.
Accelerator/OS-specific tuning (skyhook-customizations, nvidia-dra-driver-gpu) varies by hardware and OS combination.

To see exactly which components appear in a given recipe, generate one:

bash

aicr recipe --service eks --accelerator h100 --os ubuntu --intent training -o recipe.yaml

The output lists every component with its pinned version and configuration values.

Adding Components

New components are added declaratively in recipes/registry.yaml — no Go code required. See the Contributing Guide and Bundler Development docs for details.

Component Catalog ​

Components ​

How Components Are Selected ​

Adding Components ​

Component Catalog

Components

How Components Are Selected

Adding Components