Validator Development Guide
Learn how to add new validation checks to AICR.
Overview
AICR uses a container-per-validator model. Each validation check runs as an isolated Kubernetes Job with access to the cluster, a snapshot, and the recipe. Validators are organized into three phases:
| Phase | Purpose | Example |
|---|---|---|
deployment | Verify components are installed and healthy | GPU operator pods running, expected resources present |
performance | Verify system meets performance thresholds | NCCL bandwidth, GPU utilization |
conformance | Verify workload-specific requirements | DRA support, gang scheduling, autoscaling |
Architecture:
- Declarative Catalog: Validators are defined in
recipes/validators/catalog.yaml - Container Contract: Exit code 0 = pass, 1 = fail, 2 = skip
- Evidence via stdout: Check output printed to stdout is captured as CTRF evidence
- Debug via stderr: Structured logs go to stderr and are streamed to the user
- CTRF Reports: Results are aggregated into Common Test Report Format JSON
Quick Start
Adding a new check to an existing validator container requires three steps.
Step 1: Implement the Check Function
Create a new file in the appropriate phase directory (e.g., validators/deployment/):
package main
import (
"fmt"
"log/slog"
"github.com/NVIDIA/aicr/pkg/errors"
"github.com/NVIDIA/aicr/validators"
metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
)
func checkMyComponent(ctx *validators.Context) error {
slog.Info("checking my-component health")
pods, err := ctx.Clientset.CoreV1().Pods("my-namespace").List(
ctx.Ctx,
metav1.ListOptions{LabelSelector: "app=my-component"},
)
if err != nil {
return errors.Wrap(errors.ErrCodeInternal, "failed to list pods", err)
}
if len(pods.Items) == 0 {
return errors.New(errors.ErrCodeNotFound, "no my-component pods found")
}
// Evidence to stdout (captured in CTRF report)
fmt.Printf("Found %d my-component pod(s)\n", len(pods.Items))
for _, pod := range pods.Items {
fmt.Printf(" %s: %s\n", pod.Name, pod.Status.Phase)
}
return nil
}Step 2: Register in main.go
Add the check function to the dispatch map in validators/deployment/main.go:
func main() {
validators.Run(map[string]validators.CheckFunc{
"operator-health": checkOperatorHealth,
"expected-resources": checkExpectedResources,
// Add your check here:
"my-component": checkMyComponent,
})
}Step 3: Add Catalog Entry
Add an entry to recipes/validators/catalog.yaml:
validators:
# ... existing entries ...
- name: my-component
phase: deployment
description: "Verify my-component pods are running and healthy"
image: ghcr.io/nvidia/aicr-validators/deployment:latest
timeout: 2m
args: ["my-component"]
env: []The args field must match the key used in the validators.Run() dispatch map.
Container Contract
Every validator container must follow this contract:
Exit Codes
| Code | Meaning | CTRF Status |
|---|---|---|
0 | Check passed | passed |
1 | Check failed | failed |
2 | Check skipped (not applicable) | skipped |
I/O Channels
| Channel | Purpose | Captured By |
|---|---|---|
| stdout | Evidence output (human-readable check results) | CTRF report message field |
| stderr | Debug/progress logs (slog output) | Streamed live to user terminal |
/dev/termination-log | Failure reason (max 4096 bytes) | CTRF report on failure |
Mounted Data
The validator engine mounts snapshot and recipe data as ConfigMaps:
| Path | Content | Environment Override |
|---|---|---|
/data/snapshot/snapshot.yaml | Cluster snapshot | AICR_SNAPSHOT_PATH |
/data/recipe/recipe.yaml | Recipe with constraints | AICR_RECIPE_PATH |
Environment Variables
| Variable | Description |
|---|---|
AICR_NAMESPACE | Validation namespace (fallback if ServiceAccount namespace unavailable) |
AICR_SNAPSHOT_PATH | Override snapshot mount path |
AICR_RECIPE_PATH | Override recipe mount path |
AICR_VALIDATOR_IMAGE_REGISTRY | Override image registry prefix (set by user) |
AICR_NODE_SELECTOR | User-provided node selector override for inner workloads (comma-separated key=value pairs). Set by the --node-selector CLI flag. Use ctx.NodeSelector to access the parsed value. |
AICR_TOLERATIONS | User-provided toleration override for inner workloads (comma-separated key=value:effect entries). Set by the --toleration CLI flag. Use ctx.Tolerations to access the parsed value. |
Context API
The validators.Context struct provides all dependencies a check needs:
type Context struct {
Ctx context.Context // Parent context with timeout
Cancel context.CancelFunc // Release resources (caller must defer)
Clientset kubernetes.Interface // Typed K8s client
RESTConfig *rest.Config // For exec, port-forward, dynamic client
DynamicClient dynamic.Interface // For CRD access
Snapshot *snapshotter.Snapshot // Captured cluster state
Recipe *recipe.RecipeResult // Recipe with validation config
Namespace string // Validation namespace
NodeSelector map[string]string // User-provided node selector override (nil = use defaults)
Tolerations []corev1.Toleration // User-provided toleration override (nil = use defaults)
}LoadContext() builds this from the container environment: reads mounted ConfigMaps, creates in-cluster K8s clients, and sets a timeout from defaults.CheckExecutionTimeout.
Scheduling Overrides
When creating inner workloads (pods, Jobs, TrainJobs), check ctx.NodeSelector and ctx.Tolerations before applying hardcoded platform selectors. If non-nil, these override the default scheduling constraints to support clusters with non-standard GPU node labels or taints.
// Apply scheduling overrides when creating inner workload pods.
nodeSelector := map[string]string{"cloud.google.com/gke-accelerator": "nvidia-h100-mega-80gb"}
if ctx.NodeSelector != nil {
nodeSelector = ctx.NodeSelector // user override replaces platform default
}
tolerations := []corev1.Toleration{{Operator: corev1.TolerationOpExists}}
if ctx.Tolerations != nil {
tolerations = ctx.Tolerations // user override replaces default tolerate-all
}Validators that use nodeName pinning (e.g., nvidia-smi, DRA isolation) bypass the scheduler entirely and should not apply ctx.NodeSelector.
Helper Methods
ctx.Timeout(d) — Create a child context with a specific timeout:
subCtx, cancel := ctx.Timeout(30 * time.Second)
defer cancel()
pods, err := ctx.Clientset.CoreV1().Pods(ns).List(subCtx, opts)Runner Utilities
validators.Run(checks) — Main entry point for validator containers. Handles context loading, check dispatch by os.Args[1], exit codes, and termination log writing.
validators.Skip(reason) — Return from a CheckFunc to indicate the check is not applicable. The runner exits with code 2:
func checkFeatureX(ctx *validators.Context) error {
if ctx.Recipe.Validation == nil {
return validators.Skip("no validation section in recipe")
}
// ... actual check logic ...
return nil
}Catalog Entry Schema
Each entry in recipes/validators/catalog.yaml:
- name: operator-health # Unique identifier, used in Job names
phase: deployment # deployment | performance | conformance
description: "Human-readable" # Shown in CTRF report
image: ghcr.io/.../img:latest # OCI image reference
timeout: 2m # Job activeDeadlineSeconds
args: ["operator-health"] # Container args (check name)
env: # Optional environment variables
- name: MY_VAR
value: "my-value"
resources: # Optional resource requests (omit for defaults)
cpu: "100m"
memory: "128Mi"Image tag resolution (applied by catalog.Load):
:latesttags are replaced with the CLI version (e.g.,:v0.9.5) for release builds- Explicit version tags (e.g.,
:v1.2.3) are never modified AICR_VALIDATOR_IMAGE_REGISTRYoverrides the registry prefix
Code Walkthrough
The operator_health.go check demonstrates the standard pattern:
// validators/deployment/operator_health.go
func checkOperatorHealth(ctx *validators.Context) error {
// 1. Use slog for debug output (goes to stderr, streamed to user)
slog.Info("listing pods", "namespace", gpuOperatorNamespace)
// 2. Use ctx.Clientset for K8s API calls
pods, err := ctx.Clientset.CoreV1().Pods(gpuOperatorNamespace).List(
ctx.Ctx,
metav1.ListOptions{LabelSelector: gpuOperatorLabel},
)
if err != nil {
// 3. Return wrapped errors for failures
return errors.Wrap(errors.ErrCodeInternal, "failed to list pods", err)
}
// 4. Print evidence to stdout (captured in CTRF report)
fmt.Printf("Found %d gpu-operator pod(s):\n", len(pods.Items))
for _, pod := range pods.Items {
fmt.Printf(" %s: %s\n", pod.Name, pod.Status.Phase)
}
// 5. Return nil for pass, non-nil error for fail
if runningCount == 0 {
return errors.New(errors.ErrCodeInternal, "no pods in Running state")
}
return nil
}Key patterns:
slog.*→ stderr → streamed live to userfmt.Printf→ stdout → captured as CTRF evidencereturn nil→ exit 0 → passedreturn errors.*→ exit 1 → failed (message written to termination log)return validators.Skip(reason)→ exit 2 → skipped
Directory Layout
validators/
├── context.go # Shared Context type and LoadContext()
├── runner.go # Run() entry point, exit code handling
├── deployment/ # Deployment phase validators
│ ├── main.go # Check dispatch map
│ ├── Dockerfile # Container image build
│ ├── operator_health.go # Individual check implementation
│ ├── expected_resources.go
│ └── ...
├── performance/ # Performance phase validators
│ ├── main.go
│ ├── Dockerfile
│ └── ...
├── conformance/ # Conformance phase validators
│ ├── main.go
│ ├── Dockerfile
│ └── ...
└── chainsaw/ # Chainsaw test runner utilities
└── ...Each phase directory produces one container image. Multiple checks are compiled into a single binary and selected via the first argument.
Testing
Unit Tests
Use fake K8s clients for isolated testing:
func TestCheckMyComponent(t *testing.T) {
tests := []struct {
name string
pods []corev1.Pod
wantErr bool
}{
{
name: "healthy pods",
pods: []corev1.Pod{
{
ObjectMeta: metav1.ObjectMeta{
Name: "my-pod",
Labels: map[string]string{"app": "my-component"},
},
Status: corev1.PodStatus{Phase: corev1.PodRunning},
},
},
wantErr: false,
},
{
name: "no pods found",
pods: []corev1.Pod{},
wantErr: true,
},
}
for _, tt := range tests {
t.Run(tt.name, func(t *testing.T) {
objects := make([]runtime.Object, len(tt.pods))
for i := range tt.pods {
objects[i] = &tt.pods[i]
}
ctx := &validators.Context{
Ctx: context.TODO(),
Clientset: fake.NewClientset(objects...),
Namespace: "test",
}
err := checkMyComponent(ctx)
if (err != nil) != tt.wantErr {
t.Errorf("error = %v, wantErr %v", err, tt.wantErr)
}
})
}
}Local Testing with Docker
Build and run a validator locally against mounted data:
# Build the validator image
docker build -f validators/deployment/Dockerfile -t my-validator .
# Run with mounted snapshot and recipe
docker run --rm \
-v ./snapshot.yaml:/data/snapshot/snapshot.yaml \
-v ./recipe.yaml:/data/recipe/recipe.yaml \
my-validator my-component
# Check exit code
echo $? # 0=pass, 1=fail, 2=skipNote: K8s API calls will fail locally unless you mount a kubeconfig. For checks that only read snapshot/recipe data, this works without cluster access.
Testing with Custom Images
When developing validators, you can build and push a custom image to test on a live cluster before merging.
Edit the embedded catalog to point at your custom image and rebuild the CLI:
# recipes/validators/catalog.yaml
- name: nccl-all-reduce-bw
phase: performance
image: my-registry.example.com/my-validator:dev # custom image
timeout: 30m
args: ["nccl-all-reduce-bw"]make build
./dist/aicr_*/aicr validate --recipe recipe.yaml --snapshot snapshot.yaml \
--image-pull-secret my-registry-secretThe catalog is embedded in the binary at build time, so a rebuild is required. Revert before pushing:
git checkout -- recipes/validators/catalog.yamlPrivate Registry Authentication
If your image is in a private registry, create an image pull secret in the validation namespace and pass it to the CLI with --image-pull-secret:
# Create the secret (use --dry-run=client | apply for idempotent create-or-update)
kubectl create secret docker-registry my-registry-secret \
--docker-server=nvcr.io \
--docker-username='$oauthtoken' \
--docker-password=$NGC_API_KEY \
-n aicr-validation \
--dry-run=client -o yaml | kubectl apply -f -
# Run validation with the secret
aicr validate \
--recipe recipe.yaml \
--snapshot snapshot.yaml \
--image-pull-secret my-registry-secretThe secret must be of type kubernetes.io/dockerconfigjson and exist in the validation namespace. The --image-pull-secret flag can be repeated for multiple registries.
Checklist
When adding a new upstream check:
- Create
validators/{phase}/my_check.goimplementingCheckFunc - Register in
validators/{phase}/main.godispatch map - Add catalog entry in
recipes/validators/catalog.yaml - Add the check name to the recipe's
validation.{phase}.checks[](or omit to run all) - Write table-driven unit tests with fake K8s clients
- Test locally with
docker runand mounted data - Run
make testwith race detector
See Also
- Validator Extension Guide — External validators via
--data - Validator Catalog Reference — Catalog schema and entries
- Validator V2 ADR — Architecture decision record
- CLI Reference — Validate command flags