Use Cases¶

Real-world AIOps behavior observed during Kubernaut demo validation. These are not synthetic examples -- they capture actual LLM decisions, remediation outcomes, and pipeline behavior from live Kubernetes clusters.

Deep Dives¶

Multiple Remediation Paths -- How the LLM chose an alternative fix for a GitOps-managed Certificate failure, and why both approaches are valid
Remediation History Feedback -- How the LLM refused to repeat a failed workflow after remediation history revealed the prior attempt's failure, escalating to human review instead
LLM Judgment and the Approval Gate -- How the LLM selected a matching cleanup workflow but warned "no remediation warranted," deferring to a human operator via the approval gate

Demo Scenario Catalog¶

The following scenarios demonstrate Kubernaut's remediation pipeline across different failure modes, infrastructure patterns, and pipeline behaviors. Each scenario injects a fault, triggers a Prometheus alert, and validates the full remediation lifecycle.

All scenarios are available in the kubernaut-demo-scenarios repository with run scripts, manifests, and step-by-step instructions.

Pod Lifecycle¶

Scenario	Signal	Remediation	Description
crashloop	`KubePodCrashLooping`	`RollbackDeployment`	Bad ConfigMap causes CrashLoopBackOff, rollback restores previous revision
crashloop-helm	`KubePodCrashLooping`	`HelmRollback`	Helm-managed workload crash, Helm rollback to last known-good release
stuck-rollout	`KubeDeploymentRolloutStuck`	`RollbackDeployment`	Bad image tag stalls rollout, rollback restores healthy revision
memory-leak	`ContainerMemoryExhaustionPredicted`	`GracefulRestart`	Proactive: `predict_linear()` detects memory growth before OOM
memory-escalation	`ContainerMemoryHigh`	`IncreaseMemoryLimits`	Memory usage exceeds threshold; escalates to human if limits are already high
slo-burn	`ErrorBudgetBurn`	`RollbackDeployment`	Error budget burn rate exceeds threshold, proactive rollback

GitOps¶

Scenario	Signal	Remediation	Description
gitops-drift	`KubePodCrashLooping`	`GitRevertCommit`	Bad ConfigMap commit in Gitea, LLM selects git revert over kubectl rollback
cert-failure-gitops	`CertManagerCertNotReady`	`GitRevertCommit` / `FixCertificate`	Broken ClusterIssuer via git; LLM may choose git revert or direct fix (details)
disk-pressure-emptydir	`PredictedDiskPressure`	`AnsiblePVCMigration`	PostgreSQL on emptyDir fills disk; Ansible/AWX runs pg_dump, commits PVC migration to Git, ArgoCD syncs, pg_restore completes migration

Infrastructure¶

Scenario	Signal	Remediation	Description
cert-failure	`CertManagerCertNotReady`	`FixCertificate`	CA Secret deleted, workflow recreates it to restore certificate issuance
hpa-maxed	`KubeHpaMaxedOut`	`ScaleHPA`	HPA at max replicas under sustained load
resource-contention	`OOMKilled`	`IncreaseMemoryLimits`	Memory contention causes OOM kills across competing workloads
resource-quota-exhaustion	`KubeResourceQuotaExhausted`	`AdjustResourceQuota`	Namespace quota prevents scaling (details)
network-policy-block	`KubePodCrashLooping` / `KubeDeploymentReplicasMismatch`	`FixNetworkPolicy`	Deny-all NetworkPolicy blocks traffic; readiness-based signal self-resolves after remediation
statefulset-pvc-failure	`KubeStatefulSetReplicasMismatch`	`FixStatefulSetPVC`	PVC binding failure prevents StatefulSet pod scheduling

Multi-Node¶

Scenario	Signal	Remediation	Description
autoscale	`KubePodSchedulingFailed`	`AddNode`	Cluster autoscaling via kubeadm join
pending-taint	`KubePodNotScheduled`	`RemoveTaint`	Node taint prevents pod scheduling
pdb-deadlock	`KubePodDisruptionBudgetAtLimit`	`ResolvePDBDeadlock`	PodDisruptionBudget prevents necessary evictions
node-notready	`KubeNodeNotReady`	`CordonDrain`	Node health failure triggers cordon and workload migration

Advanced Pipeline¶

Scenario	Signal	Remediation	Description
duplicate-alert-suppression	`KubePodCrashLooping`	`RollbackDeployment`	Validates that duplicate alerts for the same incident are suppressed
concurrent-cross-namespace	`KubePodCrashLooping`	Per-team workflow	Two teams hit the same fault; LLM selects different workflows based on risk labels
orphaned-pvc-no-action	`KubePersistentVolumeClaimOrphaned`	None (NoActionRequired) / `CleanupPVC`	Without cleanup workflow: LLM concludes no action needed. With workflow in catalog: LLM selects it but warns "no remediation warranted," deferring to human approval (details)

Service Mesh¶

Scenario	Signal	Remediation	Description
mesh-routing-failure	`IstioHighDenyRate` / `IstioRequestsUnauthorized`	`FixServiceMeshRouting`	Istio service mesh AuthorizationPolicy misconfiguration

Unvalidated¶

These scenarios have scaffolding (manifests, run.sh, workflow) but have not been validated end-to-end on any platform. Do not rely on them until they are promoted to a category above.

Scenario	Signal	Remediation	Description	Blocker
memory-limits-gitops-ansible	`ContainerOOMKilling`	`AnsibleMemoryLimitsUpdate`	OOMKill on GitOps-managed deployment; Ansible/AWX updates limits in Git, ArgoCD syncs	Requires ArgoCD + AWX; not tested on Kind or OCP