Use Cases
Real-world AIOps behavior observed during Kubernaut demo validation. These are not synthetic
examples -- they capture actual LLM decisions, remediation outcomes, and pipeline behavior
from live Kubernetes clusters.
Deep Dives
- Multiple Remediation Paths -- How the LLM chose an alternative
fix for a GitOps-managed Certificate failure, and why both approaches are valid
- Remediation History Feedback -- How the LLM refused to
repeat a failed workflow after remediation history revealed the prior attempt's failure,
escalating to human review instead
- LLM Judgment and the Approval Gate -- How the LLM
selected a matching cleanup workflow but warned "no remediation warranted," deferring to
a human operator via the approval gate
Demo Scenario Catalog
The following scenarios demonstrate Kubernaut's remediation pipeline across different failure
modes, infrastructure patterns, and pipeline behaviors. Each scenario injects a fault,
triggers a Prometheus alert, and validates the full remediation lifecycle.
All scenarios are available in the kubernaut-demo-scenarios repository with run scripts, manifests, and step-by-step instructions.
Pod Lifecycle
| Scenario |
Signal |
Remediation |
Description |
| crashloop |
KubePodCrashLooping |
RollbackDeployment |
Bad ConfigMap causes CrashLoopBackOff, rollback restores previous revision |
| crashloop-helm |
KubePodCrashLooping |
HelmRollback |
Helm-managed workload crash, Helm rollback to last known-good release |
| stuck-rollout |
KubeDeploymentRolloutStuck |
RollbackDeployment |
Bad image tag stalls rollout, rollback restores healthy revision |
| memory-leak |
ContainerMemoryExhaustionPredicted |
GracefulRestart |
Proactive: predict_linear() detects memory growth before OOM |
| memory-escalation |
ContainerMemoryHigh |
IncreaseMemoryLimits |
Memory usage exceeds threshold; escalates to human if limits are already high |
| slo-burn |
ErrorBudgetBurn |
RollbackDeployment |
Error budget burn rate exceeds threshold, proactive rollback |
GitOps
| Scenario |
Signal |
Remediation |
Description |
| gitops-drift |
KubePodCrashLooping |
GitRevertCommit |
Bad ConfigMap commit in Gitea, LLM selects git revert over kubectl rollback |
| cert-failure-gitops |
CertManagerCertNotReady |
GitRevertCommit / FixCertificate |
Broken ClusterIssuer via git; LLM may choose git revert or direct fix (details) |
| disk-pressure-emptydir |
PredictedDiskPressure |
AnsiblePVCMigration |
PostgreSQL on emptyDir fills disk; Ansible/AWX runs pg_dump, commits PVC migration to Git, ArgoCD syncs, pg_restore completes migration |
Infrastructure
| Scenario |
Signal |
Remediation |
Description |
| cert-failure |
CertManagerCertNotReady |
FixCertificate |
CA Secret deleted, workflow recreates it to restore certificate issuance |
| hpa-maxed |
KubeHpaMaxedOut |
ScaleHPA |
HPA at max replicas under sustained load |
| resource-contention |
OOMKilled |
IncreaseMemoryLimits |
Memory contention causes OOM kills across competing workloads |
| resource-quota-exhaustion |
KubeResourceQuotaExhausted |
AdjustResourceQuota |
Namespace quota prevents scaling (details) |
| network-policy-block |
KubePodCrashLooping / KubeDeploymentReplicasMismatch |
FixNetworkPolicy |
Deny-all NetworkPolicy blocks traffic; readiness-based signal self-resolves after remediation |
| statefulset-pvc-failure |
KubeStatefulSetReplicasMismatch |
FixStatefulSetPVC |
PVC binding failure prevents StatefulSet pod scheduling |
Multi-Node
| Scenario |
Signal |
Remediation |
Description |
| autoscale |
KubePodSchedulingFailed |
AddNode |
Cluster autoscaling via kubeadm join |
| pending-taint |
KubePodNotScheduled |
RemoveTaint |
Node taint prevents pod scheduling |
| pdb-deadlock |
KubePodDisruptionBudgetAtLimit |
ResolvePDBDeadlock |
PodDisruptionBudget prevents necessary evictions |
| node-notready |
KubeNodeNotReady |
CordonDrain |
Node health failure triggers cordon and workload migration |
Advanced Pipeline
| Scenario |
Signal |
Remediation |
Description |
| duplicate-alert-suppression |
KubePodCrashLooping |
RollbackDeployment |
Validates that duplicate alerts for the same incident are suppressed |
| concurrent-cross-namespace |
KubePodCrashLooping |
Per-team workflow |
Two teams hit the same fault; LLM selects different workflows based on risk labels |
| orphaned-pvc-no-action |
KubePersistentVolumeClaimOrphaned |
None (NoActionRequired) / CleanupPVC |
Without cleanup workflow: LLM concludes no action needed. With workflow in catalog: LLM selects it but warns "no remediation warranted," deferring to human approval (details) |
Service Mesh
| Scenario |
Signal |
Remediation |
Description |
| mesh-routing-failure |
IstioHighDenyRate / IstioRequestsUnauthorized |
FixServiceMeshRouting |
Istio service mesh AuthorizationPolicy misconfiguration |
Unvalidated
These scenarios have scaffolding (manifests, run.sh, workflow) but have not been validated end-to-end on any platform. Do not rely on them until they are promoted to a category above.
| Scenario |
Signal |
Remediation |
Description |
Blocker |
| memory-limits-gitops-ansible |
ContainerOOMKilling |
AnsibleMemoryLimitsUpdate |
OOMKill on GitOps-managed deployment; Ansible/AWX updates limits in Git, ArgoCD syncs |
Requires ArgoCD + AWX; not tested on Kind or OCP |