Building Confidence with Kubernaut¶
Operators rarely hand full control of cluster remediation to an AI on day one. Kubernaut provides an incremental Trust Ladder — a four-level graduation path that lets teams build confidence at their own pace, starting with full human oversight and progressing toward autonomous remediation as trust grows.
The Trust Ladder¶
graph LR
subgraph v14["Available (v1.4)"]
L1["Level 1<br/><b>Observe</b><br/><small>Global dry-run</small>"]
L3["Level 3<br/><b>Approve</b><br/><small>RAR gate</small>"]
L4["Level 4<br/><b>Automate</b><br/><small>Full autonomous</small>"]
end
subgraph v15["Planned (v1.5)"]
L2["Level 2<br/><b>Selective Trust</b><br/><small>Per-workflow dry-run</small>"]
end
L1 --> L2 --> L3 --> L4
| Level | Name | Human Involvement | Available |
|---|---|---|---|
| 1 | Observe | Operator sees what Kubernaut would do — no execution | v1.4 (global dry-run) |
| 2 | Selective Trust | Trusted workflows execute; new ones stay in dry-run | v1.5 (per-workflow dry-run) |
| 3 | Approve | Rego policy gates remediation via RAR — operator approves/rejects | v1.4 |
| 4 | Automate | Matched workflows execute without human intervention | v1.4 |
Operators typically start at Level 1 (observe) or Level 3 (approve) and graduate individual workflows or entire namespaces to Level 4 as they gain confidence in Kubernaut's decision-making.
Level 3: Approve (Available Now)¶
At this level, the Rego approval policy determines which remediations require human review. When require_approval evaluates to true, a RemediationApprovalRequest (RAR) is created before execution. The operator reviews Kubernaut's recommendation — the selected workflow, confidence score, root cause analysis, and detected infrastructure labels — and either approves or rejects it.
How it works¶
- Alert arrives → Signal Processing enriches it → Kubernaut Agent investigates root cause
- Kubernaut Agent selects a workflow with a confidence score
- The Rego approval policy evaluates the selection and creates a RAR if approval is needed
- Operator receives a notification with the full RCA and proposed remediation
- Operator approves (execution proceeds), rejects (remediation stops), or overrides (substitutes workflow parameters via
WorkflowOverride) - If no action is taken within the configured timeout, the RAR expires (default: 15 minutes, configurable via
spec.requiredBy)
Configuration¶
The approval gate is controlled by the Rego policy deployed in the aianalysis-policies ConfigMap. The Helm chart does not ship a default policy — operators must supply one via aianalysis.policies.content or aianalysis.policies.existingConfigMap.
A typical starter policy requires approval for:
- Production namespaces (case-insensitive)
- Sensitive resource kinds (Node, StatefulSet)
- Missing remediation targets (safety net)
- Low-confidence selections (below
aianalysis.rego.confidenceThreshold, default 0.8)
Non-production namespaces with a valid remediation target and high confidence auto-approve.
To require approval for everything (strictest Level 3), replace the default policy with:
To adjust the confidence threshold for approval:
# Helm values
aianalysis:
rego:
confidenceThreshold: "0.9" # Require approval below 90% confidence (default: 0.8)
Operator workflow overrides (v1.4)¶
When approving a RAR, operators can substitute the AI-selected workflow or adjust its parameters via status.workflowOverride. Override requests are validated by the authwebhook and recorded in the audit trail. See Operator Workflow Overrides for details.
Alignment gate (v1.4)¶
When shadow-agent alignment is enabled, Kubernaut runs a secondary AI evaluation to verify the primary agent's recommendation. If alignment fails, the pipeline creates a ManualReviewRequired notification and stops execution — even if the Rego policy would have auto-approved. This provides an additional safety layer independent of the trust level.
Graduation signals¶
You're ready to graduate a workflow to Level 4 when:
- The Effectiveness Monitor consistently scores remediations as successful (
Fullassessment reason with high weighted scores) - You've approved the same workflow type multiple times without rejecting
- The workflow targets non-sensitive resources in well-understood namespaces
- Your team is comfortable with the workflow's blast radius
References¶
- Human Approval — full RAR lifecycle, Rego policy evaluation, and operator actions
- AIAnalysis Approval Policy — ConfigMap reference and default behavior
- Rego Policies — approval policy input fields and rules
Level 4: Automate (Available Now)¶
At this level, matched workflows execute without human intervention. The operator monitors outcomes via notifications and Effectiveness Monitor dashboards.
How it works¶
- Alert arrives → Signal Processing → Kubernaut Agent → workflow selected
- Rego policy evaluates and does not require approval (auto-approved)
- Workflow executes immediately
- Operator receives completion/failure notifications
- Effectiveness Monitor verifies the fix worked (health checks, alert resolution, spec hash comparison, metrics)
- Effectiveness scores feed back into future investigations
Configuration¶
Auto-approval happens when the Rego policy returns require_approval := false. With the default policy, this occurs for:
- Non-production namespaces (
staging,development,qa,test) - Non-sensitive resource kinds (anything other than Node/StatefulSet)
- Valid remediation target present
To allow specific production workflows to run autonomously, customize the Rego policy:
package aianalysis.approval
import rego.v1
default require_approval := true
require_approval := false if {
not is_production
}
require_approval := false if {
is_production
input.remediation_target.kind != "Node"
input.remediation_target.kind != "StatefulSet"
input.detected_labels["workflow_name"] in trusted_production_workflows
}
trusted_production_workflows := {
"crashloop-rollback-v1",
"restart-pod-v1",
"increase-memory-limits-v1",
}
is_production if {
lower(input.environment) == "production"
}
Monitoring autonomous remediations¶
At Level 4, monitoring replaces manual review:
| Signal | What to watch | Where |
|---|---|---|
| Effectiveness scores | Consistent Full assessments with high scores |
Effectiveness Monitor metrics, audit events |
| Notification volume | Sudden spike may indicate oscillation | Notification metrics (kubernaut_notification_reconciler_active) |
| Circuit breaker state | Channel health degradation | kubernaut_notification_channel_circuit_breaker_state |
| Assessment reasons | Expired, MetricsTimedOut, or Unrecoverable indicate infrastructure issues |
kubernaut_effectivenessmonitor_assessments_completed_total |
| ManualReviewRequired | Remediation needs human attention | Notification pipeline (ManualReview NRs) |
Rollback to Level 3¶
To move a workflow (or all workflows) back to Level 3, update the Rego policy to require approval. Changes to the policy ConfigMap take effect on the next remediation cycle — no pod restart required.
References¶
- Effectiveness Monitoring — how Kubernaut verifies fixes
- Monitoring — Prometheus metrics for all services
- Notification Channels — configuring alerts for autonomous operations
Level 1: Observe (Available — v1.4)¶
At this level, Kubernaut runs the full pipeline through AI Analysis — investigating root cause and selecting a workflow — but stops before execution. No WorkflowExecution, RemediationApprovalRequest, or EffectivenessAssessment CRDs are created. The RemediationRequest completes with outcome DryRun.
How it works¶
- Alert arrives → Signal Processing enriches it → Kubernaut Agent investigates root cause
- Kubernaut Agent selects a workflow with a confidence score
- Pipeline stops — the RR completes with outcome
DryRun - A
dryRunHoldPeriodis set on the RR to suppress re-triggering for the same signal fingerprint (default: 1 hour) - Operator reviews the RCA and selected workflow via audit events or notifications
Configuration¶
Enable dry-run mode in the Remediation Orchestrator config:
# remediationorchestrator-config ConfigMap
remediationOrchestrator:
dryRun: true
dryRunHoldPeriod: "1h" # minimum: 5m
dryRun— Whentrue, the pipeline stops after AI Analysis. Default:false.dryRunHoldPeriod— Duration to suppress new RR creation for the same signal fingerprint after a dry-run completion. Minimum:5m. Default:1h.
Goal: Understand Kubernaut's decision-making without any risk. This is the recommended starting point for new installations.
Level 2: Selective Trust (Planned — v1.5)¶
Not yet available
Level 2 depends on per-workflow dry-run overrides (kubernaut#116), planned for v1.5. Global dry-run (Level 1) is available in v1.4.
At this level, trusted workflows that have been validated in dry-run mode graduate to real execution, while new or untested workflows continue to complete with outcome DryRun.
Planned capabilities:
- Per-workflow dry-run overrides (disable dry-run for trusted workflows)
- New workflows automatically enter dry-run until explicitly graduated
- Effectiveness Monitor data drives graduation confidence
Goal: Graduate individual workflows as confidence grows, building a library of trusted automations.
Suggestions: Always-On Safety Net (Planned — v1.5)¶
Not yet available
The Suggestions feature depends on kubernaut#115, planned for v1.5.
When no workflow matches an alert — at any trust level — Kubernaut will suggest step-by-step remediation actions via an LLM-generated Suggestion RAR. This is orthogonal to the trust ladder and operates as a permanent safety net for unknown scenarios.
Planned capabilities:
- LLM-generated remediation steps when no catalog workflow matches
- Natural language investigation via MCP/A2A protocols to refine the suggestion
- Option to convert a validated suggestion into a new registered workflow
Goal: Novel incidents become automated workflows through operator-AI collaboration. The workflow library grows organically.
Recommended Adoption Path¶
For teams new to Kubernaut, we recommend the following progression:
| Phase | Timeline | What to do |
|---|---|---|
| Week 1 | Observe | Install with dryRun: true (Level 1). Kubernaut investigates and selects workflows but does not execute. Review RCAs and workflow selections via audit events. |
| Week 2–3 | Approve | Disable dry-run, deploy a Rego approval policy (Level 3). Production remediations require human approval; non-production auto-approves based on your policy. |
| Week 3–4 | Validate | Review RAR decisions. Check Effectiveness Monitor scores. Build familiarity with Kubernaut's recommendations. |
| Month 2 | Graduate non-prod | Confirm non-production workflows are consistently effective. Monitor autonomous execution. |
| Month 2–3 | Graduate prod workflows | Customize the Rego policy to auto-approve specific, well-validated production workflows (e.g., crashloop-rollback-v1). |
| Month 3+ | Expand | Add new workflow types in Level 3 (approval). Graduate to Level 4 as confidence grows. |
Future: Agentic Enhancements (v1.5+)¶
The v1.5 release will introduce agentic integration features that enhance every trust level:
| Feature | Enhancement |
|---|---|
| MCP Interactive Mode (#703) | Operators investigate and review remediations through any MCP-compatible chat interface |
| Kubernaut Console (#713) | Web dashboard with chat UI, live remediation streaming, and workflow management |
| A2A Protocol (#705) | External AI agents can delegate remediation to Kubernaut |
| Natural Language Investigation (#714) | Trigger investigations by describing the problem in plain text |
These features are complementary to the Trust Ladder — they enhance how operators interact at each level (e.g., MCP chat during dry-run review at Level 1, Console dashboards for monitoring at Level 4) without changing the fundamental graduation model.