Building Confidence with Kubernaut¶

Operators rarely hand full control of cluster remediation to an AI on day one. Kubernaut provides an incremental Trust Ladder — a four-stage graduation path that lets teams build confidence at their own pace, starting with full human oversight and progressing toward autonomous remediation as trust grows.

The Trust Ladder¶

Level	Name	Description
1	Observe	Operator sees what Kubernaut would do — no execution (global dry-run)
2	Approve	Rego policy gates remediation via RAR — operator approves/rejects/overrides
3	Security & Autonomy	SAR-based tool authorization, per-persona ClusterRoles, interactive MCP sessions, A2A delegation
4	Full Autonomy	Matched workflows execute without human intervention

At every level, operators can connect via Interactive MCP Sessions (v1.5) for real-time investigation and workflow selection.

Operators typically start at Stage 1 (observe) or Stage 2 (approve) and graduate individual workflows or entire namespaces to Stage 4 as they gain confidence in Kubernaut's decision-making.

Stage 2: Approve (Available Now)¶

At this level, the Rego approval policy determines which remediations require human review. When require_approval evaluates to true, a RemediationApprovalRequest (RAR) is created before execution. The operator reviews Kubernaut's recommendation — the selected workflow, confidence score, root cause analysis, and detected infrastructure labels — and either approves or rejects it.

How it works¶

Alert arrives → Signal Processing enriches it → Kubernaut Agent investigates root cause
Kubernaut Agent selects a workflow with a confidence score
The Rego approval policy evaluates the selection and creates a RAR if approval is needed
Operator receives a notification with the full RCA and proposed remediation
Operator approves (execution proceeds), rejects (remediation stops), or overrides (substitutes workflow parameters via WorkflowOverride)
If no action is taken within the configured timeout, the RAR expires (default: 15 minutes, configurable via spec.requiredBy)

Configuration¶

The approval gate is controlled by the Rego policy deployed in the aianalysis-policies ConfigMap. The Helm chart does not ship a default policy — operators must supply one via aianalysis.policies.content or aianalysis.policies.existingConfigMap.

A typical starter policy requires approval for:

Production namespaces (case-insensitive)
Sensitive resource kinds (Node, StatefulSet)
Missing remediation targets (safety net)
Low-confidence selections (below aianalysis.rego.confidenceThreshold, default 0.8)

Non-production namespaces with a valid remediation target and high confidence auto-approve.

To require approval for everything (strictest Stage 3), replace the default policy with:

package aianalysis.approval

default require_approval := true

To adjust the confidence threshold for approval:

# Helm values
aianalysis:
  rego:
    confidenceThreshold: "0.9"  # Require approval below 90% confidence (default: 0.8)

Operator workflow overrides (v1.4)¶

When approving a RAR, operators can substitute the AI-selected workflow or adjust its parameters via status.workflowOverride. Override requests are validated by the authwebhook and recorded in the audit trail. See Operator Workflow Overrides for details.

Alignment gate (v1.4)¶

When shadow-agent alignment is enabled, Kubernaut runs a secondary AI evaluation to verify the primary agent's recommendation. If alignment fails, the pipeline creates a ManualReviewRequired notification and stops execution — even if the Rego policy would have auto-approved. This provides an additional safety layer independent of the trust stage.

Graduation signals¶

You're ready to graduate a workflow to Stage 4 when:

The Effectiveness Monitor consistently scores remediations as successful (Full assessment reason with high weighted scores)
You've approved the same workflow type multiple times without rejecting
The workflow targets non-sensitive resources in well-understood namespaces
Your team is comfortable with the workflow's blast radius

References¶

Human Approval — full RAR lifecycle, Rego policy evaluation, and operator actions
AIAnalysis Approval Policy — ConfigMap reference and default behavior
Rego Policies — approval policy input fields and rules

Stage 4: Full Autonomy (Available Now)¶

At this level, matched workflows execute without human intervention. The operator monitors outcomes via notifications and Effectiveness Monitor dashboards.

How it works¶

Alert arrives → Signal Processing → Kubernaut Agent → workflow selected
Rego policy evaluates and does not require approval (auto-approved)
Workflow executes immediately
Operator receives completion/failure notifications
Effectiveness Monitor verifies the fix worked (health checks, alert resolution, spec hash comparison, metrics)
Effectiveness scores feed back into future investigations

Configuration¶

Auto-approval happens when the Rego policy returns require_approval := false. With the default policy, this occurs for:

Non-production namespaces (staging, development, qa, test)
Non-sensitive resource kinds (anything other than Node/StatefulSet)
Valid remediation target present

To allow specific production workflows to run autonomously, customize the Rego policy:

package aianalysis.approval

import rego.v1

default require_approval := true

require_approval := false if {
    not is_production
}

require_approval := false if {
    is_production
    input.remediation_target.kind != "Node"
    input.remediation_target.kind != "StatefulSet"
    input.detected_labels["workflow_name"] in trusted_production_workflows
}

trusted_production_workflows := {
    "crashloop-rollback-v1",
    "restart-pod-v1",
    "increase-memory-limits-v1",
}

is_production if {
    lower(input.environment) == "production"
}

Monitoring autonomous remediations¶

At Stage 4, monitoring replaces manual review:

Signal	What to watch	Where
Effectiveness scores	Consistent `Full` assessments with high scores	Effectiveness Monitor metrics, audit events
Notification volume	Sudden spike may indicate oscillation	Notification metrics (`kubernaut_notification_reconciler_active`)
Circuit breaker state	Channel health degradation	`kubernaut_notification_channel_circuit_breaker_state`
Assessment reasons	`Expired`, `MetricsTimedOut`, or `Unrecoverable` indicate infrastructure issues	`kubernaut_effectivenessmonitor_assessments_completed_total`
ManualReviewRequired	Remediation needs human attention	Notification pipeline (ManualReview NRs)

Rollback to Stage 3¶

To move a workflow (or all workflows) back to Stage 3, update the Rego policy to require approval. Changes to the policy ConfigMap take effect on the next remediation cycle — no pod restart required.

References¶

Effectiveness Monitoring — how Kubernaut verifies fixes
Monitoring — Prometheus metrics for all services
Notification Channels — configuring alerts for autonomous operations

Stage 1: Observe (Available — v1.4)¶

At this level, Kubernaut runs the full pipeline through AI Analysis — investigating root cause and selecting a workflow — but stops before execution. No WorkflowExecution, RemediationApprovalRequest, or EffectivenessAssessment CRDs are created. The RemediationRequest completes with outcome DryRun.

How it works¶

Alert arrives → Signal Processing enriches it → Kubernaut Agent investigates root cause
Kubernaut Agent selects a workflow with a confidence score
Pipeline stops — the RR completes with outcome DryRun
A dryRunHoldPeriod is set on the RR to suppress re-triggering for the same signal fingerprint (default: 1 hour)
Operator reviews the RCA and selected workflow via audit events or notifications

Configuration¶

Enable dry-run mode in the Remediation Orchestrator config:

# remediationorchestrator-config ConfigMap
remediationOrchestrator:
  dryRun: true
  dryRunHoldPeriod: "1h"  # minimum: 5m

dryRun — When true, the pipeline stops after AI Analysis. Default: false.
dryRunHoldPeriod — Duration to suppress new RR creation for the same signal fingerprint after a dry-run completion. Minimum: 5m. Default: 1h.

Goal: Understand Kubernaut's decision-making without any risk. This is the recommended starting point for new installations.

Stage 3: Security & Autonomy (v1.5)¶

At this level, organizations layer SAR-based tool authorization and interactive MCP sessions on top of the existing approval gate, establishing fine-grained control over who can do what and enabling operator-in-the-loop investigation.

What v1.5 adds¶

SAR-based tool authorization — Kubernetes-native SubjectAccessReview replaces file-based RBAC. Six per-persona ClusterRoles control which tools each group can invoke. See Security & RBAC: Tool Authorization.
Interactive MCP sessions — Operators connect via MCP for real-time investigation, workflow discovery with LLM-populated parameters, and guided remediation. See Interactive Sessions.
Session takeover security (SEC-TAKEOVER-001) — Identity-aware session management prevents privilege confusion during takeover.

Configuration¶

SAR authorization is configured via the API Frontend's rbac.personas values. Bind per-persona ClusterRoles to OIDC groups:

apifrontend:
  config:
    rbac:
      sarCacheTTL: 30s
      personas:
        sre: [kubernaut_list_remediations, kubernaut_get_remediation, ...]
        cicd: [kubernaut_list_remediations, kubernaut_get_remediation, kubernaut_watch]

See the Helm values reference for the full persona-to-tool mapping.

Goal: Establish enterprise-grade security boundaries while enabling operator-AI collaboration.

Suggestions: Always-On Safety Net (Planned — v1.5)¶

Not yet available

The Suggestions feature depends on kubernaut#115, planned for v1.5.

When no workflow matches an alert — at any trust stage — Kubernaut will suggest step-by-step remediation actions via an LLM-generated Suggestion RAR. This is orthogonal to the trust ladder and operates as a permanent safety net for unknown scenarios.

Planned capabilities:

LLM-generated remediation steps when no catalog workflow matches
Natural language investigation via MCP/A2A protocols to refine the suggestion
Option to convert a validated suggestion into a new registered workflow

Goal: Novel incidents become automated workflows through operator-AI collaboration. The workflow library grows organically.

Recommended Adoption Path¶

For teams new to Kubernaut, we recommend the following progression:

Phase	Timeline	What to do
Week 1	Observe	Install with `dryRun: true` (Stage 1). Kubernaut investigates and selects workflows but does not execute. Review RCAs and workflow selections via audit events.
Week 2–3	Approve	Disable dry-run, deploy a Rego approval policy (Stage 3). Production remediations require human approval; non-production auto-approves based on your policy.
Week 3–4	Validate	Review RAR decisions. Check Effectiveness Monitor scores. Build familiarity with Kubernaut's recommendations.
Month 2	Graduate non-prod	Confirm non-production workflows are consistently effective. Monitor autonomous execution.
Month 2–3	Graduate prod workflows	Customize the Rego policy to auto-approve specific, well-validated production workflows (e.g., `crashloop-rollback-v1`).
Month 3+	Expand	Add new workflow types in Stage 3 (approval). Graduate to Stage 4 as confidence grows.

Agentic Enhancements¶

v1.5 introduced agentic integration features that enhance every trust stage:

Feature	Status	Enhancement
MCP Interactive Mode (#703)	Shipped (v1.5)	Operators investigate and review remediations through any MCP-compatible chat interface
A2A Protocol	Shipped (v1.5)	External AI agents can delegate remediation to Kubernaut via `POST /a2a/invoke`
SAR Tool Authorization (PR #1222)	Shipped (v1.5)	Kubernetes-native per-persona tool authorization with 6 ClusterRoles
Kubernaut Console	Planned	Web dashboard with chat UI, live remediation streaming, and workflow management
Natural Language Investigation	Planned	Trigger investigations by describing the problem in plain text

These features are complementary to the Trust Ladder — they enhance how operators interact at each stage (e.g., MCP chat during dry-run review at Stage 1, Console dashboards for monitoring at Stage 4) without changing the fundamental graduation model.