Why Kubernaut¶

The Problem¶

When something breaks in a Kubernetes cluster — a pod crashlooping, a certificate expired, resources exhausted — an operator gets paged. They open a terminal, check alerts, read logs, correlate events with metrics, form a hypothesis, and execute a fix. If it doesn't work, they try something else.

This process depends on tribal knowledge, runbooks that drift out of date, and human availability. Mean time to resolution (MTTR) is measured in tens of minutes to hours. The same class of incidents recurs, and the response is manual every time.

Rule-based remediation tools improve this for known, deterministic problems. "If pod restarts exceed 5, delete it." "If memory exceeds 90%, scale up." They're fast, predictable, and easy to audit. But they can only match symptoms to predefined actions — they don't investigate why something is happening.

When the same symptom has multiple root causes, or the right fix depends on context the rule can't see, rule-based tools either pick the wrong action or do nothing.

How Kubernaut Solves It¶

Kubernaut turns remediation into a declarative, AI-driven, closed-loop process:

Detects the signal (Prometheus alert, Kubernetes event)
Investigates the root cause using an LLM with live kubectl access, logs, metrics, and remediation history
Selects a remediation workflow from a catalog based on the investigation, not a static rule
Executes the fix via Tekton Pipelines, Kubernetes Jobs, or Ansible (AWX/AAP)
Verifies the fix worked through health checks, alert resolution, and spec drift detection
Notifies the team (Slack, console, file) with the full remediation outcome and effectiveness assessment
Learns — effectiveness scores feed back into future investigations so the LLM avoids repeating what failed before

See Architecture Overview for the full pipeline.

Comparison: Rule-Based, Predictive AI, and Generative AI¶

The AIOps remediation landscape has three distinct approaches. Kubernaut uses generative AI but is designed to integrate with predictive AI platforms as complementary tools.

Capability	Rule-Based	Predictive AI (Davis, Watchdog)	Kubernaut (Generative AI)
Trigger	Pattern match on alert name/labels	Statistical anomaly detection, baseline deviation	Same as rule-based — Prometheus alerts, K8s events
Root cause analysis	None — assumes symptom = cause	Topology-aware correlation across known dependency graphs	LLM investigates live cluster state, logs, metrics, and history
Novel failure handling	Cannot handle — no matching rule	Cannot handle — no historical baseline to correlate against	Reasons about novel situations using Kubernetes semantics and context
Remediation selection	Static mapping (if X then Y)	Triggers pre-configured runbooks	AI selects from workflow catalog based on investigation context
Context awareness	Alert labels only	Vendor telemetry (traces, metrics, topology)	Full cluster state, GitOps labels, Rego policies, business metadata
Verification	Typically fire-and-forget	Monitors recovery metrics	Closed-loop: health checks, alert resolution, spec hash comparison
Learning from failure	None — repeats the same action	Adjusts baselines over time	Effectiveness scores feed into future investigations
Cold start	None — works immediately	Weeks/months of baseline data required	None — useful from day one
Latency	Milliseconds	Seconds (pre-computed models)	10-30s (LLM investigation)
Token cost	None	Vendor license	Per-investigation (includes LLM-driven workflow selection in the same session)
Auditability	Deterministic, easy to trace	Deterministic, vendor-specific dashboards	Full audit trail with 7-year retention (SOC2-aligned); LLM reasoning is probabilistic
Vendor coupling	Low	High — deep integration with vendor telemetry stack	Low — works with any monitoring stack

Where rule-based tools win: speed, zero token cost, deterministic auditability, and simplicity for well-understood single-action problems. Kubernaut's workflow catalog uses label-based scoring to rank candidates, but the LLM drives the final selection decision in a dedicated Phase 3 session -- investigation, enrichment, and workflow selection all happen in one agent session.

Where predictive AI fits: anomaly detection and topology-aware correlation for known failure patterns. Rather than competing with generative AI, predictive AI platforms are most valuable as knowledge-based agents that the LLM can query during investigation — confirming hypotheses, providing dependency context, and boosting confidence. See AIOps Remediation Landscape for the full integration architecture.

Where Kubernaut wins: novel or variable failures, multi-path remediation, environments where the same alert can have different root causes, and scenarios where verification and learning matter. When integrated with predictive AI, Kubernaut can cross-validate its root cause analysis against statistical correlations — increasing confidence when they agree, and flagging discrepancies when they don't.

For a detailed comparison against specific products and platforms in the agentic remediation space, see Agentic Remediation Market Comparison.

When to Use Kubernaut¶

Good fit:

Incidents where the root cause varies (e.g., OOMKill could be a memory leak, a misconfigured limit, or a noisy neighbor)
Environments with many workflow types and the right choice depends on context
Teams that want closed-loop verification, not fire-and-forget
Organizations that need remediation history and effectiveness tracking for compliance

Consider simpler tools when:

The problem is fully deterministic with a single known fix
Latency under 1 second is critical
The environment is simple enough that a handful of rules covers all cases

Safety and Trust¶

Kubernaut is designed for production. The question operators ask is: "What happens when the LLM is wrong?"

Human approval gates — RemediationApprovalRequest CRDs pause execution until an operator approves, for any workflow that requires it
OPA/Rego policies — Constrain which remediations are allowed for which resources, namespaces, or conditions
Blast radius controls — Scope management via kubernaut.ai/managed=true labels limits which resources Kubernaut can touch
Cooldown periods — Prevent rapid re-remediation of the same resource
Effectiveness verification — After execution, Kubernaut checks whether the fix actually worked before marking it successful
Escalation — If remediation fails or the LLM isn't confident, Kubernaut escalates to a human with the full investigation context rather than retrying blindly

See Human Approval and Rego Policies for configuration details.

The Feedback Loop¶

Most remediation tools operate in open loop: trigger, execute, done. Kubernaut closes the loop.

After every remediation, the effectiveness monitor evaluates whether the fix worked across four dimensions: pod health, alert resolution, metrics improvement, and spec drift detection. The result is an effectiveness score attached to the remediation record.

When the same resource triggers a future alert, HolmesGPT receives the remediation history — including what was tried before and whether it worked. The LLM uses this to avoid repeating failed approaches and to select alternatives.

This means Kubernaut gets better at remediating a specific resource over time, without any manual tuning of rules or weights.

See Remediation History Feedback for a worked example.