Why Kubernaut¶
The Problem¶
When something breaks in a Kubernetes cluster — a pod crashlooping, a certificate expired, resources exhausted — an operator gets paged. They open a terminal, check alerts, read logs, correlate events with metrics, form a hypothesis, and execute a fix. If it doesn't work, they try something else.
This process depends on tribal knowledge, runbooks that drift out of date, and human availability. Mean time to resolution (MTTR) is measured in tens of minutes to hours. The same class of incidents recurs, and the response is manual every time.
Rule-based remediation tools improve this for known, deterministic problems. "If pod restarts exceed 5, delete it." "If memory exceeds 90%, scale up." They're fast, predictable, and easy to audit. But they can only match symptoms to predefined actions — they don't investigate why something is happening.
When the same symptom has multiple root causes, or the right fix depends on context the rule can't see, rule-based tools either pick the wrong action or do nothing.
How Kubernaut Solves It¶
Kubernaut turns remediation into a declarative, AI-driven, closed-loop process:
- Detects the signal (Prometheus alert, Kubernetes event)
- Investigates the root cause using an LLM with live
kubectlaccess, logs, metrics, and remediation history - Selects a remediation workflow from a catalog based on the investigation, not a static rule
- Executes the fix via Tekton Pipelines, Kubernetes Jobs, or Ansible (AWX/AAP)
- Verifies the fix worked through health checks, alert resolution, and spec drift detection
- Notifies the team (Slack, console, file) with the full remediation outcome and effectiveness assessment
- Learns — effectiveness scores feed back into future investigations so the LLM avoids repeating what failed before
See Architecture Overview for the full pipeline.
Comparison: Rule-Based, Predictive AI, and Generative AI¶
The AIOps remediation landscape has three distinct approaches. Kubernaut uses generative AI but is designed to integrate with predictive AI platforms as complementary tools.
| Capability | Rule-Based | Predictive AI (Davis, Watchdog) | Kubernaut (Generative AI) |
|---|---|---|---|
| Trigger | Pattern match on alert name/labels | Statistical anomaly detection, baseline deviation | Same as rule-based — Prometheus alerts, K8s events |
| Root cause analysis | None — assumes symptom = cause | Topology-aware correlation across known dependency graphs | LLM investigates live cluster state, logs, metrics, and history |
| Novel failure handling | Cannot handle — no matching rule | Cannot handle — no historical baseline to correlate against | Reasons about novel situations using Kubernetes semantics and context |
| Remediation selection | Static mapping (if X then Y) | Triggers pre-configured runbooks | AI selects from workflow catalog based on investigation context |
| Context awareness | Alert labels only | Vendor telemetry (traces, metrics, topology) | Full cluster state, GitOps labels, Rego policies, business metadata |
| Verification | Typically fire-and-forget | Monitors recovery metrics | Closed-loop: health checks, alert resolution, spec hash comparison |
| Learning from failure | None — repeats the same action | Adjusts baselines over time | Effectiveness scores feed into future investigations |
| Cold start | None — works immediately | Weeks/months of baseline data required | None — useful from day one |
| Latency | Milliseconds | Seconds (pre-computed models) | 10-30s (LLM investigation) |
| Token cost | None | Vendor license | Per-investigation (includes LLM-driven workflow selection in the same session) |
| Auditability | Deterministic, easy to trace | Deterministic, vendor-specific dashboards | Full audit trail with 7-year retention (SOC2-aligned); LLM reasoning is probabilistic |
| Vendor coupling | Low | High — deep integration with vendor telemetry stack | Low — works with any monitoring stack |
Where rule-based tools win: speed, zero token cost, deterministic auditability, and simplicity for well-understood single-action problems. Kubernaut's workflow catalog uses label-based scoring to rank candidates, but the LLM drives the final selection decision in a dedicated Phase 3 session -- investigation, enrichment, and workflow selection all happen in one agent session.
Where predictive AI fits: anomaly detection and topology-aware correlation for known failure patterns. Rather than competing with generative AI, predictive AI platforms are most valuable as knowledge-based agents that the LLM can query during investigation — confirming hypotheses, providing dependency context, and boosting confidence. See AIOps Remediation Landscape for the full integration architecture.
Where Kubernaut wins: novel or variable failures, multi-path remediation, environments where the same alert can have different root causes, and scenarios where verification and learning matter. When integrated with predictive AI, Kubernaut can cross-validate its root cause analysis against statistical correlations — increasing confidence when they agree, and flagging discrepancies when they don't.
For a detailed comparison against specific products and platforms in the agentic remediation space, see Agentic Remediation Market Comparison.
When to Use Kubernaut¶
Good fit:
- Incidents where the root cause varies (e.g., OOMKill could be a memory leak, a misconfigured limit, or a noisy neighbor)
- Environments with many workflow types and the right choice depends on context
- Teams that want closed-loop verification, not fire-and-forget
- Organizations that need remediation history and effectiveness tracking for compliance
Consider simpler tools when:
- The problem is fully deterministic with a single known fix
- Latency under 1 second is critical
- The environment is simple enough that a handful of rules covers all cases
Safety and Trust¶
Kubernaut is designed for production. The question operators ask is: "What happens when the LLM is wrong?"
- Human approval gates —
RemediationApprovalRequestCRDs pause execution until an operator approves, for any workflow that requires it - OPA/Rego policies — Constrain which remediations are allowed for which resources, namespaces, or conditions
- Blast radius controls — Scope management via
kubernaut.ai/managed=truelabels limits which resources Kubernaut can touch - Cooldown periods — Prevent rapid re-remediation of the same resource
- Effectiveness verification — After execution, Kubernaut checks whether the fix actually worked before marking it successful
- Escalation — If remediation fails or the LLM isn't confident, Kubernaut escalates to a human with the full investigation context rather than retrying blindly
See Human Approval and Rego Policies for configuration details.
The Feedback Loop¶
Most remediation tools operate in open loop: trigger, execute, done. Kubernaut closes the loop.
After every remediation, the effectiveness monitor evaluates whether the fix worked across four dimensions: pod health, alert resolution, metrics improvement, and spec drift detection. The result is an effectiveness score attached to the remediation record.
When the same resource triggers a future alert, HolmesGPT receives the remediation history — including what was tried before and whether it worked. The LLM uses this to avoid repeating failed approaches and to select alternatives.
This means Kubernaut gets better at remediating a specific resource over time, without any manual tuning of rules or weights.
See Remediation History Feedback for a worked example.