Kubernaut¶
AIOps Platform for Intelligent Kubernetes Remediation¶
Kubernaut is an open-source AIOps platform that closes the loop from Kubernetes alert to automated remediation — without a human in the middle. When something goes wrong in your cluster (an OOMKill, a CrashLoopBackOff, node pressure), Kubernaut detects the signal, enriches it with context, sends it to an LLM for live root cause investigation, matches a remediation workflow from a searchable catalog, and executes the fix — or escalates to a human with a full RCA when it can't.
Mean time to resolution drops from 60 minutes to under 5, while humans stay in control through approval gates, configurable confidence thresholds, and audit trails designed for SOC2 alignment.
-
Why Kubernaut?
The problem with manual remediation, how Kubernaut compares to rule-based tools, and when to use it.
-
Getting Started
Install Kubernaut with Helm and run your first automated remediation in under 5 minutes.
-
Trust Ladder
Build confidence incrementally — from approval gates to full autonomous remediation, at your own pace.
-
User Guide
Learn core concepts — signals, workflows, approval gates, effectiveness monitoring, and audit trails.
-
Architecture
Understand the 10-service microservices architecture, CRD communication patterns, and data flows.
-
API Reference
CRD specifications, DataStorage REST API, and Kubernaut Agent API reference.
-
What's New in v1.4
Dry-run mode, Shadow Agent prompt injection defense, operator workflow overrides, and more.
-
What's Next
v1.5 roadmap — interactive sessions, Backstage console, MCP/A2A integration, fleet operations.
-
FAQ
Common questions about LLM support, safety, cost, air-gapped operation, and execution engines.
How It Works¶
Kubernaut automates the entire incident response lifecycle through a CRD-native pipeline.
Select a phase to learn more:
CRD: SignalProcessing
AlertManager webhooks and Kubernetes Events are ingested, enriched with Kubernetes context (owner chain, namespace labels, workload metadata), and classified by OPA/Rego policies across multiple dimensions:
- Severity — normalized to a standard scale (critical, high, medium, low).
- Environment — inferred from namespace labels (production, staging, development).
- Priority — P0–P3 based on policy evaluation.
- Signal mode — reactive (active incident) or proactive (predicted issue).
- Business classification — service owner, criticality, SLA requirements.
Each signal is fingerprinted for deduplication at the Gateway before entering the pipeline.
CRD: AIAnalysis
Three-phased pipeline:
- Investigate — The LLM investigates the incident using 36 built-in tools and produces a root cause analysis (RCA).
- Select — Using the RCA and server-side enrichment (historical context, detectable labels), the LLM selects a workflow from the existing user-created
RemediationWorkflowcatalog.
CRD: RemediationApprovalRequest
Policy-gated safety checkpoint:
- Auto-approve low-risk actions based on OPA/Rego policies and confidence thresholds.
- Operator notified via Slack, Teams, or PagerDuty for higher-risk remediations.
- Operator overrides allow substituting workflow parameters via the
WorkflowOverrideCRD, with authwebhook validation and full audit trail.
CRD: WorkflowExecution
Three execution engines:
- Tekton Pipelines — cloud-native CI/CD pipelines for complex multi-step workflows.
- Kubernetes Jobs — lightweight, single-task remediation actions.
- Ansible (AWX/AAP) — infrastructure-level remediation beyond the cluster boundary.
Each workflow runs under a dedicated ServiceAccount with short-lived TokenRequest authentication, ensuring no standing privileges.
CRD: EffectivenessAssessment
Post-remediation verification:
- Alert resolution — confirms the original alert has cleared.
- Drift detection — checks for spec changes after the fix.
- Cooldown monitoring — watches for alert recurrence within a configurable window.
- Health scoring — four-dimensional assessment (0–100%) combining alert status, metrics, health, and spec stability.
Outcomes feed back into the Kubernaut Agent so the LLM avoids repeating failed remediations.
CRD: NotificationRequest
Multi-channel delivery with full lifecycle tracking:
- Channels: Slack, PagerDuty, Microsoft Teams, console, log, file.
- Routing: Label-based rules with regex matching and fan-out to multiple channels.
- Reliability: Circuit-breaker retry with exponential backoff per channel.
- Audit: Every delivery attempt (success or failure) is recorded with correlation IDs linking back to the originating
RemediationRequest.
Key Capabilities¶
| Capability | Description |
|---|---|
| Multi-Source Signal Ingestion | Prometheus alerts (reactive and proactive), Kubernetes events, fingerprint-based deduplication at the Gateway, signal mode classification |
| AI-Powered Root Cause Analysis | Kubernaut Agent with LLM providers (Vertex AI, OpenAI, Anthropic, Bedrock, Ollama, and more via LangChainGo), Kubernetes inspection tools, and Prometheus metrics (when enabled) |
| Workflow Catalog | Searchable declarative RemediationWorkflow CRDs with category and label-based matching plus confidence scoring |
| Flexible Execution | Kubernetes Jobs, Tekton Pipelines, or Ansible (AWX/AAP) |
| Resource Scope Management | Label-based opt-in (kubernaut.ai/managed=true) controls which resources Kubernaut manages |
| Safety-First Design | Admission webhooks, human approval gates, configurable confidence thresholds, effectiveness tracking |
| SOC2 Alignment | Full audit trails with 7-year retention, CRD reconstruction from audit events, operator attribution |
| Effectiveness Tracking | Four-dimensional assessment (health, alert resolution, metrics, spec drift) with weighted scoring; remediation history feeds into the Kubernaut Agent so the LLM avoids repeating failed remediations |