Kubernaut¶

AIOps Platform for Intelligent Kubernetes Remediation¶

Kubernaut is an open-source AIOps platform that closes the loop from Kubernetes alert to automated remediation — without a human in the middle. When something goes wrong in your cluster (an OOMKill, a CrashLoopBackOff, node pressure), Kubernaut detects the signal, enriches it with context, sends it to an LLM for live root cause investigation, matches a remediation workflow from a searchable catalog, and executes the fix — or escalates to a human with a full RCA when it can't.

Mean time to resolution drops from 60 minutes to under 5, while humans stay in control through approval gates, configurable confidence thresholds, and audit trails designed for SOC2 alignment.

Why Kubernaut?

The problem with manual remediation, how Kubernaut compares to rule-based tools, and when to use it.

Why Kubernaut
Getting Started

Install Kubernaut with Helm and run your first automated remediation in under 5 minutes.

Installation
Trust Ladder

Build confidence incrementally — from approval gates to full autonomous remediation, at your own pace.

Building Confidence
User Guide

Learn core concepts — signals, workflows, approval gates, effectiveness monitoring, and audit trails.

Core Concepts
Architecture

Understand the 10-service microservices architecture, CRD communication patterns, and data flows.

Architecture Overview
API Reference

CRD specifications, DataStorage REST API, and Kubernaut Agent API reference.

API Reference
What's New in v1.4

Dry-run mode, Shadow Agent prompt injection defense, operator workflow overrides, and more.

Release Highlights
What's Next

v1.5 roadmap — interactive sessions, Backstage console, MCP/A2A integration, fleet operations.

Roadmap
FAQ

Common questions about LLM support, safety, cost, air-gapped operation, and execution engines.

FAQ

How It Works¶

Kubernaut automates the entire incident response lifecycle through a CRD-native pipeline.

Kubernaut Remediation Pipeline — 5 phases

Select a phase to learn more:

1 · Signal Processing2 · AI Analysis3 · Approval4 · Execution5 · Effectiveness6 · Notification

CRD: SignalProcessing

AlertManager webhooks and Kubernetes Events are ingested, enriched with Kubernetes context (owner chain, namespace labels, workload metadata), and classified by OPA/Rego policies across multiple dimensions:

Severity — normalized to a standard scale (critical, high, medium, low).
Environment — inferred from namespace labels (production, staging, development).
Priority — P0–P3 based on policy evaluation.
Signal mode — reactive (active incident) or proactive (predicted issue).
Business classification — service owner, criticality, SLA requirements.

Each signal is fingerprinted for deduplication at the Gateway before entering the pipeline.

CRD: AIAnalysis

Three-phased pipeline:

Investigate — The LLM investigates the incident using 36 built-in tools and produces a root cause analysis (RCA).
Select — Using the RCA and server-side enrichment (historical context, detectable labels), the LLM selects a workflow from the existing user-created RemediationWorkflow catalog.

CRD: RemediationApprovalRequest

Policy-gated safety checkpoint:

Auto-approve low-risk actions based on OPA/Rego policies and confidence thresholds.
Operator notified via Slack, Teams, or PagerDuty for higher-risk remediations.
Operator overrides allow substituting workflow parameters via the WorkflowOverride CRD, with authwebhook validation and full audit trail.

CRD: WorkflowExecution

Three execution engines:

Tekton Pipelines — cloud-native CI/CD pipelines for complex multi-step workflows.
Kubernetes Jobs — lightweight, single-task remediation actions.
Ansible (AWX/AAP) — infrastructure-level remediation beyond the cluster boundary.

Each workflow runs under a dedicated ServiceAccount with short-lived TokenRequest authentication, ensuring no standing privileges.

CRD: EffectivenessAssessment

Post-remediation verification:

Alert resolution — confirms the original alert has cleared.
Drift detection — checks for spec changes after the fix.
Cooldown monitoring — watches for alert recurrence within a configurable window.
Health scoring — four-dimensional assessment (0–100%) combining alert status, metrics, health, and spec stability.

Outcomes feed back into the Kubernaut Agent so the LLM avoids repeating failed remediations.

CRD: NotificationRequest

Multi-channel delivery with full lifecycle tracking:

Channels: Slack, PagerDuty, Microsoft Teams, console, log, file.
Routing: Label-based rules with regex matching and fan-out to multiple channels.
Reliability: Circuit-breaker retry with exponential backoff per channel.
Audit: Every delivery attempt (success or failure) is recorded with correlation IDs linking back to the originating RemediationRequest.

Key Capabilities¶

Capability	Description
Multi-Source Signal Ingestion	Prometheus alerts (reactive and proactive), Kubernetes events, fingerprint-based deduplication at the Gateway, signal mode classification
AI-Powered Root Cause Analysis	Kubernaut Agent with LLM providers (Vertex AI, OpenAI, Anthropic, Bedrock, Ollama, and more via LangChainGo), Kubernetes inspection tools, and Prometheus metrics (when enabled)
Workflow Catalog	Searchable declarative `RemediationWorkflow` CRDs with category and label-based matching plus confidence scoring
Flexible Execution	Kubernetes Jobs, Tekton Pipelines, or Ansible (AWX/AAP)
Resource Scope Management	Label-based opt-in (`kubernaut.ai/managed=true`) controls which resources Kubernaut manages
Safety-First Design	Admission webhooks, human approval gates, configurable confidence thresholds, effectiveness tracking
SOC2 Alignment	Full audit trails with 7-year retention, CRD reconstruction from audit events, operator attribution
Effectiveness Tracking	Four-dimensional assessment (health, alert resolution, metrics, spec drift) with weighted scoring; remediation history feeds into the Kubernaut Agent so the LLM avoids repeating failed remediations

Kubernaut¶

AIOps Platform for Intelligent Kubernetes Remediation¶

How It Works¶

Key Capabilities¶

Project Links¶