Kubernaut¶

AIOps Platform for Intelligent Kubernetes Remediation¶

Kubernaut is an open-source AIOps platform that closes the loop from Kubernetes alert to automated remediation — without a human in the middle. When something goes wrong in your cluster (an OOMKill, a CrashLoopBackOff, node pressure), Kubernaut detects the signal, enriches it with context, sends it to an LLM for live root cause investigation, matches a remediation workflow from a searchable catalog, and executes the fix — or escalates to a human with a full RCA when it can't.

Mean time to resolution drops from 60 minutes to under 5, while humans stay in control through approval gates, configurable confidence thresholds, and audit trails designed for SOC2 alignment.

Why Kubernaut?

The problem with manual remediation, how Kubernaut compares to rule-based tools, and when to use it.

Why Kubernaut
Getting Started

Install Kubernaut with Helm and run your first automated remediation in under 5 minutes.

Installation
User Guide

Learn core concepts — signals, workflows, approval gates, effectiveness monitoring, and audit trails.

Core Concepts
Architecture

Understand the 10-service microservices architecture, CRD communication patterns, and data flows.

Architecture Overview
API Reference

CRD specifications, DataStorage REST API, and HolmesGPT API reference.

API Reference
FAQ

Common questions about LLM support, safety, cost, air-gapped operation, and execution engines.

FAQ

How It Works¶

Kubernaut automates the entire incident response lifecycle through a six-stage pipeline:

Kubernaut Pipeline Overview

Signal Deduplication — Receives alerts from Prometheus AlertManager and Kubernetes Events, validates resource scope, deduplicates and suppresses noise, and creates a RemediationRequest.
Signal Categorization — Enriches the signal with Kubernetes context (owner chain, namespace labels, workload details), environment classification, priority assignment, business classification, severity normalization, and signal mode.
LLM Investigation — HolmesGPT investigates the incident live using Kubernetes inspection tools (logs, events, resource state, live metrics) and optionally Prometheus, Grafana Loki, and other configured toolsets. It produces a root cause analysis, resolves the target resource's owner chain and remediation history, detects infrastructure labels (GitOps, Helm, service mesh, HPA, PDB), and searches the workflow catalog for a matching remediation.
Remediation Execution — Runs the selected remediation via Kubernetes Jobs, Tekton Pipelines, or Ansible (AWX/AAP).
Effectiveness Assessment — Evaluates whether the fix actually worked via spec hash comparison, health checks, alert resolution, and effectiveness scoring.
Multichannel Notification — Notifies the team with the full remediation outcome, including the effectiveness assessment results.

Key Capabilities¶

Capability	Description
Multi-Source Signal Ingestion	Prometheus alerts (reactive and proactive), Kubernetes events, fingerprint-based deduplication at the Gateway, signal mode classification
AI-Powered Root Cause Analysis	HolmesGPT with LLM providers (Vertex AI, OpenAI, LiteLLM), Kubernetes inspection tools, and configurable observability toolsets (Prometheus, Grafana Loki/Tempo, and more)
Workflow Catalog	Searchable OCI-containerized workflows with label-based matching and confidence scoring
Flexible Execution	Kubernetes Jobs, Tekton Pipelines, or Ansible (AWX/AAP)
Resource Scope Management	Label-based opt-in (`kubernaut.ai/managed=true`) controls which resources Kubernaut manages
Safety-First Design	Admission webhooks, human approval gates, configurable confidence thresholds, effectiveness tracking
SOC2 Alignment	Full audit trails with 7-year retention, CRD reconstruction from audit events, operator attribution
Effectiveness Tracking	Four-dimensional assessment (health, alert resolution, metrics, spec drift) with weighted scoring; remediation history feeds into HolmesGPT so the LLM avoids repeating failed remediations

Kubernaut¶

AIOps Platform for Intelligent Kubernetes Remediation¶

How It Works¶

Key Capabilities¶

Project Links¶