Skip to content

Kubernaut

AIOps Platform for Intelligent Kubernetes Remediation

Kubernaut is an open-source AIOps platform that closes the loop from Kubernetes alert to automated remediation — without a human in the middle. When something goes wrong in your cluster (an OOMKill, a CrashLoopBackOff, node pressure), Kubernaut detects the signal, enriches it with context, sends it to an LLM for live root cause investigation, matches a remediation workflow from a searchable catalog, and executes the fix — or escalates to a human with a full RCA when it can't.

Mean time to resolution drops from 60 minutes to under 5, while humans stay in control through approval gates, configurable confidence thresholds, and audit trails designed for SOC2 alignment.


  • Why Kubernaut?


    The problem with manual remediation, how Kubernaut compares to rule-based tools, and when to use it.

    Why Kubernaut

  • Getting Started


    Install Kubernaut with Helm and run your first automated remediation in under 5 minutes.

    Installation

  • User Guide


    Learn core concepts — signals, workflows, approval gates, effectiveness monitoring, and audit trails.

    Core Concepts

  • Architecture


    Understand the 10-service microservices architecture, CRD communication patterns, and data flows.

    Architecture Overview

  • API Reference


    CRD specifications, DataStorage REST API, and HolmesGPT API reference.

    API Reference

  • FAQ


    Common questions about LLM support, safety, cost, air-gapped operation, and execution engines.

    FAQ


How It Works

Kubernaut automates the entire incident response lifecycle through a six-stage pipeline:

Kubernaut Pipeline Overview

  1. Signal Deduplication — Receives alerts from Prometheus AlertManager and Kubernetes Events, validates resource scope, deduplicates and suppresses noise, and creates a RemediationRequest.
  2. Signal Categorization — Enriches the signal with Kubernetes context (owner chain, namespace labels, workload details), environment classification, priority assignment, business classification, severity normalization, and signal mode.
  3. LLM Investigation — HolmesGPT investigates the incident live using Kubernetes inspection tools (logs, events, resource state, live metrics) and optionally Prometheus, Grafana Loki, and other configured toolsets. It produces a root cause analysis, resolves the target resource's owner chain and remediation history, detects infrastructure labels (GitOps, Helm, service mesh, HPA, PDB), and searches the workflow catalog for a matching remediation.
  4. Remediation Execution — Runs the selected remediation via Kubernetes Jobs, Tekton Pipelines, or Ansible (AWX/AAP).
  5. Effectiveness Assessment — Evaluates whether the fix actually worked via spec hash comparison, health checks, alert resolution, and effectiveness scoring.
  6. Multichannel Notification — Notifies the team with the full remediation outcome, including the effectiveness assessment results.

Key Capabilities

Capability Description
Multi-Source Signal Ingestion Prometheus alerts (reactive and proactive), Kubernetes events, fingerprint-based deduplication at the Gateway, signal mode classification
AI-Powered Root Cause Analysis HolmesGPT with LLM providers (Vertex AI, OpenAI, LiteLLM), Kubernetes inspection tools, and configurable observability toolsets (Prometheus, Grafana Loki/Tempo, and more)
Workflow Catalog Searchable OCI-containerized workflows with label-based matching and confidence scoring
Flexible Execution Kubernetes Jobs, Tekton Pipelines, or Ansible (AWX/AAP)
Resource Scope Management Label-based opt-in (kubernaut.ai/managed=true) controls which resources Kubernaut manages
Safety-First Design Admission webhooks, human approval gates, configurable confidence thresholds, effectiveness tracking
SOC2 Alignment Full audit trails with 7-year retention, CRD reconstruction from audit events, operator attribution
Effectiveness Tracking Four-dimensional assessment (health, alert resolution, metrics, spec drift) with weighted scoring; remediation history feeds into HolmesGPT so the LLM avoids repeating failed remediations