Skip to content

Architecture

Deep-dive documentation for Kubernaut's internal design. Pages are ordered by the natural remediation flow -- from signal ingestion through effectiveness assessment.

  • System Overview — Service topology, CRD relationships, and design principles. Introduces the orchestrator pattern, CRDs as the communication backbone, and separation of concerns.
  • Gateway — Signal ingestion, adapters, authentication, scope checking, deduplication, CRD creation. Details how signals enter the system and become RemediationRequest CRDs.
  • Signal Processing — Context enrichment, severity/priority/environment classification, and signal mode. Covers enrichment, Rego-based classification, and signal mode handling.
  • AI Analysis — HolmesGPT integration, session-based async, Rego approval. Explains HolmesGPT integration, async session handling, and Rego approval gates.
  • Investigation Pipeline — LLM investigation phases, resource context, remediation history, workflow selection, decision outcomes, approval gate. Describes the LLM investigation flow and workflow selection logic.
  • Remediation Routing — Orchestrator routing engine, phase transitions, timeout system, child CRD lifecycle, escalation. Covers orchestration, phase state machine, and escalation behavior.
  • Workflow Selection — Catalog query, label matching, confidence scoring. Details how workflows are queried, matched, and scored for selection.
  • Workflow Execution — Tekton and Job executors, dependency resolution, cooldown, deterministic locking. Explains executors, dependency resolution, and locking semantics.
  • Effectiveness Assessment — Timing model, propagation delays, health scoring. Describes how Kubernaut determines whether a remediation succeeded.
  • Notification Pipeline — Delivery orchestration, routing resolution, retry/circuit breaker, channel implementations. Covers delivery flow, routing, and channel behavior.
  • Async Propagation — GitOps and operator delay model. Explains how Kubernaut handles GitOps and operator propagation delays.
  • Audit Pipeline — Buffered store, batching, per-service events, operator attribution. Details the audit store, batching, and event attribution.
  • Data Persistence — PostgreSQL schema, partitioning, retention, reconstruction. Covers the PostgreSQL schema, partitioning strategy, and data lifecycle.