Architecture¶
Deep-dive documentation for Kubernaut's internal design. Pages are ordered by the natural remediation flow -- from signal ingestion through effectiveness assessment.
- System Overview — Service topology, CRD relationships, and design principles. Introduces the orchestrator pattern, CRDs as the communication backbone, and separation of concerns.
- Gateway — Signal ingestion, adapters, authentication, scope checking, deduplication, CRD creation. Details how signals enter the system and become RemediationRequest CRDs.
- Signal Processing — Context enrichment, severity/priority/environment classification, and signal mode. Covers enrichment, Rego-based classification, and signal mode handling.
- AI Analysis — HolmesGPT integration, session-based async, Rego approval. Explains HolmesGPT integration, async session handling, and Rego approval gates.
- Investigation Pipeline — LLM investigation phases, resource context, remediation history, workflow selection, decision outcomes, approval gate. Describes the LLM investigation flow and workflow selection logic.
- Remediation Routing — Orchestrator routing engine, phase transitions, timeout system, child CRD lifecycle, escalation. Covers orchestration, phase state machine, and escalation behavior.
- Workflow Selection — Catalog query, label matching, confidence scoring. Details how workflows are queried, matched, and scored for selection.
- Workflow Execution — Tekton and Job executors, dependency resolution, cooldown, deterministic locking. Explains executors, dependency resolution, and locking semantics.
- Effectiveness Assessment — Timing model, propagation delays, health scoring. Describes how Kubernaut determines whether a remediation succeeded.
- Notification Pipeline — Delivery orchestration, routing resolution, retry/circuit breaker, channel implementations. Covers delivery flow, routing, and channel behavior.
- Async Propagation — GitOps and operator delay model. Explains how Kubernaut handles GitOps and operator propagation delays.
- Audit Pipeline — Buffered store, batching, per-service events, operator attribution. Details the audit store, batching, and event attribution.
- Data Persistence — PostgreSQL schema, partitioning, retention, reconstruction. Covers the PostgreSQL schema, partitioning strategy, and data lifecycle.