Skip to content

Architecture Overview

Kubernaut is a microservices platform with 10 services that communicate through Kubernetes Custom Resources (CRDs). This page provides a high-level view of how the services work together.

System Diagram

Gateway Webhook intake + dedup RR Remediation Orchestrator Owns RR lifecycle — creates child CRDs for each phase 1 2 3 4 5 Signal Processor Rego classification AI Analysis LLM investigation + selection Workflow Exec. Tekton / Job / Ansible Effectiveness Health scoring + drift Notification Slack / PagerDuty / Teams Support Services DataStorage PostgreSQL + Valkey AuthWebhook RAR override validation

The Gateway receives signals (Prometheus alerts, Kubernetes events) and creates RemediationRequest CRDs. The Remediation Orchestrator coordinates the pipeline, creating child CRDs for each phase. Five phase controllers -- Signal Processing, AI Analysis, Workflow Execution, Effectiveness Monitor, and Notification -- each handle one phase. The DataStorage foundation layer persists audit events, the workflow catalog, and remediation history to PostgreSQL (with Valkey for the DLQ). All services emit audit events to DataStorage over HTTP. AI Analysis delegates to Kubernaut Agent for LLM-driven investigation, and Kubernaut Agent queries DataStorage for the workflow catalog and remediation history.

Remediation Pipeline

The pipeline processes signals through five CRD-native phases:

Phase What it does CRD
1. Signal Processing Ingest alerts (AlertManager, K8s Events), classify severity via OPA/Rego, map to workflow categories SignalProcessing
2. AI Analysis Two-invocation LLM pipeline: first invocation investigates with 36 Go tools; second selects workflow from catalog AIAnalysis
3. Approval Policy-gated review — auto-approve low-risk, manual review via Slack/Console, operator param override RemediationApprovalRequest
4. Execution Run remediation via Tekton Pipelines, Kubernetes Jobs, or Ansible (AWX/AAP) with per-workflow SA WorkflowExecution
5. Effectiveness Verify fix via alert resolution, spec drift detection, cooldown monitoring; health score feeds future RCA EffectivenessAssessment

For a detailed breakdown of all sub-phases and tools, see the Architecture: Investigation Pipeline.

Services

Kubernaut runs 10 services: 6 CRD controllers, 2 stateless HTTP services, 1 admission webhook, and 1 Go API service.

CRD Controllers

Each CRD is owned by a dedicated controller. See System Overview for the complete service topology and CRD ownership model.

Stateless Services

See System Overview for the complete service topology including Gateway, DataStorage, Auth Webhook, and Kubernaut Agent.

Communication Pattern

All inter-service communication in the remediation pipeline uses Kubernetes CRDs. The HTTP exceptions are: all controllers emit audit events to DataStorage, WFE queries DataStorage for the workflow catalog, RO queries DataStorage for remediation history, AA calls Kubernaut Agent for AI investigation, and EM queries AlertManager and Prometheus for effectiveness assessment.

This architecture provides:

  • Resilience — If a controller restarts, it picks up from the CRD's current state
  • Observability — Every stage is visible as a Kubernetes resource (kubectl get)
  • Auditability — CRD status transitions are tracked; full audit events go to PostgreSQL
  • Scalability — Each controller scales independently

Custom Resources

Kubernaut defines 9 CRD types. Each CRD is owned by a dedicated controller. See System Overview for the complete service topology and CRD ownership model.

Remediation Lifecycle

A RemediationRequest progresses through these phases:

stateDiagram-v2
    [*] --> Pending
    Pending --> Processing: Create SignalProcessing
    Pending --> Blocked: Routing condition
    Processing --> Analyzing: Enrichment complete
    Analyzing --> Completed: No remediation needed
    Analyzing --> AwaitingApproval: Rego policy requires approval
    Analyzing --> Executing: Workflow selected, auto-approved
    Analyzing --> Blocked: Routing condition
    Analyzing --> Failed: AI investigation failed
    AwaitingApproval --> Executing: Human approves
    AwaitingApproval --> Failed: Human rejects
    Executing --> Verifying: Workflow succeeded
    Executing --> Failed: Workflow fails
    Verifying --> Completed: Effectiveness assessed
    Blocked --> Failed: Cooldown expires
    Blocked --> Analyzing: Block cleared
    Blocked --> Pending: Block cleared
    Completed --> [*]
    Failed --> [*]
    TimedOut --> [*]
    Skipped --> [*]
    Cancelled --> [*]

AI Analysis Outcomes

The Analyzing phase represents the LLM investigation via Kubernaut Agent. The AI produces one of these outcomes:

Outcome RR Transition Description
No remediation needed Completed (NoActionRequired) LLM determines the issue does not require remediation — either the problem self-resolved (e.g., pod recovered) or the condition is benign (e.g., dangling PVC that doesn't warrant action)
Workflow selected Executing or AwaitingApproval LLM identified root cause and selected a workflow; Rego policy determines if approval is required
Investigation inconclusive Failed (ManualReviewRequired) LLM could not produce a reliable RCA (low confidence, incomplete analysis)
No matching workflow Failed (ManualReviewRequired) RCA succeeded but no workflow matches the detected labels
Infrastructure failure Failed API error, timeout, or max retries exceeded communicating with the LLM

Blocked Phase

The Blocked phase is non-terminal and covers 6 routing scenarios managed by the Orchestrator (not the LLM). See Core Concepts for all block reasons, cooldowns, and exit conditions.

On successful workflow execution, the Orchestrator creates an EffectivenessAssessment to evaluate whether the fix worked. Once the assessment completes (or times out), it creates a NotificationRequest that includes the remediation outcome and effectiveness results. On failure or escalation, a notification is created directly.

Data Flow

Every service emits audit events to DataStorage as it processes its CRD. These events capture the full context: what happened, when, why, and who was involved. The long-term record of every remediation lives in PostgreSQL via the audit pipeline, so even if CRDs are removed from the cluster, the complete data is preserved. A RemediationRequest can be reconstructed from audit data at any time.

Next Steps