Architecture Overview¶

Kubernaut is a microservices platform with 11 services (v1.5+; 10 in v1.4) that communicate through Kubernetes Custom Resources (CRDs). This page provides a high-level view of how the services work together.

System Diagram¶

The Gateway receives signals (Prometheus alerts, Kubernetes events) and creates RemediationRequest CRDs. The Remediation Orchestrator coordinates the pipeline, creating child CRDs for each phase. Six phase controllers -- Signal Processing, AI Analysis, Workflow Execution, Effectiveness Monitor, and Notification -- each handle one phase. The DataStorage foundation layer persists audit events, the workflow catalog, and remediation history to PostgreSQL (with Valkey for the DLQ). All services emit audit events to DataStorage over HTTP. AI Analysis delegates to Kubernaut Agent for LLM-driven investigation, and Kubernaut Agent queries DataStorage for the workflow catalog and remediation history. The API Frontend (v1.5+) exposes MCP and A2A protocol endpoints for interactive sessions. InvestigationSession CRDs are created with deferred materialization — the CRD only appears in the cluster after a RemediationRequest is successfully created via kubernaut_remediate, so sessions that never produce an RR leave no cluster footprint.

Remediation Pipeline¶

The pipeline processes signals through six CRD-native phases:

Phase	What it does	CRD
1. Signal Processing	Ingest alerts (AlertManager, K8s Events), classify severity via OPA/Rego, map to workflow categories	`SignalProcessing`
2. AI Analysis	Two-invocation LLM pipeline: first invocation investigates with 36 Go tools; second selects workflow from catalog	`AIAnalysis`
3. Approval	Policy-gated review — auto-approve low-risk, manual review via Slack/Console, operator param override	`RemediationApprovalRequest`
4. Execution	Run remediation via Tekton Pipelines, Kubernetes Jobs, or Ansible (AWX/AAP) with per-workflow SA	`WorkflowExecution`
5. Effectiveness	Verify fix via alert resolution, spec drift detection, cooldown monitoring; health score feeds future RCA	`EffectivenessAssessment`
6. Notification	Deliver outcome notifications to Slack, PagerDuty, Microsoft Teams, console, or file; retry with exponential backoff and circuit breaker	`NotificationRequest`

For a detailed breakdown of all sub-phases and tools, see the Architecture: Investigation Pipeline.

Services¶

Kubernaut runs 11 services (v1.5+): 6 CRD controllers, 2 stateless HTTP services, 1 admission webhook, 1 Go API service, and the API Frontend.

CRD Controllers¶

Each CRD is owned by a dedicated controller. See System Overview for the complete service topology and CRD ownership model.

Stateless Services¶

See System Overview for the complete service topology including Gateway, DataStorage, Auth Webhook, and Kubernaut Agent.

Pipeline Modes (v1.5+)¶

Kubernaut supports two pipeline modes simultaneously:

	Autonomous	Interactive
Trigger	Alert webhook (Prometheus, K8s Event)	Operator starts on demand or joins an autonomous session via MCP through API Frontend
Workflow selection	LLM selects automatically	Operator chooses from LLM-populated alternatives
Approval	Rego policy + RAR gate	Same Rego policy + RAR gate; identity-aware policies can auto-approve trusted operators
Visibility	Post-hoc via kubectl, notifications	Real-time SSE streaming

Both modes use the same CRDs, audit events, and effectiveness assessments. Operators can start investigations on demand via the API Frontend or join an autonomous investigation mid-flight. See Interactive Sessions for the operator guide.

Communication Pattern¶

All inter-service communication in the remediation pipeline uses Kubernetes CRDs. The HTTP exceptions are: all controllers emit audit events to DataStorage, WFE queries DataStorage for the workflow catalog, RO queries DataStorage for remediation history, AA calls Kubernaut Agent for AI investigation, EM queries AlertManager and Prometheus for effectiveness assessment, and the API Frontend dispatches its 14 MCP tools to multiple backends (K8s API, Kubernaut Agent REST/MCP, DataStorage).

This architecture provides:

Resilience — If a controller restarts, it picks up from the CRD's current state
Observability — Every stage is visible as a Kubernetes resource (kubectl get)
Auditability — CRD status transitions are tracked; full audit events go to PostgreSQL
Scalability — Each controller scales independently

Custom Resources¶

Kubernaut defines 10 CRD types (v1.5+; 9 in v1.4), all in API group kubernaut.ai/v1alpha1 and namespaced. The six pipeline CRDs are each owned by a dedicated controller. RemediationWorkflow and ActionType are catalog resources managed by the AuthWebhook. RemediationRequest is the top-level orchestration CRD. InvestigationSession (v1.5+) is created and managed by the API Frontend for interactive MCP/A2A sessions. See System Overview for the complete service topology and CRD ownership model.

Remediation Lifecycle¶

A RemediationRequest progresses through these phases:

stateDiagram-v2
    [*] --> Pending
    Pending --> Processing: Create SignalProcessing
    Pending --> Blocked: Routing condition
    Processing --> Analyzing: Enrichment complete
    Analyzing --> Completed: No remediation needed
    Analyzing --> AwaitingApproval: Rego policy requires approval
    Analyzing --> Executing: Workflow selected, auto-approved
    Analyzing --> Blocked: Routing condition
    Analyzing --> Failed: AI investigation failed
    AwaitingApproval --> Executing: Human approves
    AwaitingApproval --> Failed: Human rejects
    Executing --> Verifying: Workflow succeeded
    Executing --> Failed: Workflow fails
    Verifying --> Completed: Effectiveness assessed
    Blocked --> Failed: Cooldown expires
    Blocked --> Analyzing: Block cleared
    Blocked --> Pending: Block cleared
    Completed --> [*]
    Failed --> [*]
    TimedOut --> [*]
    Skipped --> [*]
    Cancelled --> [*]

AI Analysis Outcomes¶

The Analyzing phase represents the LLM investigation via Kubernaut Agent. The AI produces one of these outcomes:

Outcome	RR Transition	Description
No remediation needed	Completed (NoActionRequired)	LLM determines the issue does not require remediation — either the problem self-resolved (e.g., pod recovered) or the condition is benign (e.g., dangling PVC that doesn't warrant action)
Workflow selected	Executing or AwaitingApproval	LLM identified root cause and selected a workflow; Rego policy determines if approval is required
Investigation inconclusive	Failed (ManualReviewRequired)	LLM could not produce a reliable RCA (low confidence, incomplete analysis)
No matching workflow	Failed (ManualReviewRequired)	RCA succeeded but no workflow matches the detected labels
Infrastructure failure	Failed	API error, timeout, or max retries exceeded communicating with the LLM

Blocked Phase¶

The Blocked phase is non-terminal and covers 6 routing scenarios managed by the Orchestrator (not the LLM). See Core Concepts for all block reasons, cooldowns, and exit conditions.

On successful workflow execution, the Orchestrator creates an EffectivenessAssessment to evaluate whether the fix worked. Once the assessment completes (or times out), it creates a NotificationRequest that includes the remediation outcome and effectiveness results. On failure or escalation, a notification is created directly.

Data Flow¶

Every service emits audit events to DataStorage as it processes its CRD. These events capture the full context: what happened, when, why, and who was involved. The long-term record of every remediation lives in PostgreSQL via the audit pipeline, so even if CRDs are removed from the cluster, the complete data is preserved. A RemediationRequest can be reconstructed from audit data at any time.

Next Steps¶

Core Concepts — Detailed explanation of each stage
System Overview — Deep-dive architecture documentation
CRD Reference — Complete CRD spec/status definitions