Core Concepts¶

This page explains the key building blocks of Kubernaut: the data model, the services, and how a remediation flows through the system.

The Remediation Pipeline¶

Every remediation in Kubernaut follows the same six-stage pipeline:

graph LR
    SD["Signal<br/>Detection"] --> SP["Signal<br/>Processing"]
    SP --> AA["AI<br/>Analysis"]
    AA --> WE["Workflow<br/>Execution"]
    WE --> EM["Effectiveness<br/>Monitoring"]
    EM --> NF["Notification"]

Each stage is represented by a Custom Resource (CRD) in Kubernetes. The Remediation Orchestrator coordinates the flow by creating child CRDs and watching their status.

Custom Resources¶

RemediationRequest¶

The top-level resource. Created by the Gateway when a signal arrives. Contains:

TargetResource — The Kubernetes resource that triggered the alert (namespace, name, kind)
Signal metadata — Alert name, signal type, labels, annotations, original payload
OverallPhase — Current lifecycle phase (Pending → Processing → Analyzing → AwaitingApproval → Executing → Verifying → Completed/Failed/Blocked/TimedOut/Skipped/Cancelled)

The RemediationRequest is the "parent" — all other CRDs are children created by the Orchestrator.

SignalProcessing¶

Created after a RemediationRequest is accepted. The Signal Processing controller enriches the signal with:

Kubernetes context — Owner chain (Deployment → ReplicaSet → Pod), namespace labels and annotations, workload details (kind, name, labels), and custom labels
Environment classification — Inferred from namespace labels or Rego policies (production, staging, development, test)
Priority assignment — P0–P3 based on Rego policy evaluation or severity-based fallback
Business classification — Business unit, service owner, criticality level, and SLA requirement (when labels are present)
Severity normalization — Maps raw alert severity to a standard scale (critical, high, medium, low, unknown) via Rego policies with a configurable fallback matrix
Signal mode — Reactive (something broke) or proactive (something is predicted to break)
Signal name normalization — Normalizes the signal name for downstream matching while preserving the original for audit

AIAnalysis¶

Created after signal enrichment completes. The AI Analysis controller:

Submits the enriched signal to HolmesGPT (via the HolmesGPT API service) for a three-phase LLM investigation:
- Phase 1: Investigate — Root cause analysis using live cluster data (logs, events, resource state, metrics)
- Phase 2: Enrich — Resolves the target resource's owner chain, computes a spec hash, fetches remediation history (past outcomes and effectiveness scores via DataStorage), and detects infrastructure labels (GitOps, Helm, service mesh, HPA, PDB)
- Phase 3: Workflow Select — The LLM discovers and selects a workflow from the catalog via a three-step protocol (list_available_actions → list_workflows → get_workflow); DataStorage applies label-based ranking but the LLM drives the final selection
Evaluates whether auto-approval is safe via a Rego policy (configurable confidence threshold)

See Investigation Pipeline for the full three-phase architecture.

RemediationApprovalRequest¶

Created when the AI Analysis confidence is below the approval threshold, or when the Rego policy requires human review. A human operator approves or rejects the remediation.

WorkflowExecution¶

Created after approval (auto or human). The Workflow Execution controller:

Resolves the workflow from the catalog (via DataStorage)
Validates dependencies (required Secrets, ConfigMaps)
Runs the remediation via Tekton Pipelines (multi-step), Kubernetes Jobs (single-step), or Ansible (AWX/AAP) (playbook-based)
Injects parameters (namespace, deployment name, etc.)

EffectivenessAssessment¶

Created after the workflow completes successfully. The Effectiveness Monitor evaluates whether the fix actually resolved the issue:

Spec hash comparison — Did the resource spec change as expected?
Health checks — Is the workload healthy now?
Alert resolution — Did the triggering alert stop firing? (via AlertManager)
Metric evaluation — Did the triggering metric recover? (via Prometheus)

NotificationRequest¶

Created after the effectiveness assessment completes (or earlier on escalation/manual review). The notification includes the full remediation outcome and effectiveness results. Delivers via configured channels:

Slack — Rich messages with RCA summary, remediation outcome, and effectiveness score
Console / Log — For development and testing
File — For integration testing

Phases¶

A RemediationRequest progresses through these phases:

Phase	Description
Pending	Created by Gateway, waiting for Orchestrator pickup
Processing	Signal Processing is enriching the signal
Analyzing	AI Analysis is performing RCA and workflow selection
AwaitingApproval	Human approval required (low confidence or policy mandate)
Executing	Workflow is running the remediation
Verifying	Workflow succeeded; effectiveness assessment in progress
Blocked	Routing engine prevents progress; automatically retried after cooldown (see below)
Completed	Remediation finished successfully, or `NoActionRequired` / `ManualReviewRequired` outcome
Failed	Remediation failed at any stage (including human rejection, or HAPI flagged human review with a selected workflow)
TimedOut	Phase timeout expired
Skipped	Remediation skipped (e.g., resource busy)
Cancelled	Remediation cancelled

Blocked Phase¶

The Blocked phase is a transient, non-terminal state. The Orchestrator's routing engine evaluates safety conditions before allowing a remediation to proceed. When a condition is met, the RR enters Blocked with a blockReason, a human-readable blockMessage, and (for time-based blocks) a blockedUntil timestamp. The RR is automatically requeued and will resume once the condition clears.

Block Reason	Trigger	Default Cooldown
`ConsecutiveFailures`	3+ consecutive failures on the same signal fingerprint	1 hour
`DuplicateInProgress`	Another active RR with the same fingerprint is already being remediated	30 s (recheck)
`ResourceBusy`	A WorkflowExecution is already running on the same target resource	30 s (recheck)
`RecentlyRemediated`	The same workflow+target was successfully executed recently	5 minutes
`ExponentialBackoff`	Progressive retry delay after failures (1 min to 10 min)	Adaptive
`UnmanagedResource`	Target namespace or resource lacks `kubernaut.ai/managed=true`	5 s to 5 min (backoff)
`IneffectiveChain`	Consecutive remediations detected as ineffective via audit history	Escalates to manual review

See Troubleshooting for diagnostic steps.

Signal Modes¶

Kubernaut classifies signals into two modes:

Reactive — Responding to an active incident (e.g., KubePodCrashLooping, KubePodOOMKilled)
Proactive — Responding to a predicted issue (e.g., Prometheus predict_linear() alerts for disk pressure, memory exhaustion)

Signal mode determines which prompt variant HolmesGPT uses for the investigation. In reactive mode, the LLM performs root cause analysis of an incident that has already occurred. In proactive mode, the prompt shifts to trend assessment and prevention — the LLM evaluates whether the predicted issue is likely to materialize and recommends preventive action (or concludes no action is needed).

Resource Scope¶

Kubernaut uses a label-based opt-in model. Only namespaces and resources with the kubernaut.ai/managed=true label are eligible for remediation. The Gateway validates this label before creating a RemediationRequest.

# Opt a namespace into Kubernaut management
kubectl label namespace my-app kubernaut.ai/managed=true

Workflow Catalog¶

Remediation workflows are packaged as OCI images containing a workflow-schema.yaml and stored in the DataStorage service as a searchable catalog. Each workflow has:

Identity — Name (metadata.name), version, and structured description (what, whenToUse, whenNotToUse, preconditions) under spec
Action type — Taxonomy type (e.g., RestartPod, RollbackDeployment, IncreaseMemoryLimits)
Labels — Signal name, severity, environment, component, priority (with wildcard and multi-value support)
Parameters — Typed inputs injected at runtime as environment variables (UPPER_SNAKE_CASE)
Execution config — Engine (job or tekton) and OCI bundle reference with digest

During investigation, the LLM selects a workflow through a three-step discovery protocol:

List action types — HolmesGPT calls DataStorage to retrieve available action types (e.g., RestartPod, RollbackDeployment), filtered by the signal's enriched labels (severity, environment, component, priority) and detected infrastructure labels (GitOps, Helm, service mesh)
List workflows for action type — The LLM picks an action type and retrieves matching workflows, which DataStorage returns ordered by label-match scoring (though scores are not exposed to the LLM)
Get workflow details — The LLM selects a specific workflow and retrieves its full parameter schema to fill in values from the root cause analysis

The LLM makes the final selection decision based on workflow descriptions (what, whenToUse, whenNotToUse), detected infrastructure context (e.g., prefer git-based workflows when gitOpsManaged=true), and remediation history (avoid workflows that recently failed on the same target). See Remediation Workflows for the full schema reference.

Next Steps¶

Signals & Alert Routing — How signals enter the system
Remediation Workflows — Writing your own workflows
Human Approval — Understanding the approval flow