Remediation Routing¶

CRD Reference

For the complete RemediationRequest CRD specification, see API Reference: CRDs.

The Remediation Orchestrator is the central coordinator that drives the remediation lifecycle. It watches RemediationRequest CRDs created by the Gateway and routes them through the pipeline by creating child CRDs, monitoring their completion, enforcing timeouts, and evaluating routing conditions at every stage.

CRD Specification¶

RemediationRequest Spec¶

For the complete field specification, see RemediationRequest in the CRD Reference.

RemediationRequest Status¶

For the complete field specification, see RemediationRequest in the CRD Reference.

Phase State Machine¶

stateDiagram-v2
    [*] --> Pending
    Pending --> Blocked : Pre-analysis check fails
    Pending --> Processing : Pre-analysis passes → Create SP
    Processing --> Analyzing : SP completed → Create AIAnalysis
    Processing --> Failed : SP failed
    Processing --> TimedOut : Phase timeout
    Analyzing --> AwaitingApproval : Rego approval required → Create RAR
    Analyzing --> Executing : Auto-approved + post-analysis passes → Create WFE
    Analyzing --> Blocked : Post-analysis check fails
    Analyzing --> Completed : NoActionRequired / ManualReviewRequired
    Analyzing --> Failed : AI analysis failed
    Analyzing --> TimedOut : Phase timeout
    AwaitingApproval --> Executing : Approved → Create WFE
    AwaitingApproval --> Failed : Rejected
    AwaitingApproval --> TimedOut : Timeout
    Executing --> Verifying : WFE succeeded → Create EA
    Executing --> Failed : WFE failed
    Executing --> Skipped : Resource busy
    Executing --> TimedOut : Phase timeout
    Verifying --> Completed : EA completed
    Verifying --> Completed : Verification timeout (VerificationTimedOut)
    Blocked --> Pending : UnmanagedResource re-scoped
    Blocked --> Pending : DuplicateInProgress resolved
    Blocked --> Analyzing : ResourceBusy cleared
    Blocked --> Failed : Time-based block expired
    Failed --> Blocked : Consecutive failure threshold

Phase Reference¶

Phase	Terminal	Description
Pending	No	Awaiting pre-analysis routing checks
Processing	No	SignalProcessing in progress (enrichment, classification)
Analyzing	No	AIAnalysis in progress (RCA, workflow selection)
AwaitingApproval	No	Human approval required (RAR created)
Executing	No	WorkflowExecution running
Verifying	No	EffectivenessAssessment in progress
Blocked	No	Routing condition prevents progress
Completed	Yes	Remediation finished successfully
Failed	Yes	Remediation failed at any stage
TimedOut	Yes	Phase or global timeout exceeded
Skipped	Yes	Execution skipped (resource lock)
Cancelled	Yes	Manually cancelled by operator

Valid Transitions¶

Pending     → Processing, Blocked
Processing  → Analyzing, Failed, TimedOut
Analyzing   → AwaitingApproval, Executing, Completed, Failed, TimedOut, Blocked
AwaitingApproval → Executing, Failed, TimedOut
Executing   → Verifying, Failed, TimedOut, Skipped
Verifying   → Completed (includes VerificationTimedOut)
Blocked     → Failed, Analyzing, Pending
Failed      → Blocked

Source code ValidTransitions map

The ValidTransitions map in the source code omits Pending → Blocked and Analyzing → Blocked. The reconciler performs these transitions via direct status updates (routing engine sets the Blocked phase), bypassing the map check.

Phase Handlers¶

Pending → Processing¶

Run pre-analysis routing checks (see Routing Checkpoints)
If blocked → transition to Blocked with BlockReason, BlockedUntil, RequeueAfter
If clear → create SignalProcessing CRD via spCreator.Create
Set SignalProcessingRef, ProcessingStartTime
Transition to Processing

Processing → Analyzing¶

Fetch SP child CRD
If SP Completed → create AIAnalysis CRD via aiAnalysisCreator.Create
Set AIAnalysisRef, AnalyzingStartTime
Transition to Analyzing
If SP Failed → transition to Failed with SP error details

Analyzing → (multiple outcomes)¶

The AI Analysis produces one of six outcomes:

AA Outcome	RR Action
Normal (workflow selected, auto-approved)	Run post-analysis checks → create WFE → Executing
ApprovalRequired (Rego policy mandate)	Create RemediationApprovalRequest → AwaitingApproval
WorkflowNotNeeded (issue already resolved)	Transition to Completed with `Outcome: NoActionRequired`
ManualReviewRequired (no workflow, KA flagged human review)	Transition to Completed with `Outcome: ManualReviewRequired`
HumanReviewWithWorkflow (KA flagged review + workflow present)	Transition to Failed (see ManualReviewRequired Outcome)
Failed (AI analysis error)	Transition to Failed

For the normal path, the Orchestrator runs post-analysis routing checks before creating the WFE. If blocked, the RR enters Blocked (returning to Analyzing when the block clears).

AwaitingApproval → Executing¶

Fetch RAR child CRD
If Approved → create WFE → transition to Executing
If Rejected or Expired → transition to Failed

Executing → Verifying¶

Delegates to weHandler.HandleStatus which monitors WFE phase
If WFE Completed → create EffectivenessAssessment CRD
Set EffectivenessAssessmentRef, VerificationDeadline
Transition to Verifying
If WFE Failed → transition to Failed (also creates EA for failure tracking)
If WFE Skipped → transition to Skipped

Verifying → Completed¶

Wait for EA to reach a terminal phase or VerificationDeadline to pass
EA completed → transition to Completed
Deadline exceeded → transition to Completed with Outcome: VerificationTimedOut

Blocked → (re-evaluation)¶

The Blocked phase behaves differently based on the block type:

Time-based blocks (ConsecutiveFailures, RecentlyRemediated, ExponentialBackoff): Check BlockedUntil. When expired → transition to Failed (terminal). Future RRs for the same signal can then proceed.
IneffectiveChain: Uses RequeueAfter only (no BlockedUntil). The handleBlockedPhase auto-expiry never fires for this condition — the block remains until a new RR arrives after the requeue interval, at which point it transitions to Failed with RequiresManualReview.
Event-based blocks:
- UnmanagedResource: Target gains kubernaut.ai/managed=true → Pending
- DuplicateInProgress: Original RR reaches terminal → Pending
- ResourceBusy: Blocking WFE completes → Analyzing

The Gateway treats Blocked RRs as active, preventing new RRs for the same fingerprint.

Block-Reason Notifications (v1.3)¶

When a RR enters the Blocked phase, the Orchestrator creates a NotificationRequest for every block reason. Prior to v1.3, most block reasons were silent (no NR, at most a K8s event).

Block Reason	NR Type	Priority	Previously
`ConsecutiveFailures`	Escalation	High	Silent (K8s Warning event only)
`UnmanagedResource`	Escalation	High	Silent (no NR, no event)
`DuplicateInProgress`	StatusUpdate	Low	Silent (no NR, no event)
`ResourceBusy`	StatusUpdate	Low	Silent (no NR, no event)
`RecentlyRemediated`	StatusUpdate	Low	Silent (K8s Normal event only)
`ExponentialBackoff`	StatusUpdate	Low	Silent (K8s Normal event only)
`IneffectiveChain`	ManualReview	High	Already documented

NR naming: nr-block-<lowercased-reason>-<rr-name> (one NR per block reason per RR).

Terminal Failure Escalation Notifications (v1.3)¶

Two paths create Escalation NRs for terminal failures:

transitionToFailed: Any failure path reaching terminal Failed without a prior ManualReview or Escalation NR creates an Escalation NR (nr-escalation-<rr-name>). This covers config errors, SP failures, approval timeouts, WFE ref corruption, hash errors, and other previously-silent failure paths.
transitionToFailedTerminal: When a blocked RR's cooldown expires and transitions to terminal Failed, an Escalation NR is created with the block reason in the body.

Both paths enforce a double-NR guard: if a ManualReview or Escalation NR already exists for the RR (e.g., from the WFE failure handler), no duplicate is created. The guard uses Get-before-Create on the deterministic NR name (nr-escalation-<rr-name>).

Routing Checkpoints¶

The routing engine evaluates blocking conditions at two points. Checks are evaluated in order; the first blocking condition wins.

Pre-Analysis Checks (Pending → Processing)¶

Order	Check	Block Reason	Requeue After
1	Target not managed (`kubernaut.ai/managed`)	`UnmanagedResource`	5s–5m (exponential)
2	3+ consecutive failures for same fingerprint	`ConsecutiveFailures`	1 hour
3	Active RR with same fingerprint exists	`DuplicateInProgress`	30s
4	`NextAllowedExecution` in the future	`ExponentialBackoff`	Until expiry

Post-Analysis Checks (Analyzing → Executing)¶

Includes all pre-analysis checks plus:

Order	Check	Block Reason	Requeue After
5	Active WFE on same target resource	`ResourceBusy`	30s
6	Same workflow+target executed in last 5m	`RecentlyRemediated`	Remaining cooldown
7	3+ consecutive ineffective remediations	`IneffectiveChain`	4 hours

Routing Configuration¶

Parameter	Default	Description
`ConsecutiveFailureThreshold`	3	Failures before blocking
`ConsecutiveFailureCooldown`	1 hour	How long to block
`RecentlyRemediatedCooldown`	5 minutes	Min interval between same workflow+target
`ExponentialBackoffBase`	60 seconds	Base for backoff calculation
`ExponentialBackoffMax`	10 minutes	Maximum backoff
`ExponentialBackoffMaxExponent`	4	Cap on exponent
`ScopeBackoffBase`	5 seconds	Unmanaged resource recheck base
`ScopeBackoffMax`	5 minutes	Unmanaged resource recheck max
`IneffectiveChainThreshold`	3	Consecutive ineffective before blocking
`RecurrenceCountThreshold`	5	Recurrence count escalation
`IneffectiveTimeWindow`	4 hours	Time window for ineffective chain
`RequeueResourceBusy`	30 seconds	Requeue interval for resource busy
`RequeueGenericError`	5 seconds	Requeue interval for generic errors
`NoActionRequiredDelayHours`	24 hours	Cooldown for `NoActionRequired` and for Completed `ManualReviewRequired` (path A); `0` opts out (see ManualReviewRequired)

NoActionRequired Suppression¶

When an RR completes with Outcome: NoActionRequired (the LLM determined no remediation was needed), the Orchestrator sets NextAllowedExecution to now + noActionRequiredDelay (default 24h, configurable via routing.noActionRequiredDelayHours; use 0 to opt out) on the completed RR. The Gateway's deduplication logic respects this field on terminal RRs -- any new signal with the same fingerprint is suppressed until the delay expires.

This prevents duplicate RR churn for signals whose underlying condition is unchanged by design (e.g., a DiskPressure alert for a PVC that the LLM correctly identified as not requiring automated action). Without this suppression, the same alert would generate a new RR on every AlertManager re-fire interval, each producing the same NoActionRequired outcome.

Set the delay to 0 to disable the cooldown. After a non-zero delay expires, a new RR can be created if the alert is still firing, allowing the LLM to re-evaluate.

ManualReviewRequired Outcome¶

ManualReviewRequired can appear in three distinct paths. The phase and whether NextAllowedExecution is set depend on how the RR got there.

(A) Completed path (#550)¶

Phase: Completed
Outcome: ManualReviewRequired
RequiresManualReview: true
Cooldown: Same noActionRequiredDelay as NoActionRequired (default 24h, routing.noActionRequiredDelayHours; 0 to opt out). The Orchestrator sets NextAllowedExecution so the Gateway suppresses duplicate RRs while operators investigate.

Typical when AIAnalysis has NeedsHumanReview=true and SelectedWorkflow=nil (no catalog workflow, or KA requested human review without a selected workflow) — a successfully finished triage with human follow-up, not a pipeline failure. This path does not increment ConsecutiveFailureCount.

The Orchestrator still creates a NotificationRequest for the operator.

(B) Failed path¶

Transition: Outcome: ManualReviewRequired then transitionToFailed → Phase: Failed
No NextAllowedExecution / cooldown (failure path does not apply the noActionRequiredDelay suppression)

Triggered when human review is required in a failure context, including workflow rejection (e.g. approval denied), workflow resolution failure, or remediation target missing after execution concerns. These RRs are terminal failures and participate in failure metrics and backoff as designed for the failure phase.

(C) Blocked path (IneffectiveChain)¶

Phase: Blocked
Outcome: ManualReviewRequired (e.g. after ineffective-chain escalation)

The RR is blocked pending operator action; this is not the same as the Completed “triage only” path (A).

Summary¶

(A) Completed: manual review with optional duplicate suppression via noActionRequiredDelay / routing.noActionRequiredDelayHours (0 = off).
(B) Failed: manual review after a real failure; no matching cooldown.
(C) Blocked / IneffectiveChain: Blocked + ManualReviewRequired.

Low confidence WITH a selected workflow (AIAnalysis)

When NeedsHumanReview=true but SelectedWorkflow is present (the LLM selected a workflow but KA flagged the result for human review), the RR often follows a Failed-style path rather than the Completed (A) path. Check the current AA/RO behavior for your release when correlating with metrics.

Timeout System¶

The Orchestrator enforces both global and per-phase timeouts. Per-RR overrides are supported via TimeoutConfig.

Default Timeouts¶

Scope	Default	Description
Global	1 hour	Maximum time from Pending to terminal
Processing	5 minutes	Time for SP enrichment/classification
Analyzing	10 minutes	Time for AI analysis and workflow selection
AwaitingApproval	15 minutes	Time for human approval
Executing	30 minutes	Time for workflow execution
Verifying	30 minutes	Time for effectiveness assessment

Timeout Handling¶

Global timeout: Transitions the RR to TimedOut regardless of current phase, creates a notification
Phase timeout: Transitions to TimedOut with TimeoutPhase set to the phase that timed out
Verification deadline: A soft timeout -- the RR transitions to Completed with Outcome: VerificationTimedOut rather than TimedOut

Timeout defaults are populated on first reconcile via populateTimeoutDefaults. The effective timeout for each phase is resolved by getEffectivePhaseTimeout, which checks the per-RR override first, then falls back to controller defaults.

Child CRD Creation¶

The Orchestrator creates child CRDs at specific phase transitions:

Phase Transition	Child CRD	Name Pattern	Description
Pending → Processing	SignalProcessing	`sp-{rr.Name}`	Enrichment and classification
Processing → Analyzing	AIAnalysis	`ai-{rr.Name}`	RCA and workflow selection
Analyzing → AwaitingApproval	RemediationApprovalRequest	`rar-{rr.Name}`	Human approval gate
Analyzing/Approved → Executing	WorkflowExecution	`we-{rr.Name}`	Workflow execution
Executing → Verifying	EffectivenessAssessment	`ea-{rr.Name}`	Post-execution verification
Terminal phases	NotificationRequest	Various	Outcome notification

Owner References¶

All child CRDs have a controllerReference pointing to the parent RR via controllerutil.SetControllerReference. This enables cascade deletion -- when the parent RR is garbage collected, all children are automatically cleaned up.

EffectivenessAssessment is created with BlockOwnerDeletion: false to allow audit data to persist independently.

Reconciliation¶

The Orchestrator uses a single reconciler that watches all resource types in the pipeline:

RemediationRequest -- The parent resource
SignalProcessing -- Enrichment completion
AIAnalysis -- Analysis completion
RemediationApprovalRequest -- Approval decisions
WorkflowExecution -- Execution completion
EffectivenessAssessment -- Verification results
NotificationRequest -- Notification delivery

Each child CRD status change triggers a reconcile of the parent RR. The reconciler includes idempotency guards -- ObservedGeneration and phase validation via CanTransition -- to prevent duplicate transitions on retry.

Terminal Phase Actions¶

When a RR reaches a terminal phase:

NotificationRequest -- Created for Completed, Failed, and TimedOut outcomes. For Failed, transitionToFailed creates an Escalation NR (nr-escalation-<rr-name>) unless a ManualReview or Escalation NR already exists (double-NR guard)
Duplicate notification -- If DuplicateCount > 0, a bulk notification (nr-bulk-<rr-name>) is created for tracked duplicates
Consecutive failure update -- ConsecutiveFailureCount and NextAllowedExecution are updated for backoff calculation

Escalation Paths¶

Trigger	Escalation	Mechanism
Rego policy requires approval (environment, sensitive kind, confidence)	Human approval	RemediationApprovalRequest CRD
KA flags human review with selected workflow	Notification + Failed	ManualReview NR (`nr-manual-review-<rr-name>`)
WFE execution failure (v1.3)	ManualReview + Failed	ManualReview NR (`reviewSource=WorkflowExecution`, `priority=Critical`) before `transitionToFailed`
Failure at any stage (v1.3)	Escalation NR	`nr-escalation-<rr-name>` (double-NR guard)
No matching workflow	Team notification with RCA	ManualReview NR
Consecutive ineffective remediations	Manual review	`IneffectiveChain` block + ManualReview NR
3+ consecutive failures (v1.3)	Escalation NR + block	`nr-block-consecutivefailures-<rr-name>`
Blocked RR cooldown expiry (v1.3)	Escalation NR	`nr-escalation-<rr-name>` (block reason in body)

Handoff to Workflow Execution¶

When the Orchestrator creates a WorkflowExecution CRD:

The WFE spec includes the selected workflow ID, execution bundle, target resource, parameters, and confidence
The WFE controller picks up the CRD and begins dependency resolution and execution
The Orchestrator monitors WFE status and transitions accordingly

RO creates WFE → WFE validates spec → WFE resolves deps → WFE creates Job/PipelineRun/AWX Job → WFE reports status → RO transitions

Next Steps¶

Gateway -- How signals enter the system
Signal Processing -- Enrichment and classification pipeline
Workflow Execution -- How remediations are executed
AI Analysis -- How AI investigates and selects workflows