Remediation Routing¶
CRD Reference
For the complete RemediationRequest CRD specification, see API Reference: CRDs.
The Remediation Orchestrator is the central coordinator that drives the remediation lifecycle. It watches RemediationRequest CRDs created by the Gateway and routes them through the pipeline by creating child CRDs, monitoring their completion, enforcing timeouts, and evaluating routing conditions at every stage.
CRD Specification¶
RemediationRequest Spec¶
For the complete field specification, see RemediationRequest in the CRD Reference.
RemediationRequest Status¶
For the complete field specification, see RemediationRequest in the CRD Reference.
Phase State Machine¶
stateDiagram-v2
[*] --> Pending
Pending --> Blocked : Pre-analysis check fails
Pending --> Processing : Pre-analysis passes → Create SP
Processing --> Analyzing : SP completed → Create AIAnalysis
Processing --> Failed : SP failed
Processing --> TimedOut : Phase timeout
Analyzing --> AwaitingApproval : Rego approval required → Create RAR
Analyzing --> Executing : Auto-approved + post-analysis passes → Create WFE
Analyzing --> Blocked : Post-analysis check fails
Analyzing --> Completed : NoActionRequired / ManualReviewRequired
Analyzing --> Failed : AI analysis failed
Analyzing --> TimedOut : Phase timeout
AwaitingApproval --> Executing : Approved → Create WFE
AwaitingApproval --> Failed : Rejected
AwaitingApproval --> TimedOut : Timeout
Executing --> Verifying : WFE succeeded → Create EA
Executing --> Failed : WFE failed
Executing --> Skipped : Resource busy
Executing --> TimedOut : Phase timeout
Verifying --> Completed : EA completed
Verifying --> Completed : Verification timeout (VerificationTimedOut)
Blocked --> Pending : UnmanagedResource re-scoped
Blocked --> Pending : DuplicateInProgress resolved
Blocked --> Analyzing : ResourceBusy cleared
Blocked --> Failed : Time-based block expired
Failed --> Blocked : Consecutive failure threshold
Phase Reference¶
| Phase | Terminal | Description |
|---|---|---|
| Pending | No | Awaiting pre-analysis routing checks |
| Processing | No | SignalProcessing in progress (enrichment, classification) |
| Analyzing | No | AIAnalysis in progress (RCA, workflow selection) |
| AwaitingApproval | No | Human approval required (RAR created) |
| Executing | No | WorkflowExecution running |
| Verifying | No | EffectivenessAssessment in progress |
| Blocked | No | Routing condition prevents progress |
| Completed | Yes | Remediation finished successfully |
| Failed | Yes | Remediation failed at any stage |
| TimedOut | Yes | Phase or global timeout exceeded |
| Skipped | Yes | Execution skipped (resource lock) |
| Cancelled | Yes | Manually cancelled by operator |
Valid Transitions¶
Pending → Processing, Blocked
Processing → Analyzing, Failed, TimedOut
Analyzing → AwaitingApproval, Executing, Completed, Failed, TimedOut, Blocked
AwaitingApproval → Executing, Failed, TimedOut
Executing → Verifying, Failed, TimedOut, Skipped
Verifying → Completed (includes VerificationTimedOut)
Blocked → Failed, Analyzing, Pending
Failed → Blocked
Source code ValidTransitions map
The ValidTransitions map in the source code omits Pending → Blocked and Analyzing → Blocked. The reconciler performs these transitions via direct status updates (routing engine sets the Blocked phase), bypassing the map check.
Phase Handlers¶
Pending → Processing¶
- Run pre-analysis routing checks (see Routing Checkpoints)
- If blocked → transition to Blocked with
BlockReason,BlockedUntil,RequeueAfter - If clear → create SignalProcessing CRD via
spCreator.Create - Set
SignalProcessingRef,ProcessingStartTime - Transition to Processing
Processing → Analyzing¶
- Fetch SP child CRD
- If SP
Completed→ create AIAnalysis CRD viaaiAnalysisCreator.Create - Set
AIAnalysisRef,AnalyzingStartTime - Transition to Analyzing
- If SP
Failed→ transition to Failed with SP error details
Analyzing → (multiple outcomes)¶
The AI Analysis produces one of six outcomes:
| AA Outcome | RR Action |
|---|---|
| Normal (workflow selected, auto-approved) | Run post-analysis checks → create WFE → Executing |
| ApprovalRequired (Rego policy mandate) | Create RemediationApprovalRequest → AwaitingApproval |
| WorkflowNotNeeded (issue already resolved) | Transition to Completed with Outcome: NoActionRequired |
| ManualReviewRequired (no workflow, KA flagged human review) | Transition to Completed with Outcome: ManualReviewRequired |
| HumanReviewWithWorkflow (KA flagged review + workflow present) | Transition to Failed (see ManualReviewRequired Outcome) |
| Failed (AI analysis error) | Transition to Failed |
For the normal path, the Orchestrator runs post-analysis routing checks before creating the WFE. If blocked, the RR enters Blocked (returning to Analyzing when the block clears).
AwaitingApproval → Executing¶
- Fetch RAR child CRD
- If
Approved→ create WFE → transition to Executing - If
RejectedorExpired→ transition to Failed
Executing → Verifying¶
- Delegates to
weHandler.HandleStatuswhich monitors WFE phase - If WFE
Completed→ create EffectivenessAssessment CRD - Set
EffectivenessAssessmentRef,VerificationDeadline - Transition to Verifying
- If WFE
Failed→ transition to Failed (also creates EA for failure tracking) - If WFE
Skipped→ transition to Skipped
Verifying → Completed¶
- Wait for EA to reach a terminal phase or
VerificationDeadlineto pass - EA completed → transition to Completed
- Deadline exceeded → transition to Completed with
Outcome: VerificationTimedOut
Blocked → (re-evaluation)¶
The Blocked phase behaves differently based on the block type:
- Time-based blocks (ConsecutiveFailures, RecentlyRemediated, ExponentialBackoff): Check
BlockedUntil. When expired → transition to Failed (terminal). Future RRs for the same signal can then proceed. - IneffectiveChain: Uses
RequeueAfteronly (noBlockedUntil). ThehandleBlockedPhaseauto-expiry never fires for this condition — the block remains until a new RR arrives after the requeue interval, at which point it transitions to Failed withRequiresManualReview. - Event-based blocks:
- UnmanagedResource: Target gains
kubernaut.ai/managed=true→ Pending - DuplicateInProgress: Original RR reaches terminal → Pending
- ResourceBusy: Blocking WFE completes → Analyzing
- UnmanagedResource: Target gains
The Gateway treats Blocked RRs as active, preventing new RRs for the same fingerprint.
Block-Reason Notifications (v1.3)¶
When a RR enters the Blocked phase, the Orchestrator creates a NotificationRequest for every block reason. Prior to v1.3, most block reasons were silent (no NR, at most a K8s event).
| Block Reason | NR Type | Priority | Previously |
|---|---|---|---|
ConsecutiveFailures |
Escalation | High | Silent (K8s Warning event only) |
UnmanagedResource |
Escalation | High | Silent (no NR, no event) |
DuplicateInProgress |
StatusUpdate | Low | Silent (no NR, no event) |
ResourceBusy |
StatusUpdate | Low | Silent (no NR, no event) |
RecentlyRemediated |
StatusUpdate | Low | Silent (K8s Normal event only) |
ExponentialBackoff |
StatusUpdate | Low | Silent (K8s Normal event only) |
IneffectiveChain |
ManualReview | High | Already documented |
NR naming: nr-block-<lowercased-reason>-<rr-name> (one NR per block reason per RR).
Terminal Failure Escalation Notifications (v1.3)¶
Two paths create Escalation NRs for terminal failures:
transitionToFailed: Any failure path reaching terminalFailedwithout a prior ManualReview or Escalation NR creates an Escalation NR (nr-escalation-<rr-name>). This covers config errors, SP failures, approval timeouts, WFE ref corruption, hash errors, and other previously-silent failure paths.transitionToFailedTerminal: When a blocked RR's cooldown expires and transitions to terminalFailed, an Escalation NR is created with the block reason in the body.
Both paths enforce a double-NR guard: if a ManualReview or Escalation NR already exists for the RR (e.g., from the WFE failure handler), no duplicate is created. The guard uses Get-before-Create on the deterministic NR name (nr-escalation-<rr-name>).
Routing Checkpoints¶
The routing engine evaluates blocking conditions at two points. Checks are evaluated in order; the first blocking condition wins.
Pre-Analysis Checks (Pending → Processing)¶
| Order | Check | Block Reason | Requeue After |
|---|---|---|---|
| 1 | Target not managed (kubernaut.ai/managed) |
UnmanagedResource |
5s–5m (exponential) |
| 2 | 3+ consecutive failures for same fingerprint | ConsecutiveFailures |
1 hour |
| 3 | Active RR with same fingerprint exists | DuplicateInProgress |
30s |
| 4 | NextAllowedExecution in the future |
ExponentialBackoff |
Until expiry |
Post-Analysis Checks (Analyzing → Executing)¶
Includes all pre-analysis checks plus:
| Order | Check | Block Reason | Requeue After |
|---|---|---|---|
| 5 | Active WFE on same target resource | ResourceBusy |
30s |
| 6 | Same workflow+target executed in last 5m | RecentlyRemediated |
Remaining cooldown |
| 7 | 3+ consecutive ineffective remediations | IneffectiveChain |
4 hours |
Routing Configuration¶
| Parameter | Default | Description |
|---|---|---|
ConsecutiveFailureThreshold |
3 | Failures before blocking |
ConsecutiveFailureCooldown |
1 hour | How long to block |
RecentlyRemediatedCooldown |
5 minutes | Min interval between same workflow+target |
ExponentialBackoffBase |
60 seconds | Base for backoff calculation |
ExponentialBackoffMax |
10 minutes | Maximum backoff |
ExponentialBackoffMaxExponent |
4 | Cap on exponent |
ScopeBackoffBase |
5 seconds | Unmanaged resource recheck base |
ScopeBackoffMax |
5 minutes | Unmanaged resource recheck max |
IneffectiveChainThreshold |
3 | Consecutive ineffective before blocking |
RecurrenceCountThreshold |
5 | Recurrence count escalation |
IneffectiveTimeWindow |
4 hours | Time window for ineffective chain |
RequeueResourceBusy |
30 seconds | Requeue interval for resource busy |
RequeueGenericError |
5 seconds | Requeue interval for generic errors |
NoActionRequiredDelayHours |
24 hours | Cooldown for NoActionRequired and for Completed ManualReviewRequired (path A); 0 opts out (see ManualReviewRequired) |
NoActionRequired Suppression¶
When an RR completes with Outcome: NoActionRequired (the LLM determined no remediation was needed), the Orchestrator sets NextAllowedExecution to now + noActionRequiredDelay (default 24h, configurable via routing.noActionRequiredDelayHours; use 0 to opt out) on the completed RR. The Gateway's deduplication logic respects this field on terminal RRs -- any new signal with the same fingerprint is suppressed until the delay expires.
This prevents duplicate RR churn for signals whose underlying condition is unchanged by design (e.g., a DiskPressure alert for a PVC that the LLM correctly identified as not requiring automated action). Without this suppression, the same alert would generate a new RR on every AlertManager re-fire interval, each producing the same NoActionRequired outcome.
Set the delay to 0 to disable the cooldown. After a non-zero delay expires, a new RR can be created if the alert is still firing, allowing the LLM to re-evaluate.
ManualReviewRequired Outcome¶
ManualReviewRequired can appear in three distinct paths. The phase and whether NextAllowedExecution is set depend on how the RR got there.
(A) Completed path (#550)¶
- Phase:
Completed - Outcome:
ManualReviewRequired RequiresManualReview:true- Cooldown: Same
noActionRequiredDelayasNoActionRequired(default 24h,routing.noActionRequiredDelayHours; 0 to opt out). The Orchestrator setsNextAllowedExecutionso the Gateway suppresses duplicate RRs while operators investigate.
Typical when AIAnalysis has NeedsHumanReview=true and SelectedWorkflow=nil (no catalog workflow, or KA requested human review without a selected workflow) — a successfully finished triage with human follow-up, not a pipeline failure. This path does not increment ConsecutiveFailureCount.
The Orchestrator still creates a NotificationRequest for the operator.
(B) Failed path¶
- Transition:
Outcome: ManualReviewRequiredthentransitionToFailed→ Phase:Failed - No
NextAllowedExecution/ cooldown (failure path does not apply thenoActionRequiredDelaysuppression)
Triggered when human review is required in a failure context, including workflow rejection (e.g. approval denied), workflow resolution failure, or remediation target missing after execution concerns. These RRs are terminal failures and participate in failure metrics and backoff as designed for the failure phase.
(C) Blocked path (IneffectiveChain)¶
- Phase:
Blocked - Outcome:
ManualReviewRequired(e.g. after ineffective-chain escalation)
The RR is blocked pending operator action; this is not the same as the Completed “triage only” path (A).
Summary¶
- (A) Completed: manual review with optional duplicate suppression via
noActionRequiredDelay/routing.noActionRequiredDelayHours(0 = off). - (B) Failed: manual review after a real failure; no matching cooldown.
- (C) Blocked / IneffectiveChain:
Blocked+ManualReviewRequired.
Low confidence WITH a selected workflow (AIAnalysis)
When NeedsHumanReview=true but SelectedWorkflow is present (the LLM selected a workflow but KA flagged the result for human review), the RR often follows a Failed-style path rather than the Completed (A) path. Check the current AA/RO behavior for your release when correlating with metrics.
Timeout System¶
The Orchestrator enforces both global and per-phase timeouts. Per-RR overrides are supported via TimeoutConfig.
Default Timeouts¶
| Scope | Default | Description |
|---|---|---|
| Global | 1 hour | Maximum time from Pending to terminal |
| Processing | 5 minutes | Time for SP enrichment/classification |
| Analyzing | 10 minutes | Time for AI analysis and workflow selection |
| AwaitingApproval | 15 minutes | Time for human approval |
| Executing | 30 minutes | Time for workflow execution |
| Verifying | 30 minutes | Time for effectiveness assessment |
Timeout Handling¶
- Global timeout: Transitions the RR to
TimedOutregardless of current phase, creates a notification - Phase timeout: Transitions to
TimedOutwithTimeoutPhaseset to the phase that timed out - Verification deadline: A soft timeout -- the RR transitions to
CompletedwithOutcome: VerificationTimedOutrather thanTimedOut
Timeout defaults are populated on first reconcile via populateTimeoutDefaults. The effective timeout for each phase is resolved by getEffectivePhaseTimeout, which checks the per-RR override first, then falls back to controller defaults.
Child CRD Creation¶
The Orchestrator creates child CRDs at specific phase transitions:
| Phase Transition | Child CRD | Name Pattern | Description |
|---|---|---|---|
| Pending → Processing | SignalProcessing | sp-{rr.Name} |
Enrichment and classification |
| Processing → Analyzing | AIAnalysis | ai-{rr.Name} |
RCA and workflow selection |
| Analyzing → AwaitingApproval | RemediationApprovalRequest | rar-{rr.Name} |
Human approval gate |
| Analyzing/Approved → Executing | WorkflowExecution | we-{rr.Name} |
Workflow execution |
| Executing → Verifying | EffectivenessAssessment | ea-{rr.Name} |
Post-execution verification |
| Terminal phases | NotificationRequest | Various | Outcome notification |
Owner References¶
All child CRDs have a controllerReference pointing to the parent RR via controllerutil.SetControllerReference. This enables cascade deletion -- when the parent RR is garbage collected, all children are automatically cleaned up.
EffectivenessAssessment is created with BlockOwnerDeletion: false to allow audit data to persist independently.
Reconciliation¶
The Orchestrator uses a single reconciler that watches all resource types in the pipeline:
- RemediationRequest -- The parent resource
- SignalProcessing -- Enrichment completion
- AIAnalysis -- Analysis completion
- RemediationApprovalRequest -- Approval decisions
- WorkflowExecution -- Execution completion
- EffectivenessAssessment -- Verification results
- NotificationRequest -- Notification delivery
Each child CRD status change triggers a reconcile of the parent RR. The reconciler includes idempotency guards -- ObservedGeneration and phase validation via CanTransition -- to prevent duplicate transitions on retry.
Terminal Phase Actions¶
When a RR reaches a terminal phase:
- NotificationRequest -- Created for
Completed,Failed, andTimedOutoutcomes. ForFailed,transitionToFailedcreates an Escalation NR (nr-escalation-<rr-name>) unless a ManualReview or Escalation NR already exists (double-NR guard) - Duplicate notification -- If
DuplicateCount > 0, a bulk notification (nr-bulk-<rr-name>) is created for tracked duplicates - Consecutive failure update --
ConsecutiveFailureCountandNextAllowedExecutionare updated for backoff calculation
Escalation Paths¶
| Trigger | Escalation | Mechanism |
|---|---|---|
| Rego policy requires approval (environment, sensitive kind, confidence) | Human approval | RemediationApprovalRequest CRD |
| KA flags human review with selected workflow | Notification + Failed | ManualReview NR (nr-manual-review-<rr-name>) |
| WFE execution failure (v1.3) | ManualReview + Failed | ManualReview NR (reviewSource=WorkflowExecution, priority=Critical) before transitionToFailed |
| Failure at any stage (v1.3) | Escalation NR | nr-escalation-<rr-name> (double-NR guard) |
| No matching workflow | Team notification with RCA | ManualReview NR |
| Consecutive ineffective remediations | Manual review | IneffectiveChain block + ManualReview NR |
| 3+ consecutive failures (v1.3) | Escalation NR + block | nr-block-consecutivefailures-<rr-name> |
| Blocked RR cooldown expiry (v1.3) | Escalation NR | nr-escalation-<rr-name> (block reason in body) |
Handoff to Workflow Execution¶
When the Orchestrator creates a WorkflowExecution CRD:
- The WFE spec includes the selected workflow ID, execution bundle, target resource, parameters, and confidence
- The WFE controller picks up the CRD and begins dependency resolution and execution
- The Orchestrator monitors WFE status and transitions accordingly
RO creates WFE → WFE validates spec → WFE resolves deps → WFE creates Job/PipelineRun/AWX Job → WFE reports status → RO transitions
Next Steps¶
- Gateway -- How signals enter the system
- Signal Processing -- Enrichment and classification pipeline
- Workflow Execution -- How remediations are executed
- AI Analysis -- How AI investigates and selects workflows