Async Propagation¶

Many Kubernetes environments use GitOps tools (ArgoCD, Flux) or operators that introduce propagation delays between a change being applied and that change taking full effect. Kubernaut accounts for these delays in its effectiveness assessment model.

The Problem¶

When Kubernaut patches a Deployment:

The patch is applied to the Kubernetes API -- immediate
ArgoCD detects drift and syncs -- 1-5 minutes
The operator reconciles the new state -- seconds to minutes
New pods roll out and become ready -- seconds to minutes

If the effectiveness assessment runs immediately after the patch, it may see an unhealthy state even though the fix is working -- it just hasn't propagated yet. Worse, computing a spec hash immediately would capture the pre-sync state rather than the final state, producing a false-positive drift detection.

Target Detection¶

The Remediation Orchestrator detects whether the remediation target requires propagation delays based on two independent characteristics:

Characteristic	Detection Method	Source
GitOps-managed	`PostRCAContext.DetectedLabels.GitOpsManaged=true` (JSON field from HAPI response, not a Kubernetes object label)	HAPI `LabelDetector` during post-RCA analysis
Custom Resource (CRD)	Target resource belongs to a non-built-in API group (detected via `IsBuiltInGroup(group)` allowlist — non-built-in groups are treated as CRDs)	API group check, no operator/controller detection

These flags are set during the AI Analysis phase and propagated to the EffectivenessAssessment CRD spec.

Delay Model¶

Kubernaut uses configurable propagation delays that are additive based on detected target characteristics. The Orchestrator computes the total propagation delay and sets it as HashComputeDelay on the EA spec.

Parameter	Default	Applies When	Configurable Via
`gitOpsSyncDelay`	3 minutes	Target is GitOps-managed (ArgoCD/Flux)	`remediationorchestrator.config.asyncPropagation.gitOpsSyncDelay`
`operatorReconcileDelay`	1 minute	Target belongs to a non-built-in API group (CRD)	`remediationorchestrator.config.asyncPropagation.operatorReconcileDelay`
`stabilizationWindow`	5 minutes	Always (all targets)	`remediationorchestrator.config.effectivenessAssessment.stabilizationWindow`

When Delays Apply¶

The propagation delay is computed from two independent flags (isGitOps, isCRD):

Target Type	Propagation Delay	Total Wait Before Assessment
Sync target (direct patch, built-in API group)	0	`stabilizationWindow`
GitOps-managed	`gitOpsSyncDelay`	`gitOpsSyncDelay` + `stabilizationWindow`
Custom Resource (non-built-in API group)	`operatorReconcileDelay`	`operatorReconcileDelay` + `stabilizationWindow`
GitOps + CRD (both)	`gitOpsSyncDelay` + `operatorReconcileDelay`	Both delays + `stabilizationWindow`

The delays are additive — if a target is both GitOps-managed and belongs to a non-built-in API group, both delays compound. Setting either delay to 0 disables that stage.

Integration with Effectiveness Assessment¶

The propagation delay directly controls when the Effectiveness Monitor computes the spec hash and begins its assessment:

sequenceDiagram
    participant RO as Orchestrator
    participant EA as EM Controller
    participant K8s as Kubernetes API

    RO->>EA: Create EA CRD (HashComputeDelay=4m)
    Note over EA: Phase: Pending → WaitingForPropagation
    EA->>EA: Requeue after 4 minutes
    Note over EA: ... propagation delay elapses ...
    Note over EA: Phase: WaitingForPropagation → Stabilizing
    EA->>EA: Compute derived timing fields
    EA->>EA: Requeue after StabilizationWindow
    Note over EA: ... stabilization window elapses ...
    Note over EA: Phase: Stabilizing → Assessing
    EA->>K8s: Compute post-remediation spec hash
    EA->>EA: Health, alert, metrics checks
    Note over EA: Phase: Assessing → Completed

Hash Compute Deferral (DD-EM-004)¶

When HashComputeDelay is set:

The EM enters WaitingForPropagation instead of proceeding directly to Stabilizing
It requeues with the remaining delay duration
Once the delay elapses, it transitions to Stabilizing and computes derived timing from the anchor time (creation + HashComputeDelay), not from creation time
The spec hash is computed after the propagation delay, capturing the fully propagated state

Validity Window Extension¶

If the total offset (propagation delay + stabilization + alert check delay) exceeds the configured validity window, the deadline is automatically extended to prevent premature expiry.

Tuning¶

For environments with faster or slower propagation:

# values.yaml
remediationorchestrator:
  config:
    effectivenessAssessment:
      stabilizationWindow: "10m"  # longer for slow rollouts
    asyncPropagation:
      gitOpsSyncDelay: "5m"       # longer for slow ArgoCD sync
      operatorReconcileDelay: "2m" # longer for complex operators

Tuning Guidelines¶

Scenario	Recommendation
ArgoCD with polling (3-5m cycle)	`gitOpsSyncDelay: 5m`
ArgoCD with webhook (near-instant)	`gitOpsSyncDelay: 30s`
Flux with short reconciliation interval	`gitOpsSyncDelay: 2m`
Complex operator (e.g., database)	`operatorReconcileDelay: 3m`
Simple ConfigMap reload	`operatorReconcileDelay: 30s`
Fast local cluster (Kind, Minikube)	Both delays `0`

Next Steps¶

Effectiveness Assessment -- Full assessment model and scoring
Configuration Reference -- All configurable parameters