Async Propagation¶
Many Kubernetes environments use GitOps tools (ArgoCD, Flux) or operators that introduce propagation delays between a change being applied and that change taking full effect. Kubernaut accounts for these delays in its effectiveness assessment model.
The Problem¶
When Kubernaut patches a Deployment:
- The patch is applied to the Kubernetes API -- immediate
- ArgoCD detects drift and syncs -- 1-5 minutes
- The operator reconciles the new state -- seconds to minutes
- New pods roll out and become ready -- seconds to minutes
If the effectiveness assessment runs immediately after the patch, it may see an unhealthy state even though the fix is working -- it just hasn't propagated yet. Worse, computing a spec hash immediately would capture the pre-sync state rather than the final state, producing a false-positive drift detection.
Target Detection¶
The Remediation Orchestrator detects whether the remediation target requires propagation delays based on two independent characteristics:
| Characteristic | Detection Method | Source |
|---|---|---|
| GitOps-managed | PostRCAContext.DetectedLabels.GitOpsManaged=true (JSON field from HAPI response, not a Kubernetes object label) |
HAPI LabelDetector during post-RCA analysis |
| Custom Resource (CRD) | Target resource belongs to a non-built-in API group (detected via IsBuiltInGroup(group) allowlist — non-built-in groups are treated as CRDs) |
API group check, no operator/controller detection |
These flags are set during the AI Analysis phase and propagated to the EffectivenessAssessment CRD spec.
Delay Model¶
Kubernaut uses configurable propagation delays that are additive based on detected target characteristics. The Orchestrator computes the total propagation delay and sets it as HashComputeDelay on the EA spec.
| Parameter | Default | Applies When | Configurable Via |
|---|---|---|---|
gitOpsSyncDelay |
3 minutes | Target is GitOps-managed (ArgoCD/Flux) | remediationorchestrator.config.asyncPropagation.gitOpsSyncDelay |
operatorReconcileDelay |
1 minute | Target belongs to a non-built-in API group (CRD) | remediationorchestrator.config.asyncPropagation.operatorReconcileDelay |
stabilizationWindow |
5 minutes | Always (all targets) | remediationorchestrator.config.effectivenessAssessment.stabilizationWindow |
When Delays Apply¶
The propagation delay is computed from two independent flags (isGitOps, isCRD):
| Target Type | Propagation Delay | Total Wait Before Assessment |
|---|---|---|
| Sync target (direct patch, built-in API group) | 0 | stabilizationWindow |
| GitOps-managed | gitOpsSyncDelay |
gitOpsSyncDelay + stabilizationWindow |
| Custom Resource (non-built-in API group) | operatorReconcileDelay |
operatorReconcileDelay + stabilizationWindow |
| GitOps + CRD (both) | gitOpsSyncDelay + operatorReconcileDelay |
Both delays + stabilizationWindow |
The delays are additive — if a target is both GitOps-managed and belongs to a non-built-in API group, both delays compound. Setting either delay to 0 disables that stage.
Integration with Effectiveness Assessment¶
The propagation delay directly controls when the Effectiveness Monitor computes the spec hash and begins its assessment:
sequenceDiagram
participant RO as Orchestrator
participant EA as EM Controller
participant K8s as Kubernetes API
RO->>EA: Create EA CRD (HashComputeDelay=4m)
Note over EA: Phase: Pending → WaitingForPropagation
EA->>EA: Requeue after 4 minutes
Note over EA: ... propagation delay elapses ...
Note over EA: Phase: WaitingForPropagation → Stabilizing
EA->>EA: Compute derived timing fields
EA->>EA: Requeue after StabilizationWindow
Note over EA: ... stabilization window elapses ...
Note over EA: Phase: Stabilizing → Assessing
EA->>K8s: Compute post-remediation spec hash
EA->>EA: Health, alert, metrics checks
Note over EA: Phase: Assessing → Completed
Hash Compute Deferral (DD-EM-004)¶
When HashComputeDelay is set:
- The EM enters
WaitingForPropagationinstead of proceeding directly toStabilizing - It requeues with the remaining delay duration
- Once the delay elapses, it transitions to
Stabilizingand computes derived timing from the anchor time (creation + HashComputeDelay), not from creation time - The spec hash is computed after the propagation delay, capturing the fully propagated state
Validity Window Extension¶
If the total offset (propagation delay + stabilization + alert check delay) exceeds the configured validity window, the deadline is automatically extended to prevent premature expiry.
Tuning¶
For environments with faster or slower propagation:
# values.yaml
remediationorchestrator:
config:
effectivenessAssessment:
stabilizationWindow: "10m" # longer for slow rollouts
asyncPropagation:
gitOpsSyncDelay: "5m" # longer for slow ArgoCD sync
operatorReconcileDelay: "2m" # longer for complex operators
Tuning Guidelines¶
| Scenario | Recommendation |
|---|---|
| ArgoCD with polling (3-5m cycle) | gitOpsSyncDelay: 5m |
| ArgoCD with webhook (near-instant) | gitOpsSyncDelay: 30s |
| Flux with short reconciliation interval | gitOpsSyncDelay: 2m |
| Complex operator (e.g., database) | operatorReconcileDelay: 3m |
| Simple ConfigMap reload | operatorReconcileDelay: 30s |
| Fast local cluster (Kind, Minikube) | Both delays 0 |
Next Steps¶
- Effectiveness Assessment -- Full assessment model and scoring
- Configuration Reference -- All configurable parameters