Human Approval¶

Kubernaut supports human-in-the-loop approval gates to ensure that remediations are reviewed before execution when confidence is low or policy requires it.

CRD Reference

For the complete RemediationApprovalRequest CRD specification, see API Reference: CRDs.

When Approval Is Required¶

Approval is determined by a user-replaceable Rego policy evaluated during the AI Analysis phase. Operators control approval behavior by editing the Rego policy in the aianalysis-policies ConfigMap -- the policy is not hardcoded.

Default Policy Behavior¶

The shipped approval.rego makes decisions based on environment, remediation target presence, and sensitive resource kinds:

Production namespaces (kubernaut.ai/environment=production) — always require approval, regardless of confidence
Sensitive resource kinds (Node, StatefulSet) — always require approval, regardless of environment
Non-production namespaces (staging, development, qa, test) — auto-approved when remediation_target is present and the resource kind is not sensitive
Missing remediation target — always requires approval (default-deny safety per ADR-055)

When approval is required, a RemediationApprovalRequest CRD is created and the remediation enters the AwaitingApproval phase.

Custom Policies¶

The Rego policy receives a rich input context (see Rego Policy Evaluation below), including confidence, confidence_threshold, environment, detected_labels, remediation_target, custom_labels, and business_classification. Operators can write policies that use any combination of these inputs -- for example, a confidence-gated policy that auto-approves high-confidence analyses in production.

The default policy defines an is_high_confidence helper but does not use it in its approval rules. This is a building block for custom policies.

Confidence Thresholds¶

Kubernaut uses two confidence thresholds at different stages:

Stage	Threshold	Configurable	Purpose
Investigating (response processor)	0.7 (70%)	Not yet (v1.2, per BR-HAPI-198; see `pkg/aianalysis/handlers/response_processor.go`)	Rejects workflow selections with very low confidence; detects "problem already resolved" scenarios
Analyzing (Rego approval policy)	0.8 (default in Rego)	Yes, via Helm	Passed as `input.confidence_threshold` to the Rego policy

Configuring the Confidence Threshold¶

# values.yaml
aianalysis:
  rego:
    confidenceThreshold: 0.85

When confidenceThreshold is null (default), the Rego policy's built-in default of 0.8 applies.

Default policy does not use confidence for approval

The shipped approval.rego defines is_high_confidence but does not reference it in any approval rule. Setting confidenceThreshold has no effect unless you customize the policy to use is_high_confidence or input.confidence in your approval rules.

The Approval Flow¶

sequenceDiagram
    participant AA as AI Analysis
    participant RO as Orchestrator
    participant RAR as ApprovalRequest CRD
    participant Human as Operator
    participant WE as Workflow Execution

    AA->>RO: Analysis complete (confidence: 0.65)
    RO->>RAR: Create RemediationApprovalRequest
    Note over RO: Phase: AwaitingApproval
    Human->>RAR: kubectl patch (approve/reject)
    alt Approved
        RAR->>RO: Decision: Approved
        RO->>WE: Create WorkflowExecution
        Note over RO: Phase: Executing
    else Rejected
        RAR->>RO: Decision: Rejected
        Note over RO: Phase: Failed
    end

Approving or Rejecting¶

Operators approve or reject via kubectl. The RAR uses a Kubernetes status subresource, so the --subresource=status flag is required for patch commands.

# List pending approvals
kubectl get rar -n kubernaut-system

# Approve
kubectl patch rar <name> -n kubernaut-system \
  --subresource=status --type=merge \
  -p '{"status":{"decision":"Approved","decidedBy":"operator-name","decisionMessage":"RCA looks correct"}}'

# Reject
kubectl patch rar <name> -n kubernaut-system \
  --subresource=status --type=merge \
  -p '{"status":{"decision":"Rejected","decidedBy":"operator-name","decisionMessage":"Wrong root cause identified"}}'

Use --subresource=status

Omitting --subresource=status targets the main resource spec, which is immutable. The decision must be written to the status subresource for the Orchestrator to detect it.

Walkthrough: Reviewing a RemediationApprovalRequest¶

This section walks through a complete approval lifecycle.

1. List Pending RARs¶

$ kubectl get rar -n kubernaut-system
NAMESPACE          NAME                           AIANALYSIS                    CONFIDENCE   DECISION   EXPIRED   REQUIREDBY             AGE
kubernaut-system   rar-rr-b4f71c8b83b7-5e9c6266   ai-rr-b4f71c8b83b7-5e9c6266   0.85                              2026-03-10T01:13:44Z   44s

A blank DECISION column indicates a pending RAR awaiting operator action.

2. Inspect the RAR¶

kubectl get rar rar-rr-b4f71c8b83b7-5e9c6266 -n kubernaut-system -o yaml

apiVersion: kubernaut.ai/v1alpha1
kind: RemediationApprovalRequest
metadata:
  name: rar-rr-b4f71c8b83b7-5e9c6266
  namespace: kubernaut-system
  ownerReferences:
    - apiVersion: kubernaut.ai/v1alpha1
      blockOwnerDeletion: true
      controller: true
      kind: RemediationRequest
      name: rr-b4f71c8b83b7-5e9c6266
      uid: 0b6599e5-d8fa-4fef-8ae6-e118c6977372
spec:
  remediationRequestRef:
    apiVersion: kubernaut.ai/v1alpha1
    kind: RemediationRequest
    name: rr-b4f71c8b83b7-5e9c6266
    namespace: kubernaut-system
    uid: 0b6599e5-d8fa-4fef-8ae6-e118c6977372
  aiAnalysisRef:
    name: ai-rr-b4f71c8b83b7-5e9c6266
  confidence: 0.85
  confidenceLevel: high
  reason: "Production environment with sensitive resource kind - requires manual approval"
  investigationSummary: >-
    StatefulSet replica mismatch caused by missing StorageClass
    'broken-storage-class' preventing PVC provisioning for pod kv-store-2
  recommendedWorkflow:
    workflowId: f2ed351f-ccc5-41d2-bcf5-de999ed55528
    version: "1.0.2"
    executionBundle: quay.io/kubernaut-cicd/test-workflows/fix-statefulset-pvc-job@sha256:d5b0c73ab0fb4093e1e291974013660dd2e5d8e0b2016eaf94d78cf9759be500
    rationale: >-
      The workflow can recreate the missing PVC with proper StorageClass
      configuration to resolve the scheduling issue for kv-store-2
  evidenceCollected:
    - "StorageClass 'broken-storage-class' does not exist in the cluster"
    - "PVC kv-store-data-kv-store-2 is Pending with ProvisioningFailed event"
    - "StatefulSet kv-store has 2/3 ready replicas; pod kv-store-2 unschedulable"
  alternativesConsidered:
    - approach: "Manual PVC creation with existing StorageClass"
      prosCons: "Pros: Immediate fix. Cons: Diverges from volumeClaimTemplate; future scale-ups repeat the issue"
  recommendedActions:
    - action: "Review the recommended workflow and approve if appropriate"
      rationale: "Production environment with sensitive resource kind - requires manual approval"
  whyApprovalRequired: >-
    Production environment with sensitive resource kind - requires manual approval
  policyEvaluation:
    policyName: "aianalysis.approval"
    matchedRules:
      - "is_production"
    decision: "manual_review_required"
  requiredBy: "2026-03-10T01:13:44Z"
status:
  createdAt: "2026-03-10T00:58:44Z"
  timeRemaining: "14m29s"
  conditions:
    - type: ApprovalPending
      status: "True"
      lastTransitionTime: "2026-03-10T00:58:44Z"
      reason: AwaitingDecision
      message: "Awaiting decision, expires 2026-03-10T01:13:44Z"
    - type: ApprovalDecided
      status: "False"
      lastTransitionTime: "2026-03-10T00:58:44Z"
      reason: PendingDecision
      message: "No decision yet"
    - type: ApprovalExpired
      status: "False"
      lastTransitionTime: "2026-03-10T00:58:44Z"
      reason: NotExpired
      message: "Approval has not expired"

3. Review the Key Fields¶

Field	What to Check
`spec.investigationSummary`	Does the root cause analysis make sense?
`spec.recommendedWorkflow`	Is the proposed workflow appropriate for the problem?
`spec.evidenceCollected`	Does the evidence support the diagnosis?
`spec.alternativesConsidered`	Were alternative approaches reasonably dismissed?
`spec.confidence` / `confidenceLevel`	How confident is the system? (low < 0.6 < medium < 0.8 < high)
`spec.whyApprovalRequired`	What policy or condition triggered the approval gate?
`spec.policyEvaluation`	Which Rego policy rules matched?
`spec.recommendedActions`	What is the operator expected to do?
`spec.requiredBy` / `status.timeRemaining`	How much time is left before the RAR expires?

4. Approve or Reject¶

# Approve with rationale
kubectl patch rar rar-rr-b4f71c8b83b7-5e9c6266 -n kubernaut-system \
  --subresource=status --type=merge \
  -p '{"status":{"decision":"Approved","decidedBy":"jgil","decisionMessage":"RCA confirmed: StorageClass issue matches PVC events"}}'

# Or reject
kubectl patch rar rar-rr-b4f71c8b83b7-5e9c6266 -n kubernaut-system \
  --subresource=status --type=merge \
  -p '{"status":{"decision":"Rejected","decidedBy":"jgil","decisionMessage":"Wrong root cause — issue is node affinity, not StorageClass"}}'

Quick approval of the first pending RAR

To approve the first pending RAR without copy-pasting the name:

kubectl patch rar $(kubectl get rar -n kubernaut-system -o name | head -1) \
  -n kubernaut-system --type=merge --subresource=status \
  -p '{"status":{"decision":"Approved","decidedBy":"operator","decisionMessage":"Approved after review"}}'

5. Verify the Decision¶

$ kubectl get rar -n kubernaut-system
NAMESPACE          NAME                           AIANALYSIS                    CONFIDENCE   DECISION   EXPIRED   REQUIREDBY             AGE
kubernaut-system   rar-rr-b4f71c8b83b7-5e9c6266   ai-rr-b4f71c8b83b7-5e9c6266   0.85         Approved             2026-03-10T01:13:44Z   5m

After approval, the Orchestrator transitions the RemediationRequest from AwaitingApproval to Executing and creates a WorkflowExecution CRD.

Expiration¶

If no decision is made before requiredBy, the RAR expires:

status.expired is set to true
status.decision becomes Expired
status.decidedBy is set to system
The Orchestrator transitions the RemediationRequest to Failed

The default expiration window is 15 minutes, configurable per signal hierarchy via Helm values.

Rego Policy Evaluation¶

The approval decision is governed by the aianalysis.approval Rego policy. The policy receives the following input context:

Input Field	Source	Description
`confidence`	AI Analysis result	LLM confidence score (0.0--1.0)
`confidence_threshold`	Helm values (default: 0.8)	Auto-approval threshold
`environment`	Signal Processing classification	`production`, `staging`, `development`, `qa`, `test`
`detected_labels`	HAPI label detection	Infrastructure labels (e.g., `stateful`, `gitOpsManaged`)
`failed_detections`	HAPI label detection	Labels where detection failed
`remediation_target`	AI Analysis RCA result	Target resource `kind` and `name`
`warnings`	HAPI investigation	Non-fatal investigation warnings

The built-in policy behavior:

Production namespaces (kubernaut.ai/environment=production): Always require approval, regardless of confidence
Non-production namespaces: Auto-approved unless critical safety conditions (missing remediation_target)
Missing affected resource: Always requires approval (default-deny safety)

Operators can customize approval behavior by modifying the Rego policy in the aianalysis-policies ConfigMap.

Approval Context¶

When a RemediationApprovalRequest is created, it includes rich context to help the operator decide:

Investigation summary — The LLM's root cause analysis results
Recommended workflow — Which remediation was proposed and why
Confidence score and level — How confident the system is
Why approval is required — The policy reason for requiring human review
Evidence collected — Supporting data from the investigation
Alternatives considered — Other workflows the system evaluated

Audit Trail¶

All approval decisions are captured in the audit trail:

Who approved or rejected (operator identity via admission webhook)
When the decision was made
What reason was provided
The full context at the time of decision

See Audit & Observability for details.

Next Steps¶

Effectiveness Monitoring — How Kubernaut evaluates the outcome
Configuration Reference — Tuning confidence thresholds
Audit & Observability — Compliance and audit trails