Rego Policies¶
Kubernaut uses OPA Rego policies for two critical decision points: Signal Processing classification (severity, priority, environment, custom labels) and AI Analysis approval gates (whether human approval is required before execution).
All policies are deployed as ConfigMaps and can be customized. See SignalProcessing Rego Policies for provisioning details.
Policy Overview¶
| Policy File | Service | Purpose | Hot-Reload |
|---|---|---|---|
policy.rego |
Signal Processing | Unified classification: environment, severity, priority, custom labels (all rules in package signalprocessing) |
Yes |
approval.rego |
AI Analysis | Decide if human approval is required for a remediation | Yes |
Complete input field reference
For a full listing of every input.* field available in each policy -- including types, descriptions, and usage examples for writing custom policies -- see the Rego Policy Reference.
Signal Processing Policies¶
Signal Processing uses a single unified policy.rego file under package signalprocessing (ADR-060). This file contains four rule groups that run during signal enrichment, before the signal reaches AI Analysis. Their output directly feeds into workflow discovery -- see Workflow Search and Scoring.
ConfigMap: signalprocessing-policy (single key: policy.rego)
Severity Rules¶
Normalizes the raw alert severity to one of Kubernaut's standard values.
Rule name: severity
Output: string -- one of critical, high, medium, low, unknown
Default mapping:
| Input (case-insensitive) | Output |
|---|---|
critical, sev1, p0, p1, error |
critical |
high, sev2, p2, warning |
high |
medium, sev3 |
medium |
low, p3 |
low |
| Anything else | unknown |
Input: input.signal.severity (the raw severity string from the alert source)
Severity determines workflow discoverability
The severity value produced by these rules feeds into Layer 1 mandatory label filtering in DataStorage. If this maps an alert to "unknown" and no workflow declares severity: ["unknown"] or severity: ["*"], no workflows will be found. Ensure your severity mappings cover all values your alert sources produce.
Customization example: To add support for a PagerDuty P0--P4 scheme, add rules in policy.rego:
Priority Rules¶
Assigns a priority level (P0--P3) using a composite score from severity and environment. Priority rules can reference the severity and environment rules directly via Rego cross-rule references.
Rule name: priority
Output: {"priority": "P0"|"P1"|"P2"|"P3", "policy_name": "..."}
Scoring matrix:
| Severity | Score | Environment | Score | |
|---|---|---|---|---|
critical |
3 | production |
3 | |
warning / high |
2 | staging |
2 | |
info |
1 | development / test |
1 | |
| Other | 0 | Other | 0 |
Namespace label tier=critical adds +3, tier=high adds +2 (highest wins).
Priority assignment: composite_score = severity_score + env_score
| Composite Score | Priority |
|---|---|
| >= 6 | P0 |
| 5 | P1 |
| 4 | P2 |
| < 4 | P3 |
Example: A critical alert in production = 3 + 3 = 6 = P0. A warning alert in development = 2 + 1 = 3 = P3.
Cross-rule referencing: Priority rules can reference environment.environment and severity directly:
priority := {"priority": "P0", "policy_name": "production-critical"} if {
environment.environment == "production"
severity == "critical"
}
Environment Rules¶
Classifies the environment from namespace metadata. Used for workflow filtering, approval decisions, and cross-referenced by priority rules.
Rule name: environment
Output: {"environment": string, "source": string}
Resolution order (default policy):
kubernaut.ai/environmentnamespace label (if present)- Namespace name convention:
production/prod→production,staging→staging,development/dev→development - Default:
"unknown"
Workload labels for cluster-scoped resources
The Rego input includes input.workload.labels from the target resource. Custom environment rules can use these for cluster-scoped resources (e.g., Nodes) where namespace labels are not available. The default policy does not use workload labels, but operators can extend it to classify environments based on workload metadata.
Customization: Add rules in policy.rego for your namespace naming conventions:
environment := {"environment": "production", "source": "namespace-name"} if {
not input.namespace.labels["kubernaut.ai/environment"]
endswith(input.namespace.name, "-prod")
}
Custom Labels Rules¶
Extracts operator-defined labels from namespace labels with the kubernaut.ai/label- prefix.
Rule name: labels
Output: map of key-value pairs (map[string][]string)
Example: A namespace with:
Produces: {"team": ["payments"], "tier": ["gold"]}
These labels feed into Layer 2 scoring at +0.15 per exact match. See Workflow Search and Scoring.
Signal Mode Configuration¶
Signal mode is configured via YAML (not Rego). A mapping file determines which alert names are treated as proactive vs reactive.
ConfigMap: signalprocessing-proactive-signal-mappings
Default mappings:
proactive_signal_mappings:
PredictedOOMKill: OOMKilled
PredictedCPUThrottling: CPUThrottling
PredictedDiskPressure: DiskPressure
PredictedNodeNotReady: NodeNotReady
Signal names that match a key in this map are classified as proactive; all others default to reactive. The mapped value is the base signal name used for workflow catalog lookup.
Signal mode determines which prompt variant HolmesGPT uses during investigation (reactive: "Investigate the Incident" vs proactive: "Investigate the Anticipated Incident"). See Investigation Pipeline for details.
Signal mode mappings are not hot-reloaded
Unlike Rego policies, the proactive signal mappings are loaded once at startup. Changes require a pod restart.
AI Analysis Approval Policy¶
The approval policy runs after HAPI returns a successful workflow selection. It determines whether the remediation requires human approval or can proceed automatically.
Package: aianalysis.approval
ConfigMap: approval.rego mounted via charts/kubernaut/files/rego/aianalysis/approval.rego
Input Fields¶
| Field | Source | Description |
|---|---|---|
input.environment |
SP enrichment | production, staging, development, qa, test |
input.confidence |
HAPI SelectedWorkflow.Confidence |
LLM confidence score (0.0--1.0) |
input.confidence_threshold |
Helm config (optional) | Overrides default 0.8 |
input.remediation_target |
HAPI RootCauseAnalysis.RemediationTarget |
{kind, name, namespace} |
input.detected_labels |
HAPI PostRCAContext.DetectedLabels |
Infrastructure characteristics |
input.failed_detections |
HAPI PostRCAContext.DetectedLabels.FailedDetections |
Detection errors |
input.warnings |
HAPI investigation warnings | Array of warning strings |
Approval Rules¶
Three mandatory triggers:
-
Missing remediation target -- If
remediation_targetis absent or has an emptykind, approval is always required. Safety net for incomplete RCA. -
Production environment -- All production remediations require human approval, regardless of confidence. Controlled by setting
kubernaut.ai/environment=productionon the namespace. -
Sensitive resource kinds -- Remediations targeting
NodeorStatefulSetresources always require approval, regardless of environment. These are high-impact resources where automated remediation carries elevated risk.
Non-production environments (development, staging, qa, test) auto-approve when remediation_target is present and the resource kind is not sensitive.
Confidence Threshold¶
default confidence_threshold := 0.8
confidence_threshold := input.confidence_threshold if {
input.confidence_threshold
}
is_high_confidence if {
input.confidence >= confidence_threshold
}
The default threshold is 0.8 (80%). Override via Helm:
The is_high_confidence helper is defined but not currently used in the approval rules. It is available for operators to add custom rules.
Risk Factors¶
Scored risk factors determine the human-readable reason shown in the RemediationApprovalRequest:
| Score | Condition | Reason |
|---|---|---|
| 90 | Missing remediation target | "Cannot determine remediation target" |
| 80 | Production + sensitive resource (Node, StatefulSet) | "Production environment with sensitive resource kind" |
| 70 | Production environment | "Production environment - requires manual approval" |
The highest-scoring factor becomes the approval reason. Scores affect the reason text only, not the approval decision.
Customization Examples¶
Require approval for StatefulSet remediations in staging:
require_approval if {
input.environment == "staging"
input.remediation_target.kind == "StatefulSet"
}
risk_factors contains {"score": 60, "reason": "Staging StatefulSet remediation requires approval"} if {
input.environment == "staging"
input.remediation_target.kind == "StatefulSet"
}
Require approval when confidence is below threshold in any environment:
require_approval if {
not is_high_confidence
}
risk_factors contains {"score": 65, "reason": "Low confidence remediation requires approval"} if {
not is_high_confidence
}
CRD Safety Policy¶
Automated modifications to CustomResourceDefinitions are high-risk operations — a CRD change cascades to every CR of that type across the cluster. The approval policy can gate CRD remediations to require human review.
Require approval for all CRD modifications:
require_approval if {
input.remediation_target.kind == "CustomResourceDefinition"
}
is_sensitive_resource if {
input.remediation_target.kind == "CustomResourceDefinition"
}
risk_factors contains {"score": 95, "reason": "CRD modification — cascades to all CRs of this type"} if {
input.remediation_target.kind == "CustomResourceDefinition"
}
Elevated risk for GitOps-managed CRDs:
When the LLM detects that the target resource is managed by ArgoCD or Flux, direct modification conflicts with the GitOps reconciliation loop. Combine remediation_target.kind with detected_labels for a stricter gate:
risk_factors contains {"score": 95, "reason": "CRD modification under GitOps management — requires human approval"} if {
input.remediation_target.kind == "CustomResourceDefinition"
input.detected_labels.git_ops_managed == true
}
See the Rego Reference for the full list of detected_labels fields available in approval rules.
Deployment and Update¶
Where Policies Live¶
| Policy | Provisioning |
|---|---|
SP unified policy (policy.rego) |
User-provided via --set-file signalprocessing.policy=... or existingPolicyConfigMap |
| SP proactive signal mappings | User-provided via --set-file signalprocessing.proactiveSignalMappings.content=... or existingConfigMap |
| AA approval | User-provided via --set-file aianalysis.policies.content=... or existingConfigMap |
See SignalProcessing Rego Policies for full provisioning instructions and the example at charts/kubernaut/examples/signalprocessing-policy.rego.
Hot-Reload¶
The SP unified Rego policy and the AA approval policy support hot-reload via fsnotify file watchers:
- Update the policy file
- Update the ConfigMap (via
helm upgrade,kubectl apply, or direct edit) - Kubelet syncs the ConfigMap update to the pod (~60 seconds)
- fsnotify detects the file change and reloads the policy (<1 second)
- The new policy takes effect without pod restart
The reload is validated -- if the new policy has a syntax error, the previous policy is kept and an error is logged. No service interruption occurs.
Single-file reload granularity
Since all SP classification rules share one policy.rego file, any edit triggers a full reload of all rules. Structure your policy with clear section headers to make partial edits manageable. See SignalProcessing Rego Policies for the recommended file structure.
Next Steps¶
- Remediation Workflows -- How policies feed into workflow discovery and scoring
- Investigation Pipeline -- How the approval policy integrates with the investigation outcomes
- Human Approval -- What happens when approval is required
- Configuration Reference -- Other configurable aspects of Kubernaut