Signal Processing¶

CRD Reference

For the complete SignalProcessing CRD specification, see API Reference: CRDs.

The Signal Processing controller transforms raw signals into enriched, classified data ready for AI analysis. It operates as a Kubernetes controller that watches SignalProcessing CRDs created by the Remediation Orchestrator.

CRD Specification¶

For the complete field specification, see SignalProcessing in the CRD Reference.

Condition Types¶

Condition	Phases	Meanings
`EnrichmentComplete`	Enriching → Classifying	K8s context gathered, custom labels evaluated
`ClassificationComplete`	Classifying → Categorizing	Environment, priority, severity, signal mode determined
`CategorizationComplete`	Categorizing → Completed	Business classification assigned
`ProcessingComplete`	Completed	All phases finished successfully
`Ready`	Completed	CRD is ready for consumption by the Orchestrator

Phase State Machine¶

stateDiagram-v2
    [*] --> Pending
    Pending --> Enriching
    Enriching --> Classifying
    Classifying --> Categorizing
    Classifying --> Failed : Severity policy error
    Categorizing --> Completed
    Enriching --> Enriching : Transient error (backoff)
    Classifying --> Classifying : Transient error (backoff)

Phase	Description	Requeue
Pending	CRD just created, set `StartTime`	100ms
Enriching	Gather Kubernetes context and custom labels	100ms
Classifying	Evaluate Rego policies for environment, priority, severity; determine signal mode via YAML lookup	100ms
Categorizing	Business classification from namespace labels + environment mapping	None
Completed	All results stored in status, `Ready=True`	None
Failed	Terminal -- severity policy error or unrecoverable failure	None

Each phase transition emits a Kubernetes event (EventReasonPhaseTransition) and records an audit trace.

Phase 1: Enriching¶

The enrichment phase gathers Kubernetes context about the target resource.

Owner Chain Resolution¶

The owner chain builder (ownerchain.NewBuilder) follows controller: true owner references up to a depth of 5 to find the top-level controlling resource:

Pod → ReplicaSet → Deployment
Pod → StatefulSet
Pod → DaemonSet
Pod → Job → CronJob

On error, a partial chain is returned (graceful degradation).

Context by Resource Kind¶

Target Kind	Context Gathered
Pod	Namespace + Pod details + Node info + Owner chain
Deployment	Namespace + Deployment details
StatefulSet	Namespace + StatefulSet details
DaemonSet	Namespace + DaemonSet details
ReplicaSet	Namespace + ReplicaSet details
Service	Namespace + Service details
Node	Node info only (no namespace)
Unknown	Namespace only

Namespace Context¶

Extracted from the target namespace (cached with a configurable TTL, default 5m):

Namespace name
Namespace labels (environment, team, tier, business-unit, etc.)
Namespace annotations

Custom Labels (Rego)¶

After K8s enrichment, the Rego engine evaluates custom label policies:

Query: data.signalprocessing.customlabels.labels
Input: Kubernetes context + signal metadata
Output: map[string][]string (subdomain → values)
Limits: Max 10 keys, 5 values per key, key length 63, value length 100
Reserved prefixes: kubernaut.ai/ and system/ are stripped (BR-SP-104)
Timeout: 5s

If Rego evaluation fails or yields no results, a fallback reads well-known namespace labels:

Namespace Label	Custom Label Key
`kubernaut.ai/team`	`team`
`kubernaut.ai/tier`	`tier`
`kubernaut.ai/cost-center`	`cost-center`
`kubernaut.ai/region`	`region`

Degraded Mode¶

When the target resource is not found (404 Not Found), the enricher activates degraded mode:

Sets KubernetesContext.DegradedMode = true
Returns partial context (namespace-level only for namespaced resources)
Falls back to signal labels and annotations for workload details
Emits DegradedMode condition reason
Processing continues -- a signal about a deleted resource still gets classified

Operational Details Not Captured¶

Pod conditions, container statuses, events, and resource requests/limits are not captured by Signal Processing. HolmesGPT fetches these on demand via kubectl during the AI investigation phase.

Phase 2: Classifying¶

The classification phase evaluates four classifiers in sequence. A failure in severity classification is fatal -- the CRD transitions to Failed and is not requeued.

1. Environment Classifier (Rego)¶

Query: data.signalprocessing.environment
Input: {namespace: {name, labels}, signal: {severity, type, source, labels}, workload: {kind, name, labels}}
Output: {environment, source}
Values: production, staging, development, test
Source tracking: namespace-labels, rego-inference, or default

2. Priority Engine (Rego)¶

Query: data.signalprocessing.priority
Input: {namespace: {name, labels}, signal: {severity, type, source, labels}, workload: {kind, name, labels}}
Output: {priority, policy_name}
Values: P0, P1, P2, P3
Timeout: 100ms

3. Severity Classifier (Rego)¶

Query: data.signalprocessing.severity
Input: {namespace: {name, labels}, signal: {severity, type, source, labels}, workload: {kind, name, labels}}
Output: Normalized severity string
Values: critical, high, medium, low, unknown
Fatal on failure: A severity policy error transitions the CRD to Failed with RegoEvaluationError

4. Signal Mode Classifier (YAML)¶

Signal mode is determined by a YAML configuration (proactive-signal-mappings.yaml, per BR-SP-106), not a Rego policy:

Input: Signal name
Logic: Lookup in proactive signal mappings. If found → proactive + base name; otherwise → reactive + original name
Output: {SignalMode, SignalName, SourceSignalName}

Mode	Meaning	Examples
Reactive	Active incident (default)	`KubePodCrashLooping`, `KubePodOOMKilled`
Proactive	Predictive alert	`PredictDiskFull`, `MemoryApproaching90Percent`

Signal mode determines which prompt variant HolmesGPT uses during the AI investigation, affecting how the investigation is framed (reactive diagnosis vs. proactive prevention).

Classification Output¶

On success, the status is updated with:

EnvironmentClassification (environment + source + timestamp)
PriorityAssignment (priority + source + policy name + timestamp)
Severity (normalized)
PolicyHash (SHA256 of the Rego policy for audit traceability)
SignalMode and SignalName / SourceSignalName

Phase 3: Categorizing¶

The categorization phase assigns business classification using pure Go logic — no Rego evaluation.

`classifyBusiness`¶

The function reads namespace labels directly:

Namespace Label	Field
`kubernaut.ai/business-unit`	`BusinessUnit`
`kubernaut.ai/team`	`BusinessUnit` (fallback when `business-unit` is absent)
`kubernaut.ai/service-owner`	`ServiceOwner`

If a label is absent, the corresponding field is left as an empty string (not "unknown").

Environment-to-Criticality/SLA Mapping¶

After label extraction, the classifier maps the environment (determined in Phase 2) to criticality and SLA:

Environment	Criticality	SLA
`production`, `prod`	high	gold
`staging`, `stage`	medium	silver
`development`, `dev`	low	bronze
(other)	medium	bronze

On completion, Phase=Completed, CompletionTime is set, ObservedGeneration is updated, and Ready=True.

Error Handling¶

Transient Errors¶

Transient errors trigger exponential backoff with the DD-SHARED-001 pattern. Rego evaluation errors are not transient — they are treated as permanent failures.

Detection: IsTimeout, IsServerTimeout, IsTooManyRequests, IsServiceUnavailable, context.DeadlineExceeded, context.Canceled (Kubernetes API errors only)
Backoff: Base 30s, multiplier 2x, max 5m, ±10% jitter
Formula: BasePeriod × (Multiplier ^ (failures - 1)) ± jitter
Tracking: ConsecutiveFailures incremented, LastFailureTime updated
Recovery: Reset to 0 on success

Permanent Errors¶

Severity policy failure: Transitions directly to Failed with RegoEvaluationError. No requeue.
Kubernetes API errors during enrichment or classification: Network timeouts, API server errors, and context cancellation during K8s API calls are retried (treated as transient). Rego evaluation errors are never retried.

Hot-Reload¶

All Rego policies support hot-reload via FileWatcher (DD-INFRA-001):

Mechanism: fsnotify watches the policy file path
Debounce: 200ms to coalesce rapid ConfigMap mount updates
Reload: Recompile Rego, swap prepared query under a mutex
Hash: SHA256 of new policy stored for audit traceability
Affected classifiers: Environment, Priority, Severity, Custom Labels

Deduplication¶

Deduplication is handled entirely at the Gateway level -- Signal Processing does not perform deduplication. See Gateway: Phase-Based Deduplication.

Data Flow¶

sequenceDiagram
    participant RO as Orchestrator
    participant SP as Signal Processing
    participant K8s as Kubernetes API
    participant OPA as Rego Engine
    participant DS as DataStorage

    RO->>K8s: Create SignalProcessing CRD
    Note over SP: Phase: Pending → Enriching
    SP->>K8s: Get target resource + owner chain
    SP->>K8s: Get namespace (cached)
    SP->>OPA: Evaluate custom labels policy
    Note over SP: Phase: Enriching → Classifying
    SP->>OPA: Environment policy
    SP->>OPA: Priority policy
    SP->>OPA: Severity policy
    SP->>SP: Signal mode (YAML lookup)
    Note over SP: Phase: Classifying → Categorizing
    SP->>SP: Business classification (namespace labels + environment mapping)
    Note over SP: Phase: Categorizing → Completed
    SP->>DS: Audit events
    RO->>RO: Detect SP Completed → create AIAnalysis

Handoff to Remediation Orchestrator¶

When Signal Processing reaches Completed with Ready=True, the Remediation Orchestrator:

Reads the enriched classification from the SP status
Creates an AIAnalysis CRD with the enriched signal data
Transitions the RemediationRequest from Processing to Analyzing

Next Steps¶

Gateway -- How signals enter the system
AI Analysis -- How the enriched signal is analyzed by HolmesGPT
Remediation Routing -- The Orchestrator's state machine
Rego Policies -- Writing and configuring classification policies