Skip to content

Signal Processing

CRD Reference

For the complete SignalProcessing CRD specification, see API Reference: CRDs.

The Signal Processing controller transforms raw signals into enriched, classified data ready for AI analysis. It operates as a Kubernetes controller that watches SignalProcessing CRDs created by the Remediation Orchestrator.

CRD Specification

For the complete field specification, see SignalProcessing in the CRD Reference.

Condition Types

Condition Phases Meanings
EnrichmentComplete Enriching → Classifying K8s context gathered, custom labels evaluated
ClassificationComplete Classifying → Categorizing Environment, priority, severity, signal mode determined
CategorizationComplete Categorizing → Completed Business classification assigned
ProcessingComplete Completed All phases finished successfully
Ready Completed CRD is ready for consumption by the Orchestrator

Phase State Machine

stateDiagram-v2
    [*] --> Pending
    Pending --> Enriching
    Enriching --> Classifying
    Classifying --> Categorizing
    Classifying --> Failed : Severity policy error
    Categorizing --> Completed
    Enriching --> Enriching : Transient error (backoff)
    Classifying --> Classifying : Transient error (backoff)
Phase Description Requeue
Pending CRD just created, set StartTime 100ms
Enriching Gather Kubernetes context and custom labels 100ms
Classifying Evaluate Rego policies for environment, priority, severity; determine signal mode via YAML lookup 100ms
Categorizing Business classification from namespace labels + environment mapping None
Completed All results stored in status, Ready=True None
Failed Terminal -- severity policy error or unrecoverable failure None

Each phase transition emits a Kubernetes event (EventReasonPhaseTransition) and records an audit trace.

Phase 1: Enriching

The enrichment phase gathers Kubernetes context about the target resource.

Owner Chain Resolution

The owner chain builder (ownerchain.NewBuilder) follows controller: true owner references up to a depth of 5 to find the top-level controlling resource:

Pod → ReplicaSet → Deployment
Pod → StatefulSet
Pod → DaemonSet
Pod → Job → CronJob

On error, a partial chain is returned (graceful degradation).

Context by Resource Kind

Target Kind Context Gathered
Pod Namespace + Pod details + Node info + Owner chain
Deployment Namespace + Deployment details
StatefulSet Namespace + StatefulSet details
DaemonSet Namespace + DaemonSet details
ReplicaSet Namespace + ReplicaSet details
Service Namespace + Service details
Node Node info only (no namespace)
Unknown Namespace only

Namespace Context

Extracted from the target namespace (cached with a configurable TTL, default 5m):

  • Namespace name
  • Namespace labels (environment, team, tier, business-unit, etc.)
  • Namespace annotations

Custom Labels (Rego)

After K8s enrichment, the Rego engine evaluates custom label policies:

  • Query: data.signalprocessing.customlabels.labels
  • Input: Kubernetes context + signal metadata
  • Output: map[string][]string (subdomain → values)
  • Limits: Max 10 keys, 5 values per key, key length 63, value length 100
  • Reserved prefixes: kubernaut.ai/ and system/ are stripped (BR-SP-104)
  • Timeout: 5s

If Rego evaluation fails or yields no results, a fallback reads well-known namespace labels:

Namespace Label Custom Label Key
kubernaut.ai/team team
kubernaut.ai/tier tier
kubernaut.ai/cost-center cost-center
kubernaut.ai/region region

Degraded Mode

When the target resource is not found (404 Not Found), the enricher activates degraded mode:

  • Sets KubernetesContext.DegradedMode = true
  • Returns partial context (namespace-level only for namespaced resources)
  • Falls back to signal labels and annotations for workload details
  • Emits DegradedMode condition reason
  • Processing continues -- a signal about a deleted resource still gets classified

Operational Details Not Captured

Pod conditions, container statuses, events, and resource requests/limits are not captured by Signal Processing. HolmesGPT fetches these on demand via kubectl during the AI investigation phase.

Phase 2: Classifying

The classification phase evaluates four classifiers in sequence. A failure in severity classification is fatal -- the CRD transitions to Failed and is not requeued.

1. Environment Classifier (Rego)

  • Query: data.signalprocessing.environment
  • Input: {namespace: {name, labels}, signal: {severity, type, source, labels}, workload: {kind, name, labels}}
  • Output: {environment, source}
  • Values: production, staging, development, test
  • Source tracking: namespace-labels, rego-inference, or default

2. Priority Engine (Rego)

  • Query: data.signalprocessing.priority
  • Input: {namespace: {name, labels}, signal: {severity, type, source, labels}, workload: {kind, name, labels}}
  • Output: {priority, policy_name}
  • Values: P0, P1, P2, P3
  • Timeout: 100ms

3. Severity Classifier (Rego)

  • Query: data.signalprocessing.severity
  • Input: {namespace: {name, labels}, signal: {severity, type, source, labels}, workload: {kind, name, labels}}
  • Output: Normalized severity string
  • Values: critical, high, medium, low, unknown
  • Fatal on failure: A severity policy error transitions the CRD to Failed with RegoEvaluationError

4. Signal Mode Classifier (YAML)

Signal mode is determined by a YAML configuration (proactive-signal-mappings.yaml, per BR-SP-106), not a Rego policy:

  • Input: Signal name
  • Logic: Lookup in proactive signal mappings. If found → proactive + base name; otherwise → reactive + original name
  • Output: {SignalMode, SignalName, SourceSignalName}
Mode Meaning Examples
Reactive Active incident (default) KubePodCrashLooping, KubePodOOMKilled
Proactive Predictive alert PredictDiskFull, MemoryApproaching90Percent

Signal mode determines which prompt variant HolmesGPT uses during the AI investigation, affecting how the investigation is framed (reactive diagnosis vs. proactive prevention).

Classification Output

On success, the status is updated with:

  • EnvironmentClassification (environment + source + timestamp)
  • PriorityAssignment (priority + source + policy name + timestamp)
  • Severity (normalized)
  • PolicyHash (SHA256 of the Rego policy for audit traceability)
  • SignalMode and SignalName / SourceSignalName

Phase 3: Categorizing

The categorization phase assigns business classification using pure Go logic — no Rego evaluation.

classifyBusiness

The function reads namespace labels directly:

Namespace Label Field
kubernaut.ai/business-unit BusinessUnit
kubernaut.ai/team BusinessUnit (fallback when business-unit is absent)
kubernaut.ai/service-owner ServiceOwner

If a label is absent, the corresponding field is left as an empty string (not "unknown").

Environment-to-Criticality/SLA Mapping

After label extraction, the classifier maps the environment (determined in Phase 2) to criticality and SLA:

Environment Criticality SLA
production, prod high gold
staging, stage medium silver
development, dev low bronze
(other) medium bronze

On completion, Phase=Completed, CompletionTime is set, ObservedGeneration is updated, and Ready=True.

Error Handling

Transient Errors

Transient errors trigger exponential backoff with the DD-SHARED-001 pattern. Rego evaluation errors are not transient — they are treated as permanent failures.

  • Detection: IsTimeout, IsServerTimeout, IsTooManyRequests, IsServiceUnavailable, context.DeadlineExceeded, context.Canceled (Kubernetes API errors only)
  • Backoff: Base 30s, multiplier 2x, max 5m, ±10% jitter
  • Formula: BasePeriod × (Multiplier ^ (failures - 1)) ± jitter
  • Tracking: ConsecutiveFailures incremented, LastFailureTime updated
  • Recovery: Reset to 0 on success

Permanent Errors

  • Severity policy failure: Transitions directly to Failed with RegoEvaluationError. No requeue.
  • Kubernetes API errors during enrichment or classification: Network timeouts, API server errors, and context cancellation during K8s API calls are retried (treated as transient). Rego evaluation errors are never retried.

Hot-Reload

All Rego policies support hot-reload via FileWatcher (DD-INFRA-001):

  • Mechanism: fsnotify watches the policy file path
  • Debounce: 200ms to coalesce rapid ConfigMap mount updates
  • Reload: Recompile Rego, swap prepared query under a mutex
  • Hash: SHA256 of new policy stored for audit traceability
  • Affected classifiers: Environment, Priority, Severity, Custom Labels

Deduplication

Deduplication is handled entirely at the Gateway level -- Signal Processing does not perform deduplication. See Gateway: Phase-Based Deduplication.

Data Flow

sequenceDiagram
    participant RO as Orchestrator
    participant SP as Signal Processing
    participant K8s as Kubernetes API
    participant OPA as Rego Engine
    participant DS as DataStorage

    RO->>K8s: Create SignalProcessing CRD
    Note over SP: Phase: Pending → Enriching
    SP->>K8s: Get target resource + owner chain
    SP->>K8s: Get namespace (cached)
    SP->>OPA: Evaluate custom labels policy
    Note over SP: Phase: Enriching → Classifying
    SP->>OPA: Environment policy
    SP->>OPA: Priority policy
    SP->>OPA: Severity policy
    SP->>SP: Signal mode (YAML lookup)
    Note over SP: Phase: Classifying → Categorizing
    SP->>SP: Business classification (namespace labels + environment mapping)
    Note over SP: Phase: Categorizing → Completed
    SP->>DS: Audit events
    RO->>RO: Detect SP Completed → create AIAnalysis

Handoff to Remediation Orchestrator

When Signal Processing reaches Completed with Ready=True, the Remediation Orchestrator:

  1. Reads the enriched classification from the SP status
  2. Creates an AIAnalysis CRD with the enriched signal data
  3. Transitions the RemediationRequest from Processing to Analyzing

Next Steps