Signals & Alert Routing¶

Kubernaut ingests signals from two sources: Prometheus AlertManager and Kubernetes Events. This page explains how signals reach the system, how resource scope is enforced, and how deduplication works.

Signal Sources¶

Prometheus AlertManager¶

The Gateway exposes an AlertManager-compatible webhook endpoint:

POST /api/v1/signals/prometheus

Configure AlertManager to send alerts to this endpoint:

# alertmanager.yml
receivers:
  - name: kubernaut
    webhook_configs:
      - url: http://gateway-service.kubernaut-system.svc:8080/api/v1/signals/prometheus
        send_resolved: false

route:
  receiver: kubernaut
  routes:
    - match:
        severity: critical
      receiver: kubernaut

The Gateway validates each alert, extracts the target resource, checks scope labels, and creates a RemediationRequest CRD.

Kubernetes Events¶

The Gateway accepts Kubernetes events via a webhook endpoint:

POST /api/v1/signals/kubernetes-event

This captures events like BackOff, OOMKilled, FailedScheduling, and Unhealthy without requiring a Prometheus rule for each. To use this source, deploy a Kubernetes Event Exporter (e.g., the Resmo Event Exporter) and point it at the Gateway endpoint. See Event Exporter for setup details.

Event Exporter removed from chart in v1.1

The Event Exporter was previously bundled in the Helm chart. Since v1.1, Kubernetes event forwarding is a user-provided concern managed independently of the Kubernaut installation.

Signal Types¶

After ingestion, the Gateway normalizes all signals to a single type:

Source	Signal Type	Example
AlertManager	`alert`	`KubePodCrashLooping`, `KubePodOOMKilled`
Kubernetes Event	`alert`	`BackOff`, `OOMKilled`, `FailedScheduling`

Both sources produce signals of type alert. Kubernetes events are treated as alerts after normalization.

Resource Scope Management¶

Kubernaut uses a label-based opt-in model to control which resources it manages.

Namespace Scope¶

# Opt in a namespace
kubectl label namespace my-app kubernaut.ai/managed=true

# Opt out
kubectl label namespace my-app kubernaut.ai/managed-

The Gateway checks this label before creating a RemediationRequest. Signals targeting unmanaged namespaces are dropped silently.

Resource-Level Scope¶

Individual resources can also be opted in or out:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-app
  labels:
    kubernaut.ai/managed: "true"

Monitoring Infrastructure Filtering¶

Before extracting the target resource, the Gateway filters labels that refer to monitoring infrastructure rather than application workloads. Prometheus scrape configuration injects service and pod labels that identify the scraping source (e.g., kube-state-metrics, prometheus-node-exporter), not the monitored target. Without filtering, Kubernaut could attempt to remediate monitoring pods when an alert fires.

The filter matches known monitoring naming patterns:

Service labels: Names containing prometheus, kube-state-metrics, alertmanager, grafana, thanos, or exporter; names prefixed with victoria, loki, or jaeger; names ending with -operator
Pod labels: Names containing kube-state-metrics, prometheus-node-exporter, alertmanager-kube-prometheus, or prometheus-kube-prometheus; names prefixed with kube-prometheus-stack-*

Filtered labels are skipped during target resource extraction. The LLM's remediationTarget field provides a safety net for edge cases where the filter is too aggressive.

Fingerprinting¶

Before creating a RemediationRequest, the Gateway computes a signal fingerprint — a SHA256 hash based on the top-level owning resource (e.g., Deployment), not the individual Pod. This ensures that alerts from different Pods in the same Deployment produce the same fingerprint, enabling accurate deduplication.

The Gateway resolves the owner chain using a metadata-only informer cache. Two direct API fallback mechanisms handle race conditions during rollout restarts:

Cache miss (#282) — When a newly created pod isn't in the cache yet, the resolver falls back to a direct API read (apiReader) to fetch the resource and continue the owner chain walk.
Stale cache (#284) — When the cache returns a resource without controller ownerReferences (e.g., a terminating pod whose cached metadata lost its ownerRef), the resolver re-fetches via the direct API to verify. If the fresh metadata includes ownerReferences, the chain continues normally. If the resource genuinely has no owner (standalone Pod), the pod-level fingerprint is accepted. If the resource no longer exists, the signal is dropped.

This trust-but-verify approach prevents duplicate RemediationRequests caused by stale pod-level fingerprints while still supporting legitimate standalone Pods.

Deduplication¶

The Gateway prevents duplicate remediations for the same issue using CRD-based deduplication. Before creating a new RemediationRequest, it checks whether an active (non-terminal) RemediationRequest already exists with the same signal fingerprint.

If a duplicate is detected, the new signal is dropped and logged.

Signal Processing¶

After the Gateway creates a RemediationRequest, the Orchestrator creates a SignalProcessing CRD. The Signal Processing controller enriches the signal through several stages:

Enrichment¶

Owner chain resolution — Traces Pod → ReplicaSet → Deployment (or StatefulSet, DaemonSet, etc.)
Namespace labels — Extracts environment, team, and other namespace-level metadata
Resource context — Current resource status, events, and conditions

Classification¶

Rego policies evaluate the enriched signal to determine:

Severity — critical, high, medium, low, or unknown
Priority — Business impact and urgency
Environment — Production, staging, development
Signal mode — Reactive or proactive

Signal Mode¶

Mode	Meaning	Example Alerts
Reactive	Active incident requiring remediation	`KubePodCrashLooping`, `KubePodOOMKilled`
Proactive	Predicted issue before user impact	`PredictDiskFull`, `PredictMemoryExhaustion` (via `predict_linear()`)

Signal mode determines which prompt variant HolmesGPT uses during investigation, affecting the framing of the analysis (incident response vs. preventive assessment). See Signal Mode Configuration for how to configure which alert names are treated as proactive.

Error Responses¶

The Gateway returns all errors in RFC 7807 Problem Details format (Content-Type: application/problem+json). AlertManager webhook integrations can use the status and type fields to distinguish between validation errors, authentication failures, and transient server issues.

Next Steps¶

Architecture: Gateway -- Signal adapters, deduplication, and ingestion internals
Core Concepts — Understanding the full pipeline
Remediation Workflows — How workflows are matched to signals
Configuration Reference — Gateway and Signal Processing configuration