Gateway¶

The Gateway is the entry point for all signals into Kubernaut. It accepts alerts from Prometheus AlertManager and Kubernetes Event Exporter, authenticates the source, validates scope, deduplicates against in-flight remediations, and creates RemediationRequest CRDs to initiate the remediation pipeline.

Architecture¶

flowchart LR
    AM["AlertManager"] -->|"POST /api/v1/signals/prometheus"| GW["Gateway"]
    EE["Event Exporter"] -->|"POST /api/v1/signals/kubernetes-event"| GW
    GW -->|"Auth (SAR)"| K8s["Kubernetes API"]
    GW -->|"Scope check"| K8s
    GW -->|"Create RR"| K8s
    GW -->|"Audit events"| DS["DataStorage"]

Signal Ingestion Endpoints¶

Ingestion uses port 8080 ( HTTPS when inter-service TLS is enabled). Health probes use port 8081 (plain HTTP: GET /healthz liveness, GET /readyz readiness). Metrics are on port 9090 at GET /metrics (plain HTTP). See System Overview -- Port model.

Endpoint	Port	Method	Source	Description
`/api/v1/signals/prometheus`	8080	POST	AlertManager	Prometheus AlertManager webhook receiver
`/api/v1/signals/kubernetes-event`	8080	POST	Event Exporter (external)	Kubernetes Event API webhook receiver. Requires a user-deployed Event Exporter -- not included in the chart since v1.1.
`/healthz`	8081	GET	--	Liveness (always 200)
`/readyz`	8081	GET	--	Readiness (K8s API + shutdown flag)
`/metrics`	9090	GET	--	Prometheus metrics

Each signal source uses a dedicated adapter that parses the source-specific payload format into a common NormalizedSignal structure. Adapters are registered at startup via RegisterAdapter().

Expected Signal Labels¶

AlertManager Label Contract¶

The Prometheus adapter expects alerts to carry standard Kubernetes labels for resource identification. The Gateway uses these labels to resolve the target resource:

Label	Required	Description
`namespace`	Yes (namespaced resources)	Target resource namespace
`severity`	Recommended	Alert severity (normalized by SP Rego; defaults to `unknown` when absent)
`alertname`	Yes	Alert name (becomes `SignalName`)
One of: `horizontalpodautoscaler`, `poddisruptionbudget`, `persistentvolumeclaim`, `deployment`, `statefulset`, `daemonset`, `replicaset`, `node`, `service`, `job_name`, `cronjob`, `pod`	Yes	Target resource identity

The adapter uses a priority list to select the resource label: HPA > PDB > PVC > Deployment > StatefulSet > DaemonSet > ReplicaSet > Node > Service > Job (job_name) > CronJob > Pod.

Kubernetes Event Exporter Label Contract¶

The Kubernetes Event adapter expects the Event Exporter to forward events with these fields:

Field	Required	Description
`involvedObject.kind`	Yes	Resource kind
`involvedObject.name`	Yes	Resource name
`involvedObject.namespace`	Recommended	Resource namespace
`reason`	Yes	Event reason (becomes `SignalName`)
`type`	Yes	`Warning` or `Error` (`Normal` events are filtered out)

Signal Adapters¶

Prometheus Adapter¶

Parses the AlertManager webhook format (alerts[], commonLabels, commonAnnotations):

Parse -- Takes the first alert from the alerts[] array. Extracts the target resource from labels using a priority list: HPA > PDB > PVC > Deployment > StatefulSet > DaemonSet > ReplicaSet > Node > Service > Job (job_name) > CronJob > Pod. Before extraction, a monitoring metadata filter strips service and pod labels that refer to monitoring infrastructure (kube-state-metrics, prometheus-node-exporter, alertmanager, grafana, etc.) to prevent Kubernaut from targeting its own monitoring stack. Merges alert-level and common labels.
Fingerprint -- Resolves the owner chain (Pod -> ReplicaSet -> Deployment) and computes SHA256(namespace:kind:name) of the top-level owner. The alert name is excluded from the fingerprint (Issue #63) so that different alerts for the same resource deduplicate correctly.
Severity -- Pass-through from labels["severity"]. Signal Processing normalizes it later via Rego policy.
Validate -- Requires non-empty Fingerprint, SignalName, and Severity.

Kubernetes Event Adapter¶

Parses the Event Exporter format (involvedObject, reason, type, lastTimestamp):

Parse -- Requires reason, involvedObject.kind, involvedObject.name. Filters out Normal events (only Warning/Error are processed). Severity is the event type field.
Fingerprint -- Same owner chain resolution and SHA256 computation as Prometheus.
Validate -- Requires non-empty SignalName, Fingerprint, Severity, Resource.Kind, Resource.Name.

Replay Prevention¶

Both adapters enforce freshness validation to prevent replayed signals:

Prometheus: Checks X-Timestamp header or alerts[].startsAt body field
Kubernetes Events: Checks lastTimestamp or firstTimestamp body field
Tolerance: 5 minutes. Signals older than this are rejected.

Normalized Output¶

Both adapters produce a common NormalizedSignal:

Field	Description
`Fingerprint`	SHA256 of owner chain (deduplication key)
`SignalName`	Alert name or event reason
`Severity`	Raw severity (normalized later by SP)
`Namespace`	Target resource namespace
`Resource`	Target resource (Kind, Name, Namespace)
`Labels`	Merged labels from the source
`Annotations`	Source annotations
`FiringTime`	When the alert started firing
`ReceivedTime`	When the Gateway received it
`SourceType`	Always `"alert"` for all adapters
`Source`	`"prometheus"` or `"kubernetes-events"` (identifies the adapter)
`RawPayload`	Original payload for audit reconstruction

Authentication¶

Every signal request passes through authentication middleware before reaching the adapter:

sequenceDiagram
    participant Source as Signal Source
    participant GW as Gateway
    participant K8s as Kubernetes API

    Source->>GW: POST /api/v1/signals/prometheus (Bearer token)
    GW->>K8s: TokenReview (validate token)
    K8s-->>GW: User identity
    GW->>K8s: SubjectAccessReview (create services/gateway-service)
    K8s-->>GW: Allowed / Denied
    GW->>GW: Process signal (if allowed)

Extract -- Bearer token from Authorization header
TokenReview -- Validates the token against the Kubernetes API, returns the authenticated user identity
SubjectAccessReview -- Checks if the user has create permission on services/gateway-service in the controller namespace

Status Code	Meaning
401	Missing or invalid token
403	Valid token but insufficient RBAC
500	TokenReview or SAR API error

Signal sources must have a ServiceAccount with the gateway-signal-source ClusterRole. See Configuration Reference for setup.

Scope Checking¶

After authentication, the Gateway checks whether the target resource is in scope for remediation:

Resource label -- If the resource has kubernaut.ai/managed=true, it is managed. If false, it is explicitly unmanaged.
Namespace label -- If the resource label is absent, check the namespace for the same label.
Default -- If neither label is present, the resource is unmanaged.

Cluster-scoped resources (Node, PersistentVolume, Namespace) only check the resource label.

Unmanaged resources are not rejected with an error. The Gateway returns HTTP 200 with status: "rejected" and reason: "unmanaged_resource", along with a kubectl command the operator can use to opt in:

{
  "status": "rejected",
  "reason": "unmanaged_resource",
  "message": "Resource is not managed by Kubernaut. To enable: kubectl label namespace <ns> kubernaut.ai/managed=true"
}

Phase-Based Deduplication¶

The Gateway prevents duplicate remediations using a phase-based deduplication system backed by Kubernetes RR status (no Valkey dependency).

Fingerprint¶

The deduplication key is the signal fingerprint: SHA256(namespace:kind:name) of the top-level owning resource. The alert name is not part of the fingerprint, so multiple alerts about the same Deployment (e.g., KubePodCrashLooping and KubePodNotReady) are treated as duplicates.

Phase-Based Logic¶

RR Phase	Behavior
Non-terminal (Pending, Processing, Analyzing, AwaitingApproval, Executing, Verifying, Blocked)	Signal is deduplicated -- occurrence count is incremented, no new RR created
Terminal (Completed, Failed, TimedOut, Skipped, Cancelled)	A new RR is created for the signal

RRs with status.nextAllowedExecution in the future (exponential backoff) are also treated as non-terminal.

Status Updates on Deduplication¶

When a signal is deduplicated, the Gateway updates the existing RR's status:

Increments OccurrenceCount
Updates LastSeenAt timestamp
Uses retry.RetryOnConflict for concurrent safety

Distributed Locking¶

In multi-replica deployments, the Gateway uses a Kubernetes Lease lock to prevent race conditions between dedup check and CRD creation. The lock is acquired with up to 10 retry attempts with exponential backoff.

RemediationRequest Creation¶

When a signal passes all checks (authenticated, in scope, not a duplicate), the Gateway creates a RemediationRequest CRD:

Name Format¶

rr-{fingerprint-prefix}-{uuid-suffix}

The fingerprint prefix enables field-selector queries for deduplication lookups.

Spec Fields Populated¶

Field	Source
`SignalFingerprint`	Computed fingerprint
`SignalName`	Adapter-extracted name
`Severity`	Raw severity from source
`TargetResource`	Kind, Name, Namespace
`Labels`, `Annotations`	Merged from source
`FiringTime`, `ReceivedTime`	Timestamps
`ProviderData`	Source-specific metadata
`OriginalPayload`	Raw payload for audit reconstruction

Retry and Circuit Breaker¶

CRD creation uses retry with exponential backoff:

Retryable: 429, 503, 504, timeouts, network errors
Non-retryable: 400, 403, 409, 422
Circuit breaker: Wraps the Kubernetes client to prevent cascading failures during API server instability
AlreadyExists: Treated as idempotent success

Audit Events¶

The Gateway emits audit events to DataStorage for every signal processed:

Event Type	When
`gateway.signal.received`	New RR successfully created
`gateway.signal.deduplicated`	Signal matched an existing RR
`gateway.crd.created`	RR CRD creation confirmed
`gateway.crd.failed`	RR CRD creation failed (including retries)

Audit events include the full GatewayAuditPayload with reconstruction fields (labels, annotations, original payload) per BR-AUDIT-005.

End-to-End Processing Flow¶

flowchart TD
    Req["HTTP Request"] --> Auth["Authentication<br/><small>Token + SAR</small>"]
    Auth -->|401/403| Reject1["Reject"]
    Auth -->|200| Replay["Replay Prevention<br/><small>Freshness check</small>"]
    Replay -->|stale| Reject2["Reject"]
    Replay -->|fresh| Parse["Parse + Validate<br/><small>Adapter-specific</small>"]
    Parse -->|invalid| Reject3["400 Bad Request"]
    Parse -->|valid| Scope["Scope Check<br/><small>kubernaut.ai/managed</small>"]
    Scope -->|unmanaged| Reject4["200 rejected<br/><small>unmanaged_resource</small>"]
    Scope -->|managed| Lock["Acquire Lock<br/><small>K8s Lease</small>"]
    Lock --> Dedup["Dedup Check<br/><small>Phase-based</small>"]
    Dedup -->|duplicate| Update["Update RR Status<br/><small>OccurrenceCount++</small>"]
    Dedup -->|new| Create["Create RR CRD<br/><small>With retry + CB</small>"]
    Update --> Audit["Emit Audit"]
    Create --> Audit
    Audit --> Response["HTTP Response"]

Handoff to Remediation Orchestrator¶

The Gateway's responsibility ends with CRD creation. The RemediationRequest is picked up by the Remediation Orchestrator, which creates a SignalProcessing CRD to begin enrichment:

Gateway creates RR → RO watches RR → RO creates SP CRD → SP enriches signal

Security Hardening (v1.4)¶

The v1.4 release addressed 14 security audit findings:

Control	Detail
Request body limit	256 KB via `MaxBytesReader` on all ingestion endpoints
Error responses	Generic RFC 7807 problem details — internal state never leaked
Header stripping	`X-Auth-Request-User` stripped from inbound requests to prevent impersonation
RBAC	Least-privilege ClusterRole scoped to required verbs only
API timeouts	15-second per-handler timeout on all Kubernetes API calls
Trusted proxy	RealIP middleware with fail-closed behavior (rejects if no trusted proxy configured)
CORS	Restrictive default — no wildcard origins
Prometheus label denylist	`namespace` and Prometheus-reserved labels (`__name__`, `__address__`, etc.) excluded from dynamic kind resolution to prevent misrouting (#1045, #1067)

Next Steps¶

Signals & Alert Routing -- Signal modes, scope management, and alert routing for operators
Signal Processing -- How the enrichment pipeline classifies signals
Remediation Routing -- The Orchestrator's state machine and routing engine
Configuration: Signal Source Authentication -- Configuring external signal sources