Notification Pipeline¶

Operator guide

For channel setup, Slack configuration, and credential management, see Notification Channels.

CRD Reference

For the complete NotificationRequest CRD specification, see API Reference: CRDs.

The Notification controller delivers outcome notifications through multiple channels. It manages routing resolution, per-channel delivery with retry and circuit breaker logic, credential hot-reload, and audit event emission.

CRD Specification¶

Spec (Immutable)¶

For the complete field specification, see NotificationRequest in the CRD Reference.

Notification Types¶

Type	Description
`escalation`	Escalation to human review
`simple`	Basic status notification
`status-update`	Phase change update
`approval`	Approval request
`manual-review`	Manual review required
`completion`	Remediation outcome (success or failure)

Status¶

For the complete field specification, see NotificationRequest in the CRD Reference.

DeliveryAttempt¶

For the complete field specification, see DeliveryAttempt in the CRD Reference.

Phase State Machine¶

stateDiagram-v2
    [*] --> Pending
    Pending --> Sending : Channels resolved
    Pending --> Failed : No channels resolved
    Sending --> Sent : All channels succeeded
    Sending --> Retrying : Retryable failures, attempts remaining
    Sending --> PartiallySent : Some succeeded, retries exhausted
    Sending --> Failed : All channels failed
    Retrying --> Sent : All channels succeeded
    Retrying --> Retrying : Still retrying (backoff)
    Retrying --> PartiallySent : Some succeeded, retries exhausted
    Retrying --> Failed : All channels failed permanently

Phase	Terminal	Description
Pending	No	CRD created, awaiting processing
Sending	No	Delivering to resolved channels
Retrying	No	One or more channels failed, retrying with backoff
Sent	Yes	All channels delivered successfully
PartiallySent	Yes	At least one channel succeeded, others exhausted
Failed	Yes	All channels failed

Phase Transition Logic¶

Zero resolved channels → Failed (NoChannelsResolved)
All channels succeeded → Sent
All channels exhausted retries:
- At least one succeeded → PartiallySent
- All permanent failures → Failed (AllDeliveriesFailed)
- Otherwise → Failed (MaxRetriesExhausted)
Retryable failures with attempts remaining → Retrying (requeue with backoff)

Routing Resolution¶

When a notification enters the Sending phase, the controller resolves which channels to deliver to.

Routing Attributes¶

Attributes are extracted from the notification spec and used for route matching:

Attribute	Source
`type`	`spec.Type`
`severity`	`spec.Severity`
`phase`	`spec.Phase`
`reviewSource`	`spec.ReviewSource`
`priority`	`spec.Priority`
`environment`	`spec.Metadata["environment"]`
`namespace`	`spec.Metadata["namespace"]`
`skip-reason`	`spec.Metadata["skip-reason"]`
`investigation-outcome`	`spec.Metadata["investigation-outcome"]`

Route Matching¶

The routing configuration follows an AlertManager-style tree:

Build routing attributes from the notification spec
Walk the route tree depth-first -- child routes are evaluated before the parent
Each route's Match map must match all specified attributes (AND logic)
First matching route identifies the receiver
The receiver declares which channels to use

Fallback¶

If no route matches, the notification falls back to the console channel. This ensures every notification is delivered somewhere.

Qualified Channel Names¶

For channels that support multiple instances (e.g., multiple Slack webhooks), the routing resolver produces qualified names:

Slack: slack:receiverName or slack:receiverName:index
Other channels: Unqualified names (e.g., console, file, log)

Configuration¶

Routing configuration is loaded from a ConfigMap:

Setting	Default
ConfigMap name	`notification-routing-config`
ConfigMap namespace	`kubernaut-notifications` (or `POD_NAMESPACE`)
ConfigMap key	`routing.yaml`

Delivery Orchestration¶

The delivery orchestrator manages per-channel delivery with deduplication, attempt tracking, and error classification.

Delivery Flow¶

For each resolved channel:

Skip if channel already succeeded (persisted or in-memory)
Skip if channel has a permanent error
Skip if attempt count ≥ MaxAttempts
Circuit breaker pre-check (Slack only) -- if open, emit CircuitBreakerOpen event and skip
Increment in-flight attempt counter
Deliver via singleflight (dedup key: {notificationUID}:{channel})
Decrement in-flight counter
Classify error as permanent or retryable
Record audit event (message.sent or message.failed)
Append DeliveryAttempt to status

Singleflight Deduplication¶

Concurrent reconciles for the same notification + channel are deduplicated via singleflight.Group. Only one actual delivery call runs; others share the result.

In-Memory State (DD-NOT-008)¶

The orchestrator tracks delivery state in memory to prevent duplicate deliveries between status persistence:

In-flight attempts: Incremented before delivery, decremented after
Successful deliveries: Marked in memory immediately on success
Total attempt count = persisted attempts + in-flight attempts
Cleanup: ClearInMemoryState(uid) is called after status is persisted

Counter Logic¶

SuccessfulDeliveries and FailedDeliveries track unique channels, not attempts. A successful delivery for a channel that previously failed overwrites the failure count for that channel.

Retry Policy¶

Defaults¶

Parameter	Default	Range
`MaxAttempts`	5	1–10
`InitialBackoffSeconds`	30	1–300
`BackoffMultiplier`	2	1–10
`MaxBackoffSeconds`	480	60–3600

Backoff Calculation¶

Uses the shared pkg/shared/backoff library:

Formula: BasePeriod × (Multiplier ^ (attempts - 1))
Jitter: ±10% (BR-NOT-055)
Cap: MaxBackoffSeconds

Example with defaults: 30s → 60s → 120s → 240s → 480s

Backoff Enforcement (NT-BUG-007)¶

When in Retrying phase, the controller computes the next retry time from the last failed attempt. If the current time is before the next retry time, the reconcile is requeued with the remaining backoff duration.

Error Classification¶

Retryable Errors¶

Wrapped with NewRetryableError(err). Detected via IsRetryableError(err).

Permanent Errors¶

Stored with the prefix permanent failure: in the DeliveryAttempt.Error field.

Slack Error Classification¶

Condition	Classification
TLS / x509 errors	Permanent
HTTP 5xx	Retryable
HTTP 429 (rate limit)	Retryable
HTTP 4xx (other)	Permanent
Network errors (non-TLS)	Retryable

Circuit Breaker (Slack)¶

The Slack channel uses a Sony gobreaker circuit breaker to prevent hammering a failing webhook:

Setting	Value
Trip threshold	3 consecutive failures
Open duration	30 seconds
Half-open requests	2 test requests
Reset interval	10 seconds

When the circuit breaker is open, the controller emits a CircuitBreakerOpen Kubernetes event and skips delivery to that channel.

Channel Implementations¶

Channel	Description	Error Behavior
Console	Logs to controller-runtime logger. Format: `[Priority] [Type] Subject\nBody`	Always succeeds
Log	Structured JSON Lines to stdout with notification fields	Always succeeds
File	Writes JSON/YAML to a directory. Filename: `notification-{name}-{timestamp}.{format}`. Atomic write (temp + rename). E2E/debug use (DD-NOT-002)	Directory/write errors → retryable
Slack	Webhook POST with Block Kit formatting. Default timeout: 10s	See error classification above

Channels for email, teams, sms, and webhook are defined in the CRD schema but not yet implemented.

Credential Management¶

The credential resolver reads secrets from a mounted directory (projected volume):

Each file: name = credential name, content = secret value
Hidden files (e.g., ..data symlinks) are skipped
Hot-reload: fsnotify watches the directory; cache is reloaded on file changes
Validation: ValidateRefs(refs) ensures all referenced credentials exist before delivery

Notification Enrichment¶

Before delivery, the controller enriches notification bodies by resolving workflow UUIDs to human-readable workflow names. This ensures operators see meaningful identifiers (e.g., "RollbackDeployment") instead of opaque UUIDs in notification messages.

Enrichment Flow¶

Extract the workflow UUID from spec.metadata["workflowId"] with fallback to spec.metadata["selectedWorkflow"] (the former is set for completion notifications, the latter for approval notifications)
Call the DataStorage catalog API (GET /api/v1/workflows/{id}) to resolve the UUID to a workflow name
Replace every occurrence of the UUID in spec.Body with the resolved name
Pass the enriched notification to the delivery orchestrator

Graceful Degradation¶

If the workflow UUID is absent from the workflow context, the DataStorage lookup fails, or the resolved name is empty, the notification is delivered unchanged with the original UUID preserved. Enrichment failures are logged but never block delivery.

Extensibility¶

The enrichment layer uses a WorkflowNameResolver interface, allowing alternative resolution backends (e.g., in-memory cache, external catalog) without changing the delivery pipeline.

Audit Events¶

Event Type	When
`notification.message.sent`	Per-channel delivery success
`notification.message.failed`	Per-channel delivery failure
`notification.message.acknowledged`	All channels in Sent phase
`notification.message.escalated`	All channels in Failed phase (permanent)

Idempotency (NT-BUG-001)¶

Audit events are tracked in a sync.Map with composite keys (message.sent:{channel}, message.failed:{channel}:attempt{N}). Duplicate events are suppressed. Tracking is cleaned up when the CRD is deleted.

Duplicate Reconcile Prevention (NT-BUG-008)¶

The controller skips reconciliation if all of the following are true:

Generation == ObservedGeneration
len(DeliveryAttempts) > 0
Phase is terminal

This prevents redundant processing when the controller-runtime re-delivers an event for an already-completed notification.

Next Steps¶

Notification Channels Configuration -- Operator guide for setting up channels and routing
Remediation Routing -- How the Orchestrator creates notifications
Architecture Overview -- System topology