Notification Pipeline¶
Operator guide
For channel setup, Slack configuration, and credential management, see Notification Channels.
CRD Reference
For the complete NotificationRequest CRD specification, see API Reference: CRDs.
The Notification controller delivers outcome notifications through multiple channels. It manages routing resolution, per-channel delivery with retry and circuit breaker logic, credential hot-reload, and audit event emission.
CRD Specification¶
Spec (Immutable)¶
For the complete field specification, see NotificationRequest in the CRD Reference.
Notification Types¶
| Type | Description |
|---|---|
escalation |
Escalation to human review |
simple |
Basic status notification |
status-update |
Phase change update |
approval |
Approval request |
manual-review |
Manual review required |
completion |
Remediation outcome (success or failure) |
Status¶
For the complete field specification, see NotificationRequest in the CRD Reference.
DeliveryAttempt¶
For the complete field specification, see DeliveryAttempt in the CRD Reference.
Phase State Machine¶
stateDiagram-v2
[*] --> Pending
Pending --> Sending : Channels resolved
Pending --> Failed : No channels resolved
Sending --> Sent : All channels succeeded
Sending --> Retrying : Retryable failures, attempts remaining
Sending --> PartiallySent : Some succeeded, retries exhausted
Sending --> Failed : All channels failed
Retrying --> Sent : All channels succeeded
Retrying --> Retrying : Still retrying (backoff)
Retrying --> PartiallySent : Some succeeded, retries exhausted
Retrying --> Failed : All channels failed permanently
| Phase | Terminal | Description |
|---|---|---|
| Pending | No | CRD created, awaiting processing |
| Sending | No | Delivering to resolved channels |
| Retrying | No | One or more channels failed, retrying with backoff |
| Sent | Yes | All channels delivered successfully |
| PartiallySent | Yes | At least one channel succeeded, others exhausted |
| Failed | Yes | All channels failed |
Phase Transition Logic¶
- Zero resolved channels → Failed (
NoChannelsResolved) - All channels succeeded → Sent
- All channels exhausted retries:
- At least one succeeded → PartiallySent
- All permanent failures → Failed (
AllDeliveriesFailed) - Otherwise → Failed (
MaxRetriesExhausted)
- Retryable failures with attempts remaining → Retrying (requeue with backoff)
Routing Resolution¶
When a notification enters the Sending phase, the controller resolves which channels to deliver to.
Routing Attributes¶
Attributes are extracted from the notification spec and used for route matching:
| Attribute | Source |
|---|---|
type |
spec.Type |
severity |
spec.Severity |
phase |
spec.Phase |
reviewSource |
spec.ReviewSource |
priority |
spec.Priority |
environment |
spec.Metadata["environment"] |
namespace |
spec.Metadata["namespace"] |
skip-reason |
spec.Metadata["skip-reason"] |
investigation-outcome |
spec.Metadata["investigation-outcome"] |
Route Matching¶
The routing configuration follows an AlertManager-style tree:
- Build routing attributes from the notification spec
- Walk the route tree depth-first -- child routes are evaluated before the parent
- Each route's
Matchmap must match all specified attributes (AND logic) - First matching route identifies the receiver
- The receiver declares which channels to use
Fallback¶
If no route matches, the notification falls back to the console channel. This ensures every notification is delivered somewhere.
Qualified Channel Names¶
For channels that support multiple instances (e.g., multiple Slack webhooks), the routing resolver produces qualified names:
- Slack:
slack:receiverNameorslack:receiverName:index - Other channels: Unqualified names (e.g.,
console,file,log)
Configuration¶
Routing configuration is loaded from a ConfigMap:
| Setting | Default |
|---|---|
| ConfigMap name | notification-routing-config |
| ConfigMap namespace | kubernaut-notifications (or POD_NAMESPACE) |
| ConfigMap key | routing.yaml |
Delivery Orchestration¶
The delivery orchestrator manages per-channel delivery with deduplication, attempt tracking, and error classification.
Delivery Flow¶
For each resolved channel:
- Skip if channel already succeeded (persisted or in-memory)
- Skip if channel has a permanent error
- Skip if attempt count ≥
MaxAttempts - Circuit breaker pre-check (Slack only) -- if open, emit
CircuitBreakerOpenevent and skip - Increment in-flight attempt counter
- Deliver via singleflight (dedup key:
{notificationUID}:{channel}) - Decrement in-flight counter
- Classify error as permanent or retryable
- Record audit event (
message.sentormessage.failed) - Append
DeliveryAttemptto status
Singleflight Deduplication¶
Concurrent reconciles for the same notification + channel are deduplicated via singleflight.Group. Only one actual delivery call runs; others share the result.
In-Memory State (DD-NOT-008)¶
The orchestrator tracks delivery state in memory to prevent duplicate deliveries between status persistence:
- In-flight attempts: Incremented before delivery, decremented after
- Successful deliveries: Marked in memory immediately on success
- Total attempt count = persisted attempts + in-flight attempts
- Cleanup:
ClearInMemoryState(uid)is called after status is persisted
Counter Logic¶
SuccessfulDeliveries and FailedDeliveries track unique channels, not attempts. A successful delivery for a channel that previously failed overwrites the failure count for that channel.
Retry Policy¶
Defaults¶
| Parameter | Default | Range |
|---|---|---|
MaxAttempts |
5 | 1–10 |
InitialBackoffSeconds |
30 | 1–300 |
BackoffMultiplier |
2 | 1–10 |
MaxBackoffSeconds |
480 | 60–3600 |
Backoff Calculation¶
Uses the shared pkg/shared/backoff library:
- Formula:
BasePeriod × (Multiplier ^ (attempts - 1)) - Jitter: ±10% (BR-NOT-055)
- Cap:
MaxBackoffSeconds
Example with defaults: 30s → 60s → 120s → 240s → 480s
Backoff Enforcement (NT-BUG-007)¶
When in Retrying phase, the controller computes the next retry time from the last failed attempt. If the current time is before the next retry time, the reconcile is requeued with the remaining backoff duration.
Error Classification¶
Retryable Errors¶
Wrapped with NewRetryableError(err). Detected via IsRetryableError(err).
Permanent Errors¶
Stored with the prefix permanent failure: in the DeliveryAttempt.Error field.
Slack Error Classification¶
| Condition | Classification |
|---|---|
| TLS / x509 errors | Permanent |
| HTTP 5xx | Retryable |
| HTTP 429 (rate limit) | Retryable |
| HTTP 4xx (other) | Permanent |
| Network errors (non-TLS) | Retryable |
Circuit Breaker (Slack)¶
The Slack channel uses a Sony gobreaker circuit breaker to prevent hammering a failing webhook:
| Setting | Value |
|---|---|
| Trip threshold | 3 consecutive failures |
| Open duration | 30 seconds |
| Half-open requests | 2 test requests |
| Reset interval | 10 seconds |
When the circuit breaker is open, the controller emits a CircuitBreakerOpen Kubernetes event and skips delivery to that channel.
Channel Implementations¶
| Channel | Description | Error Behavior |
|---|---|---|
| Console | Logs to controller-runtime logger. Format: [Priority] [Type] Subject\nBody |
Always succeeds |
| Log | Structured JSON Lines to stdout with notification fields | Always succeeds |
| File | Writes JSON/YAML to a directory. Filename: notification-{name}-{timestamp}.{format}. Atomic write (temp + rename). E2E/debug use (DD-NOT-002) |
Directory/write errors → retryable |
| Slack | Webhook POST with Block Kit formatting. Default timeout: 10s | See error classification above |
Channels for email, teams, sms, and webhook are defined in the CRD schema but not yet implemented.
Credential Management¶
The credential resolver reads secrets from a mounted directory (projected volume):
- Each file: name = credential name, content = secret value
- Hidden files (e.g.,
..datasymlinks) are skipped - Hot-reload:
fsnotifywatches the directory; cache is reloaded on file changes - Validation:
ValidateRefs(refs)ensures all referenced credentials exist before delivery
Notification Enrichment¶
Before delivery, the controller enriches notification bodies by resolving workflow UUIDs to human-readable workflow names. This ensures operators see meaningful identifiers (e.g., "RollbackDeployment") instead of opaque UUIDs in notification messages.
Enrichment Flow¶
- Extract the workflow UUID from
spec.metadata["workflowId"]with fallback tospec.metadata["selectedWorkflow"](the former is set for completion notifications, the latter for approval notifications) - Call the DataStorage catalog API (
GET /api/v1/workflows/{id}) to resolve the UUID to a workflow name - Replace every occurrence of the UUID in
spec.Bodywith the resolved name - Pass the enriched notification to the delivery orchestrator
Graceful Degradation¶
If the workflow UUID is absent from the workflow context, the DataStorage lookup fails, or the resolved name is empty, the notification is delivered unchanged with the original UUID preserved. Enrichment failures are logged but never block delivery.
Extensibility¶
The enrichment layer uses a WorkflowNameResolver interface, allowing alternative resolution backends (e.g., in-memory cache, external catalog) without changing the delivery pipeline.
Audit Events¶
| Event Type | When |
|---|---|
notification.message.sent |
Per-channel delivery success |
notification.message.failed |
Per-channel delivery failure |
notification.message.acknowledged |
All channels in Sent phase |
notification.message.escalated |
All channels in Failed phase (permanent) |
Idempotency (NT-BUG-001)¶
Audit events are tracked in a sync.Map with composite keys (message.sent:{channel}, message.failed:{channel}:attempt{N}). Duplicate events are suppressed. Tracking is cleaned up when the CRD is deleted.
Duplicate Reconcile Prevention (NT-BUG-008)¶
The controller skips reconciliation if all of the following are true:
Generation == ObservedGenerationlen(DeliveryAttempts) > 0- Phase is terminal
This prevents redundant processing when the controller-runtime re-delivers an event for an already-completed notification.
Next Steps¶
- Notification Channels Configuration -- Operator guide for setting up channels and routing
- Remediation Routing -- How the Orchestrator creates notifications
- Architecture Overview -- System topology