Data Persistence¶
Operator guide
For CRD retention, storage lifetime, and use cases, see Data Lifecycle.
Kubernaut uses PostgreSQL as its persistent data store, accessed exclusively through the DataStorage REST API service. Valkey provides a dead-letter queue for audit event resilience. This page covers the database schema, partitioning strategy, indexing, and the RemediationRequest reconstruction pipeline.
Storage Architecture¶
graph TB
subgraph Services["All Kubernaut Services"]
S1[Gateway]
S2[Signal Processing]
S3[AI Analysis]
S4[Orchestrator]
S5[Workflow Execution]
S6[Notification]
S7[Effectiveness Monitor]
S8[Auth Webhook]
end
Services -->|REST API| DS[DataStorage Service]
DS --> PG[(PostgreSQL)]
DS --> RD[(Valkey<br/><small>DLQ</small>)]
subgraph Tables["PostgreSQL Tables"]
AE[audit_events<br/><small>partitioned by month</small>]
WF[remediation_workflow_catalog]
AT[action_type_taxonomy]
AH[action_histories]
RT[resource_action_traces]
OP[oscillation_patterns]
AEM[action_effectiveness_metrics]
RO_T[retention_operations]
end
PG --- Tables
Database Schema¶
audit_events¶
The primary audit table, partitioned by month. This is the largest table in the system, storing the complete remediation history.
| Column | Type | Description |
|---|---|---|
event_id |
UUID |
Primary key (with event_date) |
event_version |
VARCHAR(10) |
Schema version (default: 1.0) |
event_timestamp |
TIMESTAMPTZ |
When the event occurred |
event_date |
DATE |
Partition key |
event_type |
VARCHAR(100) |
Hierarchical type (e.g., aianalysis.analysis.completed) |
event_category |
VARCHAR(50) |
Category (e.g., signal, remediation) |
event_action |
VARCHAR(50) |
Action (e.g., received, completed) |
event_outcome |
VARCHAR(20) |
success, failure, pending |
actor_type |
VARCHAR(50) |
Service or human operator |
actor_id |
VARCHAR(255) |
Identity of the actor |
resource_type |
VARCHAR(100) |
Target resource type |
resource_id |
VARCHAR(255) |
Target resource identifier |
correlation_id |
VARCHAR(255) |
Links events for one remediation (RR name) |
parent_event_id |
UUID |
Chain to parent event |
parent_event_date |
DATE |
Parent event partition key |
namespace |
VARCHAR(253) |
Kubernetes namespace |
cluster_name |
VARCHAR(255) |
Cluster identifier |
event_data |
JSONB |
Service-specific payload |
event_hash |
TEXT |
SHA256 hash chain for integrity |
previous_event_hash |
TEXT |
Previous event's hash |
severity |
VARCHAR(20) |
Signal severity |
duration_ms |
BIGINT |
Operation duration |
error_code |
VARCHAR(50) |
Error code (if failure) |
error_message |
TEXT |
Error description |
retention_days |
INTEGER |
Default: 2555 (7 years) |
is_sensitive |
BOOLEAN |
PII flag |
legal_hold |
BOOLEAN |
Legal hold flag |
legal_hold_reason |
TEXT |
Reason for hold |
legal_hold_placed_by |
VARCHAR(255) |
Who placed the hold |
legal_hold_placed_at |
TIMESTAMPTZ |
When hold was placed |
Indexes¶
| Index | Columns | Purpose |
|---|---|---|
idx_audit_events_event_timestamp |
event_timestamp DESC |
Chronological queries |
idx_audit_events_correlation_id |
correlation_id, event_timestamp DESC |
Remediation timeline reconstruction |
idx_audit_events_event_type |
event_type, event_timestamp DESC |
Event type filtering |
idx_audit_events_event_data_gin |
event_data USING GIN |
JSONB payload queries |
idx_audit_events_pre_remediation_spec_hash |
(event_data->>'pre_remediation_spec_hash'), event_timestamp DESC |
Spec hash history lookups |
Partitioning¶
The audit_events table uses monthly range partitioning on event_date:
- Partitions:
audit_events_2026_03,audit_events_2026_04, ...,audit_events_2028_12 - Default partition:
audit_events_default(catches events outside defined ranges)
Partitioning provides:
- Fast queries -- Scoped to relevant months via partition pruning
- Efficient retention -- Drop old partitions without vacuuming the entire table
- Manageable storage -- Each partition is independently sized and can be backed up separately
remediation_workflow_catalog¶
The workflow catalog table, used for workflow discovery and scoring.
| Column | Type | Description |
|---|---|---|
workflow_id |
TEXT |
Unique workflow identifier |
workflow_name |
VARCHAR |
Human-readable name |
version |
VARCHAR |
Semantic version |
is_latest_version |
BOOLEAN |
Partial index for discovery queries |
action_type |
TEXT |
FK to action_type_taxonomy |
status |
VARCHAR |
active, disabled, deprecated, archived, superseded |
labels |
JSONB |
Mandatory labels (severity, component, environment, priority) |
custom_labels |
JSONB |
Custom labels from workflow schema |
detected_labels |
JSONB |
Infrastructure-awareness labels |
description |
JSONB |
Workflow description (what, whenToUse, whenNotToUse) |
execution_bundle |
TEXT |
OCI image reference |
execution_bundle_digest |
TEXT |
OCI digest |
engine_config |
JSONB |
Engine-specific config (e.g., AWX jobTemplateName, inventoryName). NULL for Tekton/Job |
content_hash |
TEXT |
SHA256 hash of normalized workflow content for deduplication (DD-EM-002) |
schema_data |
JSONB |
Full workflow schema |
created_at |
TIMESTAMPTZ |
Creation timestamp |
updated_at |
TIMESTAMPTZ |
Last update |
Key indexes:
- GIN index on
labels,custom_labels,detected_labelsfor JSONB containment queries - Composite index on
(action_type, status, is_latest_version)for discovery Step 2 - Partial index on
is_latest_version = truefor active workflow queries
Workflow supersession: Only one workflow version per (workflow_name, action_type) pair can be active at a time. When a new version of a workflow is registered (via RemediationWorkflow CRD creation or update), DataStorage marks the previous active entry as superseded and activates the new one. This is enforced by the GetActiveByWorkflowName repository method during registration. The AuthWebhook intercepts both CREATE and UPDATE operations on RemediationWorkflow CRDs and forwards them to DataStorage for catalog registration.
action_type_taxonomy¶
The action type registry for workflow categorization.
| Column | Type | Description |
|---|---|---|
action_type |
TEXT |
Primary key (PascalCase identifier) |
description |
JSONB |
{what, whenToUse, whenNotToUse, preconditions} |
status |
VARCHAR |
active or disabled |
disabled_at |
TIMESTAMPTZ |
When the action type was disabled (NULL if active) |
disabled_by |
VARCHAR |
Operator identity who disabled it (NULL if active) |
created_at |
TIMESTAMPTZ |
Creation timestamp |
updated_at |
TIMESTAMPTZ |
Last update |
The database deploys with a clean schema -- no pre-seeded rows. Action types are registered via kubectl apply -f on ActionType CRDs. The AuthWebhook intercepts the admission request and registers each action type in the DataStorage catalog via its REST API. See Workflow Selection: Action Type Taxonomy and Installation: Action Types.
Other Tables¶
| Table | Purpose |
|---|---|
action_histories |
Historical action records per resource |
resource_action_traces |
Per-resource action tracking for remediation history queries |
oscillation_patterns |
Pattern definitions for oscillation detection (repeated fail/fix cycles) |
oscillation_detections |
Detected oscillation instances |
action_effectiveness_metrics |
Effectiveness scoring per workflow/incident type |
retention_operations |
Retention operation tracking and scheduling |
RemediationRequest Reconstruction¶
The DataStorage service can rebuild a complete RemediationRequest from audit events -- even after the CRD has been removed from the cluster.
Endpoint¶
Pipeline¶
graph LR
Q["1. Query<br/><small>audit events by<br/>correlation_id</small>"]
P["2. Parse<br/><small>extract CRD fields<br/>from typed payloads</small>"]
M["3. Map<br/><small>aggregate into<br/>spec/status</small>"]
B["4. Build<br/><small>produce RR<br/>object</small>"]
V["5. Validate<br/><small>check completeness<br/>and integrity</small>"]
Q --> P --> M --> B --> V
Query¶
Events are fetched by correlation_id filtered to specific event types:
SELECT event_id, event_type, event_timestamp, event_outcome,
resource_type, resource_id, actor_type, actor_id,
event_data, namespace, cluster_name, duration_ms
FROM audit_events
WHERE correlation_id = $1
AND event_type IN (
'gateway.signal.received',
'aianalysis.analysis.completed',
'workflowexecution.selection.completed',
'workflowexecution.execution.started',
'orchestrator.lifecycle.created'
)
ORDER BY event_timestamp ASC, event_id ASC
Source Event Mapping¶
| Reconstructed Field | Source Event | Payload Field |
|---|---|---|
spec.signalName, signalType, signalLabels |
gateway.signal.received |
GatewayAuditPayload |
spec.originalPayload |
gateway.signal.received |
GatewayAuditPayload |
spec.signalAnnotations |
gateway.signal.received |
GatewayAuditPayload |
status.selectedWorkflowRef |
workflowexecution.selection.completed |
WorkflowExecutionAuditPayload |
status.executionRef |
workflowexecution.execution.started |
WorkflowExecutionAuditPayload |
status.timeoutConfig |
orchestrator.lifecycle.created |
RemediationOrchestratorAuditPayload |
Events are ordered by timestamp and mapped into typed payloads (GatewayAuditPayload, RemediationOrchestratorAuditPayload, AIAnalysisAuditPayload, WorkflowExecutionAuditPayload) to rebuild the RR.
Limitations¶
- Reconstruction is available for RemediationRequest CRDs only (other CRD types planned)
status.errorandOverallPhaseare not reconstructed from the current event schema
Valkey (DLQ)¶
Valkey serves as a dead-letter queue for audit event resilience:
Streams¶
| Stream | Purpose | Max Length |
|---|---|---|
audit:dlq:events |
Failed generic audit batches | 10,000 |
audit:dlq:notifications |
Failed notification audit events | 10,000 |
audit:dead-letter:{type} |
Events that exceeded all retry attempts | 10,000 |
Operations¶
| Operation | Command | Description |
|---|---|---|
| Enqueue | XADD |
Add failed batch to stream |
| Read | XREADGROUP |
Consumer group for reliable delivery |
| Acknowledge | XACK |
Mark message as processed |
| Move to dead letter | XADD to dead-letter stream |
After max retries |
| Drain | DrainWithTimeout |
Graceful shutdown flush |
Message Format¶
{
"type": "audit_event",
"payload": "...",
"timestamp": "2026-03-04T12:00:00Z",
"retry_count": 2,
"last_error": "connection refused"
}
Data Flow Summary¶
graph TD
S[Service] -->|StoreAudit| BS[Buffered Store]
BS -->|batch POST| DS[DataStorage]
DS -->|INSERT| PG[(PostgreSQL)]
DS -->|on failure| RD[(Valkey DLQ)]
RD -->|retry| DS
DS -->|query| PG
PG -->|reconstruct| RR[RemediationRequest]
Next Steps¶
- Audit Pipeline -- How events reach DataStorage
- Data Lifecycle -- User-facing data lifecycle documentation
- API Reference: DataStorage -- REST API endpoints