Data Persistence¶

Operator guide

For CRD retention, storage lifetime, and use cases, see Data Lifecycle.

Kubernaut uses PostgreSQL as its persistent data store, accessed exclusively through the DataStorage REST API service. Valkey provides a dead-letter queue for audit event resilience. This page covers the database schema, partitioning strategy, indexing, and the RemediationRequest reconstruction pipeline.

Storage Architecture¶

graph TB
    subgraph Services["All Kubernaut Services"]
        S1[Gateway]
        S2[Signal Processing]
        S3[AI Analysis]
        S4[Orchestrator]
        S5[Workflow Execution]
        S6[Notification]
        S7[Effectiveness Monitor]
        S8[Auth Webhook]
    end

    Services -->|REST API| DS[DataStorage Service]

    DS --> PG[(PostgreSQL)]
    DS --> RD[(Valkey<br/><small>DLQ</small>)]

    subgraph Tables["PostgreSQL Tables"]
        AE[audit_events<br/><small>partitioned by month</small>]
        WF[remediation_workflow_catalog]
        AT[action_type_taxonomy]
        AH[action_histories]
        RT[resource_action_traces]
        OP[oscillation_patterns]
        AEM[action_effectiveness_metrics]
        RO_T[retention_operations]
    end

    PG --- Tables

Database Schema¶

audit_events¶

The primary audit table, partitioned by month. This is the largest table in the system, storing the complete remediation history.

Column	Type	Description
`event_id`	`UUID`	Primary key (with `event_date`)
`event_version`	`VARCHAR(10)`	Schema version (default: `1.0`)
`event_timestamp`	`TIMESTAMPTZ`	When the event occurred
`event_date`	`DATE`	Partition key
`event_type`	`VARCHAR(100)`	Hierarchical type (e.g., `aianalysis.analysis.completed`)
`event_category`	`VARCHAR(50)`	Category (e.g., `signal`, `remediation`)
`event_action`	`VARCHAR(50)`	Action (e.g., `received`, `completed`)
`event_outcome`	`VARCHAR(20)`	`success`, `failure`, `pending`
`actor_type`	`VARCHAR(50)`	Service or human operator
`actor_id`	`VARCHAR(255)`	Identity of the actor
`resource_type`	`VARCHAR(100)`	Target resource type
`resource_id`	`VARCHAR(255)`	Target resource identifier
`correlation_id`	`VARCHAR(255)`	Links events for one remediation (RR name)
`parent_event_id`	`UUID`	Chain to parent event
`parent_event_date`	`DATE`	Parent event partition key
`namespace`	`VARCHAR(253)`	Kubernetes namespace
`cluster_name`	`VARCHAR(255)`	Cluster identifier
`event_data`	`JSONB`	Service-specific payload
`event_hash`	`TEXT`	SHA256 hash chain for integrity
`previous_event_hash`	`TEXT`	Previous event's hash
`severity`	`VARCHAR(20)`	Signal severity
`duration_ms`	`BIGINT`	Operation duration
`error_code`	`VARCHAR(50)`	Error code (if failure)
`error_message`	`TEXT`	Error description
`retention_days`	`INTEGER`	Default: 2555 (7 years)
`is_sensitive`	`BOOLEAN`	PII flag
`legal_hold`	`BOOLEAN`	Legal hold flag
`legal_hold_reason`	`TEXT`	Reason for hold
`legal_hold_placed_by`	`VARCHAR(255)`	Who placed the hold
`legal_hold_placed_at`	`TIMESTAMPTZ`	When hold was placed

Indexes¶

Index	Columns	Purpose
`idx_audit_events_event_timestamp`	`event_timestamp DESC`	Chronological queries
`idx_audit_events_correlation_id`	`correlation_id, event_timestamp DESC`	Remediation timeline reconstruction
`idx_audit_events_event_type`	`event_type, event_timestamp DESC`	Event type filtering
`idx_audit_events_event_data_gin`	`event_data USING GIN`	JSONB payload queries
`idx_audit_events_pre_remediation_spec_hash`	`(event_data->>'pre_remediation_spec_hash'), event_timestamp DESC`	Spec hash history lookups

Partitioning¶

The audit_events table uses monthly range partitioning on event_date:

Partitions: audit_events_2026_03, audit_events_2026_04, ..., audit_events_2028_12
Default partition: audit_events_default (catches events outside defined ranges)

Partitioning provides:

Fast queries -- Scoped to relevant months via partition pruning
Efficient retention -- Drop old partitions without vacuuming the entire table
Manageable storage -- Each partition is independently sized and can be backed up separately

remediation_workflow_catalog¶

The workflow catalog table, used for workflow discovery and scoring.

Column	Type	Description
`workflow_id`	`TEXT`	Unique workflow identifier
`workflow_name`	`VARCHAR`	Human-readable name
`version`	`VARCHAR`	Semantic version
`is_latest_version`	`BOOLEAN`	Partial index for discovery queries
`action_type`	`TEXT`	FK to `action_type_taxonomy`
`status`	`VARCHAR`	`active`, `disabled`, `deprecated`, `archived`, `superseded`
`labels`	`JSONB`	Mandatory labels (severity, component, environment, priority)
`custom_labels`	`JSONB`	Custom labels from workflow schema
`detected_labels`	`JSONB`	Infrastructure-awareness labels
`description`	`JSONB`	Workflow description (what, whenToUse, whenNotToUse)
`execution_bundle`	`TEXT`	OCI image reference
`execution_bundle_digest`	`TEXT`	OCI digest
`engine_config`	`JSONB`	Engine-specific config (e.g., AWX `jobTemplateName`, `inventoryName`). NULL for Tekton/Job
`content_hash`	`TEXT`	SHA256 hash of normalized workflow content for deduplication (DD-EM-002)
`schema_data`	`JSONB`	Full workflow schema
`created_at`	`TIMESTAMPTZ`	Creation timestamp
`updated_at`	`TIMESTAMPTZ`	Last update

Key indexes:

GIN index on labels, custom_labels, detected_labels for JSONB containment queries
Composite index on (action_type, status, is_latest_version) for discovery Step 2
Partial index on is_latest_version = true for active workflow queries

Workflow supersession: Only one workflow version per (workflow_name, action_type) pair can be active at a time. When a new version of a workflow is registered (via RemediationWorkflow CRD creation or update), DataStorage marks the previous active entry as superseded and activates the new one. This is enforced by the GetActiveByWorkflowName repository method during registration. The AuthWebhook intercepts both CREATE and UPDATE operations on RemediationWorkflow CRDs and forwards them to DataStorage for catalog registration.

action_type_taxonomy¶

The action type registry for workflow categorization.

Column	Type	Description
`action_type`	`TEXT`	Primary key (PascalCase identifier)
`description`	`JSONB`	`{what, whenToUse, whenNotToUse, preconditions}`
`status`	`VARCHAR`	`active` or `disabled`
`disabled_at`	`TIMESTAMPTZ`	When the action type was disabled (NULL if active)
`disabled_by`	`VARCHAR`	Operator identity who disabled it (NULL if active)
`created_at`	`TIMESTAMPTZ`	Creation timestamp
`updated_at`	`TIMESTAMPTZ`	Last update

The database deploys with a clean schema -- no pre-seeded rows. Action types are registered via kubectl apply -f on ActionType CRDs. The AuthWebhook intercepts the admission request and registers each action type in the DataStorage catalog via its REST API. See Workflow Selection: Action Type Taxonomy and Installation: Action Types.

Other Tables¶

Table	Purpose
`action_histories`	Historical action records per resource
`resource_action_traces`	Per-resource action tracking for remediation history queries
`oscillation_patterns`	Pattern definitions for oscillation detection (repeated fail/fix cycles)
`oscillation_detections`	Detected oscillation instances
`action_effectiveness_metrics`	Effectiveness scoring per workflow/incident type
`retention_operations`	Retention operation tracking and scheduling

RemediationRequest Reconstruction¶

The DataStorage service can rebuild a complete RemediationRequest from audit events -- even after the CRD has been removed from the cluster.

Endpoint¶

POST /api/v1/audit/remediation-requests/{correlation_id}/reconstruct

Pipeline¶

graph LR
    Q["1. Query<br/><small>audit events by<br/>correlation_id</small>"]
    P["2. Parse<br/><small>extract CRD fields<br/>from typed payloads</small>"]
    M["3. Map<br/><small>aggregate into<br/>spec/status</small>"]
    B["4. Build<br/><small>produce RR<br/>object</small>"]
    V["5. Validate<br/><small>check completeness<br/>and integrity</small>"]
    Q --> P --> M --> B --> V

Query¶

Events are fetched by correlation_id filtered to specific event types:

SELECT event_id, event_type, event_timestamp, event_outcome,
       resource_type, resource_id, actor_type, actor_id,
       event_data, namespace, cluster_name, duration_ms
FROM audit_events
WHERE correlation_id = $1
  AND event_type IN (
    'gateway.signal.received',
    'aianalysis.analysis.completed',
    'workflowexecution.selection.completed',
    'workflowexecution.execution.started',
    'orchestrator.lifecycle.created'
  )
ORDER BY event_timestamp ASC, event_id ASC

Source Event Mapping¶

Reconstructed Field	Source Event	Payload Field
`spec.signalName`, `signalType`, `signalLabels`	`gateway.signal.received`	`GatewayAuditPayload`
`spec.originalPayload`	`gateway.signal.received`	`GatewayAuditPayload`
`spec.signalAnnotations`	`gateway.signal.received`	`GatewayAuditPayload`
`status.selectedWorkflowRef`	`workflowexecution.selection.completed`	`WorkflowExecutionAuditPayload`
`status.executionRef`	`workflowexecution.execution.started`	`WorkflowExecutionAuditPayload`
`status.timeoutConfig`	`orchestrator.lifecycle.created`	`RemediationOrchestratorAuditPayload`

Events are ordered by timestamp and mapped into typed payloads (GatewayAuditPayload, RemediationOrchestratorAuditPayload, AIAnalysisAuditPayload, WorkflowExecutionAuditPayload) to rebuild the RR.

Limitations¶

Reconstruction is available for RemediationRequest CRDs only (other CRD types planned)
status.error and OverallPhase are not reconstructed from the current event schema

Valkey (DLQ)¶

Valkey serves as a dead-letter queue for audit event resilience:

Streams¶

Stream	Purpose	Max Length
`audit:dlq:events`	Failed generic audit batches	10,000
`audit:dlq:notifications`	Failed notification audit events	10,000
`audit:dead-letter:{type}`	Events that exceeded all retry attempts	10,000

Operations¶

Operation	Command	Description
Enqueue	`XADD`	Add failed batch to stream
Read	`XREADGROUP`	Consumer group for reliable delivery
Acknowledge	`XACK`	Mark message as processed
Move to dead letter	`XADD` to dead-letter stream	After max retries
Drain	`DrainWithTimeout`	Graceful shutdown flush

Message Format¶

{
  "type": "audit_event",
  "payload": "...",
  "timestamp": "2026-03-04T12:00:00Z",
  "retry_count": 2,
  "last_error": "connection refused"
}

Data Flow Summary¶

graph TD
    S[Service] -->|StoreAudit| BS[Buffered Store]
    BS -->|batch POST| DS[DataStorage]
    DS -->|INSERT| PG[(PostgreSQL)]
    DS -->|on failure| RD[(Valkey DLQ)]
    RD -->|retry| DS

    DS -->|query| PG
    PG -->|reconstruct| RR[RemediationRequest]

Next Steps¶

Audit Pipeline -- How events reach DataStorage
Data Lifecycle -- User-facing data lifecycle documentation
API Reference: DataStorage -- REST API endpoints