Data Persistence¶
Operator guide
For CRD retention, storage lifetime, and use cases, see Data Lifecycle.
Kubernaut uses PostgreSQL as its persistent data store, accessed exclusively through the DataStorage REST API service. Valkey provides a dead-letter queue for audit event resilience. This page covers the database schema, partitioning strategy, indexing, and the RemediationRequest reconstruction pipeline.
Storage Architecture¶
graph TB
subgraph Services["All Kubernaut Services"]
S1[Gateway]
S2[Signal Processing]
S3[AI Analysis]
S4[Orchestrator]
S5[Workflow Execution]
S6[Notification]
S7[Effectiveness Monitor]
S8[Auth Webhook]
end
Services -->|REST API| DS[DataStorage Service]
DS --> PG[(PostgreSQL)]
DS --> RD[(Valkey<br/><small>DLQ</small>)]
subgraph Tables["PostgreSQL Tables (17)"]
direction LR
subgraph Core["Core Tables"]
AE[audit_events<br/><small>partitioned</small>]
RR_T[resource_references]
WF[remediation_workflow_catalog]
AT[action_type_taxonomy]
end
subgraph Tracking["Action Tracking"]
AH[action_histories]
RT[resource_action_traces<br/><small>partitioned</small>]
NAU[notification_audit]
end
subgraph Learning["Learning & Effectiveness"]
AA[action_assessments]
ER[effectiveness_results]
ACS[action_confidence_scores]
AO[action_outcomes]
AAlt[action_alternatives]
AEM[action_effectiveness_metrics]
end
subgraph Operations["Operations"]
OP[oscillation_patterns]
OD[oscillation_detections]
ARP[audit_retention_policies]
RO_T[retention_operations]
end
end
subgraph Views["Views (5)"]
V1[effectiveness_trends]
V2[low_confidence_actions]
V3[action_history_summary]
V4[incident_summary_view]
V5[oscillation_detection_summary]
end
PG --- Tables
PG --- Views
Database Schema¶
resource_references¶
Core resource identity table, serving as the FK target for action histories and oscillation detections.
| Column | Type | Nullable | Description |
|---|---|---|---|
id |
BIGSERIAL |
PK | Auto-incrementing primary key |
resource_uid |
VARCHAR(36) |
NOT NULL | Kubernetes resource UID |
api_version |
VARCHAR(100) |
NOT NULL | API version (e.g., apps/v1) |
kind |
VARCHAR(100) |
NOT NULL | Resource kind (e.g., Deployment) |
name |
VARCHAR(253) |
NOT NULL | Resource name |
namespace |
VARCHAR(63) |
Kubernetes namespace (NULL for cluster-scoped) | |
created_at |
TIMESTAMPTZ |
Default: now() |
|
deleted_at |
TIMESTAMPTZ |
Soft-delete timestamp | |
last_seen |
TIMESTAMPTZ |
Default: now() |
Indexes: PK, UNIQUE on resource_uid, UNIQUE on (namespace, kind, name), btree on kind, namespace, last_seen.
Referenced by: action_histories.resource_id, oscillation_detections.resource_id (both CASCADE on delete).
audit_events¶
The primary audit table, partitioned by month. This is the largest table in the system, storing the complete remediation history.
| Column | Type | Nullable | Description |
|---|---|---|---|
event_id |
UUID |
PK (with event_date) |
Primary key |
event_version |
VARCHAR(10) |
Schema version (default: 1.0) |
|
event_timestamp |
TIMESTAMPTZ |
When the event occurred | |
event_date |
DATE |
PK | Partition key |
event_type |
VARCHAR(100) |
Hierarchical type (e.g., aianalysis.analysis.completed) |
|
event_category |
VARCHAR(50) |
Category (e.g., signal, remediation) |
|
event_action |
VARCHAR(50) |
Action (e.g., received, completed) |
|
event_outcome |
VARCHAR(20) |
success, failure, pending |
|
actor_type |
VARCHAR(50) |
Service or human operator | |
actor_id |
VARCHAR(255) |
Identity of the actor | |
resource_type |
VARCHAR(100) |
Target resource type | |
resource_id |
VARCHAR(255) |
Target resource identifier | |
correlation_id |
VARCHAR(255) |
Links events for one remediation (RR name) | |
parent_event_id |
UUID |
Chain to parent event | |
parent_event_date |
DATE |
Parent event partition key | |
namespace |
VARCHAR(253) |
Kubernetes namespace | |
cluster_name |
VARCHAR(255) |
Cluster identifier | |
event_data |
JSONB |
Service-specific payload | |
event_hash |
TEXT |
SHA256 hash chain for integrity | |
previous_event_hash |
TEXT |
Previous event's hash | |
severity |
VARCHAR(20) |
Signal severity | |
duration_ms |
BIGINT |
Operation duration | |
error_code |
VARCHAR(50) |
Error code (if failure) | |
error_message |
TEXT |
Error description | |
retention_days |
INTEGER |
Default: 2555 (7 years) | |
is_sensitive |
BOOLEAN |
PII flag | |
legal_hold |
BOOLEAN |
Legal hold flag | |
legal_hold_reason |
TEXT |
Reason for hold | |
legal_hold_placed_by |
VARCHAR(255) |
Who placed the hold | |
legal_hold_placed_at |
TIMESTAMPTZ |
When hold was placed |
Indexes:
| Index | Columns | Purpose |
|---|---|---|
idx_audit_events_event_timestamp |
event_timestamp DESC |
Chronological queries |
idx_audit_events_correlation_id |
correlation_id, event_timestamp DESC |
Remediation timeline reconstruction |
idx_audit_events_event_type |
event_type, event_timestamp DESC |
Event type filtering |
idx_audit_events_event_data_gin |
event_data USING GIN |
JSONB payload queries |
idx_audit_events_pre_remediation_spec_hash |
(event_data->>'pre_remediation_spec_hash'), event_timestamp DESC |
Spec hash history lookups |
idx_audit_events_post_remediation_spec_hash |
(event_data->>'post_remediation_spec_hash'), event_timestamp DESC |
Post-remediation spec hash lookups (migration 004) |
Partitioning: Monthly range on event_date (audit_events_2026_03 through audit_events_2028_12, plus audit_events_default). Provides fast queries via partition pruning, efficient retention via partition drops, and independent sizing/backup per partition.
Trigger: prevent_legal_hold_deletion prevents deletion of rows with legal_hold = true.
remediation_workflow_catalog¶
The workflow catalog table, used for workflow discovery and scoring.
| Column | Type | Nullable | Description |
|---|---|---|---|
workflow_id |
UUID |
PK | Deterministic UUIDv5 from content hash |
workflow_name |
VARCHAR(255) |
NOT NULL | CRD metadata name |
version |
VARCHAR(50) |
NOT NULL | Semantic version |
name |
VARCHAR(255) |
NOT NULL | Human-readable display name |
description |
JSONB |
NOT NULL | {what, whenToUse, whenNotToUse, preconditions} (camelCase keys) |
owner |
VARCHAR(255) |
Workflow owner | |
maintainer |
VARCHAR(255) |
Workflow maintainer | |
content |
TEXT |
NOT NULL | Full workflow content (YAML) |
content_hash |
VARCHAR(64) |
NOT NULL | SHA256 hash of normalized workflow content (DD-EM-002) |
labels |
JSONB |
NOT NULL | Mandatory labels (signalName key for semantic signal matching) |
parameters |
JSONB |
Workflow parameters | |
execution_engine |
VARCHAR(50) |
NOT NULL | Default: tekton. One of tekton, job, ansible |
schema_image |
TEXT |
OCI image pulled at registration to extract /workflow-schema.yaml |
|
schema_digest |
VARCHAR(71) |
SHA256 digest of the schema image | |
execution_bundle |
TEXT |
OCI execution bundle reference (digest-pinned) | |
execution_bundle_digest |
VARCHAR(71) |
SHA256 digest of execution bundle | |
custom_labels |
JSONB |
NOT NULL | Custom labels from workflow schema |
detected_labels |
JSONB |
NOT NULL | Infrastructure-awareness labels |
engine_config |
JSONB |
Engine-specific config (e.g., AWX jobTemplateName, inventoryName). NULL for Tekton/Job |
|
action_type |
TEXT |
NOT NULL | FK to action_type_taxonomy |
status |
VARCHAR(20) |
NOT NULL | active, disabled, deprecated, archived, superseded |
status_reason |
TEXT |
Reason for current status | |
schema_version |
VARCHAR(10) |
NOT NULL | Default: 1.0 |
disabled_at |
TIMESTAMPTZ |
When disabled | |
disabled_by |
VARCHAR(255) |
Who disabled it | |
disabled_reason |
TEXT |
Why disabled | |
is_latest_version |
BOOLEAN |
NOT NULL | Partial index for discovery queries |
previous_version |
VARCHAR(50) |
Previous version reference | |
deprecation_notice |
TEXT |
Deprecation notice text | |
version_notes |
TEXT |
Version release notes | |
change_summary |
TEXT |
Summary of changes from previous version | |
approved_by |
VARCHAR(255) |
Approver identity | |
approved_at |
TIMESTAMPTZ |
Approval timestamp | |
expected_success_rate |
NUMERIC(4,3) |
Expected success rate [0.000–1.000] | |
expected_duration_seconds |
INTEGER |
Expected execution duration | |
actual_success_rate |
NUMERIC(4,3) |
Computed from execution history [0.000–1.000] | |
total_executions |
INTEGER |
Default: 0 | |
successful_executions |
INTEGER |
Default: 0 (constrained ≤ total_executions) |
|
created_at |
TIMESTAMPTZ |
NOT NULL | |
updated_at |
TIMESTAMPTZ |
NOT NULL | Auto-updated via trigger |
created_by |
VARCHAR(255) |
Creator identity | |
updated_by |
VARCHAR(255) |
Last updater identity |
Key indexes: GIN on labels, custom_labels, detected_labels; composite on (action_type, status, is_latest_version) for discovery; partial UNIQUE on (workflow_name, version) WHERE status = 'active'; btree on created_at DESC, status, workflow_name, schema_digest, execution_bundle_digest, actual_success_rate DESC (active only).
Check constraints: status must be one of active/disabled/deprecated/archived/superseded. expected_success_rate, actual_success_rate must be in [0, 1]. successful_executions must be ≤ total_executions.
Workflow supersession: Only one workflow version per (workflow_name, action_type) pair can be active at a time. When a new version is registered (via RemediationWorkflow CRD creation or update), DataStorage marks the previous active entry as superseded and activates the new one, enforced by the GetActiveByWorkflowName repository method.
PK collision recovery (SupersedeAndCreate): Workflow primary keys are deterministic (UUIDv5). A rare re-registration can collide with an existing row. If the colliding row has status = 'Superseded', DataStorage re-activates it with reason reactivated: re-registered via CRD. The implementation uses a SAVEPOINT/ROLLBACK pattern.
action_type_taxonomy¶
The action type registry for workflow categorization.
| Column | Type | Nullable | Description |
|---|---|---|---|
action_type |
TEXT |
PK | PascalCase identifier (e.g., ScaleReplicas, RestartPod) |
description |
JSONB |
NOT NULL | {what, whenToUse, whenNotToUse, preconditions} (camelCase keys) |
status |
TEXT |
NOT NULL | Default: active. Lifecycle: active, disabled, deprecated, archived, superseded |
disabled_at |
TIMESTAMPTZ |
When disabled | |
disabled_by |
TEXT |
Who disabled it | |
created_at |
TIMESTAMP |
NOT NULL | |
updated_at |
TIMESTAMP |
NOT NULL | Auto-updated via trigger |
Referenced by: remediation_workflow_catalog.action_type (FK).
The database deploys with a clean schema — no pre-seeded rows. Action types are registered via kubectl apply -f on ActionType CRDs. The AuthWebhook intercepts the admission request and registers each action type in the DataStorage catalog via its REST API.
resource_action_traces¶
The major partitioned trace table for per-resource action tracking. Stores signal data, AI decision metadata, execution tracking, and effectiveness scoring. Partitioned by month on action_timestamp.
| Column | Type | Nullable | Description |
|---|---|---|---|
id |
BIGSERIAL |
PK (with action_timestamp) |
|
action_history_id |
BIGINT |
NOT NULL | FK to action_histories.id |
action_id |
VARCHAR(64) |
NOT NULL | UNIQUE (with action_timestamp) |
correlation_id |
VARCHAR(64) |
Links to RR correlation | |
| Timing | |||
action_timestamp |
TIMESTAMPTZ |
NOT NULL | Partition key |
execution_start_time |
TIMESTAMPTZ |
||
execution_end_time |
TIMESTAMPTZ |
||
execution_duration_ms |
INTEGER |
||
| Signal context | |||
signal_name |
VARCHAR(200) |
NOT NULL | Alert/signal name |
signal_severity |
VARCHAR(20) |
NOT NULL | Signal severity |
signal_labels |
JSONB |
GIN-indexed for containment queries | |
signal_annotations |
JSONB |
||
signal_firing_time |
TIMESTAMPTZ |
||
incident_type |
VARCHAR(100) |
||
alert_name |
VARCHAR(255) |
||
incident_severity |
VARCHAR(20) |
||
| Workflow / execution | |||
workflow_id |
VARCHAR(64) |
||
workflow_version |
VARCHAR(20) |
||
workflow_step_number |
INTEGER |
||
workflow_execution_id |
VARCHAR(64) |
||
execution_status |
VARCHAR(20) |
Default: pending. Values: pending, executing, completed, failed |
|
execution_error |
TEXT |
||
kubernetes_operations |
JSONB |
||
| AI decision | |||
ai_selected_workflow |
BOOLEAN |
Default: false |
|
ai_chained_workflows |
BOOLEAN |
Default: false |
|
ai_manual_escalation |
BOOLEAN |
Default: false |
|
ai_workflow_customization |
JSONB |
||
model_used |
VARCHAR(100) |
NOT NULL | LLM model identifier |
routing_tier |
VARCHAR(20) |
||
model_confidence |
NUMERIC(4,3) |
NOT NULL | [0.000–1.000] |
model_reasoning |
TEXT |
||
alternative_actions |
JSONB |
||
action_type |
VARCHAR(50) |
NOT NULL | |
action_parameters |
JSONB |
GIN-indexed | |
| Resource state | |||
resource_state_before |
JSONB |
GIN-indexed | |
resource_state_after |
JSONB |
||
| Effectiveness | |||
effectiveness_score |
NUMERIC(4,3) |
[0.000–1.000] when assessed | |
effectiveness_criteria |
JSONB |
||
effectiveness_assessed_at |
TIMESTAMPTZ |
||
effectiveness_assessment_method |
VARCHAR(20) |
||
effectiveness_assessment_due |
TIMESTAMPTZ |
||
effectiveness_notes |
TEXT |
||
follow_up_actions |
JSONB |
||
created_at |
TIMESTAMPTZ |
Default: now() |
|
updated_at |
TIMESTAMPTZ |
Auto-updated via trigger |
Key indexes (22 total): composite indexes for workflow success rates, incident type analysis, AI execution mode queries, alert-based lookups, effectiveness analysis, and multi-dimensional success correlation. Partial indexes on execution_status IN ('pending','executing') and effectiveness_score IS NOT NULL for targeted queries.
Triggers: create_assessment_for_action_trace (AFTER UPDATE — auto-creates action_assessments entries), update_resource_action_traces_updated_at (BEFORE UPDATE).
Partitioning: Monthly range on action_timestamp (resource_action_traces_2026_03 through resource_action_traces_2028_12, plus resource_action_traces_default).
action_histories¶
Per-resource action history with configurable retention and compaction.
| Column | Type | Nullable | Description |
|---|---|---|---|
id |
BIGSERIAL |
PK | |
resource_id |
BIGINT |
NOT NULL | FK to resource_references.id (UNIQUE, CASCADE) |
max_actions |
INTEGER |
Default: 1000 | |
max_age_days |
INTEGER |
Default: 30 | |
compaction_strategy |
VARCHAR(20) |
Default: pattern-aware |
|
oscillation_window_minutes |
INTEGER |
Default: 120 | |
effectiveness_threshold |
NUMERIC(3,2) |
Default: 0.70 | |
pattern_min_occurrences |
INTEGER |
Default: 3 | |
total_actions |
INTEGER |
Default: 0 | |
last_action_at |
TIMESTAMPTZ |
||
last_analysis_at |
TIMESTAMPTZ |
||
next_analysis_at |
TIMESTAMPTZ |
||
created_at |
TIMESTAMPTZ |
Default: now() |
|
updated_at |
TIMESTAMPTZ |
Auto-updated via trigger |
Referenced by: resource_action_traces.action_history_id, retention_operations.action_history_id (both CASCADE).
notification_audit¶
Notification-specific audit events with delivery tracking.
| Column | Type | Nullable | Description |
|---|---|---|---|
id |
BIGSERIAL |
PK | |
remediation_id |
VARCHAR(255) |
NOT NULL | |
notification_id |
VARCHAR(255) |
NOT NULL | UNIQUE |
recipient |
VARCHAR(255) |
NOT NULL | |
channel |
VARCHAR(50) |
NOT NULL | Constrained: email, slack, pagerduty, sms |
message_summary |
TEXT |
NOT NULL | |
status |
VARCHAR(50) |
NOT NULL | Constrained: sent, failed, acknowledged, escalated |
sent_at |
TIMESTAMPTZ |
NOT NULL | |
delivery_status |
TEXT |
||
error_message |
TEXT |
||
escalation_level |
INTEGER |
NOT NULL | Default: 0 |
created_at |
TIMESTAMPTZ |
NOT NULL | |
updated_at |
TIMESTAMPTZ |
NOT NULL |
Indexes: btree on remediation_id, notification_id, channel, status, created_at DESC.
action_assessments¶
Pending effectiveness assessments for completed actions. Auto-created by the create_assessment_for_action_trace trigger.
| Column | Type | Nullable | Description |
|---|---|---|---|
id |
UUID |
PK | gen_random_uuid() |
trace_id |
VARCHAR(255) |
NOT NULL | |
action_type |
VARCHAR(100) |
NOT NULL | |
context_hash |
VARCHAR(64) |
NOT NULL | |
alert_name |
VARCHAR(255) |
NOT NULL | |
namespace |
VARCHAR(255) |
NOT NULL | |
resource_name |
VARCHAR(255) |
NOT NULL | |
executed_at |
TIMESTAMPTZ |
NOT NULL | |
scheduled_for |
TIMESTAMPTZ |
NOT NULL | Default: now() + 5 minutes |
status |
VARCHAR(50) |
NOT NULL | Default: pending. Constrained: pending, processing, completed, failed, skipped |
created_at |
TIMESTAMPTZ |
NOT NULL | |
completed_at |
TIMESTAMPTZ |
Indexes: partial index on (status, scheduled_for) WHERE status = 'pending' for scheduler queries.
effectiveness_results¶
Results of effectiveness assessments for learning feedback.
| Column | Type | Nullable | Description |
|---|---|---|---|
id |
UUID |
PK | |
trace_id |
VARCHAR(255) |
NOT NULL | UNIQUE |
action_type |
VARCHAR(100) |
NOT NULL | |
overall_score |
FLOAT8 |
NOT NULL | [0.0–1.0] |
alert_resolved |
BOOLEAN |
NOT NULL | |
metric_delta |
JSONB |
||
side_effects |
INTEGER |
Default: 0 | |
confidence |
FLOAT8 |
NOT NULL | [0.0–1.0] |
assessed_at |
TIMESTAMPTZ |
NOT NULL | |
recommended_adjustments |
JSONB |
||
learning_contribution |
FLOAT8 |
NOT NULL | Default: 0.5. [0.0–1.0] |
created_at |
TIMESTAMPTZ |
NOT NULL |
Indexes: btree on action_type, assessed_at, overall_score; composite on (action_type, assessed_at DESC) for learning queries.
action_confidence_scores¶
Dynamic confidence scores that improve through learning.
| Column | Type | Nullable | Description |
|---|---|---|---|
id |
UUID |
PK | |
action_type |
VARCHAR(100) |
NOT NULL | |
context_hash |
VARCHAR(64) |
NOT NULL | |
base_confidence |
FLOAT8 |
NOT NULL | [0.0–1.0] |
adjusted_confidence |
FLOAT8 |
NOT NULL | [0.0–1.0] |
adjustment_reason |
TEXT |
||
effectiveness_samples |
INTEGER |
Default: 0 | |
last_updated |
TIMESTAMPTZ |
NOT NULL | |
created_at |
TIMESTAMPTZ |
NOT NULL |
Unique constraint: (action_type, context_hash).
action_outcomes¶
Historical outcomes for ML training.
| Column | Type | Nullable | Description |
|---|---|---|---|
id |
UUID |
PK | |
trace_id |
VARCHAR(255) |
NOT NULL | |
action_type |
VARCHAR(100) |
NOT NULL | |
context_hash |
VARCHAR(64) |
NOT NULL | |
success |
BOOLEAN |
NOT NULL | |
alert_resolved |
BOOLEAN |
NOT NULL | |
side_effects |
INTEGER |
Default: 0 | |
effectiveness_score |
FLOAT8 |
NOT NULL | [0.0–1.0] |
execution_time |
BIGINT |
Duration in ms | |
metrics_before |
JSONB |
||
metrics_after |
JSONB |
||
failure_reason |
TEXT |
||
executed_at |
TIMESTAMPTZ |
NOT NULL | |
assessed_at |
TIMESTAMPTZ |
NOT NULL | |
created_at |
TIMESTAMPTZ |
NOT NULL |
Indexes: composite on (action_type, context_hash, executed_at DESC) for learning queries; btree on effectiveness_score, success, executed_at.
action_alternatives¶
Alternative action recommendations for failed patterns.
| Column | Type | Nullable | Description |
|---|---|---|---|
id |
UUID |
PK | |
failed_action_type |
VARCHAR(100) |
NOT NULL | |
context_hash |
VARCHAR(64) |
NOT NULL | |
alternative_action_type |
VARCHAR(100) |
NOT NULL | |
success_rate |
FLOAT8 |
NOT NULL | Default: 0.5. [0.0–1.0] |
sample_size |
INTEGER |
NOT NULL | Default: 0 |
last_success_at |
TIMESTAMPTZ |
||
created_at |
TIMESTAMPTZ |
NOT NULL | |
updated_at |
TIMESTAMPTZ |
NOT NULL |
Unique constraint: (failed_action_type, context_hash, alternative_action_type).
action_effectiveness_metrics¶
Aggregated effectiveness metrics by scope, period, and action type.
| Column | Type | Nullable | Description |
|---|---|---|---|
id |
BIGSERIAL |
PK | |
scope_type |
VARCHAR(50) |
NOT NULL | e.g., namespace, cluster, action_type |
scope_value |
VARCHAR(200) |
||
metric_period |
VARCHAR(20) |
NOT NULL | e.g., daily, weekly |
period_start |
TIMESTAMPTZ |
NOT NULL | |
period_end |
TIMESTAMPTZ |
NOT NULL | |
action_type |
VARCHAR(50) |
NOT NULL | |
sample_size |
INTEGER |
NOT NULL | |
average_score |
NUMERIC(4,3) |
NOT NULL | |
median_score |
NUMERIC(4,3) |
||
std_deviation |
NUMERIC(4,3) |
||
confidence_interval_lower |
NUMERIC(4,3) |
||
confidence_interval_upper |
NUMERIC(4,3) |
||
trend_direction |
VARCHAR(20) |
||
trend_confidence |
NUMERIC(4,3) |
||
min_sample_size_met |
BOOLEAN |
||
statistical_significance |
NUMERIC(4,3) |
||
created_at |
TIMESTAMPTZ |
Unique constraint: (scope_type, scope_value, metric_period, period_start, action_type).
oscillation_patterns¶
Pattern definitions for oscillation detection (repeated fail/fix cycles).
| Column | Type | Nullable | Description |
|---|---|---|---|
id |
BIGSERIAL |
PK | |
pattern_type |
VARCHAR(50) |
NOT NULL | |
pattern_name |
VARCHAR(200) |
NOT NULL | |
description |
TEXT |
||
min_occurrences |
INTEGER |
NOT NULL | Default: 3 |
time_window_minutes |
INTEGER |
NOT NULL | Default: 120 |
action_sequence |
JSONB |
||
threshold_config |
JSONB |
||
resource_types |
TEXT[] |
||
namespaces |
TEXT[] |
||
label_selectors |
JSONB |
||
prevention_strategy |
VARCHAR(50) |
NOT NULL | |
prevention_parameters |
JSONB |
||
alerting_enabled |
BOOLEAN |
Default: true |
|
alert_severity |
VARCHAR(20) |
Default: warning |
|
alert_channels |
TEXT[] |
||
total_detections |
INTEGER |
Default: 0 | |
prevention_success_rate |
NUMERIC(4,3) |
||
false_positive_rate |
NUMERIC(4,3) |
||
last_detection_at |
TIMESTAMPTZ |
||
active |
BOOLEAN |
Default: true |
|
created_at |
TIMESTAMPTZ |
||
updated_at |
TIMESTAMPTZ |
Auto-updated via trigger |
Referenced by: oscillation_detections.pattern_id (CASCADE).
oscillation_detections¶
Detected oscillation instances.
| Column | Type | Nullable | Description |
|---|---|---|---|
id |
BIGSERIAL |
PK | |
pattern_id |
BIGINT |
NOT NULL | FK to oscillation_patterns.id (CASCADE) |
resource_id |
BIGINT |
NOT NULL | FK to resource_references.id (CASCADE) |
detected_at |
TIMESTAMPTZ |
NOT NULL | Default: now() |
confidence |
NUMERIC(4,3) |
NOT NULL | |
action_count |
INTEGER |
NOT NULL | |
time_span_minutes |
INTEGER |
NOT NULL | |
matching_actions |
BIGINT[] |
Array of resource_action_traces.id values |
|
pattern_evidence |
JSONB |
||
prevention_applied |
BOOLEAN |
Default: false |
|
prevention_action |
VARCHAR(50) |
||
prevention_details |
JSONB |
||
prevention_successful |
BOOLEAN |
||
resolved |
BOOLEAN |
Default: false |
|
resolved_at |
TIMESTAMPTZ |
||
resolution_method |
VARCHAR(50) |
||
resolution_notes |
TEXT |
||
created_at |
TIMESTAMPTZ |
Indexes: btree on (pattern_id, resource_id), detected_at; partial on resolved = false for active detections.
audit_retention_policies¶
Retention policy definitions per event category.
| Column | Type | Nullable | Description |
|---|---|---|---|
policy_id |
UUID |
PK | gen_random_uuid() |
event_category |
TEXT |
NOT NULL | UNIQUE |
retention_days |
INTEGER |
NOT NULL | |
legal_hold_override |
BOOLEAN |
Default: false |
|
created_at |
TIMESTAMP |
NOT NULL | |
updated_at |
TIMESTAMP |
NOT NULL |
retention_operations¶
Retention operation tracking and scheduling.
| Column | Type | Nullable | Description |
|---|---|---|---|
id |
BIGSERIAL |
PK | |
action_history_id |
BIGINT |
NOT NULL | FK to action_histories.id (CASCADE) |
operation_type |
VARCHAR(30) |
NOT NULL | |
strategy_used |
VARCHAR(30) |
NOT NULL | |
records_before |
INTEGER |
NOT NULL | |
records_after |
INTEGER |
NOT NULL | |
records_deleted |
INTEGER |
NOT NULL | |
records_archived |
INTEGER |
||
retention_criteria |
JSONB |
||
preserved_criteria |
JSONB |
||
operation_start |
TIMESTAMPTZ |
NOT NULL | |
operation_end |
TIMESTAMPTZ |
||
operation_duration_ms |
INTEGER |
||
operation_status |
VARCHAR(20) |
Default: running |
|
error_message |
TEXT |
||
created_at |
TIMESTAMPTZ |
Views¶
effectiveness_trends¶
Daily aggregation of effectiveness results by action type.
| Column | Type | Source |
|---|---|---|
action_type |
VARCHAR(100) |
effectiveness_results.action_type |
assessment_date |
TIMESTAMPTZ |
date_trunc('day', assessed_at) |
total_assessments |
BIGINT |
count(*) |
avg_effectiveness |
FLOAT8 |
avg(overall_score) |
avg_confidence |
FLOAT8 |
avg(confidence) |
alerts_resolved |
BIGINT |
Count where alert_resolved = true |
resolution_rate |
FLOAT8 |
alerts_resolved / total_assessments |
low_confidence_actions¶
Actions requiring attention due to poor performance (adjusted confidence < 0.5).
| Column | Type | Source |
|---|---|---|
action_type |
VARCHAR(100) |
action_confidence_scores |
context_hash |
VARCHAR(64) |
action_confidence_scores |
adjusted_confidence |
FLOAT8 |
Below 0.5 threshold |
adjustment_reason |
TEXT |
|
effectiveness_samples |
INTEGER |
|
last_updated |
TIMESTAMPTZ |
|
recent_success_rate |
NUMERIC |
From action_outcomes (last 7 days) |
recent_samples |
BIGINT |
From action_outcomes (last 7 days) |
action_history_summary¶
Per-resource action summary joining resource_references, action_histories, and resource_action_traces.
| Column | Type | Source |
|---|---|---|
namespace |
VARCHAR(63) |
resource_references |
kind |
VARCHAR(100) |
resource_references |
name |
VARCHAR(253) |
resource_references |
total_actions |
INTEGER |
action_histories |
last_action_at |
TIMESTAMPTZ |
action_histories |
recent_actions_24h |
BIGINT |
Count from resource_action_traces (last 24h) |
avg_effectiveness_24h |
NUMERIC |
Avg effectiveness_score (last 24h) |
action_types_used |
BIGINT |
Distinct action_type count (last 24h) |
incident_summary_view¶
Incident counts grouped by signal severity (ordered critical → low).
oscillation_detection_summary¶
Oscillation detection statistics grouped by pattern type, including total detections, preventions applied/successful, average confidence, and last detection timestamp.
Stored Functions¶
The database includes procedural functions for automated analysis:
| Function | Purpose |
|---|---|
create_assessment_for_action_trace() |
Trigger: auto-creates action_assessments entries after trace updates |
create_monthly_partitions() |
Generates monthly partitions for audit_events and resource_action_traces |
analyze_action_oscillation() |
Detects oscillation patterns in action histories |
detect_cascading_failures() |
Identifies cascading failure sequences |
detect_ineffective_loops() |
Finds repeated ineffective remediation loops |
detect_resource_thrashing() |
Detects resource thrashing (rapid scale up/down) |
detect_scale_oscillation() |
Detects scale oscillation patterns |
analyze_cascade_effects() |
Analyzes the scope of cascading failures |
store_oscillation_detection() |
Persists a detected oscillation to oscillation_detections |
prevent_legal_hold_deletion() |
Blocks deletion of audit events under legal hold |
audit_event_lock_id() |
Generates advisory lock IDs for audit event deduplication |
get_action_effectiveness() |
Retrieves computed effectiveness scores |
get_action_traces() |
Queries resource_action_traces with filters |
get_recent_actions() |
Returns recent actions for a resource |
get_resource_actions_base() |
Base query for resource action retrieval |
get_resource_id() |
Resolves or creates resource_references entries |
Deterministic catalog IDs (UUIDv5)¶
RemediationWorkflow resources use deterministic UUIDs (UUIDv5 derived from a content hash of the spec). The same workflow specification always yields the same UUID, so workflow catalog rows and cross-references stay stable across PVC wipes and database replays as long as the spec is unchanged. ActionType entries are keyed by their actionType identifier string.
Auth Webhook startup reconciliation¶
On startup, the Auth Webhook runs a Runnable that lists cluster ActionType objects, then RemediationWorkflow objects, and reconciles them with DataStorage through idempotent creates. This repopulates the catalog after storage loss or drift without duplicating rows.
Database migrations¶
Schema changes use an append-only migration chain managed by goose. The strategy is:
- Append-only chain — migrations are never rewritten in place; history stays linear.
- Per-major baselines — each major release can ship a squashed baseline for fresh installs, while upgrades follow the incremental chain from their installed version.
- Minor release squash — development incrementals are typically squashed per minor at release time to keep the chain maintainable.
db-migratemigration job — runs via Helm hook (post-install,post-upgrade) and distinguishes fresh install vs upgrade using thegoose_db_versiontable so the correct migration path applies.
Migrations 002–005 are part of this chain; 004 adds an index on post_remediation_spec_hash in event_data for audit queries, and 005 adds an effectiveness correlation index.
RemediationRequest Reconstruction¶
The DataStorage service can rebuild a complete RemediationRequest from audit events — even after the CRD has been removed from the cluster.
Endpoint¶
Pipeline¶
graph LR
Q["1. Query<br/><small>audit events by<br/>correlation_id</small>"]
P["2. Parse<br/><small>extract CRD fields<br/>from typed payloads</small>"]
M["3. Map<br/><small>aggregate into<br/>spec/status</small>"]
B["4. Build<br/><small>produce RR<br/>object</small>"]
V["5. Validate<br/><small>check completeness<br/>and integrity</small>"]
Q --> P --> M --> B --> V
Query¶
Events are fetched by correlation_id filtered to specific event types:
SELECT event_id, event_type, event_timestamp, event_outcome,
resource_type, resource_id, actor_type, actor_id,
event_data, namespace, cluster_name, duration_ms
FROM audit_events
WHERE correlation_id = $1
AND event_type IN (
'gateway.signal.received',
'aianalysis.analysis.completed',
'workflowexecution.selection.completed',
'workflowexecution.execution.started',
'orchestrator.lifecycle.created'
)
ORDER BY event_timestamp ASC, event_id ASC
Source Event Mapping¶
| Reconstructed Field | Source Event | Payload Field |
|---|---|---|
spec.signalName, signalType, signalLabels |
gateway.signal.received |
GatewayAuditPayload |
spec.originalPayload |
gateway.signal.received |
GatewayAuditPayload |
spec.signalAnnotations |
gateway.signal.received |
GatewayAuditPayload |
status.selectedWorkflowRef |
workflowexecution.selection.completed |
WorkflowExecutionAuditPayload |
status.executionRef |
workflowexecution.execution.started |
WorkflowExecutionAuditPayload |
status.timeoutConfig |
orchestrator.lifecycle.created |
RemediationOrchestratorAuditPayload |
Events are ordered by timestamp and mapped into typed payloads (GatewayAuditPayload, RemediationOrchestratorAuditPayload, AIAnalysisAuditPayload, WorkflowExecutionAuditPayload) to rebuild the RR.
Limitations¶
- Reconstruction is available for RemediationRequest CRDs only (other CRD types planned)
status.errorandOverallPhaseare not reconstructed from the current event schema
Valkey (DLQ)¶
Valkey serves as a dead-letter queue for audit event resilience:
Streams¶
| Stream | Purpose | Max Length |
|---|---|---|
audit:dlq:events |
Failed generic audit batches | 10,000 |
audit:dlq:notifications |
Failed notification audit events | 10,000 |
audit:dead-letter:{type} |
Events that exceeded all retry attempts | 10,000 |
Operations¶
| Operation | Command | Description |
|---|---|---|
| Enqueue | XADD |
Add failed batch to stream |
| Read | XREADGROUP |
Consumer group for reliable delivery |
| Acknowledge | XACK |
Mark message as processed |
| Move to dead letter | XADD to dead-letter stream |
After max retries |
| Drain | DrainWithTimeout |
Graceful shutdown flush |
Message Format¶
{
"type": "audit_event",
"payload": "...",
"timestamp": "2026-03-04T12:00:00Z",
"retry_count": 2,
"last_error": "connection refused"
}
Data Flow Summary¶
graph TD
S[Service] -->|StoreAudit| BS[Buffered Store]
BS -->|batch POST| DS[DataStorage]
DS -->|INSERT| PG[(PostgreSQL)]
DS -->|on failure| RD[(Valkey DLQ)]
RD -->|retry| DS
DS -->|query| PG
PG -->|reconstruct| RR[RemediationRequest]
Next Steps¶
- Audit Pipeline — How events reach DataStorage
- Data Lifecycle — User-facing data lifecycle documentation
- API Reference: DataStorage — REST API endpoints