Audit & Observability¶
Architecture reference
For the buffered audit store, flush triggers, and DLQ design, see Architecture: Audit Pipeline.
Kubernaut provides a comprehensive audit trail that records every action taken during remediation. This supports SOC2 Type II alignment, incident review, and continuous improvement.
Audit Architecture¶
Every Kubernaut service emits structured audit events to DataStorage, which persists them in PostgreSQL.
graph LR
subgraph Services
GW[Gateway]
SP[Signal Processing]
AA[AI Analysis]
RO[Orchestrator]
WE[Workflow Execution]
NF[Notification]
EM[Effectiveness Monitor]
AW[Auth Webhook]
end
Services -->|buffered batch| DS[DataStorage]
DS --> PG[(PostgreSQL<br/>audit_events)]
Audit Pipeline Design¶
- Buffered and batched — Events are queued in-memory and sent in batches to DataStorage, minimizing overhead
- Fire-and-forget — Audit failures never block remediation; events are retried transparently
- Configurable batching — Buffer size, batch size, and flush interval are tunable per service
What Gets Audited¶
Every stage of the remediation lifecycle emits audit events:
| Service | Event Types | Examples |
|---|---|---|
| Gateway | Signal received, scope validated | gateway.signal.received |
| Signal Processing | Enrichment completed, classification results | signalprocessing.enrichment.completed |
| AI Analysis | Investigation submitted, analysis completed/failed, Rego evaluation, approval decision | aianalysis.analysis.completed, aianalysis.rego.evaluation, aianalysis.approval.decision |
| Kubernaut Agent | Enrichment and investigation; LLM completion and max-token handling (v1.3) | aiagent.enrichment.completed, aiagent.enrichment.failed, aiagent.response.complete, truncation_detected |
| Orchestrator | Lifecycle transitions, child CRD creation, routing blocks | orchestrator.lifecycle.created, orchestrator.lifecycle.transitioned, orchestrator.routing.blocked |
| Workflow Execution | Workflow selected, execution started/completed, block clearance | workflowexecution.selection.completed, workflowexecution.execution.started, workflowexecution.block.cleared |
| Notification | Message sent, delivery failure, acknowledgement, escalation | notification.message.sent, notification.message.failed, notification.message.acknowledged, notification.message.escalated |
| Effectiveness Monitor | Component assessments (health, hash, alerts, metrics), scheduling, completion | effectiveness.health.assessed, effectiveness.hash.computed, effectiveness.alert.assessed, effectiveness.metrics.assessed, effectiveness.assessment.scheduled, effectiveness.assessment.completed |
| Auth Webhook | Operator approval decisions, notification actions, timeout modifications, RemediationWorkflow and ActionType CRD lifecycle | webhook.remediationapprovalrequest.decided, webhook.notification.cancelled, webhook.remediationrequest.timeout_modified, remediationworkflow.admitted.create, remediationworkflow.admitted.delete, remediationworkflow.admitted.denied, actiontype.admitted.create, actiontype.admitted.update, actiontype.admitted.delete, actiontype.denied.* |
| DataStorage | Workflow catalog operations, action type taxonomy, workflow discovery | datastorage.workflow.created, datastorage.workflow.updated, datastorage.actiontype.created, datastorage.actiontype.updated, datastorage.actiontype.disabled, datastorage.actiontype.reenabled, datastorage.actiontype.disable_denied, workflow.catalog.actions_listed, workflow.catalog.workflows_listed, workflow.catalog.workflow_retrieved, workflow.catalog.selection_validated |
Audit Event Structure¶
Each event contains core fields: event_id, event_timestamp, event_type, event_category, event_action, event_outcome, actor_type/actor_id, resource_type/resource_id, correlation_id, namespace, and event_data. See Architecture: Audit Pipeline for the complete event structure and field definitions.
LLM token usage¶
Token usage is recorded at two levels in the Kubernaut Agent investigation pipeline:
| Event type | Scope | Payload fields |
|---|---|---|
aiagent.llm.response |
Per-turn (each LLM call) | prompt_tokens, completion_tokens, total_tokens |
aiagent.response.complete |
Cumulative (full investigation) | total_prompt_tokens, total_completion_tokens, total_tokens |
Cumulative totals are tracked by an internal TokenAccumulator and emitted on investigation completion.
Token fields are NOT on aianalysis.* events
The aianalysis.analysis.completed event (emitted by the AIAnalysis controller) does not include token fields. Token usage is exclusively on the aiagent.* events from the Kubernaut Agent investigator pipeline. When querying for token cost analysis, filter to aiagent.llm.response (per-turn) or aiagent.response.complete (cumulative).
finish_reason and truncation (v1.3)¶
Kubernaut Agent investigation response events can include finish_reason, taken from the provider completion, so you can see whether a response ended with length (max tokens), stop, tool_calls, and so on in your audit store.
A truncation_detected event is emitted when truncation triggers the token escalation path. Its event_data can include **escalated_max_tokens: true when a second attempt runs with a higher max-token cap (capped, for example, at 16,384; see Investigation Pipeline: LLM output resilience).
Tool call attribution¶
For tool-invocation audit events, tool_name is populated with the actual tool name. In v1.1 this field was incorrectly recorded as unknown in some paths; v1.2 records it correctly.
Operator Attribution¶
The Auth Webhook captures human actions through Kubernetes admission control:
- Approval decisions — Who approved or rejected a RemediationApprovalRequest
- Block clearance — Who cleared a workflow execution block
- Timeout modifications — Who changed a RemediationRequest's timeout configuration
- Notification cancellation — Who deleted a NotificationRequest
- Workflow registration — Who registered, deleted, or was denied a RemediationWorkflow CRD
- Action type management — Who created, updated, deleted, or was denied an ActionType CRD
This ensures that every human action in the system has a recorded identity, timestamp, and context — critical for SOC2 readiness.
Retention¶
Audit events are stored with a configured retention of 2,555 days (7 years), supporting long-term compliance requirements.
Retention Enforcement
The retention period is recorded per event but automatic deletion of expired events is not yet implemented. Events currently accumulate indefinitely. Retention enforcement is tracked in kubernaut#485 (v1.3). Customers are responsible for configuring retention policies based on their local regulatory requirements.
The audit_events table is partitioned by month for efficient storage and querying. Individual events can be flagged as is_sensitive for PII handling.
Correlation¶
All audit events for a single remediation share the same correlation_id (the RemediationRequest name). This enables:
- Querying the complete history of a remediation across all services
- Reconstructing the full CRD from audit data (see Data Lifecycle)
- Incident timeline reconstruction for post-mortems
Metrics¶
All services expose Prometheus metrics on :9090/metrics. Kubernaut exposes ~115 custom metrics across all services covering signal ingestion, classification, orchestration, execution, notification, effectiveness, audit, and LLM usage.
Key metric categories:
- Throughput -- Signals received, remediations completed, notifications delivered
- Latency -- Per-phase processing duration, LLM call latency, delivery duration
- Errors -- Failure rates, retry counts, circuit breaker states
- Audit health -- Buffer utilization, DLQ depth, write latency
- LLM cost -- Token consumption by provider and model
See Monitoring: Prometheus Metrics Reference for the complete per-service metrics inventory with metric names, types, labels, and example PromQL queries for Grafana dashboards.
Next Steps¶
- Data Lifecycle — CRD retention and reconstruction from audit data
- Monitoring — Prometheus metrics and dashboards
- Architecture: Audit Pipeline — Deep-dive into the audit system design