Audit & Observability¶

Architecture reference

For the buffered audit store, flush triggers, and DLQ design, see Architecture: Audit Pipeline.

Kubernaut provides a comprehensive audit trail that records every action taken during remediation. This supports SOC2 Type II alignment, incident review, and continuous improvement.

Audit Architecture¶

Every Kubernaut service emits structured audit events to DataStorage, which persists them in PostgreSQL.

graph LR
    subgraph Services
        GW[Gateway]
        SP[Signal Processing]
        AA[AI Analysis]
        RO[Orchestrator]
        WE[Workflow Execution]
        NF[Notification]
        EM[Effectiveness Monitor]
        AW[Auth Webhook]
    end

    Services -->|buffered batch| DS[DataStorage]
    DS --> PG[(PostgreSQL<br/>audit_events)]

Audit Pipeline Design¶

Buffered and batched — Events are queued in-memory and sent in batches to DataStorage, minimizing overhead
Fire-and-forget — Audit failures never block remediation; events are retried transparently
Configurable batching — Buffer size, batch size, and flush interval are tunable per service

What Gets Audited¶

Every stage of the remediation lifecycle emits audit events:

Service	Event Types	Examples
Gateway	Signal received, scope validated	`gateway.signal.received`
Signal Processing	Enrichment completed, classification results	`signalprocessing.enrichment.completed`
AI Analysis	Investigation submitted, analysis completed/failed, Rego evaluation, approval decision	`aianalysis.analysis.completed`, `aianalysis.rego.evaluation`, `aianalysis.approval.decision`
Kubernaut Agent	Enrichment and investigation; LLM completion and max-token handling (v1.3)	`aiagent.enrichment.completed`, `aiagent.enrichment.failed`, `aiagent.response.complete`, `truncation_detected`
Orchestrator	Lifecycle transitions, child CRD creation, routing blocks	`orchestrator.lifecycle.created`, `orchestrator.lifecycle.transitioned`, `orchestrator.routing.blocked`
Workflow Execution	Workflow selected, execution started/completed, block clearance	`workflowexecution.selection.completed`, `workflowexecution.execution.started`, `workflowexecution.block.cleared`
Notification	Message sent, delivery failure, acknowledgement, escalation	`notification.message.sent`, `notification.message.failed`, `notification.message.acknowledged`, `notification.message.escalated`
Effectiveness Monitor	Component assessments (health, hash, alerts, metrics), scheduling, completion	`effectiveness.health.assessed`, `effectiveness.hash.computed`, `effectiveness.alert.assessed`, `effectiveness.metrics.assessed`, `effectiveness.assessment.scheduled`, `effectiveness.assessment.completed`
Auth Webhook	Operator approval decisions, notification actions, timeout modifications, RemediationWorkflow and ActionType CRD lifecycle	`webhook.remediationapprovalrequest.decided`, `webhook.notification.cancelled`, `webhook.remediationrequest.timeout_modified`, `remediationworkflow.admitted.create`, `remediationworkflow.admitted.delete`, `remediationworkflow.admitted.denied`, `actiontype.admitted.create`, `actiontype.admitted.update`, `actiontype.admitted.delete`, `actiontype.denied.*`
DataStorage	Workflow catalog operations, action type taxonomy, workflow discovery	`datastorage.workflow.created`, `datastorage.workflow.updated`, `datastorage.actiontype.created`, `datastorage.actiontype.updated`, `datastorage.actiontype.disabled`, `datastorage.actiontype.reenabled`, `datastorage.actiontype.disable_denied`, `workflow.catalog.actions_listed`, `workflow.catalog.workflows_listed`, `workflow.catalog.workflow_retrieved`, `workflow.catalog.selection_validated`

Audit Event Structure¶

Each event contains core fields: event_id, event_timestamp, event_type, event_category, event_action, event_outcome, actor_type/actor_id, resource_type/resource_id, correlation_id, namespace, and event_data. See Architecture: Audit Pipeline for the complete event structure and field definitions.

LLM token usage¶

Token usage is recorded at two levels in the Kubernaut Agent investigation pipeline:

Event type	Scope	Payload fields
`aiagent.llm.response`	Per-turn (each LLM call)	`prompt_tokens`, `completion_tokens`, `total_tokens`
`aiagent.response.complete`	Cumulative (full investigation)	`total_prompt_tokens`, `total_completion_tokens`, `total_tokens`

Cumulative totals are tracked by an internal TokenAccumulator and emitted on investigation completion.

Token fields are NOT on aianalysis.* events

The aianalysis.analysis.completed event (emitted by the AIAnalysis controller) does not include token fields. Token usage is exclusively on the aiagent.* events from the Kubernaut Agent investigator pipeline. When querying for token cost analysis, filter to aiagent.llm.response (per-turn) or aiagent.response.complete (cumulative).

`finish_reason` and truncation (v1.3)¶

Kubernaut Agent investigation response events can include finish_reason, taken from the provider completion, so you can see whether a response ended with length (max tokens), stop, tool_calls, and so on in your audit store.

A truncation_detected event is emitted when truncation triggers the token escalation path. Its event_data can include **escalated_max_tokens: true when a second attempt runs with a higher max-token cap (capped, for example, at 16,384; see Investigation Pipeline: LLM output resilience).

Tool call attribution¶

For tool-invocation audit events, tool_name is populated with the actual tool name. In v1.1 this field was incorrectly recorded as unknown in some paths; v1.2 records it correctly.

Operator Attribution¶

The Auth Webhook captures human actions through Kubernetes admission control:

Approval decisions — Who approved or rejected a RemediationApprovalRequest
Block clearance — Who cleared a workflow execution block
Timeout modifications — Who changed a RemediationRequest's timeout configuration
Notification cancellation — Who deleted a NotificationRequest
Workflow registration — Who registered, deleted, or was denied a RemediationWorkflow CRD
Action type management — Who created, updated, deleted, or was denied an ActionType CRD

This ensures that every human action in the system has a recorded identity, timestamp, and context — critical for SOC2 readiness.

Retention¶

Audit events are stored with a configured retention of 2,555 days (7 years), supporting long-term compliance requirements.

Retention Enforcement

The retention period is recorded per event but automatic deletion of expired events is not yet implemented. Events currently accumulate indefinitely. Retention enforcement is tracked in kubernaut#485 (v1.3). Customers are responsible for configuring retention policies based on their local regulatory requirements.

The audit_events table is partitioned by month for efficient storage and querying. Individual events can be flagged as is_sensitive for PII handling.

Correlation¶

All audit events for a single remediation share the same correlation_id (the RemediationRequest name). This enables:

Querying the complete history of a remediation across all services
Reconstructing the full CRD from audit data (see Data Lifecycle)
Incident timeline reconstruction for post-mortems

Metrics¶

All services expose Prometheus metrics on :9090/metrics. Kubernaut exposes ~115 custom metrics across all services covering signal ingestion, classification, orchestration, execution, notification, effectiveness, audit, and LLM usage.

Key metric categories:

Throughput -- Signals received, remediations completed, notifications delivered
Latency -- Per-phase processing duration, LLM call latency, delivery duration
Errors -- Failure rates, retry counts, circuit breaker states
Audit health -- Buffer utilization, DLQ depth, write latency
LLM cost -- Token consumption by provider and model

See Monitoring: Prometheus Metrics Reference for the complete per-service metrics inventory with metric names, types, labels, and example PromQL queries for Grafana dashboards.

Next Steps¶

Data Lifecycle — CRD retention and reconstruction from audit data
Monitoring — Prometheus metrics and dashboards
Architecture: Audit Pipeline — Deep-dive into the audit system design