Monitoring¶
All Kubernaut services expose Prometheus-compatible metrics and standard health check endpoints. This page provides a complete metrics reference for building Grafana dashboards and alerting rules.
In v1.3, the Kubernaut Agent metrics were renamed from the legacy holmesgpt_* namespace to aiagent_api_*. Effectiveness Monitor metrics remain stable from v1.2. Notification metrics were refactored internally (DD-METRICS-001: 3-layer → 1-layer) but the metric names and semantics are unchanged.
Health Checks¶
v1.3+ (three-port components): Gateway, DataStorage, Kubernaut Agent, and the AIAnalysis controller split traffic by port: 8080 (primary API; HTTPS when inter-service TLS is enabled), 8081 (health only -- plain HTTP), and 9090 (/metrics -- plain HTTP). Probes use 8081 with GET /healthz (liveness) and GET /readyz (readiness). /livez is not a registered path (do not use it in probes or docs).
| Service Type | Liveness | Readiness | Port | Notes |
|---|---|---|---|---|
| Go CRD controllers (RO, SP, WFE, NT, EM) | GET /healthz |
GET /readyz |
8081 (plain HTTP) | Metrics on 9090 only -- no 8080 API |
| AIAnalysis | GET /healthz |
GET /readyz |
8081 (plain HTTP) | Three-port: API 8080, metrics 9090 |
| Gateway | GET /healthz |
GET /readyz |
8081 (plain HTTP) | Ingestion API on 8080 |
| DataStorage | GET /healthz |
GET /readyz |
8081 (plain HTTP) | Readiness checks PostgreSQL; REST API on 8080 |
| Kubernaut Agent | GET /healthz |
GET /readyz |
8081 (plain HTTP) | Readiness includes LLM connectivity |
| Auth Webhook | GET /healthz |
GET /readyz |
8081 (plain HTTP) | Service 443 → targetPort 9443 for admission |
Scrape Configuration¶
All services expose metrics at :9090/metrics in Prometheus exposition format.
# prometheus.yml
scrape_configs:
- job_name: kubernaut
kubernetes_sd_configs:
- role: pod
namespaces:
names: [kubernaut-system]
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: ${1}:9090
Gateway Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
gateway_signals_received_total |
Counter | source_type, severity |
Total signals received by source type and severity |
gateway_signals_deduplicated_total |
Counter | signal_name |
Signals deduplicated (duplicate fingerprint) |
gateway_signals_rejected_total |
Counter | reason |
Signals rejected by scope filtering |
gateway_crds_created_total |
Counter | source_type, status |
RemediationRequest CRDs created |
gateway_crd_creation_errors_total |
Counter | error_type |
CRD creation errors |
gateway_http_request_duration_seconds |
Histogram | endpoint, method, status |
HTTP request duration |
gateway_circuit_breaker_state |
Gauge | name |
Circuit breaker state (0=closed, 1=half-open, 2=open) |
Signal Processing Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
signalprocessing_processing_total |
Counter | phase, result |
Processing operations by phase and result |
signalprocessing_processing_duration_seconds |
Histogram | phase |
Processing duration per phase |
signalprocessing_enrichment_errors_total |
Counter | error_type |
Enrichment errors (K8s API issues) |
AI Analysis Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
aianalysis_confidence_score_distribution |
Histogram | signal_type |
LLM confidence score distribution |
aianalysis_approval_decisions_total |
Counter | decision, environment |
Approval decisions (auto-approved, approval-required) |
aianalysis_failures_total |
Counter | reason, sub_reason |
Analysis failures by reason |
aianalysis_rego_evaluations_total |
Counter | outcome, degraded |
Rego policy evaluations |
kubernaut_alignment_grounding_total |
Counter | result |
Shadow agent grounding review verdicts (v1.4) |
kubernaut_alignment_grounding_duration_seconds |
Histogram | Grounding review latency (v1.4) | |
alignmentCircuitBreakerTotal |
Counter | Investigations cancelled by alignment circuit breaker (v1.4) | |
aiagent_api_llm_circuit_breaker_state |
Gauge | LLM HTTP client circuit breaker state (v1.4) | |
aiagent_api_ds_circuit_breaker_state |
Gauge | DataStorage HTTP client circuit breaker state (v1.4) |
Remediation Orchestrator Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
kubernaut_remediationorchestrator_phase_transitions_total |
Counter | from_phase, to_phase, namespace |
Phase transitions -- core throughput and failure metric |
kubernaut_remediationorchestrator_reconcile_duration_seconds |
Histogram | namespace, phase |
Reconciliation duration per phase |
kubernaut_remediationorchestrator_timeouts_total |
Counter | phase, namespace |
Remediation timeouts by phase |
kubernaut_remediationorchestrator_blocked_total |
Counter | namespace, reason |
RRs blocked by routing engine |
kubernaut_remediationorchestrator_current_blocked |
Gauge | namespace |
Currently blocked RRs |
kubernaut_remediationorchestrator_child_crd_creations_total |
Counter | child_type, namespace |
Child CRD creations by type |
kubernaut_remediationorchestrator_no_action_needed_total |
Counter | reason, namespace |
Remediations where no action was needed |
kubernaut_remediationorchestrator_duplicates_skipped_total |
Counter | skip_reason, namespace |
Duplicate remediations skipped |
kubernaut_remediationorchestrator_approval_decisions_total |
Counter | decision, namespace |
Human approval throughput |
Workflow Execution Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
workflowexecution_reconciler_total |
Counter | outcome |
Workflow executions by outcome (success/failure) |
workflowexecution_reconciler_duration_seconds |
Histogram | outcome |
Execution duration |
Notification Metrics¶
DD-METRICS-001: Metrics wiring pattern
In v1.3, the Notification controller's metrics were collapsed from a 3-layer stack (interface → recorder → raw metrics) to a single dependency-injected *Metrics struct, matching the pattern mandated by DD-METRICS-001 for all CRD controllers. The metrics struct is injected into both the reconciler and the delivery orchestrator at startup via NewMetrics(). Test isolation uses NewMetricsWithRegistry(prometheus.NewRegistry()) instead of interface mocking.
| Metric | Type | Labels | Description |
|---|---|---|---|
kubernaut_notification_reconciler_active |
Gauge | phase |
Active notification backlog by phase (Pending, Sending, Sent, Retrying, PartiallySent, Failed) |
kubernaut_notification_delivery_attempts_total |
Counter | channel, status |
Delivery attempts per channel |
kubernaut_notification_delivery_duration_seconds |
Histogram | channel |
Delivery duration per channel |
kubernaut_notification_delivery_retries_total |
Counter | channel |
Delivery retries per channel |
kubernaut_notification_channel_circuit_breaker_state |
Gauge | channel |
Circuit breaker state (0=closed, 1=open, 2=half-open) |
kubernaut_notification_channel_health_score |
Gauge | channel |
Channel health score (0–100) |
Effectiveness Monitor Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
kubernaut_effectivenessmonitor_component_scores |
Histogram | component |
Score distribution (0.0--1.0) per component |
kubernaut_effectivenessmonitor_component_assessments_total |
Counter | component, result |
Component assessments (health, hash, alert, metrics) |
kubernaut_effectivenessmonitor_assessments_completed_total |
Counter | reason |
Assessments completed (Full, Partial, Expired, and other AssessmentReason values) |
kubernaut_effectivenessmonitor_validity_expirations_total |
Counter | -- | Assessments that expired before completion |
kubernaut_effectivenessmonitor_external_call_errors_total |
Counter | service, operation, error_type |
Prometheus/AlertManager call errors |
DataStorage Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
datastorage_write_duration_seconds |
Histogram | table |
Duration of write operations in seconds |
datastorage_audit_lag_seconds |
Histogram | service |
Lag between event occurrence and write |
datastorage_dlq_warning |
Gauge | stream |
DLQ at 80% capacity (1 = warning) |
datastorage_dlq_critical |
Gauge | stream |
DLQ at 90% capacity (1 = critical) |
Kubernaut Agent Metrics¶
| Metric | Type | Labels | Description |
|---|---|---|---|
aiagent_api_investigations_total |
Counter | status |
Investigation requests by outcome |
aiagent_api_investigations_duration_seconds |
Histogram | -- | End-to-end investigation duration |
aiagent_api_llm_requests_total |
Counter | status |
LLM API calls by outcome (success, error) |
aiagent_api_llm_request_duration_seconds |
Histogram | -- | LLM request latency |
aiagent_api_llm_tokens_total |
Counter | type |
LLM token consumption; use label type to distinguish prompt vs completion tokens (increments on each completed LLM call) |
The Kubernaut Agent exposes prompt and completion token counts as counters via aiagent_api_llm_tokens_total (see LLM Token Cost Tracking for example PromQL).
Audit Pipeline Metrics¶
These metrics are shared across all Go services via the buffered audit store:
| Metric | Type | Labels | Description |
|---|---|---|---|
audit_events_dropped_total |
Counter | service |
Events dropped due to full buffer -- data loss indicator |
Controller-Runtime Built-in Metrics¶
All Go CRD controllers also expose standard controller-runtime metrics:
| Metric | Type | Description |
|---|---|---|
controller_runtime_reconcile_total |
Counter | Total reconciliations (controller, result labels) |
controller_runtime_reconcile_errors_total |
Counter | Reconciliation errors |
controller_runtime_reconcile_time_seconds |
Histogram | Reconciliation duration |
workqueue_depth |
Gauge | Current work queue depth |
workqueue_adds_total |
Counter | Work queue additions |
workqueue_queue_duration_seconds |
Histogram | Time items spend in queue |
workqueue_retries_total |
Counter | Work queue retries |
Example PromQL Queries¶
Remediation Throughput¶
# Remediations completed per minute
rate(kubernaut_remediationorchestrator_phase_transitions_total{to_phase="Completed"}[5m]) * 60
Failure Rate¶
# Percentage of remediations that fail
sum(rate(kubernaut_remediationorchestrator_phase_transitions_total{to_phase="Failed"}[1h]))
/
sum(rate(kubernaut_remediationorchestrator_phase_transitions_total{to_phase=~"Completed|Failed|TimedOut"}[1h]))
* 100
LLM Latency (p99)¶
Signal Deduplication Rate¶
sum(rate(gateway_signals_deduplicated_total[5m]))
/
(sum(rate(gateway_signals_received_total[5m])) + sum(rate(gateway_signals_deduplicated_total[5m])))
* 100
Audit Pipeline Health¶
# Events being dropped -- alert if non-zero drop rate
rate(audit_events_dropped_total[5m]) > 0
# DLQ at warning capacity -- alert if any stream is at 80%
datastorage_dlq_warning > 0
Effectiveness Score Distribution¶
# Median health score across assessments
histogram_quantile(0.5, rate(kubernaut_effectivenessmonitor_component_scores_bucket{component="health"}[1h]))
Notification Circuit Breaker¶
# Alert when any channel circuit breaker opens
kubernaut_notification_channel_circuit_breaker_state > 0
LLM Token Cost Tracking¶
# Tokens consumed per hour by type (prompt vs completion)
sum by (type) (increase(aiagent_api_llm_tokens_total[1h]))
Logging¶
All services use structured JSON logging with configurable log levels:
{
"level": "info",
"ts": "2026-03-04T10:30:00.000Z",
"msg": "Reconciling RemediationRequest",
"controller": "remediationorchestrator",
"name": "rr-b157a3a9e42f-1c2b5576",
"namespace": "kubernaut-system",
"phase": "Processing"
}
Diagnostics¶
The must-gather tool collects comprehensive diagnostics:
kubectl run must-gather \
--image=quay.io/kubernaut-ai/must-gather:latest \
--restart=Never \
-n kubernaut-system \
-- collect
This gathers CRDs, logs, Tekton resources, DataStorage state, events, and metrics into a single archive for troubleshooting.
Operator Monitoring Configuration¶
When deploying via the Kubernaut Operator, monitoring integration is controlled by spec.monitoring.enabled (default: true). When enabled, the operator:
- Auto-derives Prometheus and AlertManager URLs from the OCP monitoring stack
- Creates 2 additional ClusterRoles:
{namespace}-alertmanager-viewand{namespace}-gateway-signal-source - Binds the Effectiveness Monitor and Kubernaut Agent ServiceAccounts to
cluster-monitoring-viewfor Prometheus query access - Binds the OCP AlertManager ServiceAccount to
gateway-signal-sourcefor signal ingestion
apiVersion: kubernaut.ai/v1alpha1
kind: Kubernaut
spec:
monitoring:
enabled: true # default; set to false to disable monitoring RBAC
Disabling monitoring
Setting spec.monitoring.enabled: false removes the 2 monitoring ClusterRoles and their bindings. The Effectiveness Monitor will not be able to query Prometheus for post-remediation health checks, and AlertManager will not have RBAC to send alerts to the Gateway.
See the Operator CR API Reference for all monitoring-related fields.
Next Steps¶
- Troubleshooting -- Common issues and resolutions
- Configuration Reference -- Tuning service parameters
- Audit & Observability -- Audit event reference