Monitoring
All Kubernaut services expose Prometheus-compatible metrics and standard health check endpoints. This page provides a complete metrics reference for building Grafana dashboards and alerting rules.
Health Checks
Services expose health endpoints at different paths depending on their framework:
| Service Type |
Liveness |
Readiness |
Notes |
| Go CRD controllers (RO, SP, AA, WFE, NT, EM) |
GET /healthz |
GET /readyz |
controller-runtime defaults |
| Gateway |
GET /health |
GET /ready |
Also supports GET /healthz |
| DataStorage |
GET /health/live |
GET /health/ready |
Nested structure; checks PostgreSQL |
| HolmesGPT API |
GET /health |
GET /ready |
Readiness includes LLM connectivity |
Scrape Configuration
All services expose metrics at :9090/metrics in Prometheus exposition format.
# prometheus.yml
scrape_configs:
- job_name: kubernaut
kubernetes_sd_configs:
- role: pod
namespaces:
names: [kubernaut-system]
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: ${1}:9090
Gateway Metrics
| Metric |
Type |
Labels |
Description |
gateway_signals_received_total |
Counter |
source_type, severity |
Total signals received by source type and severity |
gateway_signals_deduplicated_total |
Counter |
signal_name |
Signals deduplicated (duplicate fingerprint) |
gateway_signals_rejected_total |
Counter |
reason |
Signals rejected by scope filtering |
gateway_crds_created_total |
Counter |
source_type, status |
RemediationRequest CRDs created |
gateway_crd_creation_errors_total |
Counter |
error_type |
CRD creation errors |
gateway_http_request_duration_seconds |
Histogram |
endpoint, method, status |
HTTP request duration |
gateway_circuit_breaker_state |
Gauge |
name |
Circuit breaker state (0=closed, 1=half-open, 2=open) |
Signal Processing Metrics
| Metric |
Type |
Labels |
Description |
signalprocessing_processing_total |
Counter |
phase, result |
Processing operations by phase and result |
signalprocessing_processing_duration_seconds |
Histogram |
phase |
Processing duration per phase |
signalprocessing_enrichment_errors_total |
Counter |
error_type |
Enrichment errors (K8s API issues) |
AI Analysis Metrics
| Metric |
Type |
Labels |
Description |
aianalysis_confidence_score_distribution |
Histogram |
signal_type |
LLM confidence score distribution |
aianalysis_approval_decisions_total |
Counter |
decision, environment |
Approval decisions (auto-approved, approval-required) |
aianalysis_failures_total |
Counter |
reason, sub_reason |
Analysis failures by reason |
aianalysis_rego_evaluations_total |
Counter |
outcome, degraded |
Rego policy evaluations |
| Metric |
Type |
Labels |
Description |
kubernaut_remediationorchestrator_phase_transitions_total |
Counter |
from_phase, to_phase, namespace |
Phase transitions -- core throughput and failure metric |
kubernaut_remediationorchestrator_reconcile_duration_seconds |
Histogram |
namespace, phase |
Reconciliation duration per phase |
kubernaut_remediationorchestrator_timeouts_total |
Counter |
phase, namespace |
Remediation timeouts by phase |
kubernaut_remediationorchestrator_blocked_total |
Counter |
namespace, reason |
RRs blocked by routing engine |
kubernaut_remediationorchestrator_current_blocked |
Gauge |
namespace |
Currently blocked RRs |
kubernaut_remediationorchestrator_child_crd_creations_total |
Counter |
child_type, namespace |
Child CRD creations by type |
kubernaut_remediationorchestrator_no_action_needed_total |
Counter |
reason, namespace |
Remediations where no action was needed |
kubernaut_remediationorchestrator_duplicates_skipped_total |
Counter |
skip_reason, namespace |
Duplicate remediations skipped |
kubernaut_remediationorchestrator_approval_decisions_total |
Counter |
decision, namespace |
Human approval throughput |
Workflow Execution Metrics
| Metric |
Type |
Labels |
Description |
workflowexecution_reconciler_total |
Counter |
outcome |
Workflow executions by outcome (success/failure) |
workflowexecution_reconciler_duration_seconds |
Histogram |
outcome |
Execution duration |
Notification Metrics
| Metric |
Type |
Labels |
Description |
kubernaut_notification_delivery_attempts_total |
Counter |
channel, status |
Delivery attempts per channel |
kubernaut_notification_delivery_duration_seconds |
Histogram |
channel |
Delivery duration per channel |
kubernaut_notification_delivery_retries_total |
Counter |
channel, reason |
Delivery retries per channel |
kubernaut_notification_channel_circuit_breaker_state |
Gauge |
channel |
Circuit breaker state (0=closed, 1=open, 2=half-open) |
kubernaut_notification_channel_health_score |
Gauge |
channel |
Channel health score (0--100) |
kubernaut_notification_reconciler_active |
Gauge |
phase |
Active notification backlog by phase |
Effectiveness Monitor Metrics
| Metric |
Type |
Labels |
Description |
kubernaut_effectivenessmonitor_component_scores |
Histogram |
component |
Score distribution (0.0--1.0) per component |
kubernaut_effectivenessmonitor_component_assessments_total |
Counter |
component, result |
Component assessments (health, hash, alert, metrics) |
kubernaut_effectivenessmonitor_assessments_completed_total |
Counter |
reason |
Assessments completed (full, partial, expired) |
kubernaut_effectivenessmonitor_validity_expirations_total |
Counter |
-- |
Assessments that expired before completion |
kubernaut_effectivenessmonitor_external_call_errors_total |
Counter |
service, operation, error_type |
Prometheus/AlertManager call errors |
DataStorage Metrics
| Metric |
Type |
Labels |
Description |
datastorage_write_duration_seconds |
Histogram |
table |
Duration of write operations in seconds |
datastorage_audit_lag_seconds |
Histogram |
service |
Lag between event occurrence and write |
datastorage_dlq_warning |
Gauge |
stream |
DLQ at 80% capacity (1 = warning) |
datastorage_dlq_critical |
Gauge |
stream |
DLQ at 90% capacity (1 = critical) |
AI Agent API Metrics
| Metric |
Type |
Labels |
Description |
aiagent_api_investigations_total |
Counter |
status |
Investigation requests by outcome |
aiagent_api_investigations_duration_seconds |
Histogram |
-- |
End-to-end investigation duration |
aiagent_api_llm_calls_total |
Counter |
provider, model, status |
LLM API calls by provider and outcome |
aiagent_api_llm_call_duration_seconds |
Histogram |
provider, model |
LLM call latency |
aiagent_api_llm_token_usage_total |
Counter |
provider, model, type |
Token consumption (prompt, completion) |
Audit Pipeline Metrics
These metrics are shared across all Go services via the buffered audit store:
| Metric |
Type |
Labels |
Description |
audit_events_dropped_total |
Counter |
service |
Events dropped due to full buffer -- data loss indicator |
Controller-Runtime Built-in Metrics
All Go CRD controllers also expose standard controller-runtime metrics:
| Metric |
Type |
Description |
controller_runtime_reconcile_total |
Counter |
Total reconciliations (controller, result labels) |
controller_runtime_reconcile_errors_total |
Counter |
Reconciliation errors |
controller_runtime_reconcile_time_seconds |
Histogram |
Reconciliation duration |
workqueue_depth |
Gauge |
Current work queue depth |
workqueue_adds_total |
Counter |
Work queue additions |
workqueue_queue_duration_seconds |
Histogram |
Time items spend in queue |
workqueue_retries_total |
Counter |
Work queue retries |
Example PromQL Queries
# Remediations completed per minute
rate(kubernaut_remediationorchestrator_phase_transitions_total{to_phase="Completed"}[5m]) * 60
Failure Rate
# Percentage of remediations that fail
sum(rate(kubernaut_remediationorchestrator_phase_transitions_total{to_phase="Failed"}[1h]))
/
sum(rate(kubernaut_remediationorchestrator_phase_transitions_total{to_phase=~"Completed|Failed|TimedOut"}[1h]))
* 100
LLM Latency (p99)
histogram_quantile(0.99, rate(aiagent_api_llm_call_duration_seconds_bucket[5m]))
Signal Deduplication Rate
sum(rate(gateway_signals_deduplicated_total[5m]))
/
(sum(rate(gateway_signals_received_total[5m])) + sum(rate(gateway_signals_deduplicated_total[5m])))
* 100
Audit Pipeline Health
# Events being dropped -- alert if non-zero drop rate
rate(audit_events_dropped_total[5m]) > 0
# DLQ at warning capacity -- alert if any stream is at 80%
datastorage_dlq_warning > 0
Effectiveness Score Distribution
# Median health score across assessments
histogram_quantile(0.5, rate(kubernaut_effectivenessmonitor_component_scores_bucket{component="health"}[1h]))
Notification Circuit Breaker
# Alert when Slack circuit breaker opens
kubernaut_notification_channel_circuit_breaker_state{channel="slack"} > 0
LLM Token Cost Tracking
# Tokens consumed per hour by provider
sum by (provider, type) (increase(aiagent_api_llm_token_usage_total[1h]))
Logging
All services use structured JSON logging with configurable log levels:
{
"level": "info",
"ts": "2026-03-04T10:30:00.000Z",
"msg": "Reconciling RemediationRequest",
"controller": "remediationorchestrator",
"name": "rr-b157a3a9e42f-1c2b5576",
"namespace": "kubernaut-system",
"phase": "Processing"
}
Diagnostics
The must-gather tool collects comprehensive diagnostics:
kubectl run must-gather \
--image=quay.io/kubernaut-ai/must-gather:latest \
--restart=Never \
-n kubernaut-system \
-- collect
This gathers CRDs, logs, Tekton resources, DataStorage state, events, and metrics into a single archive for troubleshooting.
Next Steps