What's Next¶
Kubernaut v1.5 is the next major milestone, focused on agentic architecture and interactive sessions. The features below are in active development.
Interactive Sessions¶
Streaming, conversational MCP interface enabling any consuming agent to interact with Kubernaut in real time:
- Real-time streaming — Investigation findings streamed token-by-token via SSE; consuming agents render them natively (chat bubbles, log panels, progress bars).
- Choice presentation — Workflow options returned as structured data (name, description, risk level, parameters) for agents to render in their preferred UX.
- Conversational steering — Multi-turn context maintained across follow-up questions during an investigation.
- Join mid-flight — Attach to an in-progress remediation that was already started autonomously (from an alert) and receive current state plus live updates without restarting the session.
Joining an in-flight investigation
What's going on with
checkout-service in prod-us-east? I see alerts firing.There's already an investigation running on
checkout-service — it started 2 minutes ago from a PodCrashLoopBackOff alert. Let me catch you up.Checking events first… I see 14 OOMKilled events in the last 30 minutes, all on
checkout-service-7b4d9. The memory limit is set to 512Mi.Let me look at the metrics… There's a clear spike in
container_memory_working_set_bytes — usage jumped to 623Mi right at 03:14 UTC. Something changed at that time.Pulling the logs around 03:14… Found it: "cache size exceeded soft limit, disabling eviction". A config reload at 03:14 introduced an unbounded in-memory cache — it's consuming all available memory and triggering the OOM kills.
So the root cause is the config change, not a traffic spike.
Makes sense. What can we do about it?
Searching workflow catalog… Found two workflows that match this scenario:
Bump memory limit to 768Mi + rolling restart
Revert ConfigMap to pre-03:14 version
Go with rollback-config — the cache change wasn't approved.
Reverting ConfigMap…
checkout-config rolled back to revision v42 — that's the last known good version before the cache change.Triggering rolling restart… Pods are cycling now. 1/3 ready… 2/3 ready… 3/3 ready. All pods are back up with the old config.
Rollback complete. I'll keep watching for the next 5 minutes to make sure the OOM kills don't come back.
All clear — no OOMKilled events in the last 5 minutes and memory is stable at 340Mi. Looks like the rollback did the trick. Closing this one out.
Starting from a natural language signal
The payment processing in
prod-eu-west feels sluggish since this morning. Customers are complaining about timeouts at checkout.Got it — sounds like a latency issue on
payment-gateway in prod-eu-west starting this morning. Let me take a look.Looking at the metrics… p99 latency jumped from 120ms to 2.4s right at 07:58 UTC. That's a 20x increase — definitely not gradual.
Checking pod health… All 5 replicas are running with no restarts, so it's not a crash loop. The issue is upstream.
Pulling logs… I see repeated "connection pool exhausted, waiting for available connection" — the app can't reach its database fast enough.
Checking the database endpoints… There it is —
payment-db only has 1 of 3 endpoints ready. Two replicas went down at 07:55 UTC from a node drain. The remaining replica can't handle the full connection load, so the pool saturates and requests time out.
That makes sense, the infra team had a maintenance window. What can we do to fix it?
Searching workflow catalog… Two options based on the RCA:
Restore
payment-db to 3 replicas on available nodes
Uncordon the drained node and reschedule pods
Go with scale-statefulset.
Scaling StatefulSet…
payment-db target set to 3 replicas. Waiting for the new pods to join the cluster.Watching rollout… 1/3 ready… 2/3 ready… 3/3 ready. All database replicas are back.
Connection pool is recovering — active connections dropped from 500/500 to 180/500. Latency should normalize shortly. I'll keep monitoring p99 for the next 10 minutes to confirm.
Backstage Console¶
A Backstage plugin providing an operator dashboard for investigation management, workflow oversight, and approve/reject/override controls through a web UI.
Conceptual mockups
The following mockups show the planned Backstage console experience. Designs are subject to change.
Fleet Overview¶
Natural language query bar for intent-driven navigation. KPI cards (active investigations, resolved, critical alerts, avg resolution time), cluster health grid, and a filtered alerts table — all driven by the operator's query.
Investigation View¶
Chat-style investigation transcript showing the Kubernaut Agent's live reasoning, tool calls, and root cause analysis. Operators can follow the AI's investigation in real time and intervene when needed.
Workflow Catalog¶
Searchable workflow catalog with natural language filtering, KPI metrics, and a table showing workflow status, action types, match count, and effectiveness scores.
MCP & A2A Integration¶
The API Frontend service acts as the MCP/A2A gateway, exposing 20 MCP tools across four domains:
- Remediation lifecycle —
kubernaut_list_remediations,kubernaut_get_remediation,kubernaut_approve,kubernaut_cancel_remediation,kubernaut_watch,kubernaut_submit_signal - Investigation —
kubernaut_start_investigation,kubernaut_poll_investigation,kubernaut_select_workflow,kubernaut_present_decision - Data & history —
kubernaut_list_workflows,kubernaut_get_remediation_history,kubernaut_get_effectiveness,kubernaut_get_audit_trail - Cluster context —
af_list_events,af_get_pods,af_get_workloads,af_resolve_owner,af_check_existing_rr,af_create_rr
A2A (Agent-to-Agent) protocol support is implemented at the library level with agent card discovery at /.well-known/agent-card.json, ADK executor integration, and InvestigationSession CRDs linking A2A task IDs to remediation context.
Declarative Recipes¶
SREs define reusable agentic workflows as declarative Goose recipes — YAML-based configurations that package instructions, MCP extensions, and parameters into shareable, reproducible agent behaviors. Kubernaut injects them at three pipeline points via the Goose runtime, each calling external MCP tools. Each injection point accepts multiple stacked recipes.
Injection 1: Pre-Investigation (Kubernaut Agent)¶
Context injected into the LLM prompt before analysis begins.
Example: check-maintenance-window — Calls a CMDB MCP server to check if the resource is in a maintenance window or had recent deployments. The result is injected into the investigation context before the LLM starts. If under maintenance, alerting is skipped and the RCA is annotated as expected downtime.
Injection 2: Pre-Workflow Selection (Kubernaut Agent)¶
Constraints injected to bias workflow choice.
Example: enforce-cost-guardrails — Calls a Cost/Resource MCP for budget utilization and scaling limits for the namespace. Returns constraints such as "do not select scale-up workflows", nudging the LLM toward restart/rollback over resource-intensive remediations.
Injection 3: EM Direct Execution (via Goose)¶
Recipe runs via Kubernaut Agent endpoint at effectiveness assessment time.
Example: verify-business-slo — Calls an SLO/Business Metrics MCP to check p95 latency, error rate, and order throughput against SLO budget. Returns a structured pass/fail verdict with business impact data, replacing the default Kubernetes health check with SRE-defined assessment SOPs.
Fleet Operations¶
Hub-and-spoke deployment using OCM (Open Cluster Management) — 7 steps from alert to remediation, zero remote footprint.
Remediation flow¶
- Remote Prometheus forwards metrics to Thanos on hub
- Alertmanager fires alert → Kubernaut Engine triggers pipeline
- KE obtains JWT from Keycloak for MCP investigation
- KE calls MCP on target remote cluster for RCA investigation
- KE obtains JWT from Keycloak for remediation execution
- KE dispatches remediation playbook to AWX
- AWX executes fix on target remote cluster via ephemeral SA
Zero persistent credentials
Remediation uses ephemeral ServiceAccounts with OCM-managed lifecycle — no long-lived secrets stored on remote clusters.
Natural Language Signal Intake¶
Accept signals described in plain language — not just structured Prometheus alerts or Kubernetes events. Operators, chat bots, and external agents can trigger investigations by describing symptoms conversationally. Kubernaut resolves the intent (cluster, service, symptom) and opens an investigation automatically. See the "Starting from a natural language signal" example under Interactive Sessions.
Observe Mode (Trust Ladder Level 2)¶
Building on v1.4's global dry-run mode, v1.5 adds operator dashboard visibility through the Backstage console and a guided onboarding path for new clusters.
Subject to change
Features listed here are planned but may change. See the Kubernaut milestones for the latest status.