What's Next¶

Kubernaut v1.5 is the next major milestone, focused on agentic architecture and interactive sessions. The features below are in active development.

Interactive Sessions¶

Streaming, conversational MCP interface enabling any consuming agent to interact with Kubernaut in real time:

Real-time streaming — Investigation findings streamed token-by-token via SSE; consuming agents render them natively (chat bubbles, log panels, progress bars).
Choice presentation — Workflow options returned as structured data (name, description, risk level, parameters) for agents to render in their preferred UX.
Conversational steering — Multi-turn context maintained across follow-up questions during an investigation.
Join mid-flight — Attach to an in-progress remediation that was already started autonomously (from an alert) and receive current state plus live updates without restarting the session.

Joining an in-flight investigation

Backstage > Kubernaut > rr-b83e19d4a7f1-5c2d09ae

Operator
What's going on with checkout-service in prod-us-east? I see alerts firing.

Kubernaut live
There's already an investigation running on checkout-service — it started 2 minutes ago from a PodCrashLoopBackOff alert. Let me catch you up.

Checking events first… I see 14 OOMKilled events in the last 30 minutes, all on checkout-service-7b4d9. The memory limit is set to 512Mi.

Let me look at the metrics… There's a clear spike in container_memory_working_set_bytes — usage jumped to 623Mi right at 03:14 UTC. Something changed at that time.

Pulling the logs around 03:14… Found it: "cache size exceeded soft limit, disabling eviction". A config reload at 03:14 introduced an unbounded in-memory cache — it's consuming all available memory and triggering the OOM kills.

So the root cause is the config change, not a traffic spike.

Operator
Makes sense. What can we do about it?

Kubernaut
Searching workflow catalog… Found two workflows that match this scenario:

restart-and-patch-memory
Confidence: 91% · Risk: low
Bump memory limit to 768Mi + rolling restart

rollback-config
Confidence: 85% · Risk: low
Revert ConfigMap to pre-03:14 version

Operator
Go with rollback-config — the cache change wasn't approved.

Kubernaut
Reverting ConfigMap… checkout-config rolled back to revision v42 — that's the last known good version before the cache change.

Triggering rolling restart… Pods are cycling now. 1/3 ready… 2/3 ready… 3/3 ready. All pods are back up with the old config.

Rollback complete. I'll keep watching for the next 5 minutes to make sure the OOM kills don't come back.

Kubernaut resolved
All clear — no OOMKilled events in the last 5 minutes and memory is stable at 340Mi. Looks like the rollback did the trick. Closing this one out.

Starting from a natural language signal

Backstage > Kubernaut > rr-d47f62c1e8a3-91b3f07c

Operator
The payment processing in prod-eu-west feels sluggish since this morning. Customers are complaining about timeouts at checkout.

Kubernaut investigating
Got it — sounds like a latency issue on payment-gateway in prod-eu-west starting this morning. Let me take a look.

Looking at the metrics… p99 latency jumped from 120ms to 2.4s right at 07:58 UTC. That's a 20x increase — definitely not gradual.

Checking pod health… All 5 replicas are running with no restarts, so it's not a crash loop. The issue is upstream.

Pulling logs… I see repeated "connection pool exhausted, waiting for available connection" — the app can't reach its database fast enough.

Checking the database endpoints… There it is — payment-db only has 1 of 3 endpoints ready. Two replicas went down at 07:55 UTC from a node drain. The remaining replica can't handle the full connection load, so the pool saturates and requests time out.

Operator
That makes sense, the infra team had a maintenance window. What can we do to fix it?

Kubernaut
Searching workflow catalog… Two options based on the RCA:

scale-statefulset
Confidence: 93% · Risk: low
Restore payment-db to 3 replicas on available nodes

rollback-node-drain
Confidence: 72% · Risk: medium
Uncordon the drained node and reschedule pods

Operator
Go with scale-statefulset.

Kubernaut
Scaling StatefulSet… payment-db target set to 3 replicas. Waiting for the new pods to join the cluster.

Watching rollout… 1/3 ready… 2/3 ready… 3/3 ready. All database replicas are back.

Connection pool is recovering — active connections dropped from 500/500 to 180/500. Latency should normalize shortly. I'll keep monitoring p99 for the next 10 minutes to confirm.

Backstage Console¶

A Backstage plugin providing an operator dashboard for investigation management, workflow oversight, and approve/reject/override controls through a web UI.

Conceptual mockups

The following mockups show the planned Backstage console experience. Designs are subject to change.

Fleet Overview¶

Natural language query bar for intent-driven navigation. KPI cards (active investigations, resolved, critical alerts, avg resolution time), cluster health grid, and a filtered alerts table — all driven by the operator's query.

Investigation View¶

Chat-style investigation transcript showing the Kubernaut Agent's live reasoning, tool calls, and root cause analysis. Operators can follow the AI's investigation in real time and intervene when needed.

Workflow Catalog¶

Searchable workflow catalog with natural language filtering, KPI metrics, and a table showing workflow status, action types, match count, and effectiveness scores.

MCP & A2A Integration¶

The API Frontend service acts as the MCP/A2A gateway, exposing 20 MCP tools across four domains:

Remediation lifecycle — kubernaut_list_remediations, kubernaut_get_remediation, kubernaut_approve, kubernaut_cancel_remediation, kubernaut_watch, kubernaut_submit_signal
Investigation — kubernaut_start_investigation, kubernaut_poll_investigation, kubernaut_select_workflow, kubernaut_present_decision
Data & history — kubernaut_list_workflows, kubernaut_get_remediation_history, kubernaut_get_effectiveness, kubernaut_get_audit_trail
Cluster context — af_list_events, af_get_pods, af_get_workloads, af_resolve_owner, af_check_existing_rr, af_create_rr

A2A (Agent-to-Agent) protocol support is implemented at the library level with agent card discovery at /.well-known/agent-card.json, ADK executor integration, and InvestigationSession CRDs linking A2A task IDs to remediation context.

Declarative Recipes¶

SREs define reusable agentic workflows as declarative Goose recipes — YAML-based configurations that package instructions, MCP extensions, and parameters into shareable, reproducible agent behaviors. Kubernaut injects them at three pipeline points via the Goose runtime, each calling external MCP tools. Each injection point accepts multiple stacked recipes.

Injection 1: Pre-Investigation (Kubernaut Agent)¶

Context injected into the LLM prompt before analysis begins.

Example: check-maintenance-window — Calls a CMDB MCP server to check if the resource is in a maintenance window or had recent deployments. The result is injected into the investigation context before the LLM starts. If under maintenance, alerting is skipped and the RCA is annotated as expected downtime.

Injection 2: Pre-Workflow Selection (Kubernaut Agent)¶

Constraints injected to bias workflow choice.

Example: enforce-cost-guardrails — Calls a Cost/Resource MCP for budget utilization and scaling limits for the namespace. Returns constraints such as "do not select scale-up workflows", nudging the LLM toward restart/rollback over resource-intensive remediations.

Injection 3: EM Direct Execution (via Goose)¶

Recipe runs via Kubernaut Agent endpoint at effectiveness assessment time.

Example: verify-business-slo — Calls an SLO/Business Metrics MCP to check p95 latency, error rate, and order throughput against SLO budget. Returns a structured pass/fail verdict with business impact data, replacing the default Kubernetes health check with SRE-defined assessment SOPs.

Fleet Operations¶

Hub-and-spoke deployment using OCM (Open Cluster Management) — 7 steps from alert to remediation, zero remote footprint.

Remediation flow¶

Remote Prometheus forwards metrics to Thanos on hub
Alertmanager fires alert → Kubernaut Engine triggers pipeline
KE obtains JWT from Keycloak for MCP investigation
KE calls MCP on target remote cluster for RCA investigation
KE obtains JWT from Keycloak for remediation execution
KE dispatches remediation playbook to AWX
AWX executes fix on target remote cluster via ephemeral SA

Zero persistent credentials

Remediation uses ephemeral ServiceAccounts with OCM-managed lifecycle — no long-lived secrets stored on remote clusters.

Natural Language Signal Intake¶

Accept signals described in plain language — not just structured Prometheus alerts or Kubernetes events. Operators, chat bots, and external agents can trigger investigations by describing symptoms conversationally. Kubernaut resolves the intent (cluster, service, symptom) and opens an investigation automatically. See the "Starting from a natural language signal" example under Interactive Sessions.

Observe Mode (Trust Ladder Level 2)¶

Building on v1.4's global dry-run mode, v1.5 adds operator dashboard visibility through the Backstage console and a guided onboarding path for new clusters.

Subject to change

Features listed here are planned but may change. See the Kubernaut milestones for the latest status.