Skip to content

What's New

This page summarises the notable changes in each Kubernaut release. Kubernaut does not support in-place upgrades — each release is a fresh install. Review the changes below to understand what differs from the version you are currently running.


v1.4

Prompt injection defense — Shadow Agent

Kubernaut v1.4 introduces a fail-closed shadow agent that evaluates every LLM tool output for prompt injection. Two evaluation layers provide defense-in-depth:

  • Per-step scanning with random boundary markers and data exfiltration detection
  • Full-context grounding review at the RCA-to-workflow boundary that detects distributed "boiling frog" injection attacks

Enforcement modes (monitor or enforce) control whether suspicious content is logged or triggers a circuit breaker that cancels the investigation. See Security & RBAC: Shadow Agent for details.

Operator workflow overrides

Operators can now override the AI-selected workflow when approving a RemediationApprovalRequest. The authwebhook validates that the override workflow exists and is active; the orchestrator merges the override with full audit trail. See Human Approval: Overrides.

PagerDuty and Microsoft Teams notifications

Two new delivery channels join Slack:

  • PagerDuty — Events API v2 delivery with circuit breaker and CredentialRef config pattern
  • Microsoft Teams — Adaptive Card delivery with circuit breaker

All delivery channels now share a generic circuit breaker pattern. See Notification Channels.

NetworkPolicies

12 NetworkPolicy templates with default-deny ingress posture are deployed for all Kubernaut services. Configurable CIDRs and per-service toggles via networkPolicies.<service>.enabled. See Security & RBAC: NetworkPolicies.

Breaking: Kubernaut Agent config restructured

The Kubernaut Agent configuration has three breaking changes:

  1. camelCase migration (#908) — All YAML config fields migrated from snake_case to camelCase
  2. Three-domain layout — Config reorganized into runtime, ai, and integrations top-level domains
  3. Config split (#916) — Static ConfigMap (mounted at startup) and hot-reloadable ConfigMap (watched at runtime)

See Kubernaut Agent SDK Config for the updated reference.

Parallel tool execution

The investigation pipeline now executes multiple LLM tool calls concurrently when the model returns batched requests. The investigation prompt also instructs the LLM to batch independent tool calls for reduced round-trips.

Platform hardening

  • Inconclusive outcome exponential backoff (#1091) — Inconclusive outcomes trigger exponential backoff (1m → 10m cap) and 3-strikes blocking, preventing RR flood for persistent alerts
  • SA token refresh (#1055) — Custom token path constructor with 401 cache invalidation for Kubernaut Agent
  • CRD-aware engine registration (#868) — Engine registration validates CRD availability; enters degraded status when required CRDs are missing
  • Session hardening (#1078) — Panic recovery, two-tier TTL eviction, 25-minute wall-clock investigation timeout
  • Gateway security hardening (#673) — 256KB body limits, generic RFC 7807 errors, header stripping, RBAC least-privilege, trusted proxy middleware
  • Unified monitoring config (#463) — Prometheus and AlertManager configuration unified into a single monitoring block
  • Standardized log levels (#875) — Log level configuration standardized across all services
  • Verdict label rename (#1077) — VerdictClean changed from "clean" to "aligned". Breaking: update Prometheus queries
  • Audit event batching fix (#1056) — Audit 401/403 errors reclassified as retryable; token source extracted for shared cache across all callers
  • API version validation gate (#1044) — Detects when the LLM omits api_version for ambiguous Kubernetes Kinds (e.g., Event in both v1 and events.k8s.io/v1), retries with a correction listing all conflicting API groups, and escalates to human review on exhaustion to prevent incorrect RBAC grants
  • CRD TTL enforcement (#265) — Terminal RemediationRequest resources are garbage-collected after 24h (configurable via retention.period), preventing CRD accumulation in high-volume clusters

Dry-run mode

When dryRun is enabled, the pipeline stops after AI analysis — no WorkflowExecution, RAR, or EA CRDs are created. The RemediationRequest completes with outcome DryRun.

Kubernaut Operator

The Kubernaut Operator — introduced in v1.3 — is the recommended deployment method for OpenShift. v1.4 adds:

  • OLM lifecycle management — Install, upgrade, and uninstall via Operator Lifecycle Manager with automatic CRD installation and cleanup
  • Supply chain security — Container images ship with SBOM, Cosign signatures, and SLSA provenance attestations
  • postgresql.sslMode — Configurable SSL mode for PostgreSQL connections (disable, require, verify-ca, verify-full)
  • notification.routing BYO — Bring-your-own routing ConfigMap with hot-reload support
  • runtimeConfigMapName — Separate hot-reloadable ConfigMap for Kubernaut Agent runtime configuration
  • Init image mirroringRELATED_IMAGE_* environment variables for disconnected/air-gapped installs

See the Operator installation guide for deployment instructions.

Deprecated: OCP-specific Helm chart

The OCP-specific Helm chart is deprecated (#848). Use the unified kubernaut chart with the Kubernaut Operator for OpenShift deployments.

Removed: Conversation API

Conversational mode for Kubernaut Agent (#592) has been removed from v1.4 and deferred to v1.5 as part of the interactive session model.


v1.3

Kubernaut Agent (formerly HolmesGPT)

The LLM integration component has been renamed from HolmesGPT / HAPI to Kubernaut Agent across all services, Helm values, ConfigMaps, and documentation.

Before (v1.2) After (v1.3)
holmesgptApi.* Helm values kubernautAgent.*
holmesgpt-sdk-config ConfigMap kubernaut-agent-sdk-config
holmesgpt-config ConfigMap kubernaut-agent-config

Two-invocation investigation architecture

The investigation pipeline has been redesigned from a single three-phase LLM session (v1.1/v1.2) into two independent LLM invocations:

  1. Invocation 1 — Root Cause Analysis: A full tool-access session that performs live Kubernetes inspection and produces a structured RCA result.
  2. Invocation 2 — Workflow Selection: A separate session with no memory of Invocation 1, receiving only structured context fields. Selects a workflow or reports that none is applicable.

This separation improves reliability and makes each invocation independently testable.

mTLS and three-port model

All inter-service communication can now be secured with mutual TLS. API-serving components (Gateway, DataStorage, Kubernaut Agent, AIAnalysis) expose three ports:

  • HTTPS serving port — mTLS-protected API traffic
  • Health port — plaintext liveness/readiness probes
  • Metrics port — plaintext Prometheus scrape target

Certificate rotation is handled automatically when tls.mode: hook is set, or delegated to cert-manager. See Monitoring for port details and probe configuration.

SDK config hot-reload

The Kubernaut Agent SDK config (LLM model, endpoint, API key, toolset settings) now supports hot-reload via fsnotify. Active investigations pin a config snapshot at session start, so in-flight work is unaffected. Provider-level settings (llm.provider, OAuth2 credentials) still require a pod restart.

Expanded LLM provider support

The Kubernaut Agent now supports Vertex AI, OpenAI, Anthropic, Bedrock, Ollama, and additional providers via LangChainGo.

Effectiveness Monitor improvements

  • maxConcurrentReconciles for parallel EA processing
  • Configurable connectionTimeout, prometheusLookback, and scrapeInterval
  • Clarified stabilization window semantics (EM-internal vs RO-configured EA.spec)

See Effectiveness for configuration details.

Notification coverage

Block reasons and terminal failure states now produce notifications (BR-ORCH-036), closing gaps where operators were not informed of remediation failures.

Prometheus metric rename

Kubernaut Agent metrics have been renamed from the legacy holmesgpt_* namespace to aiagent_api_*. Update any Prometheus queries, alerting rules, or dashboards that reference the old metric names. See Monitoring for the current metric reference.

New notification types

v1.3 introduces additional notification types that may require routing configuration updates:

  • Escalation notifications for trust-ladder escalation events
  • StatusUpdate notifications for transient block conditions
  • ManualReview notifications now split by review-source for finer routing control

Data persistence

Comprehensive schema documentation rewritten from the live v1.3 database, including enrichment tables, metric baselines, and updated entity-relationship diagrams.

Feature enrichments and metrics

New documentation for feature enrichment pipeline stages and the notification metrics design decision (DD-METRICS-001).


v1.2

Per-workflow ServiceAccount and RBAC

Each workflow execution now runs under its own ServiceAccount with a dedicated TokenRequest. This replaces the shared SA model from v1.1 and provides fine-grained RBAC isolation per remediation workflow.

Declarative workflow catalog

The workflow catalog has moved from OCI-containerized workflow bundles to declarative RemediationWorkflow CRDs with category and label-based matching plus confidence scoring.

Effectiveness and notification pipeline

Updated effectiveness assessment configuration, notification routing semantics, and EM config key alignment.

DataStorage, audit, and monitoring

Updated data access patterns, audit event documentation, and monitoring metric names.

Signal Processing and Gateway

Rego policy entrypoint corrections, gateway label contract updates, and investigation tier-1 semantics fixes.


v1.1

Initial documented release of Kubernaut.

  • CRD-based microservices architecture with the full six-stage remediation pipeline
  • Prometheus AlertManager and Kubernetes Event ingestion
  • LLM-powered root cause analysis with Kubernetes inspection tools
  • Remediation execution via Kubernetes Jobs, Tekton Pipelines, or Ansible (AWX/AAP)
  • Effectiveness assessment with four-dimensional scoring
  • Human approval gates via RemediationApprovalRequest CRDs
  • Rego-based policy evaluation for signal processing and approval
  • Multichannel notifications (Slack, console, log, file)
  • Full audit trail with 7-year retention and CRD reconstruction
  • ActionType and RemediationWorkflow CRD registration via Auth Webhook
  • Alert decay detection (DD-EM-003)
  • Resource lock persistence with deterministic naming (DD-WE-003)

v1.0

End-of-life. No longer documented or supported.