What's New¶
This page summarises the notable changes in each Kubernaut release. Kubernaut does not support in-place upgrades — each release is a fresh install. Review the changes below to understand what differs from the version you are currently running.
v1.4¶
Prompt injection defense — Shadow Agent¶
Kubernaut v1.4 introduces a fail-closed shadow agent that evaluates every LLM tool output for prompt injection. Two evaluation layers provide defense-in-depth:
- Per-step scanning with random boundary markers and data exfiltration detection
- Full-context grounding review at the RCA-to-workflow boundary that detects distributed "boiling frog" injection attacks
Enforcement modes (monitor or enforce) control whether suspicious content is logged or triggers a circuit breaker that cancels the investigation. See Security & RBAC: Shadow Agent for details.
Operator workflow overrides¶
Operators can now override the AI-selected workflow when approving a RemediationApprovalRequest. The authwebhook validates that the override workflow exists and is active; the orchestrator merges the override with full audit trail. See Human Approval: Overrides.
PagerDuty and Microsoft Teams notifications¶
Two new delivery channels join Slack:
- PagerDuty — Events API v2 delivery with circuit breaker and
CredentialRefconfig pattern - Microsoft Teams — Adaptive Card delivery with circuit breaker
All delivery channels now share a generic circuit breaker pattern. See Notification Channels.
NetworkPolicies¶
12 NetworkPolicy templates with default-deny ingress posture are deployed for all Kubernaut services. Configurable CIDRs and per-service toggles via networkPolicies.<service>.enabled. See Security & RBAC: NetworkPolicies.
Breaking: Kubernaut Agent config restructured¶
The Kubernaut Agent configuration has three breaking changes:
- camelCase migration (#908) — All YAML config fields migrated from
snake_casetocamelCase - Three-domain layout — Config reorganized into
runtime,ai, andintegrationstop-level domains - Config split (#916) — Static ConfigMap (mounted at startup) and hot-reloadable ConfigMap (watched at runtime)
See Kubernaut Agent SDK Config for the updated reference.
Parallel tool execution¶
The investigation pipeline now executes multiple LLM tool calls concurrently when the model returns batched requests. The investigation prompt also instructs the LLM to batch independent tool calls for reduced round-trips.
Platform hardening¶
- Inconclusive outcome exponential backoff (#1091) —
Inconclusiveoutcomes trigger exponential backoff (1m → 10m cap) and 3-strikes blocking, preventing RR flood for persistent alerts - SA token refresh (#1055) — Custom token path constructor with 401 cache invalidation for Kubernaut Agent
- CRD-aware engine registration (#868) — Engine registration validates CRD availability; enters degraded status when required CRDs are missing
- Session hardening (#1078) — Panic recovery, two-tier TTL eviction, 25-minute wall-clock investigation timeout
- Gateway security hardening (#673) — 256KB body limits, generic RFC 7807 errors, header stripping, RBAC least-privilege, trusted proxy middleware
- Unified monitoring config (#463) — Prometheus and AlertManager configuration unified into a single
monitoringblock - Standardized log levels (#875) — Log level configuration standardized across all services
- Verdict label rename (#1077) —
VerdictCleanchanged from"clean"to"aligned". Breaking: update Prometheus queries - Audit event batching fix (#1056) — Audit 401/403 errors reclassified as retryable; token source extracted for shared cache across all callers
- API version validation gate (#1044) — Detects when the LLM omits
api_versionfor ambiguous Kubernetes Kinds (e.g.,Eventin bothv1andevents.k8s.io/v1), retries with a correction listing all conflicting API groups, and escalates to human review on exhaustion to prevent incorrect RBAC grants - CRD TTL enforcement (#265) — Terminal
RemediationRequestresources are garbage-collected after 24h (configurable viaretention.period), preventing CRD accumulation in high-volume clusters
Dry-run mode¶
When dryRun is enabled, the pipeline stops after AI analysis — no WorkflowExecution, RAR, or EA CRDs are created. The RemediationRequest completes with outcome DryRun.
Kubernaut Operator¶
The Kubernaut Operator — introduced in v1.3 — is the recommended deployment method for OpenShift. v1.4 adds:
- OLM lifecycle management — Install, upgrade, and uninstall via Operator Lifecycle Manager with automatic CRD installation and cleanup
- Supply chain security — Container images ship with SBOM, Cosign signatures, and SLSA provenance attestations
postgresql.sslMode— Configurable SSL mode for PostgreSQL connections (disable,require,verify-ca,verify-full)notification.routingBYO — Bring-your-own routing ConfigMap with hot-reload supportruntimeConfigMapName— Separate hot-reloadable ConfigMap for Kubernaut Agent runtime configuration- Init image mirroring —
RELATED_IMAGE_*environment variables for disconnected/air-gapped installs
See the Operator installation guide for deployment instructions.
Deprecated: OCP-specific Helm chart¶
The OCP-specific Helm chart is deprecated (#848). Use the unified kubernaut chart with the Kubernaut Operator for OpenShift deployments.
Removed: Conversation API¶
Conversational mode for Kubernaut Agent (#592) has been removed from v1.4 and deferred to v1.5 as part of the interactive session model.
v1.3¶
Kubernaut Agent (formerly HolmesGPT)¶
The LLM integration component has been renamed from HolmesGPT / HAPI to Kubernaut Agent across all services, Helm values, ConfigMaps, and documentation.
| Before (v1.2) | After (v1.3) |
|---|---|
holmesgptApi.* Helm values |
kubernautAgent.* |
holmesgpt-sdk-config ConfigMap |
kubernaut-agent-sdk-config |
holmesgpt-config ConfigMap |
kubernaut-agent-config |
Two-invocation investigation architecture¶
The investigation pipeline has been redesigned from a single three-phase LLM session (v1.1/v1.2) into two independent LLM invocations:
- Invocation 1 — Root Cause Analysis: A full tool-access session that performs live Kubernetes inspection and produces a structured RCA result.
- Invocation 2 — Workflow Selection: A separate session with no memory of Invocation 1, receiving only structured context fields. Selects a workflow or reports that none is applicable.
This separation improves reliability and makes each invocation independently testable.
mTLS and three-port model¶
All inter-service communication can now be secured with mutual TLS. API-serving components (Gateway, DataStorage, Kubernaut Agent, AIAnalysis) expose three ports:
- HTTPS serving port — mTLS-protected API traffic
- Health port — plaintext liveness/readiness probes
- Metrics port — plaintext Prometheus scrape target
Certificate rotation is handled automatically when tls.mode: hook is set, or delegated to cert-manager. See Monitoring for port details and probe configuration.
SDK config hot-reload¶
The Kubernaut Agent SDK config (LLM model, endpoint, API key, toolset settings) now supports hot-reload via fsnotify. Active investigations pin a config snapshot at session start, so in-flight work is unaffected. Provider-level settings (llm.provider, OAuth2 credentials) still require a pod restart.
Expanded LLM provider support¶
The Kubernaut Agent now supports Vertex AI, OpenAI, Anthropic, Bedrock, Ollama, and additional providers via LangChainGo.
Effectiveness Monitor improvements¶
maxConcurrentReconcilesfor parallel EA processing- Configurable
connectionTimeout,prometheusLookback, andscrapeInterval - Clarified stabilization window semantics (EM-internal vs RO-configured
EA.spec)
See Effectiveness for configuration details.
Notification coverage¶
Block reasons and terminal failure states now produce notifications (BR-ORCH-036), closing gaps where operators were not informed of remediation failures.
Prometheus metric rename¶
Kubernaut Agent metrics have been renamed from the legacy holmesgpt_* namespace to aiagent_api_*. Update any Prometheus queries, alerting rules, or dashboards that reference the old metric names. See Monitoring for the current metric reference.
New notification types¶
v1.3 introduces additional notification types that may require routing configuration updates:
- Escalation notifications for trust-ladder escalation events
- StatusUpdate notifications for transient block conditions
- ManualReview notifications now split by review-source for finer routing control
Data persistence¶
Comprehensive schema documentation rewritten from the live v1.3 database, including enrichment tables, metric baselines, and updated entity-relationship diagrams.
Feature enrichments and metrics¶
New documentation for feature enrichment pipeline stages and the notification metrics design decision (DD-METRICS-001).
v1.2¶
Per-workflow ServiceAccount and RBAC¶
Each workflow execution now runs under its own ServiceAccount with a dedicated TokenRequest. This replaces the shared SA model from v1.1 and provides fine-grained RBAC isolation per remediation workflow.
Declarative workflow catalog¶
The workflow catalog has moved from OCI-containerized workflow bundles to declarative RemediationWorkflow CRDs with category and label-based matching plus confidence scoring.
Effectiveness and notification pipeline¶
Updated effectiveness assessment configuration, notification routing semantics, and EM config key alignment.
DataStorage, audit, and monitoring¶
Updated data access patterns, audit event documentation, and monitoring metric names.
Signal Processing and Gateway¶
Rego policy entrypoint corrections, gateway label contract updates, and investigation tier-1 semantics fixes.
v1.1¶
Initial documented release of Kubernaut.
- CRD-based microservices architecture with the full six-stage remediation pipeline
- Prometheus AlertManager and Kubernetes Event ingestion
- LLM-powered root cause analysis with Kubernetes inspection tools
- Remediation execution via Kubernetes Jobs, Tekton Pipelines, or Ansible (AWX/AAP)
- Effectiveness assessment with four-dimensional scoring
- Human approval gates via RemediationApprovalRequest CRDs
- Rego-based policy evaluation for signal processing and approval
- Multichannel notifications (Slack, console, log, file)
- Full audit trail with 7-year retention and CRD reconstruction
- ActionType and RemediationWorkflow CRD registration via Auth Webhook
- Alert decay detection (DD-EM-003)
- Resource lock persistence with deterministic naming (DD-WE-003)
v1.0¶
End-of-life. No longer documented or supported.