Installation¶
Deployment Methods¶
Kubernaut offers two deployment methods:
| Method | Use Case | Platform |
|---|---|---|
| Kubernaut Operator | Production — full lifecycle management, OLM integration, status reporting | OpenShift 4.18+ |
| Helm Chart | Development, testing, CI — quick setup for evaluation and local development | Any Kubernetes 1.32+ |
Production deployments
The Kubernaut Operator is the only supported production deployment method. The Helm chart does not provide lifecycle management, status reporting, or OLM integration required for production operations. Use the Helm chart for development, testing, and CI environments only.
Kubernaut Operator (Production)¶
The Kubernaut Operator manages the full lifecycle of the Kubernaut platform on OpenShift: secret validation, database migrations, CRD installation, deployment of all 10 microservices, RBAC, NetworkPolicies, OCP Routes, and status reporting. It is a singleton — one Kubernaut CR named kubernaut per cluster.
Installation¶
The operator is available through OLM (Operator Lifecycle Manager) or direct deployment:
- OperatorHub (recommended) — Install from the OperatorHub catalog in the OpenShift Console
- Custom CatalogSource — For disconnected or custom environments, create a
CatalogSourcepointing to the operator index image
For complete installation instructions, see the Kubernaut Operator Installation Guide.
Prerequisites (Operator)¶
| Requirement | Version | Notes |
|---|---|---|
| OpenShift | 4.18+ | OLM and operator-framework support required |
| PostgreSQL | 15+ | BYO — the operator does not deploy a database; provide connection details via spec.postgresql |
| Valkey / Redis | 7+ | BYO — provide connection details via spec.valkey |
| LLM provider | — | Any supported provider with JSON structured output |
Operator image: quay.io/kubernaut-ai/kubernaut-operator:1.4.0 (note: no v prefix, unlike component images which use v1.4.0).
Minimal Kubernaut CR:
apiVersion: kubernaut.ai/v1alpha1
kind: Kubernaut
metadata:
name: kubernaut
namespace: kubernaut-system
spec:
postgresql:
host: postgres.database.svc.cluster.local
secretName: kubernaut-postgresql
valkey:
host: valkey.cache.svc.cluster.local
secretName: kubernaut-valkey
kubernautAgent:
llm:
provider: openai
model: gpt-4o
credentialsSecretName: kubernaut-llm
Disconnected installs
For air-gapped environments, mirror all component images and set RELATED_IMAGE_* environment variables on the operator Deployment. See the operator installation guide for the full image list.
What the Operator manages¶
- Validates BYO PostgreSQL and Valkey secrets before deployment
- Runs embedded database schema migrations
- Installs and upgrades the 9 Kubernaut workload CRDs
- Deploys all 10 microservices with RBAC, ConfigMaps, PDBs, admission webhooks, and NetworkPolicies
- Configures OCP Routes and service-serving CA TLS
- Reports per-service readiness status on the
KubernautCR - Cleans up cluster-scoped RBAC and workflow namespace on CR deletion (workload CRDs are retained by design)
Helm Chart (Development/Testing)¶
This section walks you through installing Kubernaut using the Helm chart for development, testing, and CI environments.
Not for production
The Helm chart is intended for development, testing, and CI. For production deployments, use the Kubernaut Operator.
Prerequisites¶
| Requirement | Version | Notes |
|---|---|---|
| Kubernetes | 1.32+ | selectableFields GA in 1.32; required for CRD field selectors |
| Helm | 3.12+ | |
| StorageClass | dynamic provisioning | For PostgreSQL and Valkey PVCs |
| cert-manager | 1.12+ (optional) | Required when tls.mode=cert-manager. Optional for dev (tls.mode=hook is default). |
LLM provider (required for AI investigation):
- Any supported provider with JSON structured output support (
response_format: json_objector equivalent). KA enables JSON mode on all LLM requests — models that do not support it will produce parse failures.
Workflow execution engine (at least one):
- Kubernetes Jobs (built-in, no extra dependency)
- Tekton Pipelines (optional)
- Ansible Automation Platform (AAP) / AWX (optional)
External monitoring (recommended):
- kube-prometheus-stack provides:
- Alert-based signal ingestion (AlertManager sends alerts to Gateway)
- Metrics enrichment for effectiveness assessments (Prometheus queries)
- Alert resolution checks (AlertManager API)
- Metrics scraping for all Kubernaut services (all pods expose
/metrics)
Prometheus and AlertManager integration is disabled by default. To enable effectiveness assessments based on alert resolution and metric queries, set effectivenessmonitor.external.prometheusEnabled=true and effectivenessmonitor.external.alertManagerEnabled=true.
Infrastructure Setup¶
Complete these steps before installing the Kubernaut chart.
Storage¶
PostgreSQL and Valkey each require a PersistentVolumeClaim for data persistence:
| Component | PVC Name | Default Size | Values |
|---|---|---|---|
| PostgreSQL | postgresql-data |
10Gi |
postgresql.storage.size, postgresql.storage.storageClassName |
| Valkey | valkey-data |
512Mi |
valkey.storage.size, valkey.storage.storageClassName |
Both PVCs are annotated with helm.sh/resource-policy: keep so data survives helm uninstall.
If the cluster has no default StorageClass, set storageClassName explicitly:
postgresql:
storage:
size: 50Gi
storageClassName: gp3-encrypted
valkey:
storage:
storageClassName: gp3-encrypted
To skip in-chart databases entirely and use external instances, set postgresql.enabled=false and/or valkey.enabled=false and configure postgresql.host/valkey.host values in the Configuration Reference.
Prometheus and AlertManager¶
Kubernaut integrates with Prometheus and AlertManager at two levels:
1. EffectivenessMonitor queries -- EM queries Prometheus for metric-based assessment enrichment and AlertManager for alert resolution checks. The expected service endpoints (configurable):
| Service | Default URL | Override |
|---|---|---|
| Prometheus | http://kube-prometheus-stack-prometheus.monitoring.svc:9090 |
effectivenessmonitor.external.prometheusUrl |
| AlertManager | http://kube-prometheus-stack-alertmanager.monitoring.svc:9093 |
effectivenessmonitor.external.alertManagerUrl |
2. AlertManager sends alerts to Gateway -- The Gateway authenticates every signal ingestion request using Kubernetes TokenReview + SubjectAccessReview (SAR). AlertManager must include a bearer token in its webhook requests. See Signal Source Authentication below for the full configuration.
Signal Source Authentication¶
The Gateway authenticates every signal ingestion request using Kubernetes TokenReview + SubjectAccessReview (SAR). Signal sources (e.g., AlertManager) must present a valid ServiceAccount bearer token, and that ServiceAccount must have RBAC permission to submit signals.
The chart provides a gateway-signal-source ClusterRole that grants create on the gateway-service resource. Each entry in gateway.auth.signalSources creates a ClusterRoleBinding binding this role to the specified ServiceAccount.
See Security & RBAC for the full TokenReview + SAR flow, Gateway RBAC details, and the gateway-signal-source ClusterRole definition.
Configuring AlertManager¶
AlertManager must include http_config.bearer_token_file in its webhook receiver so the Gateway can authenticate the request. The Gateway service is gateway-service on port 8080, and the AlertManager adapter path is /api/v1/signals/prometheus.
# alertmanager.yml (standalone)
receivers:
- name: kubernaut
webhook_configs:
- url: "https://gateway-service.kubernaut-system.svc.cluster.local:8080/api/v1/signals/prometheus"
send_resolved: true
http_config:
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
route:
routes:
- receiver: kubernaut
matchers:
- alertname!=""
continue: true
For kube-prometheus-stack, configure via Helm values:
# kube-prometheus-stack values.yaml
alertmanager:
config:
receivers:
- name: kubernaut
webhook_configs:
- url: "https://gateway-service.kubernaut-system.svc.cluster.local:8080/api/v1/signals/prometheus"
send_resolved: true
http_config:
bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
route:
routes:
- receiver: kubernaut
matchers:
- alertname!=""
continue: true
Then register AlertManager's ServiceAccount as an authorized signal source in your Kubernaut values:
# kubernaut values.yaml
gateway:
auth:
signalSources:
- name: alertmanager
serviceAccount: alertmanager-kube-prometheus-stack-alertmanager
namespace: monitoring
Warning
Without bearer_token_file, AlertManager sends unauthenticated requests and the Gateway rejects them with 401 Unauthorized. Without the signalSources entry, the token is valid but SAR denies access with 403 Forbidden.
Pre-Installation¶
Kubernaut uses 9 Custom Resource Definitions. Helm installs them automatically from the chart's crds/ directory on first install -- no manual step is needed. For reinstalls, see Reinstalling.
1. Create the Namespace¶
2. Provision Secrets¶
All required secrets must be pre-created before running helm install. The chart validates the presence of database and cache secrets at template time and fails with a descriptive error if any are missing.
PostgreSQL + DataStorage (consolidated secret)¶
PostgreSQL and DataStorage share a single secret. The db-secrets.yaml key must use the same password as POSTGRES_PASSWORD to avoid authentication mismatches.
PG_PASSWORD=$(openssl rand -base64 24)
kubectl create secret generic postgresql-secret \
--from-literal=POSTGRES_USER=slm_user \
--from-literal=POSTGRES_PASSWORD="$PG_PASSWORD" \
--from-literal=POSTGRES_DB=action_history \
--from-literal=db-secrets.yaml="$(printf 'username: slm_user\npassword: %s' "$PG_PASSWORD")" \
-n kubernaut-system
Valkey¶
kubectl create secret generic valkey-secret \
--from-literal=valkey-secrets.yaml="$(printf 'password: %s' "$(openssl rand -base64 24)")" \
-n kubernaut-system
Required secrets summary¶
| Secret Name | Required Keys | Consumed By |
|---|---|---|
postgresql-secret |
POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_DB, db-secrets.yaml |
PostgreSQL (env vars), DataStorage (file mount), migration hook |
valkey-secret |
valkey-secrets.yaml |
DataStorage (file mount) |
llm-credentials |
Provider-specific (see below) | Kubernaut Agent |
To use custom secret names for database/cache secrets, pass --set postgresql.auth.existingSecret=<name> and --set valkey.existingSecret=<name> at install time. For LLM credentials, use --set kubernautAgent.llm.credentialsSecretName=<name>.
LLM credentials (required for AI analysis)¶
kubectl create secret generic llm-credentials \
--from-file=application_default_credentials.json=path/to/service-account-key.json \
-n kubernaut-system
Kubernaut Agent auto-detects application_default_credentials.json in the mounted secret and sets GOOGLE_APPLICATION_CREDENTIALS to the mount path at runtime.
With GCP Workload Identity the secret can be omitted.
Vertex AI requires an SDK config file
The quickstart --set kubernautAgent.llm.provider=... path only supports OpenAI and Anthropic. Vertex AI requires gcp_project_id and gcp_region, which must be provided via sdkConfigContent or existingSdkConfigMap. See Advanced Configuration and the Vertex AI SDK config example.
| Chart Value | Secret Name | Required Keys |
|---|---|---|
kubernautAgent.llm.credentialsSecretName |
llm-credentials (default) |
Provider-specific: OPENAI_API_KEY, AZURE_API_KEY, or application_default_credentials.json (file) |
Notification credentials (optional, Slack only)¶
kubectl create secret generic slack-webhook \
--from-literal=webhook-url=https://hooks.slack.com/services/T.../B.../... \
-n kubernaut-system
| Chart Value | Secret Name | Required Keys |
|---|---|---|
notification.slack.secretName |
slack-webhook (example) |
webhook-url |
Only required when Slack delivery is configured. When using console-only routing (default), no notification secret is needed. For advanced multi-receiver routing, use notification.credentials[] and notification.routing.content instead of the Slack shortcut.
Install¶
OCP Helm chart deprecated — use the Kubernaut Operator (v1.4)
The OpenShift-specific Helm chart path is deprecated as of v1.4 (#848). For OpenShift production deployments, use the Kubernaut Operator instead. The Helm chart examples below for OpenShift are provided for development and testing convenience only.
NetworkPolicies (v1.4)
Kubernaut v1.4 deploys NetworkPolicies for all services with a default-deny ingress posture. Your cluster's CNI plugin must support NetworkPolicy enforcement (Calico, Cilium, etc.) — clusters without enforcement silently ignore them. Disable per-service with networkPolicies.<service>.enabled: false. See Security & RBAC: NetworkPolicies for details.
The chart is distributed as an OCI artifact. With the namespace and secrets provisioned in Pre-Installation, install using helm install:
Kind / Vanilla Kubernetes¶
helm install kubernaut oci://quay.io/kubernaut-ai/charts/kubernaut \
--namespace kubernaut-system \
--set kubernautAgent.llm.provider=openai \
--set kubernautAgent.llm.model=gpt-4o
OpenShift (OCP)¶
OpenShift requires additional configuration: cert-manager TLS mode, OCP monitoring endpoints (TLS + service-serving CA), and the Red Hat PostgreSQL image. Download the OCP values overlay from the kubernaut-demo-scenarios repository and layer it on top:
helm install kubernaut oci://quay.io/kubernaut-ai/charts/kubernaut \
--namespace kubernaut-system \
--values kubernaut-ocp-values.yaml \
--set kubernautAgent.llm.provider=openai \
--set kubernautAgent.llm.model=gpt-4o
The OCP values overlay configures:
| Setting | Value |
|---|---|
| TLS mode | cert-manager with a selfsigned-issuer ClusterIssuer |
| Signal source | alertmanager-main in openshift-monitoring |
| Prometheus URL | https://prometheus-k8s.openshift-monitoring.svc:9091 (TLS) |
| AlertManager URL | https://alertmanager-main.openshift-monitoring.svc:9094 (TLS) |
| PostgreSQL image | registry.redhat.io/rhel10/postgresql-16 |
See the kubernaut-ocp-values.yaml reference file for the full configuration.
Disconnected / air-gapped clusters
If your OCP cluster has no internet access, see the Disconnected Installation Guide for mirroring images and configuring the chart for offline use.
Advanced Configuration¶
For advanced LLM configurations (Vertex AI, local models) or custom Rego policies, use --set-file to inject configuration files:
helm install kubernaut oci://quay.io/kubernaut-ai/charts/kubernaut \
--namespace kubernaut-system \
--set-file kubernautAgent.sdkConfigContent=my-sdk-config.yaml \
--set-file aianalysis.policies.content=my-approval.rego \
--set-file signalprocessing.policy=my-policy.rego
See the sdk-config.yaml.example for a reference SDK config covering Vertex AI, Anthropic, OpenAI, and local models.
To pin a specific chart version, add --version <version>. Omitting --version pulls the latest release.
Start with minimal toolsets
The default SDK config ships with toolsets: {} (no optional toolsets). This is the recommended starting point — the Kubernetes core toolset is always available and handles most incident types (CrashLoopBackOff, config errors, OOMKilled). Enable additional toolsets like prometheus/metrics only for workloads that require metric-driven investigation. Unused toolsets add ~30% token overhead per investigation. See Toolset Optimization for details.
Quickstart¶
For a complete end-to-end demo environment (Kind cluster, monitoring stack, Kubernaut, infrastructure dependencies, workflow catalog), use the kubernaut-demo-scenarios repository:
git clone https://github.com/jordigilh/kubernaut-demo-scenarios.git
cd kubernaut-demo-scenarios
# Configure your LLM provider
export KUBERNAUT_LLM_PROVIDER=openai
export KUBERNAUT_LLM_MODEL=gpt-4o
# Create the full demo environment (~10 minutes)
./scripts/setup-demo-cluster.sh
The setup script creates a Kind cluster, installs Prometheus/Grafana, deploys Kubernaut, and installs infrastructure dependencies (cert-manager, metrics-server, Istio, blackbox-exporter). Gitea and ArgoCD are installed automatically when running GitOps scenarios. See the demo scenarios README for all options including OCP support and advanced LLM configuration.
Post-Install Verification¶
# All pods should be 1/1 Running (readiness probes confirm service health)
kubectl get pods -n kubernaut-system
# Verify workflow catalog
kubectl get remediationworkflows -A
Post-Installation¶
Action Types and Workflows¶
Kubernaut uses an ActionType taxonomy to organize remediation capabilities. Operators register ActionType CRDs that describe what each remediation does, when to use it, and under what preconditions. RemediationWorkflow CRDs reference ActionTypes by name.
Register your own ActionType CRs and RemediationWorkflows to build a catalog tailored to your environment. See Authoring Workflows for guidelines and the Action Type reference for registration details.
Resource Scope¶
After installation, Kubernaut only manages namespaces and resources that opt in via labels:
See Signals & Alert Routing for details on scope management.
Reinstalling¶
Kubernaut does not support in-place upgrades. To move to a new version, perform a fresh install. See What's New for changes between releases.
Uninstalling¶
What is retained after uninstall¶
| Resource | Behavior | Manual cleanup |
|---|---|---|
PostgreSQL PVC (postgresql-data) |
Retained (resource-policy: keep) |
kubectl delete pvc postgresql-data -n kubernaut-system |
Valkey PVC (valkey-data) |
Retained (resource-policy: keep) |
kubectl delete pvc valkey-data -n kubernaut-system |
| CRDs (9 definitions) | Retained (standard Helm behavior) | kubectl delete crd <name>.kubernaut.ai for each CRD |
| CR instances | Retained until CRDs are deleted | Deleted when parent CRD is deleted |
| Hook ClusterRole/CRB | Retained (hook resources not tracked by Helm) | kubectl delete clusterrole kubernaut-hook-role --ignore-not-found and kubectl delete clusterrolebinding kubernaut-hook-rolebinding --ignore-not-found |
| TLS Secret and CA ConfigMap | Deleted by post-delete hook (hook mode) or by cert-manager (cert-manager mode) |
-- |
| Cluster-scoped RBAC | Deleted by Helm | -- |
kubernaut-workflows namespace |
Deleted by Helm | May get stuck if it contains active Jobs; see below |
If the kubernaut-workflows namespace gets stuck in Terminating state:
Full cleanup¶
To remove everything including persistent data:
helm uninstall kubernaut -n kubernaut-system
# Remove PVCs retained by resource policy
kubectl delete pvc postgresql-data valkey-data -n kubernaut-system
# Remove hook-created cluster resources (not tracked by Helm)
kubectl delete clusterrole kubernaut-hook-role --ignore-not-found
kubectl delete clusterrolebinding kubernaut-hook-rolebinding --ignore-not-found
# Remove CRDs and all CR instances
kubectl delete crd actiontypes.kubernaut.ai aianalyses.kubernaut.ai \
effectivenessassessments.kubernaut.ai notificationrequests.kubernaut.ai \
remediationapprovalrequests.kubernaut.ai remediationrequests.kubernaut.ai \
remediationworkflows.kubernaut.ai signalprocessings.kubernaut.ai \
workflowexecutions.kubernaut.ai
kubectl delete namespace kubernaut-system
Known Limitations¶
- Single installation per cluster: Cluster-scoped resources (ClusterRoles, ClusterRoleBindings, WebhookConfigurations) use static names. Installing multiple releases in different namespaces will cause conflicts.
- Init container timeouts: The
wait-for-postgresinit containers in DataStorage and the migration Job have no timeout. If PostgreSQL is unavailable, these containers will block indefinitely.
Next Steps¶
- Quickstart -- Trigger your first automated remediation
- Architecture Overview -- Understand how the services work together
- Configuration Reference -- Tune Kubernaut for your environment
- Rego Policies -- Customize classification and approval policies
- Workflows -- Author and register remediation workflows