Installation¶

Deployment Methods¶

Kubernaut offers two deployment methods:

Method	Use Case	Platform
Kubernaut Operator	Production — full lifecycle management, OLM integration, status reporting	OpenShift 4.18+
Helm Chart	Development, testing, CI — quick setup for evaluation and local development	Any Kubernetes 1.32+

Production deployments

The Kubernaut Operator is the only supported production deployment method. The Helm chart does not provide lifecycle management, status reporting, or OLM integration required for production operations. Use the Helm chart for development, testing, and CI environments only.

Kubernaut Operator (Production)¶

The Kubernaut Operator manages the full lifecycle of the Kubernaut platform on OpenShift: secret validation, database migrations, CRD installation, deployment of all 10 microservices, RBAC, NetworkPolicies, OCP Routes, and status reporting. It is a singleton — one Kubernaut CR named kubernaut per cluster.

Installation¶

The operator is available through OLM (Operator Lifecycle Manager) or direct deployment:

OperatorHub (recommended) — Install from the OperatorHub catalog in the OpenShift Console
Custom CatalogSource — For disconnected or custom environments, create a CatalogSource pointing to the operator index image

For complete installation instructions, see the Kubernaut Operator Installation Guide.

Prerequisites (Operator)¶

Requirement	Version	Notes
OpenShift	4.18+	OLM and operator-framework support required
PostgreSQL	15+	BYO — the operator does not deploy a database; provide connection details via `spec.postgresql`
Valkey / Redis	7+	BYO — provide connection details via `spec.valkey`
LLM provider	—	Any supported provider with JSON structured output

Operator image: quay.io/kubernaut-ai/kubernaut-operator:1.4.0 (note: no v prefix, unlike component images which use v1.4.0).

Minimal Kubernaut CR:

apiVersion: kubernaut.ai/v1alpha1
kind: Kubernaut
metadata:
  name: kubernaut
  namespace: kubernaut-system
spec:
  postgresql:
    host: postgres.database.svc.cluster.local
    secretName: kubernaut-postgresql
  valkey:
    host: valkey.cache.svc.cluster.local
    secretName: kubernaut-valkey
  kubernautAgent:
    llm:
      provider: openai
      model: gpt-4o
      credentialsSecretName: kubernaut-llm

Disconnected installs

For air-gapped environments, mirror all component images and set RELATED_IMAGE_* environment variables on the operator Deployment. See the operator installation guide for the full image list.

What the Operator manages¶

Validates BYO PostgreSQL and Valkey secrets before deployment
Runs embedded database schema migrations
Installs and upgrades the 9 Kubernaut workload CRDs
Deploys all 10 microservices with RBAC, ConfigMaps, PDBs, admission webhooks, and NetworkPolicies
Configures OCP Routes and service-serving CA TLS
Reports per-service readiness status on the Kubernaut CR
Cleans up cluster-scoped RBAC and workflow namespace on CR deletion (workload CRDs are retained by design)

Helm Chart (Development/Testing)¶

This section walks you through installing Kubernaut using the Helm chart for development, testing, and CI environments.

Not for production

The Helm chart is intended for development, testing, and CI. For production deployments, use the Kubernaut Operator.

Prerequisites¶

Requirement	Version	Notes
Kubernetes	1.32+	selectableFields GA in 1.32; required for CRD field selectors
Helm	3.12+
StorageClass	dynamic provisioning	For PostgreSQL and Valkey PVCs
cert-manager	1.12+ (optional)	Required when `tls.mode=cert-manager`. Optional for dev (`tls.mode=hook` is default).

LLM provider (required for AI investigation):

Any supported provider with JSON structured output support (response_format: json_object or equivalent). KA enables JSON mode on all LLM requests — models that do not support it will produce parse failures.

Workflow execution engine (at least one):

Kubernetes Jobs (built-in, no extra dependency)
Tekton Pipelines (optional)
Ansible Automation Platform (AAP) / AWX (optional)

External monitoring (recommended):

kube-prometheus-stack provides:
Alert-based signal ingestion (AlertManager sends alerts to Gateway)
Metrics enrichment for effectiveness assessments (Prometheus queries)
Alert resolution checks (AlertManager API)
Metrics scraping for all Kubernaut services (all pods expose /metrics)

Prometheus and AlertManager integration is disabled by default. To enable effectiveness assessments based on alert resolution and metric queries, set effectivenessmonitor.external.prometheusEnabled=true and effectivenessmonitor.external.alertManagerEnabled=true.

Infrastructure Setup¶

Complete these steps before installing the Kubernaut chart.

Storage¶

PostgreSQL and Valkey each require a PersistentVolumeClaim for data persistence:

Component	PVC Name	Default Size	Values
PostgreSQL	`postgresql-data`	`10Gi`	`postgresql.storage.size`, `postgresql.storage.storageClassName`
Valkey	`valkey-data`	`512Mi`	`valkey.storage.size`, `valkey.storage.storageClassName`

Both PVCs are annotated with helm.sh/resource-policy: keep so data survives helm uninstall.

If the cluster has no default StorageClass, set storageClassName explicitly:

postgresql:
  storage:
    size: 50Gi
    storageClassName: gp3-encrypted
valkey:
  storage:
    storageClassName: gp3-encrypted

To skip in-chart databases entirely and use external instances, set postgresql.enabled=false and/or valkey.enabled=false and configure postgresql.host/valkey.host values in the Configuration Reference.

Prometheus and AlertManager¶

Kubernaut integrates with Prometheus and AlertManager at two levels:

1. EffectivenessMonitor queries -- EM queries Prometheus for metric-based assessment enrichment and AlertManager for alert resolution checks. The expected service endpoints (configurable):

Service	Default URL	Override
Prometheus	`http://kube-prometheus-stack-prometheus.monitoring.svc:9090`	`effectivenessmonitor.external.prometheusUrl`
AlertManager	`http://kube-prometheus-stack-alertmanager.monitoring.svc:9093`	`effectivenessmonitor.external.alertManagerUrl`

2. AlertManager sends alerts to Gateway -- The Gateway authenticates every signal ingestion request using Kubernetes TokenReview + SubjectAccessReview (SAR). AlertManager must include a bearer token in its webhook requests. See Signal Source Authentication below for the full configuration.

Signal Source Authentication¶

The Gateway authenticates every signal ingestion request using Kubernetes TokenReview + SubjectAccessReview (SAR). Signal sources (e.g., AlertManager) must present a valid ServiceAccount bearer token, and that ServiceAccount must have RBAC permission to submit signals.

The chart provides a gateway-signal-source ClusterRole that grants create on the gateway-service resource. Each entry in gateway.auth.signalSources creates a ClusterRoleBinding binding this role to the specified ServiceAccount.

See Security & RBAC for the full TokenReview + SAR flow, Gateway RBAC details, and the gateway-signal-source ClusterRole definition.

Configuring AlertManager¶

AlertManager must include http_config.bearer_token_file in its webhook receiver so the Gateway can authenticate the request. The Gateway service is gateway-service on port 8080, and the AlertManager adapter path is /api/v1/signals/prometheus.

# alertmanager.yml (standalone)
receivers:
  - name: kubernaut
    webhook_configs:
      - url: "https://gateway-service.kubernaut-system.svc.cluster.local:8080/api/v1/signals/prometheus"
        send_resolved: true
        http_config:
          bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

route:
  routes:
    - receiver: kubernaut
      matchers:
        - alertname!=""
      continue: true

For kube-prometheus-stack, configure via Helm values:

# kube-prometheus-stack values.yaml
alertmanager:
  config:
    receivers:
      - name: kubernaut
        webhook_configs:
          - url: "https://gateway-service.kubernaut-system.svc.cluster.local:8080/api/v1/signals/prometheus"
            send_resolved: true
            http_config:
              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    route:
      routes:
        - receiver: kubernaut
          matchers:
            - alertname!=""
          continue: true

Then register AlertManager's ServiceAccount as an authorized signal source in your Kubernaut values:

# kubernaut values.yaml
gateway:
  auth:
    signalSources:
      - name: alertmanager
        serviceAccount: alertmanager-kube-prometheus-stack-alertmanager
        namespace: monitoring

Warning

Without bearer_token_file, AlertManager sends unauthenticated requests and the Gateway rejects them with 401 Unauthorized. Without the signalSources entry, the token is valid but SAR denies access with 403 Forbidden.

Pre-Installation¶

Kubernaut uses 9 Custom Resource Definitions. Helm installs them automatically from the chart's crds/ directory on first install -- no manual step is needed. For reinstalls, see Reinstalling.

1. Create the Namespace¶

kubectl create namespace kubernaut-system

2. Provision Secrets¶

All required secrets must be pre-created before running helm install. The chart validates the presence of database and cache secrets at template time and fails with a descriptive error if any are missing.

PostgreSQL + DataStorage (consolidated secret)¶

PostgreSQL and DataStorage share a single secret. The db-secrets.yaml key must use the same password as POSTGRES_PASSWORD to avoid authentication mismatches.

PG_PASSWORD=$(openssl rand -base64 24)
kubectl create secret generic postgresql-secret \
  --from-literal=POSTGRES_USER=slm_user \
  --from-literal=POSTGRES_PASSWORD="$PG_PASSWORD" \
  --from-literal=POSTGRES_DB=action_history \
  --from-literal=db-secrets.yaml="$(printf 'username: slm_user\npassword: %s' "$PG_PASSWORD")" \
  -n kubernaut-system

Valkey¶

kubectl create secret generic valkey-secret \
  --from-literal=valkey-secrets.yaml="$(printf 'password: %s' "$(openssl rand -base64 24)")" \
  -n kubernaut-system

Required secrets summary¶

Secret Name	Required Keys	Consumed By
`postgresql-secret`	`POSTGRES_USER`, `POSTGRES_PASSWORD`, `POSTGRES_DB`, `db-secrets.yaml`	PostgreSQL (env vars), DataStorage (file mount), migration hook
`valkey-secret`	`valkey-secrets.yaml`	DataStorage (file mount)
`llm-credentials`	Provider-specific (see below)	Kubernaut Agent

To use custom secret names for database/cache secrets, pass --set postgresql.auth.existingSecret=<name> and --set valkey.existingSecret=<name> at install time. For LLM credentials, use --set kubernautAgent.llm.credentialsSecretName=<name>.

LLM credentials (required for AI analysis)¶

OpenAI / AzureGoogle Vertex AI

kubectl create secret generic llm-credentials \
  --from-literal=OPENAI_API_KEY=sk-... \
  -n kubernaut-system

kubectl create secret generic llm-credentials \
  --from-file=application_default_credentials.json=path/to/service-account-key.json \
  -n kubernaut-system

Kubernaut Agent auto-detects application_default_credentials.json in the mounted secret and sets GOOGLE_APPLICATION_CREDENTIALS to the mount path at runtime. With GCP Workload Identity the secret can be omitted.

Vertex AI requires an SDK config file

The quickstart --set kubernautAgent.llm.provider=... path only supports OpenAI and Anthropic. Vertex AI requires gcp_project_id and gcp_region, which must be provided via sdkConfigContent or existingSdkConfigMap. See Advanced Configuration and the Vertex AI SDK config example.

Chart Value	Secret Name	Required Keys
`kubernautAgent.llm.credentialsSecretName`	`llm-credentials` (default)	Provider-specific: `OPENAI_API_KEY`, `AZURE_API_KEY`, or `application_default_credentials.json` (file)

Notification credentials (optional, Slack only)¶

kubectl create secret generic slack-webhook \
  --from-literal=webhook-url=https://hooks.slack.com/services/T.../B.../... \
  -n kubernaut-system

Chart Value	Secret Name	Required Keys
`notification.slack.secretName`	`slack-webhook` (example)	`webhook-url`

Only required when Slack delivery is configured. When using console-only routing (default), no notification secret is needed. For advanced multi-receiver routing, use notification.credentials[] and notification.routing.content instead of the Slack shortcut.

Install¶

OCP Helm chart deprecated — use the Kubernaut Operator (v1.4)

The OpenShift-specific Helm chart path is deprecated as of v1.4 (#848). For OpenShift production deployments, use the Kubernaut Operator instead. The Helm chart examples below for OpenShift are provided for development and testing convenience only.

NetworkPolicies (v1.4)

Kubernaut v1.4 deploys NetworkPolicies for all services with a default-deny ingress posture. Your cluster's CNI plugin must support NetworkPolicy enforcement (Calico, Cilium, etc.) — clusters without enforcement silently ignore them. Disable per-service with networkPolicies.<service>.enabled: false. See Security & RBAC: NetworkPolicies for details.

The chart is distributed as an OCI artifact. With the namespace and secrets provisioned in Pre-Installation, install using helm install:

Kind / Vanilla Kubernetes¶

helm install kubernaut oci://quay.io/kubernaut-ai/charts/kubernaut \
  --namespace kubernaut-system \
  --set kubernautAgent.llm.provider=openai \
  --set kubernautAgent.llm.model=gpt-4o

OpenShift (OCP)¶

OpenShift requires additional configuration: cert-manager TLS mode, OCP monitoring endpoints (TLS + service-serving CA), and the Red Hat PostgreSQL image. Download the OCP values overlay from the kubernaut-demo-scenarios repository and layer it on top:

helm install kubernaut oci://quay.io/kubernaut-ai/charts/kubernaut \
  --namespace kubernaut-system \
  --values kubernaut-ocp-values.yaml \
  --set kubernautAgent.llm.provider=openai \
  --set kubernautAgent.llm.model=gpt-4o

The OCP values overlay configures:

Setting	Value
TLS mode	`cert-manager` with a `selfsigned-issuer` ClusterIssuer
Signal source	`alertmanager-main` in `openshift-monitoring`
Prometheus URL	`https://prometheus-k8s.openshift-monitoring.svc:9091` (TLS)
AlertManager URL	`https://alertmanager-main.openshift-monitoring.svc:9094` (TLS)
PostgreSQL image	`registry.redhat.io/rhel10/postgresql-16`

See the kubernaut-ocp-values.yaml reference file for the full configuration.

Disconnected / air-gapped clusters

If your OCP cluster has no internet access, see the Disconnected Installation Guide for mirroring images and configuring the chart for offline use.

Advanced Configuration¶

For advanced LLM configurations (Vertex AI, local models) or custom Rego policies, use --set-file to inject configuration files:

helm install kubernaut oci://quay.io/kubernaut-ai/charts/kubernaut \
  --namespace kubernaut-system \
  --set-file kubernautAgent.sdkConfigContent=my-sdk-config.yaml \
  --set-file aianalysis.policies.content=my-approval.rego \
  --set-file signalprocessing.policy=my-policy.rego

See the sdk-config.yaml.example for a reference SDK config covering Vertex AI, Anthropic, OpenAI, and local models.

To pin a specific chart version, add --version <version>. Omitting --version pulls the latest release.

Start with minimal toolsets

The default SDK config ships with toolsets: {} (no optional toolsets). This is the recommended starting point — the Kubernetes core toolset is always available and handles most incident types (CrashLoopBackOff, config errors, OOMKilled). Enable additional toolsets like prometheus/metrics only for workloads that require metric-driven investigation. Unused toolsets add ~30% token overhead per investigation. See Toolset Optimization for details.

Quickstart¶

For a complete end-to-end demo environment (Kind cluster, monitoring stack, Kubernaut, infrastructure dependencies, workflow catalog), use the kubernaut-demo-scenarios repository:

git clone https://github.com/jordigilh/kubernaut-demo-scenarios.git
cd kubernaut-demo-scenarios

# Configure your LLM provider
export KUBERNAUT_LLM_PROVIDER=openai
export KUBERNAUT_LLM_MODEL=gpt-4o

# Create the full demo environment (~10 minutes)
./scripts/setup-demo-cluster.sh

The setup script creates a Kind cluster, installs Prometheus/Grafana, deploys Kubernaut, and installs infrastructure dependencies (cert-manager, metrics-server, Istio, blackbox-exporter). Gitea and ArgoCD are installed automatically when running GitOps scenarios. See the demo scenarios README for all options including OCP support and advanced LLM configuration.

Post-Install Verification¶

# All pods should be 1/1 Running (readiness probes confirm service health)
kubectl get pods -n kubernaut-system

# Verify workflow catalog
kubectl get remediationworkflows -A

Post-Installation¶

Action Types and Workflows¶

Kubernaut uses an ActionType taxonomy to organize remediation capabilities. Operators register ActionType CRDs that describe what each remediation does, when to use it, and under what preconditions. RemediationWorkflow CRDs reference ActionTypes by name.

Register your own ActionType CRs and RemediationWorkflows to build a catalog tailored to your environment. See Authoring Workflows for guidelines and the Action Type reference for registration details.

Resource Scope¶

After installation, Kubernaut only manages namespaces and resources that opt in via labels:

kubectl label namespace my-app kubernaut.ai/managed=true

See Signals & Alert Routing for details on scope management.

Reinstalling¶

Kubernaut does not support in-place upgrades. To move to a new version, perform a fresh install. See What's New for changes between releases.

Uninstalling¶

helm uninstall kubernaut -n kubernaut-system

What is retained after uninstall¶

Resource	Behavior	Manual cleanup
PostgreSQL PVC (`postgresql-data`)	Retained (`resource-policy: keep`)	`kubectl delete pvc postgresql-data -n kubernaut-system`
Valkey PVC (`valkey-data`)	Retained (`resource-policy: keep`)	`kubectl delete pvc valkey-data -n kubernaut-system`
CRDs (9 definitions)	Retained (standard Helm behavior)	`kubectl delete crd <name>.kubernaut.ai` for each CRD
CR instances	Retained until CRDs are deleted	Deleted when parent CRD is deleted
Hook ClusterRole/CRB	Retained (hook resources not tracked by Helm)	`kubectl delete clusterrole kubernaut-hook-role --ignore-not-found` and `kubectl delete clusterrolebinding kubernaut-hook-rolebinding --ignore-not-found`
TLS Secret and CA ConfigMap	Deleted by post-delete hook (`hook` mode) or by cert-manager (`cert-manager` mode)	--
Cluster-scoped RBAC	Deleted by Helm	--
`kubernaut-workflows` namespace	Deleted by Helm	May get stuck if it contains active Jobs; see below

If the kubernaut-workflows namespace gets stuck in Terminating state:

kubectl get all -n kubernaut-workflows
kubectl delete jobs --all -n kubernaut-workflows

Full cleanup¶

To remove everything including persistent data:

helm uninstall kubernaut -n kubernaut-system

# Remove PVCs retained by resource policy
kubectl delete pvc postgresql-data valkey-data -n kubernaut-system

# Remove hook-created cluster resources (not tracked by Helm)
kubectl delete clusterrole kubernaut-hook-role --ignore-not-found
kubectl delete clusterrolebinding kubernaut-hook-rolebinding --ignore-not-found

# Remove CRDs and all CR instances
kubectl delete crd actiontypes.kubernaut.ai aianalyses.kubernaut.ai \
  effectivenessassessments.kubernaut.ai notificationrequests.kubernaut.ai \
  remediationapprovalrequests.kubernaut.ai remediationrequests.kubernaut.ai \
  remediationworkflows.kubernaut.ai signalprocessings.kubernaut.ai \
  workflowexecutions.kubernaut.ai

kubectl delete namespace kubernaut-system

Known Limitations¶

Single installation per cluster: Cluster-scoped resources (ClusterRoles, ClusterRoleBindings, WebhookConfigurations) use static names. Installing multiple releases in different namespaces will cause conflicts.
Init container timeouts: The wait-for-postgres init containers in DataStorage and the migration Job have no timeout. If PostgreSQL is unavailable, these containers will block indefinitely.

Next Steps¶

Quickstart -- Trigger your first automated remediation
Architecture Overview -- Understand how the services work together
Configuration Reference -- Tune Kubernaut for your environment
Rego Policies -- Customize classification and approval policies
Workflows -- Author and register remediation workflows