Skip to content

Installation

This guide walks you through installing Kubernaut on a Kubernetes cluster using Helm.

Prerequisites

Requirement Version Notes
Kubernetes 1.32+ selectableFields GA in 1.32; required for CRD field selectors
Helm 3.12+
StorageClass dynamic provisioning For PostgreSQL and Valkey PVCs
cert-manager 1.12+ (production) Required when tls.mode=cert-manager. Optional for dev (tls.mode=hook is default).

Workflow execution engine (at least one):

  • Kubernetes Jobs (built-in, no extra dependency)
  • Tekton Pipelines (optional)
  • Ansible Automation Platform (AAP) / AWX (optional)

External monitoring (recommended):

  • kube-prometheus-stack provides:
  • Alert-based signal ingestion (AlertManager sends alerts to Gateway)
  • Metrics enrichment for effectiveness assessments (Prometheus queries)
  • Alert resolution checks (AlertManager API)
  • Metrics scraping for all Kubernaut services (all pods expose /metrics)

Prometheus and AlertManager integration is disabled by default. To enable effectiveness assessments based on alert resolution and metric queries, set effectivenessmonitor.external.prometheusEnabled=true and effectivenessmonitor.external.alertManagerEnabled=true.

Infrastructure Setup

Complete these steps before installing the Kubernaut chart.

Storage

PostgreSQL and Valkey each require a PersistentVolumeClaim for data persistence:

Component PVC Name Default Size Values
PostgreSQL postgresql-data 10Gi postgresql.storage.size, postgresql.storage.storageClassName
Valkey valkey-data 512Mi valkey.storage.size, valkey.storage.storageClassName

Both PVCs are annotated with helm.sh/resource-policy: keep so data survives helm uninstall.

If the cluster has no default StorageClass, set storageClassName explicitly:

postgresql:
  storage:
    size: 50Gi
    storageClassName: gp3-encrypted
valkey:
  storage:
    storageClassName: gp3-encrypted

To skip in-chart databases entirely and use external instances, set postgresql.enabled=false and/or valkey.enabled=false and configure postgresql.host/valkey.host values in the Configuration Reference.

Prometheus and AlertManager

Kubernaut integrates with Prometheus and AlertManager at two levels:

1. EffectivenessMonitor queries -- EM queries Prometheus for metric-based assessment enrichment and AlertManager for alert resolution checks. The expected service endpoints (configurable):

Service Default URL Override
Prometheus http://kube-prometheus-stack-prometheus.monitoring.svc:9090 effectivenessmonitor.external.prometheusUrl
AlertManager http://kube-prometheus-stack-alertmanager.monitoring.svc:9093 effectivenessmonitor.external.alertManagerUrl

2. AlertManager sends alerts to Gateway -- The Gateway authenticates every signal ingestion request using Kubernetes TokenReview + SubjectAccessReview (SAR). AlertManager must include a bearer token in its webhook requests. See Signal Source Authentication below for the full configuration.

Signal Source Authentication

The Gateway authenticates every signal ingestion request using Kubernetes TokenReview + SubjectAccessReview (SAR). Signal sources (e.g., AlertManager) must present a valid ServiceAccount bearer token, and that ServiceAccount must have RBAC permission to submit signals.

The chart provides a gateway-signal-source ClusterRole that grants create on the gateway-service resource. Each entry in gateway.auth.signalSources creates a ClusterRoleBinding binding this role to the specified ServiceAccount.

See Security & RBAC for the full TokenReview + SAR flow, Gateway RBAC details, and the gateway-signal-source ClusterRole definition.

Configuring AlertManager

AlertManager must include http_config.bearer_token_file in its webhook receiver so the Gateway can authenticate the request. The Gateway service is gateway-service on port 8080, and the AlertManager adapter path is /api/v1/signals/prometheus.

# alertmanager.yml (standalone)
receivers:
  - name: kubernaut
    webhook_configs:
      - url: "http://gateway-service.kubernaut-system.svc.cluster.local:8080/api/v1/signals/prometheus"
        send_resolved: true
        http_config:
          bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

route:
  routes:
    - receiver: kubernaut
      matchers:
        - alertname!=""
      continue: true

For kube-prometheus-stack, configure via Helm values:

# kube-prometheus-stack values.yaml
alertmanager:
  config:
    receivers:
      - name: kubernaut
        webhook_configs:
          - url: "http://gateway-service.kubernaut-system.svc.cluster.local:8080/api/v1/signals/prometheus"
            send_resolved: true
            http_config:
              bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
    route:
      routes:
        - receiver: kubernaut
          matchers:
            - alertname!=""
          continue: true

Then register AlertManager's ServiceAccount as an authorized signal source in your Kubernaut values:

# kubernaut values.yaml
gateway:
  auth:
    signalSources:
      - name: alertmanager
        serviceAccount: alertmanager-kube-prometheus-stack-alertmanager
        namespace: monitoring

Warning

Without bearer_token_file, AlertManager sends unauthenticated requests and the Gateway rejects them with 401 Unauthorized. Without the signalSources entry, the token is valid but SAR denies access with 403 Forbidden.

Pre-Installation

Kubernaut uses 9 Custom Resource Definitions. Helm installs them automatically from the chart's crds/ directory on first install -- no manual step is needed. For upgrades, see Upgrading.

1. Create the Namespace

kubectl create namespace kubernaut-system

2. Provision Secrets

All required secrets must be pre-created before running helm install. The chart validates the presence of database and cache secrets at template time and fails with a descriptive error if any are missing.

PostgreSQL + DataStorage (consolidated secret)

PostgreSQL and DataStorage share a single secret. The db-secrets.yaml key must use the same password as POSTGRES_PASSWORD to avoid authentication mismatches.

PG_PASSWORD=$(openssl rand -base64 24)
kubectl create secret generic postgresql-secret \
  --from-literal=POSTGRES_USER=slm_user \
  --from-literal=POSTGRES_PASSWORD="$PG_PASSWORD" \
  --from-literal=POSTGRES_DB=action_history \
  --from-literal=db-secrets.yaml="$(printf 'username: slm_user\npassword: %s' "$PG_PASSWORD")" \
  -n kubernaut-system

Valkey

kubectl create secret generic valkey-secret \
  --from-literal=valkey-secrets.yaml="$(printf 'password: %s' "$(openssl rand -base64 24)")" \
  -n kubernaut-system

Required secrets summary

Secret Name Required Keys Consumed By
postgresql-secret POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_DB, db-secrets.yaml PostgreSQL (env vars), DataStorage (file mount), migration hook
valkey-secret valkey-secrets.yaml DataStorage (file mount)
llm-credentials Provider-specific (see below) HolmesGPT API

To use custom secret names for database/cache secrets, pass --set postgresql.auth.existingSecret=<name> and --set valkey.existingSecret=<name> at install time. For LLM credentials, use --set holmesgptApi.llm.credentialsSecretName=<name>.

LLM credentials (required for AI analysis)

kubectl create secret generic llm-credentials \
  --from-literal=OPENAI_API_KEY=sk-... \
  -n kubernaut-system
kubectl create secret generic llm-credentials \
  --from-file=application_default_credentials.json=path/to/service-account-key.json \
  -n kubernaut-system

HAPI auto-detects application_default_credentials.json in the mounted secret and sets GOOGLE_APPLICATION_CREDENTIALS to the mount path at runtime. With GCP Workload Identity the secret can be omitted.

Vertex AI requires an SDK config file

The quickstart --set holmesgptApi.llm.provider=... path only supports OpenAI and Anthropic. Vertex AI requires gcp_project_id and gcp_region, which must be provided via sdkConfigContent or existingSdkConfigMap. See Advanced Configuration and the Vertex AI SDK config example.

Chart Value Secret Name Required Keys
holmesgptApi.llm.credentialsSecretName llm-credentials (default) Provider-specific: OPENAI_API_KEY, AZURE_API_KEY, or application_default_credentials.json (file)

Notification credentials (optional, Slack only)

kubectl create secret generic slack-webhook \
  --from-literal=webhook-url=https://hooks.slack.com/services/T.../B.../... \
  -n kubernaut-system
Chart Value Secret Name Required Keys
notification.slack.secretName slack-webhook (example) webhook-url

Only required when Slack delivery is configured. When using console-only routing (default), no notification secret is needed. For advanced multi-receiver routing, use notification.credentials[] and notification.routing.content instead of the Slack shortcut.

Install

The chart is distributed as an OCI artifact. With the namespace and secrets provisioned in Pre-Installation, install using helm install:

Kind / Vanilla Kubernetes

helm install kubernaut oci://quay.io/kubernaut-ai/charts/kubernaut \
  --namespace kubernaut-system \
  --set holmesgptApi.llm.provider=openai \
  --set holmesgptApi.llm.model=gpt-4o

OpenShift (OCP)

OpenShift requires additional configuration: cert-manager TLS mode, OCP monitoring endpoints (TLS + service-serving CA), and the Red Hat PostgreSQL image. Download the OCP values overlay from the kubernaut-demo-scenarios repository and layer it on top:

helm install kubernaut oci://quay.io/kubernaut-ai/charts/kubernaut \
  --namespace kubernaut-system \
  --values kubernaut-ocp-values.yaml \
  --set holmesgptApi.llm.provider=openai \
  --set holmesgptApi.llm.model=gpt-4o

The OCP values overlay configures:

Setting Value
TLS mode cert-manager with a selfsigned-issuer ClusterIssuer
Signal source alertmanager-main in openshift-monitoring
Prometheus URL https://prometheus-k8s.openshift-monitoring.svc:9091 (TLS)
AlertManager URL https://alertmanager-main.openshift-monitoring.svc:9094 (TLS)
PostgreSQL image registry.redhat.io/rhel10/postgresql-16

See the kubernaut-ocp-values.yaml reference file for the full configuration.

Disconnected / air-gapped clusters

If your OCP cluster has no internet access, see the Disconnected Installation Guide for mirroring images and configuring the chart for offline use.

Advanced Configuration

For advanced LLM configurations (Vertex AI, local models) or custom Rego policies, use --set-file to inject configuration files:

helm install kubernaut oci://quay.io/kubernaut-ai/charts/kubernaut \
  --namespace kubernaut-system \
  --set-file holmesgptApi.sdkConfigContent=my-sdk-config.yaml \
  --set-file aianalysis.policies.content=my-approval.rego \
  --set-file signalprocessing.policy=my-policy.rego

See the sdk-config.yaml.example for a reference SDK config covering Vertex AI, Anthropic, OpenAI, and local models.

To pin a specific chart version, add --version <version>. Omitting --version pulls the latest release.

Production: disable demo fixtures

The chart seeds demo ActionTypes and RemediationWorkflows by default (demoContent.enabled: true) as a convenience path for getting started quickly. For production deployments where you want only your own workflows, add --set demoContent.enabled=false. See Action Types and Workflows (Demo Content) for details.

Start with minimal toolsets

The default SDK config ships with toolsets: {} (no optional toolsets). This is the recommended starting point — the Kubernetes core toolset is always available and handles most incident types (CrashLoopBackOff, config errors, OOMKilled). Enable additional toolsets like prometheus/metrics only for workloads that require metric-driven investigation. Unused toolsets add ~30% token overhead per investigation. See Toolset Optimization for details.

Quickstart

For a complete end-to-end demo environment (Kind cluster, monitoring stack, Kubernaut, infrastructure dependencies, workflow catalog), use the kubernaut-demo-scenarios repository:

git clone https://github.com/jordigilh/kubernaut-demo-scenarios.git
cd kubernaut-demo-scenarios

# Configure your LLM provider
export KUBERNAUT_LLM_PROVIDER=openai
export KUBERNAUT_LLM_MODEL=gpt-4o

# Create the full demo environment (~10 minutes)
./scripts/setup-demo-cluster.sh

The setup script creates a Kind cluster, installs Prometheus/Grafana, deploys Kubernaut, and installs infrastructure dependencies (cert-manager, metrics-server, Istio, blackbox-exporter). Gitea and ArgoCD are installed automatically when running GitOps scenarios. See the demo scenarios README for all options including OCP support and advanced LLM configuration.

Post-Install Verification

# All pods should be 1/1 Running (readiness probes confirm service health)
kubectl get pods -n kubernaut-system

# Verify workflow catalog
kubectl get remediationworkflows -A

Post-Installation

Action Types and Workflows (Demo Content)

When demoContent.enabled: true (the default), the chart seeds demo ActionType definitions and RemediationWorkflows into the catalog as a convenience path for getting started quickly. These are not built-in product features -- they are reusable demo content covering common remediation scenarios (CrashLoopBackOff rollback, OOM memory increase, GitOps revert, etc.). No manual loading is required.

To disable demo content for production, add --set demoContent.enabled=false during install. See the production tip in the Install section.

Custom Remediation Workflows

Each RemediationWorkflow references an ActionType by name. When demoContent.enabled: true (default), demo ActionTypes are available in the catalog. For production deployments with demoContent.enabled=false, register your own ActionType CRs before creating RemediationWorkflows. See Authoring Workflows for guidelines and the Action Type reference for the full list.

Resource Scope

After installation, Kubernaut only manages namespaces and resources that opt in via labels:

kubectl label namespace my-app kubernaut.ai/managed=true

See Signals & Alert Routing for details on scope management.

Upgrading

See the Upgrading Guide for general upgrade procedures, CRD schema change handling, and version-specific migration notes.

Uninstalling

helm uninstall kubernaut -n kubernaut-system

What is retained after uninstall

Resource Behavior Manual cleanup
PostgreSQL PVC (postgresql-data) Retained (resource-policy: keep) kubectl delete pvc postgresql-data -n kubernaut-system
Valkey PVC (valkey-data) Retained (resource-policy: keep) kubectl delete pvc valkey-data -n kubernaut-system
CRDs (9 definitions) Retained (standard Helm behavior) kubectl delete crd <name>.kubernaut.ai for each CRD
CR instances Retained until CRDs are deleted Deleted when parent CRD is deleted
Hook ClusterRole/CRB Retained (hook resources not tracked by Helm) kubectl delete clusterrole kubernaut-hook-role --ignore-not-found and kubectl delete clusterrolebinding kubernaut-hook-rolebinding --ignore-not-found
TLS Secret and CA ConfigMap Deleted by post-delete hook (hook mode) or by cert-manager (cert-manager mode) --
Cluster-scoped RBAC Deleted by Helm --
kubernaut-workflows namespace Deleted by Helm May get stuck if it contains active Jobs; see below

If the kubernaut-workflows namespace gets stuck in Terminating state:

kubectl get all -n kubernaut-workflows
kubectl delete jobs --all -n kubernaut-workflows

Full cleanup

To remove everything including persistent data:

helm uninstall kubernaut -n kubernaut-system

# Remove PVCs retained by resource policy
kubectl delete pvc postgresql-data valkey-data -n kubernaut-system

# Remove hook-created cluster resources (not tracked by Helm)
kubectl delete clusterrole kubernaut-hook-role --ignore-not-found
kubectl delete clusterrolebinding kubernaut-hook-rolebinding --ignore-not-found

# Remove CRDs and all CR instances
kubectl delete crd actiontypes.kubernaut.ai aianalyses.kubernaut.ai \
  effectivenessassessments.kubernaut.ai notificationrequests.kubernaut.ai \
  remediationapprovalrequests.kubernaut.ai remediationrequests.kubernaut.ai \
  remediationworkflows.kubernaut.ai signalprocessings.kubernaut.ai \
  workflowexecutions.kubernaut.ai

kubectl delete namespace kubernaut-system

Known Limitations

  • Single installation per cluster: Cluster-scoped resources (ClusterRoles, ClusterRoleBindings, WebhookConfigurations) use static names. Installing multiple releases in different namespaces will cause conflicts.
  • Init container timeouts: The wait-for-postgres init containers in DataStorage and the migration Job have no timeout. If PostgreSQL is unavailable, these containers will block indefinitely.

Next Steps