Quickstart¶

This guide walks you through the CrashLoopBackOff remediation demo — Kubernaut's most common scenario. You'll deploy a healthy workload, inject a bad configuration that causes a crash loop, and watch Kubernaut detect, diagnose, and fix the problem automatically.

Prerequisites¶

Component	Purpose	How to Set Up
Kind cluster	Local Kubernetes environment	`kind create cluster` (the demo runner creates one automatically)
Kubernaut platform	All services running	Helm chart (see below)
Prometheus + AlertManager	Fires `KubePodCrashLooping` alert	kube-prometheus-stack via Helm
kube-state-metrics	Exposes `kube_pod_container_status_restarts_total`	Included in kube-prometheus-stack
LLM provider	Root cause analysis	API key in an `llm-credentials` Secret
Workflow catalog	`crashloop-rollback-v1` workflow registered	Embedded in chart via `demoContent.enabled` (default)

Automated Setup¶

The demo runner handles all prerequisites automatically:

git clone https://github.com/jordigilh/kubernaut-demo-scenarios.git
cd kubernaut-demo-scenarios
./scripts/setup-demo-cluster.sh
./scenarios/crashloop/run.sh

This script:

Creates a Kind cluster (if not running)
Installs kube-prometheus-stack (Prometheus, AlertManager, kube-state-metrics, Grafana)
Installs Kubernaut via Helm with demo content (ActionTypes + RemediationWorkflows)
Runs the full scenario

Manual Setup¶

If you prefer to set up each component yourself:

# 1. Create Kind cluster
kind create cluster

# 2. Install monitoring stack
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
  --namespace monitoring --create-namespace \
  --wait --timeout 5m

# 3. Create namespace and secrets
kubectl create namespace kubernaut-system

PG_PASSWORD=$(openssl rand -base64 24)
kubectl create secret generic postgresql-secret \
  --from-literal=POSTGRES_USER=slm_user \
  --from-literal=POSTGRES_PASSWORD="$PG_PASSWORD" \
  --from-literal=POSTGRES_DB=action_history \
  --from-literal=db-secrets.yaml="$(printf 'username: slm_user\npassword: %s' "$PG_PASSWORD")" \
  -n kubernaut-system

kubectl create secret generic valkey-secret \
  --from-literal=valkey-secrets.yaml="$(printf 'password: %s' "$(openssl rand -base64 24)")" \
  -n kubernaut-system

kubectl create secret generic llm-credentials \
  --from-literal=OPENAI_API_KEY=sk-... \
  -n kubernaut-system

# 4. Configure AlertManager to forward alerts to the Gateway
# The Gateway requires authenticated signal sources. Create a webhook receiver:
kubectl apply -f - <<AMCFG
apiVersion: monitoring.coreos.com/v1alpha1
kind: AlertmanagerConfig
metadata:
  name: kubernaut-webhook
  namespace: monitoring
spec:
  route:
    receiver: kubernaut
    matchers:
      - name: kubernaut_managed
        value: "true"
  receivers:
    - name: kubernaut
      webhookConfigs:
        - url: "http://gateway-service.kubernaut-system.svc.cluster.local:8080/api/v1/signals/prometheus"
          httpConfig:
            authorization:
              type: Bearer
              credentials:
                name: alertmanager-kubernaut-token
                key: token
AMCFG

# 5. Install Kubernaut with signal source authentication
helm install kubernaut oci://quay.io/kubernaut-ai/charts/kubernaut \
  --namespace kubernaut-system \
  --version 1.1.0 \
  --set holmesgptApi.llm.provider=openai \
  --set holmesgptApi.llm.model=gpt-4o \
  --set gateway.auth.signalSources[0].name=alertmanager \
  --set gateway.auth.signalSources[0].namespace=monitoring \
  --set gateway.auth.signalSources[0].serviceAccount=alertmanager-kube-prometheus-stack-alertmanager \
  --wait --timeout 10m

What the chart handles automatically

The chart embeds default Rego policies for signal processing and AI analysis approval, and seeds demo ActionTypes and RemediationWorkflows when demoContent.enabled: true (the default).

For advanced LLM configurations (Vertex AI, Azure, local models), see HolmesGPT SDK Config.

Step 1: Deploy the Healthy Workload¶

Deploy a namespace, a healthy nginx worker, and a Prometheus alerting rule. If you cloned the kubernaut-demo-scenarios repository:

kubectl apply -f scenarios/crashloop/manifests/namespace.yaml
kubectl apply -f scenarios/crashloop/manifests/configmap.yaml
kubectl apply -f scenarios/crashloop/manifests/deployment.yaml
kubectl apply -f scenarios/crashloop/manifests/prometheus-rule.yaml

The namespace is labeled for Kubernaut management:

apiVersion: v1
kind: Namespace
metadata:
  name: demo-crashloop
  labels:
    kubernaut.ai/managed: "true"
    kubernaut.ai/environment: production
    kubernaut.ai/business-unit: engineering
    kubernaut.ai/service-owner: backend-team
    kubernaut.ai/criticality: high
    kubernaut.ai/sla-tier: tier-1

The deployment runs an nginx worker with a valid configuration:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: worker
  namespace: demo-crashloop
  labels:
    app: worker
    kubernaut.ai/managed: "true"
spec:
  replicas: 2
  selector:
    matchLabels:
      app: worker
  template:
    metadata:
      labels:
        app: worker
    spec:
      containers:
      - name: worker
        image: nginx:1.27-alpine
        ports:
        - containerPort: 8080
        volumeMounts:
        - name: config
          mountPath: /etc/nginx/nginx.conf
          subPath: nginx.conf
        livenessProbe:
          httpGet:
            path: /healthz
            port: 8080
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
      volumes:
      - name: config
        configMap:
          name: worker-config

And the Prometheus alerting rule detects crash loops:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: kubernaut-crashloop-rules
  namespace: demo-crashloop
spec:
  groups:
  - name: kubernaut-crashloop
    rules:
    - alert: KubePodCrashLooping
      expr: |
        increase(
          kube_pod_container_status_restarts_total{
            namespace="demo-crashloop",
            container="worker"
          }[3m]
        ) > 3
      for: 1m
      labels:
        severity: critical
      annotations:
        summary: >
          Container {{ $labels.container }} in pod {{ $labels.pod }} is in
          CrashLoopBackOff with {{ $value | humanize }} restarts in the last 3 minutes.

Step 2: Verify the Healthy State¶

Wait for the deployment to become available, then confirm the pods are running:

kubectl wait --for=condition=Available deployment/worker \
  -n demo-crashloop --timeout=120s

kubectl get pods -n demo-crashloop

You should see:

NAME                      READY   STATUS    RESTARTS   AGE
worker-5d4b8c7f9-abc12   1/1     Running   0          30s
worker-5d4b8c7f9-def34   1/1     Running   0          30s

Let Prometheus establish a healthy baseline (wait ~20 seconds for at least one scrape cycle).

Step 3: Inject Bad Configuration¶

Now break the deployment by injecting an invalid nginx configuration:

kubectl apply -f - <<'EOF'
apiVersion: v1
kind: ConfigMap
metadata:
  name: worker-config-bad
  namespace: demo-crashloop
data:
  nginx.conf: |
    http {
        invalid_directive_that_breaks_nginx on;
    }
EOF

kubectl patch deployment worker -n demo-crashloop \
  --type=json \
  -p '[{"op":"replace","path":"/spec/template/spec/volumes/0/configMap/name","value":"worker-config-bad"}]'

Step 4: Observe CrashLoopBackOff¶

Watch the pods crash:

kubectl get pods -n demo-crashloop -w

You'll see pods cycle through Error → CrashLoopBackOff → Error:

NAME                      READY   STATUS             RESTARTS   AGE
worker-7f8a9b3c2-xyz78   0/1     CrashLoopBackOff   3          45s
worker-7f8a9b3c2-uvw56   0/1     Error              2          45s

Step 5: Wait for Alert and Pipeline¶

The KubePodCrashLooping alert fires after >3 restarts in 3 minutes (~2-3 min). You can check Prometheus directly:

kubectl port-forward -n monitoring svc/kube-prometheus-stack-prometheus 9090:9090 &
# Open http://localhost:9090/alerts to see the alert firing

Once the alert fires, AlertManager sends it to Kubernaut's Gateway webhook. Watch the remediation pipeline progress:

kubectl get remediationrequests,signalprocessing,aianalysis,workflowexecution,effectivenessassessment \
  -n kubernaut-system -w

The pipeline flows through these stages:

Stage	What's Happening	Duration
Gateway	Alert received, RemediationRequest created	Instant
Signal Processing	Enriches signal with K8s context (owner chain, namespace labels, severity)	~5s
AI Analysis	HolmesGPT investigates via `kubectl` (pod logs, events, config)	30-90s
Approval	Auto-approved if confidence >= 80%, otherwise awaits human review	Instant or manual
Workflow Execution	Runs `kubectl rollout undo deployment/worker` via Job	~10s
Effectiveness Monitor	Confirms pods running, restart count stabilized	5 min (stabilization window)

Step 6: Verify the Fix¶

After the pipeline completes, check that the workload is healthy again:

kubectl get pods -n demo-crashloop

NAME                      READY   STATUS    RESTARTS   AGE
worker-5d4b8c7f9-new12   1/1     Running   0          60s
worker-5d4b8c7f9-new34   1/1     Running   0          60s

The deployment was rolled back to the previous healthy revision:

kubectl rollout history deployment/worker -n demo-crashloop

Clean Up¶

kubectl delete namespace demo-crashloop

What Just Happened?¶

Kubernaut's Gateway received the KubePodCrashLooping alert from AlertManager
Signal Processing enriched it with Kubernetes context (owner chain: Pod → ReplicaSet → Deployment, namespace labels: production, high criticality)
AI Analysis submitted the enriched signal to HolmesGPT, which used kubectl to inspect the crashing pods, read their logs (unknown directive "invalid_directive_that_breaks_nginx"), and diagnosed the root cause as a bad ConfigMap
The LLM searched the workflow catalog and selected crashloop-rollback-v1 (RollbackDeployment)
Workflow Execution ran the rollback Job, which executed kubectl rollout undo deployment/worker
Effectiveness Monitor waited for the stabilization window and confirmed the pods were healthy with no further restarts
Notification informed the team about the successful remediation

The full audit trail of every step is stored in PostgreSQL for compliance and post-mortem review.

Next Steps¶

Core Concepts — Understand the data model and lifecycle
Remediation Workflows — Write your own workflow schemas
Human Approval — Configure approval gates and confidence thresholds
Architecture Overview — Dive deeper into the system design
HolmesGPT SDK Config — Configure LLM providers, toolsets, and token optimization