Configuration Reference¶

Kubernaut is configured via Helm values and per-service ConfigMaps. This page documents the operator-facing configuration surfaces -- from Helm values to namespace labels, signal sources, LLM providers, and operational tuning.

Namespace and Resource Labels¶

Kubernaut uses kubernaut.ai/* labels on namespaces and resources to control scope, enrichment, and classification. These labels are the primary way operators integrate their workloads with Kubernaut.

Scope Control¶

Label	Values	Description
`kubernaut.ai/managed`	`true` / `false`	Opt-in scope control. Only resources in managed namespaces (or with this label) are remediated.

Resolution order: Resource label > Namespace label > Default (unmanaged)

To enable Kubernaut for a namespace:

kubectl label namespace my-app kubernaut.ai/managed=true

Classification Labels¶

Label	Values	Used By	Purpose
`kubernaut.ai/environment`	`production`, `staging`, `development`, `qa`, `test`	SP `policy.rego` (environment rules), AA approval	Environment classification and approval gates
`kubernaut.ai/business-unit`	Any string	SP `policy.rego` (custom labels rules)	Business unit classification (LLM context only)
`kubernaut.ai/service-owner`	Any string	SP `policy.rego` (custom labels rules)	Service owner team
`kubernaut.ai/criticality`	`critical`, `high`, `medium`, `low`	SP `policy.rego` (custom labels rules)	Business criticality
`kubernaut.ai/sla-tier`	`platinum`, `gold`, `silver`, `bronze`	SP `policy.rego` (custom labels rules)	SLA tier

Custom Labels¶

Label Pattern	Used By	Purpose
`kubernaut.ai/label-*`	SP `policy.rego` (custom labels rules)	Arbitrary key-value pairs fed into workflow scoring (+0.15 per exact match, +0.075 wildcard)

The kubernaut.ai/label- prefix is stripped by SP before passing to workflow discovery. Example:

metadata:
  labels:
    kubernaut.ai/managed: "true"
    kubernaut.ai/environment: production
    kubernaut.ai/business-unit: payments
    kubernaut.ai/criticality: critical
    kubernaut.ai/label-team: checkout
    kubernaut.ai/label-region: us-east-1

See Rego Policies for how each label feeds into enrichment, and Workflow Search and Scoring for how labels affect workflow discovery.

Helm Values¶

All values are validated against values.schema.json. Run helm lint to check your overrides before installing.

Global Settings¶

Parameter	Description	Default
`global.image.registry`	Container image registry	`quay.io`
`global.image.namespace`	Image namespace/organization	`kubernaut-ai`
`global.image.separator`	Path separator (`/` for nested registries, `-` for flat registries like Docker Hub)	`/`
`global.image.tag`	Image tag override (defaults to `appVersion`)	`""`
`global.image.digest`	Immutable image digest; overrides tag when set (e.g., `sha256:abc...`)	`""`
`global.image.pullPolicy`	Image pull policy	`IfNotPresent`
`global.imagePullSecrets`	Array of image pull secret names for private registries	`[]`
`global.nodeSelector`	Global node selector applied to all pods	`{}`
`global.tolerations`	Global tolerations applied to all pods	`[]`

Image paths are constructed as {registry}{separator}{namespace}{separator}{service}:{tag}. For example, with the defaults: quay.io/kubernaut-ai/gateway:v1.1.0. For flat registries that don't support nested paths, set separator: "-" to produce myregistry.example.com/kubernaut-ai-gateway:v1.1.0.

Gateway¶

Parameter	Description	Default
`gateway.replicas`	Number of gateway replicas	`1`
`gateway.resources`	CPU/memory requests and limits	See `values.yaml`
`gateway.service.type`	Kubernetes Service type	`ClusterIP`
`gateway.config.server.maxConcurrentRequests`	Maximum concurrent request processing	`100`
`gateway.config.server.readTimeout`	HTTP read timeout	`30s`
`gateway.config.server.writeTimeout`	HTTP write timeout	`30s`
`gateway.config.deduplication.cooldownPeriod`	Signal deduplication cooldown	`5m`
`gateway.auth.signalSources`	External signal sources requiring RBAC	`[]`

DataStorage¶

Parameter	Description	Default
`datastorage.replicas`	Number of datastorage replicas	`1`
`datastorage.dbExistingSecret`	Deprecated. Override secret name for DataStorage DB credentials. Leave empty to use the consolidated `postgresql-secret`. Only needed when DataStorage must read from a separate secret (e.g., BYO PostgreSQL with split credentials).	`""`
`datastorage.config.database.sslMode`	PostgreSQL SSL mode	`disable`
`datastorage.config.database.maxOpenConns`	Maximum open database connections	`100`
`datastorage.config.database.maxIdleConns`	Maximum idle database connections	`20`
`datastorage.config.database.connMaxLifetime`	Maximum connection lifetime	`1h`
`datastorage.resources`	CPU/memory requests and limits	See `values.yaml`
`datastorage.service.type`	Kubernetes Service type	`ClusterIP`

HolmesGPT API (LLM Integration)¶

Parameter	Description	Default
`holmesgptApi.replicas`	Number of replicas	`1`
`holmesgptApi.llm.credentialsSecretName`	Name of pre-existing Secret with LLM API keys	`llm-credentials`
`holmesgptApi.sdkConfigContent`	SDK config YAML content (via `--set-file`). Used to create the `holmesgpt-sdk-config` ConfigMap.	`""`
`holmesgptApi.existingSdkConfigMap`	Pre-existing ConfigMap name for SDK config. Takes priority over `sdkConfigContent`.	`""`

HAPI uses two ConfigMaps: a service config (ports, logging, auth secret references) and an SDK config (LLM settings, toolsets, MCP servers). The SDK config is provided in one of two ways:

Inline content (recommended): Provide full SDK config content via --set-file holmesgptApi.sdkConfigContent=my-sdk-config.yaml. The chart creates the holmesgpt-sdk-config ConfigMap from this content.
External ConfigMap: Set holmesgptApi.existingSdkConfigMap to reference a pre-existing ConfigMap (takes priority over sdkConfigContent).

One of these two options must be provided; the chart will fail at install time if neither is set.

Notification Controller¶

Parameter	Description	Default
`notification.replicas`	Number of replicas	`1`
`notification.routing.content`	Routing config YAML content (via `--set-file`). Chart creates ConfigMap from this.	`""`
`notification.routing.existingConfigMap`	Pre-existing ConfigMap name for routing config. Takes priority over `routing.content`.	`""`
`notification.credentials`	Projected volume sources from K8s Secrets	`[]`

When neither routing.content nor routing.existingConfigMap is set, the chart generates a default routing config:

If notification.slack.secretName is set, the chart generates a slack-and-console catch-all receiver that routes all notification types to both Slack and console.
If notification.slack.secretName is not set, the chart generates a console-only default.

To provide fully custom routing:

helm install kubernaut charts/kubernaut/ \
  --set-file notification.routing.content=my-routing.yaml \
  ...

Add credentials entries to mount the Slack webhook Secret into the notification pod:

notification:
  credentials:
    - name: slack-webhook
      secretName: slack-webhook
      secretKey: webhook-url

Controllers (Common Parameters)¶

All controllers (aianalysis, signalprocessing, remediationorchestrator, workflowexecution, effectivenessmonitor, authwebhook, notification) accept:

Parameter	Description	Default
`<controller>.replicas`	Number of replicas	`1`
`<controller>.resources`	CPU/memory requests and limits	See `values.yaml`
`<controller>.podSecurityContext`	Pod-level security context override	`runAsNonRoot: true` + `seccompProfile: RuntimeDefault` (Tier 1); `seccompProfile: RuntimeDefault` only (Tier 2: postgresql, valkey)
`<controller>.containerSecurityContext`	Container-level security context override	`allowPrivilegeEscalation: false`, `capabilities.drop: [ALL]`
`<controller>.nodeSelector`	Per-component node selector (overrides global)	`{}`
`<controller>.tolerations`	Per-component tolerations (overrides global)	`[]`
`<controller>.affinity`	Pod affinity/anti-affinity rules	`{}`
`<controller>.topologySpreadConstraints`	Topology spread constraints	`[]`
`<controller>.pdb.enabled`	Create a PodDisruptionBudget	`false`
`<controller>.pdb.minAvailable`	PDB minimum available pods	--
`<controller>.pdb.maxUnavailable`	PDB maximum unavailable pods	--

WorkflowExecution¶

Parameter	Description	Default
`workflowexecution.workflowNamespace`	Namespace for Job/PipelineRun execution	`kubernaut-workflows`

EffectivenessMonitor¶

Parameter	Description	Default
`effectivenessmonitor.config.assessment.stabilizationWindow`	Wait time after remediation before assessment	`30s`
`effectivenessmonitor.config.assessment.validityWindow`	Time window for assessment validity	`120s`
`effectivenessmonitor.external.prometheusUrl`	Prometheus URL	`http://kube-prometheus-stack-prometheus.monitoring.svc:9090`
`effectivenessmonitor.external.prometheusEnabled`	Enable Prometheus integration	`false`
`effectivenessmonitor.external.alertManagerUrl`	AlertManager URL	`http://kube-prometheus-stack-alertmanager.monitoring.svc:9093`
`effectivenessmonitor.external.alertManagerEnabled`	Enable AlertManager integration	`false`
`effectivenessmonitor.external.tlsCaFile`	Path to PEM CA bundle for HTTPS connections to Prometheus/AlertManager. On OCP with `ocpMonitoringRbac`, set to `/etc/ssl/em/service-ca.crt` (auto-mounted).	`""`
`effectivenessmonitor.external.ocpMonitoringRbac`	Create `cluster-monitoring-view` ClusterRoleBinding and (when `alertManagerEnabled`) a ClusterRole granting `monitoring.coreos.com/alertmanagers/api` access for OCP's `kube-rbac-proxy`. Also sets `IS_OPENSHIFT` env and auto-configures TLS CA trust via a service-CA ConfigMap.	`false`

AIAnalysis¶

Parameter	Description	Default
`aianalysis.replicas`	Number of replicas	`1`
`aianalysis.rego.confidenceThreshold`	Auto-approval confidence threshold (nil = use Rego default 0.8)	`null`
`aianalysis.policies.content`	Approval policy Rego content (via `--set-file`). Chart creates ConfigMap.	`""`
`aianalysis.policies.existingConfigMap`	Pre-existing ConfigMap name for approval policy. Takes priority.	`""`

One of policies.content or policies.existingConfigMap must be provided; the chart fails at install if neither is set. See AIAnalysis Approval Policy for the full schema and customization guide.

SignalProcessing¶

Parameter	Description	Default
`signalprocessing.replicas`	Number of replicas	`1`
`signalprocessing.policy`	Unified Rego policy content (via `--set-file`). Chart creates `signalprocessing-policy` ConfigMap.	`""`
`signalprocessing.existingPolicyConfigMap`	Pre-existing ConfigMap name for the unified Rego policy. Takes priority over `policy`.	`""`
`signalprocessing.proactiveSignalMappings.content`	Proactive signal mappings YAML (via `--set-file`). Chart creates ConfigMap.	`""`
`signalprocessing.proactiveSignalMappings.existingConfigMap`	Pre-existing ConfigMap name for proactive signal mappings.	`""`

One of policy or existingPolicyConfigMap must be provided; the chart fails at install if neither is set. The policy file is a single .rego file (not a YAML bundle) containing all classification rules under package signalprocessing. Proactive signal mappings are optional and injected separately. See SignalProcessing Rego Policies for the policy structure and customization guide.

PostgreSQL¶

All PostgreSQL credentials must be provided via pre-created Kubernetes Secrets. See Provision Secrets.

Parameter	Description	Default
`postgresql.enabled`	Deploy in-chart PostgreSQL	`true`
`postgresql.variant`	PostgreSQL distribution variant (`upstream` or `ocp`). `ocp` uses Red Hat RHEL10 image with `POSTGRESQL_*` env vars and non-root UID 26, compatible with `restricted-v2` SCC.	`upstream`
`postgresql.replicas`	Number of replicas	`1`
`postgresql.image`	PostgreSQL container image	`postgres:16-alpine`
`postgresql.auth.existingSecret`	Pre-created Secret with `POSTGRES_USER`, `POSTGRES_PASSWORD`, `POSTGRES_DB` keys (required)	`""`
`postgresql.auth.username`	Database username (only used when chart creates the DB)	`slm_user`
`postgresql.auth.database`	Database name (only used when chart creates the DB)	`action_history`
`postgresql.storage.size`	PVC size	`10Gi`
`postgresql.storage.storageClassName`	StorageClass (empty = cluster default)	`""`

To use an external PostgreSQL instance, set postgresql.enabled=false and provide the connection details:

Parameter	Description	Default
`postgresql.host`	External PostgreSQL hostname (required when `enabled=false`)	`""`
`postgresql.port`	External PostgreSQL port	`5432`

Valkey¶

All Valkey credentials must be provided via pre-created Kubernetes Secrets. See Provision Secrets.

Parameter	Description	Default
`valkey.enabled`	Deploy in-chart Valkey	`true`
`valkey.replicas`	Number of replicas	`1`
`valkey.image`	Valkey container image	`valkey/valkey:8-alpine`
`valkey.existingSecret`	Pre-created Secret with `valkey-secrets.yaml` key containing `password: <pass>` (required)	`""`
`valkey.storage.size`	PVC size	`512Mi`
`valkey.storage.storageClassName`	StorageClass (empty = cluster default)	`""`

To use an external Valkey instance, set valkey.enabled=false and provide:

Parameter	Description	Default
`valkey.host`	External Valkey hostname (required when `enabled=false`)	`""`
`valkey.port`	External Valkey port	`6379`

Signal Source Authentication¶

External signal sources need RBAC authorization. Configure via Helm:

gateway:
  auth:
    signalSources:
      - name: alertmanager
        serviceAccount: alertmanager-kube-prometheus-stack-alertmanager
        namespace: monitoring

Each entry creates a ClusterRoleBinding granting the ServiceAccount permission to submit signals.

See Security & RBAC -- Signal Ingestion for the full TokenReview + SAR authentication flow and RBAC details. See Installation -- Signal Source Authentication for AlertManager configuration examples.

LLM Provider Setup¶

LLM configuration lives in the SDK config file, not in values.yaml. See HolmesGPT SDK Config for the full schema and provider examples.

Quick setup:

Copy the example SDK config from the chart:

cp charts/kubernaut/examples/sdk-config.yaml my-sdk-config.yaml

Edit my-sdk-config.yaml -- set llm.provider, llm.model, and any provider-specific fields.
Create the API key Secret:

kubectl create secret generic llm-credentials \
  --namespace kubernaut-system \
  --from-literal=OPENAI_API_KEY="sk-..."

Pass the SDK config during install:

helm install kubernaut charts/kubernaut/ \
  --set-file holmesgptApi.sdkConfigContent=my-sdk-config.yaml \
  ...

Temperature Tuning¶

The temperature parameter in the SDK config (default 0.7) controls the LLM's creativity vs determinism:

Lower (0.3--0.5): More deterministic workflow selection. Recommended for production environments where consistency is critical.
Default (0.7): Balanced. Good for most environments.
Higher (0.8--1.0): More creative investigation. May discover non-obvious root causes but with less consistent workflow selection.

Remediation Timeouts and Routing¶

The RemediationOrchestrator exposes per-phase timeouts and routing thresholds as values.yaml parameters under remediationorchestrator.config.

Phase Timeouts¶

Parameter	Default	Description
`remediationorchestrator.config.timeouts.global`	`1h`	Total remediation timeout
`remediationorchestrator.config.timeouts.processing`	`5m`	Signal Processing phase
`remediationorchestrator.config.timeouts.analyzing`	`10m`	AI Analysis (HAPI investigation)
`remediationorchestrator.config.timeouts.executing`	`30m`	Workflow execution
`remediationorchestrator.config.timeouts.verifying`	`30m`	Effectiveness assessment

Individual RemediationRequest resources can override timeouts via spec.timeouts.

Routing Configuration¶

Parameter	Default	Description
`remediationorchestrator.config.routing.consecutiveFailureThreshold`	`3`	Block a resource after N consecutive remediation failures
`remediationorchestrator.config.routing.consecutiveFailureCooldown`	`1h`	How long to block after hitting the threshold
`remediationorchestrator.config.routing.recentlyRemediatedCooldown`	`5m`	Minimum interval between successful remediations for the same resource
`remediationorchestrator.config.routing.ineffectiveChainThreshold`	`3`	Consecutive ineffective remediations before escalation
`remediationorchestrator.config.routing.recurrenceCountThreshold`	`5`	Safety-net recurrence count
`remediationorchestrator.config.routing.ineffectiveTimeWindow`	`4h`	Lookback window for ineffective chain detection

These settings prevent remediation storms and avoid repeating failed approaches.

Execution Namespace¶

Workflow Jobs and Tekton PipelineRuns execute in a dedicated namespace, separate from the target resource's namespace. This creates a security boundary.

Parameter	Default	Description
`workflowexecution.workflowNamespace`	`kubernaut-workflows`	Namespace for workflow execution
`workflowexecution.config.execution.cooldownPeriod`	`1m`	Cooldown between executions

The kubernaut-workflow-runner ServiceAccount has pre-configured RBAC to read and patch resources across namespaces. See Security & RBAC -- Workflow Execution for the full permission list.

Ansible Engine (AWX/AAP)¶

To enable the Ansible execution engine for workflows that run Ansible playbooks via AWX or AAP, configure the workflowexecution.config.ansible block.

1. Create the AWX API token secret¶

Generate an API token in your AWX/AAP instance and store it in a Kubernetes Secret. The secret name is user-chosen -- it just needs to match tokenSecretRef.name in step 3.

kubectl create secret generic awx-api-token \
  --from-literal=token=<YOUR_AWX_API_TOKEN> \
  -n kubernaut-system

Replace awx-api-token with your preferred name (e.g. aap-api-token for AAP deployments).

2. Grant RBAC for the token secret¶

The workflowexecution-controller ServiceAccount needs permission to read the token secret at startup. The chart does not create this RBAC automatically -- you must create it:

kubectl create role awx-token-reader \
  --verb=get --resource=secrets --resource-name=awx-api-token \
  -n kubernaut-system

kubectl create rolebinding awx-token-reader \
  --role=awx-token-reader \
  --serviceaccount=kubernaut-system:workflowexecution-controller \
  -n kubernaut-system

Replace awx-api-token in --resource-name with the secret name you chose in step 1.

Without this RBAC, the ansible executor is silently skipped

The controller logs "Failed to read AWX token secret, ansible executor not available" and only registers the job and tekton engines. Any WorkflowExecution with engine: ansible will fail with unsupported execution engine: "ansible".

3. Configure Helm values¶

Uncomment the ansible block in your values file:

workflowexecution:
  config:
    ansible:
      apiURL: "https://awx.example.com"
      insecure: false            # set true to skip TLS verification
      organizationID: 1          # AWX organization ID for credential creation
      tokenSecretRef:
        name: awx-api-token      # Secret created in step 1
        key: token               # key within the Secret
        namespace: ""            # empty = release namespace (kubernaut-system)

Parameter	Required	Default	Description
`ansible.apiURL`	Yes	--	AWX/AAP API base URL
`ansible.insecure`	No	`false`	Skip TLS certificate verification
`ansible.organizationID`	No	`1`	AWX organization ID for ephemeral credential creation
`ansible.tokenSecretRef.name`	Yes	--	Kubernetes Secret name containing the AWX API token
`ansible.tokenSecretRef.key`	No	`token`	Key within the Secret
`ansible.tokenSecretRef.namespace`	No	release namespace	Namespace of the token Secret

4. Verify¶

After installing or upgrading with the ansible config, check the controller logs:

kubectl logs -n kubernaut-system deployment/workflowexecution-controller | grep -i ansible

Expected output:

"Ansible executor registered" "awxURL"="https://awx.example.com" "organizationID"=1

"Executor registry initialized" "engines"=["tekton","job","ansible"]

Automatic K8s API credentials for playbooks

The ansible executor automatically injects the WE controller's in-cluster ServiceAccount token as an ephemeral AWX credential on every job launch. Playbooks using kubernetes.core modules receive K8S_AUTH_HOST, K8S_AUTH_API_KEY, and K8S_AUTH_SSL_CA_CERT without manual credential configuration. If the in-cluster environment is unavailable, the job proceeds without K8s credentials.

For authoring ansible workflows, see Ansible (AWX/AAP) in Remediation Workflows and Workflow Execution Architecture.

TLS and Certificate Management¶

The Auth Webhook requires TLS certificates for Kubernetes admission webhook communication.

The chart supports three modes for managing TLS certificates used by the admission webhooks, controlled by tls.mode:

Hook Mode (`tls.mode: hook`) -- Default¶

Self-signed certificates are generated and managed by Helm hooks. No external dependencies required. Suitable for development, testing, and CI environments.

How it works:

Pre-install/pre-upgrade (tls-cert-gen): Generates a self-signed CA and server certificate, stored as the authwebhook-tls Secret and authwebhook-ca ConfigMap.
Post-install/post-upgrade (tls-cabundle-patch): Patches the caBundle field on the webhook configurations.
Post-delete (tls-cleanup): Removes the authwebhook-tls Secret and authwebhook-ca ConfigMap.

Automatic renewal: On helm upgrade, if the certificate expires within 30 days, it is automatically regenerated. Additionally, the AuthWebhook init-container patches the caBundle on every pod restart, making the TLS configuration self-healing.

Recovery: If the authwebhook-ca ConfigMap is accidentally deleted while authwebhook-tls still exists, delete the authwebhook-tls Secret and run helm upgrade to regenerate both:

kubectl delete secret authwebhook-tls -n kubernaut-system
helm upgrade kubernaut kubernaut/kubernaut -n kubernaut-system -f my-values.yaml

Note: helm template output will not show caBundle on webhook configurations. This is expected -- the hook injects it at runtime after the webhook resources are created.

cert-manager Mode (`tls.mode: cert-manager`) -- Production¶

Certificates are managed by cert-manager. Recommended for production environments. cert-manager handles issuance, renewal, and caBundle injection automatically.

Prerequisites:

Install cert-manager (v1.12+):

kubectl apply -f https://github.com/cert-manager/cert-manager/releases/latest/download/cert-manager.yaml
kubectl wait --for=condition=Available deployment --all -n cert-manager --timeout=120s

Create an Issuer or ClusterIssuer. For development with cert-manager, a self-signed issuer works:

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: selfsigned-issuer
spec:
  selfSigned: {}

For production, use your organization's CA or an ACME issuer (e.g., Let's Encrypt).

Install the chart with cert-manager mode:

helm install kubernaut kubernaut/kubernaut \
  --namespace kubernaut-system \
  --set tls.mode=cert-manager \
  --set tls.certManager.issuerRef.name=selfsigned-issuer \
  -f my-values.yaml

The chart creates a Certificate resource (authwebhook-cert) that provisions the authwebhook-tls Secret. cert-manager's cainjector automatically writes the caBundle into the webhook configurations via the cert-manager.io/inject-ca-from annotation.

No TLS hook jobs are created in this mode -- cert-manager handles the full lifecycle including renewal.

Migrating from Hook to cert-manager¶

To switch an existing installation from tls.mode=hook to tls.mode=cert-manager:

Install cert-manager and create an Issuer/ClusterIssuer (see Installation)

Upgrade with the new mode:

helm upgrade kubernaut charts/kubernaut \
  --namespace kubernaut-system \
  --set tls.mode=cert-manager \
  --set tls.certManager.issuerRef.name=your-issuer \
  -f my-values.yaml

The hook-generated Secret and ConfigMap are replaced by cert-manager-managed resources. The old hook cleanup job removes the previous artifacts.

Verify the webhook is serving the new certificate:

kubectl get certificate -n kubernaut-system
kubectl get secret authwebhook-tls -n kubernaut-system -o jsonpath='{.metadata.annotations}'

See Troubleshooting if webhook calls fail after migration.

Manual Mode (`tls.mode: manual`) -- External PKI¶

For environments where TLS certificates are managed externally (service mesh, external PKI, CI pipelines). The chart creates no TLS-related hook Jobs, no Certificate resources, and no caBundle patching.

Operator responsibilities:

Pre-create the authwebhook-tls Secret with tls.crt and tls.key entries
Pre-create the authwebhook-ca ConfigMap with the CA bundle
Ensure the caBundle field on ValidatingWebhookConfiguration resources matches the CA

helm install kubernaut charts/kubernaut/ \
  --namespace kubernaut-system \
  --set tls.mode=manual \
  -f my-values.yaml

This mode is useful when a service mesh (e.g., Istio) handles mTLS between the API server and webhooks, or when certificates are provisioned by an external PKI and injected via a sidecar or init container.

CA Bundle Self-Healing¶

In hook mode, the AuthWebhook deployment includes an init-container that patches the caBundle field on the ValidatingWebhookConfiguration at startup. This makes TLS self-healing across Helm upgrades and interrupted installs -- if the caBundle drifts from the actual CA, the next pod restart corrects it automatically.

Hot-Reload and Graceful Shutdown¶

Understanding which configuration changes take effect live vs which require a restart is critical for operational confidence.

Hot-Reload Support¶

Configuration	Hot-Reload	Mechanism	Latency
SP unified Rego policy (`policy.rego` -- environment, severity, priority, custom labels)	Yes	fsnotify file watcher	~60s (kubelet sync)
AA approval policy	Yes	fsnotify file watcher	~60s
Notification credentials	Yes	fsnotify file watcher	~60s
Notification routing	Yes	fsnotify file watcher	~60s
HolmesGPT config	Yes	Python watchdog	~60s
Gateway config	No	Restart required	--
DataStorage config	No	Restart required	--
Proactive signal mappings	No	Restart required	--

Policies are validated before reload -- if the new policy has a syntax error, the previous policy is kept and an error is logged. No service interruption occurs.

Graceful Shutdown¶

All services implement graceful shutdown to ensure in-flight remediations are not disrupted during rolling updates:

Service	Shutdown Behavior
Gateway	Sets shutdown flag → readiness probe returns 503 → waits 5s for endpoint removal → drains in-flight requests → closes resources
DataStorage	Same 4-step sequence as Gateway
CRD Controllers (SP, AA, RO, WFE, EM, NT)	controller-runtime built-in signal handling; in-flight reconciles complete
HolmesGPT API	Python SIGTERM handler; readiness returns 503; in-flight investigations complete

This means helm upgrade and rolling updates do not disrupt in-flight remediations. The readiness probe change ensures no new traffic reaches the pod during drain.

Next Steps¶

HolmesGPT SDK Config -- LLM provider, toolsets, and MCP server configuration
SignalProcessing Rego Policies -- Policy bundle format and customization
AIAnalysis Approval Policy -- Approval gates and risk factors
Notification Routing -- Routing schema and Slack setup
Rego Policies -- Rego language reference for classification policies
Notification Channels -- Setting up Slack and other channels
Remediation Workflows -- Authoring and registering workflows
Installation -- Using these values during deployment