Configuration Reference¶
Kubernaut is configured via Helm values and per-service ConfigMaps. This page documents the operator-facing configuration surfaces -- from Helm values to namespace labels, signal sources, LLM providers, and operational tuning.
Namespace and Resource Labels¶
Kubernaut uses kubernaut.ai/* labels on namespaces and resources to control scope, enrichment, and classification. These labels are the primary way operators integrate their workloads with Kubernaut.
Scope Control¶
| Label | Values | Description |
|---|---|---|
kubernaut.ai/managed |
true / false |
Opt-in scope control. Only resources in managed namespaces (or with this label) are remediated. |
Resolution order: Resource label > Namespace label > Default (unmanaged)
To enable Kubernaut for a namespace:
Classification Labels¶
| Label | Values | Used By | Purpose |
|---|---|---|---|
kubernaut.ai/environment |
production, staging, development, qa, test |
SP policy.rego (environment rules), AA approval |
Environment classification and approval gates |
kubernaut.ai/business-unit |
Any string | SP policy.rego (custom labels rules) |
Business unit classification (LLM context only) |
kubernaut.ai/service-owner |
Any string | SP policy.rego (custom labels rules) |
Service owner team |
kubernaut.ai/criticality |
critical, high, medium, low |
SP policy.rego (custom labels rules) |
Business criticality |
kubernaut.ai/sla-tier |
platinum, gold, silver, bronze |
SP policy.rego (custom labels rules) |
SLA tier |
Custom Labels¶
| Label Pattern | Used By | Purpose |
|---|---|---|
kubernaut.ai/label-* |
SP policy.rego (custom labels rules) |
Arbitrary key-value pairs fed into workflow scoring (+0.15 per exact match, +0.075 wildcard) |
The kubernaut.ai/label- prefix is stripped by SP before passing to workflow discovery. Example:
metadata:
labels:
kubernaut.ai/managed: "true"
kubernaut.ai/environment: production
kubernaut.ai/business-unit: payments
kubernaut.ai/criticality: critical
kubernaut.ai/label-team: checkout
kubernaut.ai/label-region: us-east-1
See Rego Policies for how each label feeds into enrichment, and Workflow Search and Scoring for how labels affect workflow discovery.
Helm Values¶
All values are validated against values.schema.json. Run helm lint to check your overrides before installing.
Global Settings¶
| Parameter | Description | Default |
|---|---|---|
global.image.registry |
Container image registry | quay.io |
global.image.namespace |
Image namespace/organization | kubernaut-ai |
global.image.separator |
Path separator (/ for nested registries, - for flat registries like Docker Hub) |
/ |
global.image.tag |
Image tag override (defaults to appVersion) |
"" |
global.image.digest |
Immutable image digest; overrides tag when set (e.g., sha256:abc...) |
"" |
global.image.pullPolicy |
Image pull policy | IfNotPresent |
global.imagePullSecrets |
Array of image pull secret names for private registries | [] |
global.nodeSelector |
Global node selector applied to all pods | {} |
global.tolerations |
Global tolerations applied to all pods | [] |
Image paths are constructed as {registry}{separator}{namespace}{separator}{service}:{tag}. For example, with the defaults: quay.io/kubernaut-ai/gateway:v1.1.0. For flat registries that don't support nested paths, set separator: "-" to produce myregistry.example.com/kubernaut-ai-gateway:v1.1.0.
Gateway¶
| Parameter | Description | Default |
|---|---|---|
gateway.replicas |
Number of gateway replicas | 1 |
gateway.resources |
CPU/memory requests and limits | See values.yaml |
gateway.service.type |
Kubernetes Service type | ClusterIP |
gateway.config.server.maxConcurrentRequests |
Maximum concurrent request processing | 100 |
gateway.config.server.readTimeout |
HTTP read timeout | 30s |
gateway.config.server.writeTimeout |
HTTP write timeout | 30s |
gateway.config.deduplication.cooldownPeriod |
Signal deduplication cooldown | 5m |
gateway.auth.signalSources |
External signal sources requiring RBAC | [] |
DataStorage¶
| Parameter | Description | Default |
|---|---|---|
datastorage.replicas |
Number of datastorage replicas | 1 |
datastorage.dbExistingSecret |
Deprecated. Override secret name for DataStorage DB credentials. Leave empty to use the consolidated postgresql-secret. Only needed when DataStorage must read from a separate secret (e.g., BYO PostgreSQL with split credentials). |
"" |
datastorage.config.database.sslMode |
PostgreSQL SSL mode | disable |
datastorage.config.database.maxOpenConns |
Maximum open database connections | 100 |
datastorage.config.database.maxIdleConns |
Maximum idle database connections | 20 |
datastorage.config.database.connMaxLifetime |
Maximum connection lifetime | 1h |
datastorage.resources |
CPU/memory requests and limits | See values.yaml |
datastorage.service.type |
Kubernetes Service type | ClusterIP |
HolmesGPT API (LLM Integration)¶
| Parameter | Description | Default |
|---|---|---|
holmesgptApi.replicas |
Number of replicas | 1 |
holmesgptApi.llm.credentialsSecretName |
Name of pre-existing Secret with LLM API keys | llm-credentials |
holmesgptApi.sdkConfigContent |
SDK config YAML content (via --set-file). Used to create the holmesgpt-sdk-config ConfigMap. |
"" |
holmesgptApi.existingSdkConfigMap |
Pre-existing ConfigMap name for SDK config. Takes priority over sdkConfigContent. |
"" |
HAPI uses two ConfigMaps: a service config (ports, logging, auth secret references) and an SDK config (LLM settings, toolsets, MCP servers). The SDK config is provided in one of two ways:
- Inline content (recommended): Provide full SDK config content via
--set-file holmesgptApi.sdkConfigContent=my-sdk-config.yaml. The chart creates theholmesgpt-sdk-configConfigMap from this content. - External ConfigMap: Set
holmesgptApi.existingSdkConfigMapto reference a pre-existing ConfigMap (takes priority oversdkConfigContent).
One of these two options must be provided; the chart will fail at install time if neither is set.
Notification Controller¶
| Parameter | Description | Default |
|---|---|---|
notification.replicas |
Number of replicas | 1 |
notification.routing.content |
Routing config YAML content (via --set-file). Chart creates ConfigMap from this. |
"" |
notification.routing.existingConfigMap |
Pre-existing ConfigMap name for routing config. Takes priority over routing.content. |
"" |
notification.credentials |
Projected volume sources from K8s Secrets | [] |
When neither routing.content nor routing.existingConfigMap is set, the chart generates a default routing config:
- If
notification.slack.secretNameis set, the chart generates aslack-and-consolecatch-all receiver that routes all notification types to both Slack and console. - If
notification.slack.secretNameis not set, the chart generates a console-only default.
To provide fully custom routing:
helm install kubernaut charts/kubernaut/ \
--set-file notification.routing.content=my-routing.yaml \
...
Add credentials entries to mount the Slack webhook Secret into the notification pod:
Controllers (Common Parameters)¶
All controllers (aianalysis, signalprocessing, remediationorchestrator, workflowexecution, effectivenessmonitor, authwebhook, notification) accept:
| Parameter | Description | Default |
|---|---|---|
<controller>.replicas |
Number of replicas | 1 |
<controller>.resources |
CPU/memory requests and limits | See values.yaml |
<controller>.podSecurityContext |
Pod-level security context override | runAsNonRoot: true + seccompProfile: RuntimeDefault (Tier 1); seccompProfile: RuntimeDefault only (Tier 2: postgresql, valkey) |
<controller>.containerSecurityContext |
Container-level security context override | allowPrivilegeEscalation: false, capabilities.drop: [ALL] |
<controller>.nodeSelector |
Per-component node selector (overrides global) | {} |
<controller>.tolerations |
Per-component tolerations (overrides global) | [] |
<controller>.affinity |
Pod affinity/anti-affinity rules | {} |
<controller>.topologySpreadConstraints |
Topology spread constraints | [] |
<controller>.pdb.enabled |
Create a PodDisruptionBudget | false |
<controller>.pdb.minAvailable |
PDB minimum available pods | -- |
<controller>.pdb.maxUnavailable |
PDB maximum unavailable pods | -- |
WorkflowExecution¶
| Parameter | Description | Default |
|---|---|---|
workflowexecution.workflowNamespace |
Namespace for Job/PipelineRun execution | kubernaut-workflows |
EffectivenessMonitor¶
| Parameter | Description | Default |
|---|---|---|
effectivenessmonitor.config.assessment.stabilizationWindow |
Wait time after remediation before assessment | 30s |
effectivenessmonitor.config.assessment.validityWindow |
Time window for assessment validity | 120s |
effectivenessmonitor.external.prometheusUrl |
Prometheus URL | http://kube-prometheus-stack-prometheus.monitoring.svc:9090 |
effectivenessmonitor.external.prometheusEnabled |
Enable Prometheus integration | false |
effectivenessmonitor.external.alertManagerUrl |
AlertManager URL | http://kube-prometheus-stack-alertmanager.monitoring.svc:9093 |
effectivenessmonitor.external.alertManagerEnabled |
Enable AlertManager integration | false |
effectivenessmonitor.external.tlsCaFile |
Path to PEM CA bundle for HTTPS connections to Prometheus/AlertManager. On OCP with ocpMonitoringRbac, set to /etc/ssl/em/service-ca.crt (auto-mounted). |
"" |
effectivenessmonitor.external.ocpMonitoringRbac |
Create cluster-monitoring-view ClusterRoleBinding and (when alertManagerEnabled) a ClusterRole granting monitoring.coreos.com/alertmanagers/api access for OCP's kube-rbac-proxy. Also sets IS_OPENSHIFT env and auto-configures TLS CA trust via a service-CA ConfigMap. |
false |
AIAnalysis¶
| Parameter | Description | Default |
|---|---|---|
aianalysis.replicas |
Number of replicas | 1 |
aianalysis.rego.confidenceThreshold |
Auto-approval confidence threshold (nil = use Rego default 0.8) | null |
aianalysis.policies.content |
Approval policy Rego content (via --set-file). Chart creates ConfigMap. |
"" |
aianalysis.policies.existingConfigMap |
Pre-existing ConfigMap name for approval policy. Takes priority. | "" |
One of policies.content or policies.existingConfigMap must be provided; the chart fails at install if neither is set. See AIAnalysis Approval Policy for the full schema and customization guide.
SignalProcessing¶
| Parameter | Description | Default |
|---|---|---|
signalprocessing.replicas |
Number of replicas | 1 |
signalprocessing.policy |
Unified Rego policy content (via --set-file). Chart creates signalprocessing-policy ConfigMap. |
"" |
signalprocessing.existingPolicyConfigMap |
Pre-existing ConfigMap name for the unified Rego policy. Takes priority over policy. |
"" |
signalprocessing.proactiveSignalMappings.content |
Proactive signal mappings YAML (via --set-file). Chart creates ConfigMap. |
"" |
signalprocessing.proactiveSignalMappings.existingConfigMap |
Pre-existing ConfigMap name for proactive signal mappings. | "" |
One of policy or existingPolicyConfigMap must be provided; the chart fails at install if neither is set. The policy file is a single .rego file (not a YAML bundle) containing all classification rules under package signalprocessing. Proactive signal mappings are optional and injected separately. See SignalProcessing Rego Policies for the policy structure and customization guide.
PostgreSQL¶
All PostgreSQL credentials must be provided via pre-created Kubernetes Secrets. See Provision Secrets.
| Parameter | Description | Default |
|---|---|---|
postgresql.enabled |
Deploy in-chart PostgreSQL | true |
postgresql.variant |
PostgreSQL distribution variant (upstream or ocp). ocp uses Red Hat RHEL10 image with POSTGRESQL_* env vars and non-root UID 26, compatible with restricted-v2 SCC. |
upstream |
postgresql.replicas |
Number of replicas | 1 |
postgresql.image |
PostgreSQL container image | postgres:16-alpine |
postgresql.auth.existingSecret |
Pre-created Secret with POSTGRES_USER, POSTGRES_PASSWORD, POSTGRES_DB keys (required) |
"" |
postgresql.auth.username |
Database username (only used when chart creates the DB) | slm_user |
postgresql.auth.database |
Database name (only used when chart creates the DB) | action_history |
postgresql.storage.size |
PVC size | 10Gi |
postgresql.storage.storageClassName |
StorageClass (empty = cluster default) | "" |
To use an external PostgreSQL instance, set postgresql.enabled=false and provide the connection details:
| Parameter | Description | Default |
|---|---|---|
postgresql.host |
External PostgreSQL hostname (required when enabled=false) |
"" |
postgresql.port |
External PostgreSQL port | 5432 |
Valkey¶
All Valkey credentials must be provided via pre-created Kubernetes Secrets. See Provision Secrets.
| Parameter | Description | Default |
|---|---|---|
valkey.enabled |
Deploy in-chart Valkey | true |
valkey.replicas |
Number of replicas | 1 |
valkey.image |
Valkey container image | valkey/valkey:8-alpine |
valkey.existingSecret |
Pre-created Secret with valkey-secrets.yaml key containing password: <pass> (required) |
"" |
valkey.storage.size |
PVC size | 512Mi |
valkey.storage.storageClassName |
StorageClass (empty = cluster default) | "" |
To use an external Valkey instance, set valkey.enabled=false and provide:
| Parameter | Description | Default |
|---|---|---|
valkey.host |
External Valkey hostname (required when enabled=false) |
"" |
valkey.port |
External Valkey port | 6379 |
Signal Source Authentication¶
External signal sources need RBAC authorization. Configure via Helm:
gateway:
auth:
signalSources:
- name: alertmanager
serviceAccount: alertmanager-kube-prometheus-stack-alertmanager
namespace: monitoring
Each entry creates a ClusterRoleBinding granting the ServiceAccount permission to submit signals.
See Security & RBAC -- Signal Ingestion for the full TokenReview + SAR authentication flow and RBAC details. See Installation -- Signal Source Authentication for AlertManager configuration examples.
LLM Provider Setup¶
LLM configuration lives in the SDK config file, not in values.yaml. See HolmesGPT SDK Config for the full schema and provider examples.
Quick setup:
- Copy the example SDK config from the chart:
-
Edit
my-sdk-config.yaml-- setllm.provider,llm.model, and any provider-specific fields. -
Create the API key Secret:
kubectl create secret generic llm-credentials \
--namespace kubernaut-system \
--from-literal=OPENAI_API_KEY="sk-..."
- Pass the SDK config during install:
helm install kubernaut charts/kubernaut/ \
--set-file holmesgptApi.sdkConfigContent=my-sdk-config.yaml \
...
Temperature Tuning¶
The temperature parameter in the SDK config (default 0.7) controls the LLM's creativity vs determinism:
- Lower (0.3--0.5): More deterministic workflow selection. Recommended for production environments where consistency is critical.
- Default (0.7): Balanced. Good for most environments.
- Higher (0.8--1.0): More creative investigation. May discover non-obvious root causes but with less consistent workflow selection.
Remediation Timeouts and Routing¶
The RemediationOrchestrator exposes per-phase timeouts and routing thresholds as values.yaml parameters under remediationorchestrator.config.
Phase Timeouts¶
| Parameter | Default | Description |
|---|---|---|
remediationorchestrator.config.timeouts.global |
1h |
Total remediation timeout |
remediationorchestrator.config.timeouts.processing |
5m |
Signal Processing phase |
remediationorchestrator.config.timeouts.analyzing |
10m |
AI Analysis (HAPI investigation) |
remediationorchestrator.config.timeouts.executing |
30m |
Workflow execution |
remediationorchestrator.config.timeouts.verifying |
30m |
Effectiveness assessment |
Individual RemediationRequest resources can override timeouts via spec.timeouts.
Routing Configuration¶
| Parameter | Default | Description |
|---|---|---|
remediationorchestrator.config.routing.consecutiveFailureThreshold |
3 |
Block a resource after N consecutive remediation failures |
remediationorchestrator.config.routing.consecutiveFailureCooldown |
1h |
How long to block after hitting the threshold |
remediationorchestrator.config.routing.recentlyRemediatedCooldown |
5m |
Minimum interval between successful remediations for the same resource |
remediationorchestrator.config.routing.ineffectiveChainThreshold |
3 |
Consecutive ineffective remediations before escalation |
remediationorchestrator.config.routing.recurrenceCountThreshold |
5 |
Safety-net recurrence count |
remediationorchestrator.config.routing.ineffectiveTimeWindow |
4h |
Lookback window for ineffective chain detection |
These settings prevent remediation storms and avoid repeating failed approaches.
Execution Namespace¶
Workflow Jobs and Tekton PipelineRuns execute in a dedicated namespace, separate from the target resource's namespace. This creates a security boundary.
| Parameter | Default | Description |
|---|---|---|
workflowexecution.workflowNamespace |
kubernaut-workflows |
Namespace for workflow execution |
workflowexecution.config.execution.cooldownPeriod |
1m |
Cooldown between executions |
The kubernaut-workflow-runner ServiceAccount has pre-configured RBAC to read and patch resources across namespaces. See Security & RBAC -- Workflow Execution for the full permission list.
Ansible Engine (AWX/AAP)¶
To enable the Ansible execution engine for workflows that run Ansible playbooks via AWX or AAP, configure the workflowexecution.config.ansible block.
1. Create the AWX API token secret¶
Generate an API token in your AWX/AAP instance and store it in a Kubernetes Secret. The secret name is user-chosen -- it just needs to match tokenSecretRef.name in step 3.
kubectl create secret generic awx-api-token \
--from-literal=token=<YOUR_AWX_API_TOKEN> \
-n kubernaut-system
Replace awx-api-token with your preferred name (e.g. aap-api-token for AAP deployments).
2. Grant RBAC for the token secret¶
The workflowexecution-controller ServiceAccount needs permission to read the token secret at startup. The chart does not create this RBAC automatically -- you must create it:
kubectl create role awx-token-reader \
--verb=get --resource=secrets --resource-name=awx-api-token \
-n kubernaut-system
kubectl create rolebinding awx-token-reader \
--role=awx-token-reader \
--serviceaccount=kubernaut-system:workflowexecution-controller \
-n kubernaut-system
Replace awx-api-token in --resource-name with the secret name you chose in step 1.
Without this RBAC, the ansible executor is silently skipped
The controller logs "Failed to read AWX token secret, ansible executor not available" and only registers the job and tekton engines. Any WorkflowExecution with engine: ansible will fail with unsupported execution engine: "ansible".
3. Configure Helm values¶
Uncomment the ansible block in your values file:
workflowexecution:
config:
ansible:
apiURL: "https://awx.example.com"
insecure: false # set true to skip TLS verification
organizationID: 1 # AWX organization ID for credential creation
tokenSecretRef:
name: awx-api-token # Secret created in step 1
key: token # key within the Secret
namespace: "" # empty = release namespace (kubernaut-system)
| Parameter | Required | Default | Description |
|---|---|---|---|
ansible.apiURL |
Yes | -- | AWX/AAP API base URL |
ansible.insecure |
No | false |
Skip TLS certificate verification |
ansible.organizationID |
No | 1 |
AWX organization ID for ephemeral credential creation |
ansible.tokenSecretRef.name |
Yes | -- | Kubernetes Secret name containing the AWX API token |
ansible.tokenSecretRef.key |
No | token |
Key within the Secret |
ansible.tokenSecretRef.namespace |
No | release namespace | Namespace of the token Secret |
4. Verify¶
After installing or upgrading with the ansible config, check the controller logs:
Expected output:
Automatic K8s API credentials for playbooks
The ansible executor automatically injects the WE controller's in-cluster ServiceAccount token as an ephemeral AWX credential on every job launch. Playbooks using kubernetes.core modules receive K8S_AUTH_HOST, K8S_AUTH_API_KEY, and K8S_AUTH_SSL_CA_CERT without manual credential configuration. If the in-cluster environment is unavailable, the job proceeds without K8s credentials.
For authoring ansible workflows, see Ansible (AWX/AAP) in Remediation Workflows and Workflow Execution Architecture.
TLS and Certificate Management¶
The Auth Webhook requires TLS certificates for Kubernetes admission webhook communication.
The chart supports three modes for managing TLS certificates used by the admission webhooks, controlled by tls.mode:
Hook Mode (tls.mode: hook) -- Default¶
Self-signed certificates are generated and managed by Helm hooks. No external dependencies required. Suitable for development, testing, and CI environments.
How it works:
- Pre-install/pre-upgrade (
tls-cert-gen): Generates a self-signed CA and server certificate, stored as theauthwebhook-tlsSecret andauthwebhook-caConfigMap. - Post-install/post-upgrade (
tls-cabundle-patch): Patches thecaBundlefield on the webhook configurations. - Post-delete (
tls-cleanup): Removes theauthwebhook-tlsSecret andauthwebhook-caConfigMap.
Automatic renewal: On helm upgrade, if the certificate expires within 30 days, it is automatically regenerated. Additionally, the AuthWebhook init-container patches the caBundle on every pod restart, making the TLS configuration self-healing.
Recovery: If the authwebhook-ca ConfigMap is accidentally deleted while authwebhook-tls still exists, delete the authwebhook-tls Secret and run helm upgrade to regenerate both:
kubectl delete secret authwebhook-tls -n kubernaut-system
helm upgrade kubernaut kubernaut/kubernaut -n kubernaut-system -f my-values.yaml
Note:
helm templateoutput will not showcaBundleon webhook configurations. This is expected -- the hook injects it at runtime after the webhook resources are created.
cert-manager Mode (tls.mode: cert-manager) -- Production¶
Certificates are managed by cert-manager. Recommended for production environments. cert-manager handles issuance, renewal, and caBundle injection automatically.
Prerequisites:
- Install cert-manager (v1.12+):
kubectl apply -f https://github.com/cert-manager/cert-manager/releases/latest/download/cert-manager.yaml
kubectl wait --for=condition=Available deployment --all -n cert-manager --timeout=120s
- Create an Issuer or ClusterIssuer. For development with cert-manager, a self-signed issuer works:
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
name: selfsigned-issuer
spec:
selfSigned: {}
For production, use your organization's CA or an ACME issuer (e.g., Let's Encrypt).
- Install the chart with cert-manager mode:
helm install kubernaut kubernaut/kubernaut \
--namespace kubernaut-system \
--set tls.mode=cert-manager \
--set tls.certManager.issuerRef.name=selfsigned-issuer \
-f my-values.yaml
The chart creates a Certificate resource (authwebhook-cert) that provisions the authwebhook-tls Secret. cert-manager's cainjector automatically writes the caBundle into the webhook configurations via the cert-manager.io/inject-ca-from annotation.
No TLS hook jobs are created in this mode -- cert-manager handles the full lifecycle including renewal.
Migrating from Hook to cert-manager¶
To switch an existing installation from tls.mode=hook to tls.mode=cert-manager:
- Install cert-manager and create an Issuer/ClusterIssuer (see Installation)
-
Upgrade with the new mode:
-
The hook-generated Secret and ConfigMap are replaced by cert-manager-managed resources. The old hook cleanup job removes the previous artifacts.
-
Verify the webhook is serving the new certificate:
See Troubleshooting if webhook calls fail after migration.
Manual Mode (tls.mode: manual) -- External PKI¶
For environments where TLS certificates are managed externally (service mesh, external PKI, CI pipelines). The chart creates no TLS-related hook Jobs, no Certificate resources, and no caBundle patching.
Operator responsibilities:
- Pre-create the
authwebhook-tlsSecret withtls.crtandtls.keyentries - Pre-create the
authwebhook-caConfigMap with the CA bundle - Ensure the
caBundlefield onValidatingWebhookConfigurationresources matches the CA
helm install kubernaut charts/kubernaut/ \
--namespace kubernaut-system \
--set tls.mode=manual \
-f my-values.yaml
This mode is useful when a service mesh (e.g., Istio) handles mTLS between the API server and webhooks, or when certificates are provisioned by an external PKI and injected via a sidecar or init container.
CA Bundle Self-Healing¶
In hook mode, the AuthWebhook deployment includes an init-container that patches the caBundle field on the ValidatingWebhookConfiguration at startup. This makes TLS self-healing across Helm upgrades and interrupted installs -- if the caBundle drifts from the actual CA, the next pod restart corrects it automatically.
Hot-Reload and Graceful Shutdown¶
Understanding which configuration changes take effect live vs which require a restart is critical for operational confidence.
Hot-Reload Support¶
| Configuration | Hot-Reload | Mechanism | Latency |
|---|---|---|---|
SP unified Rego policy (policy.rego -- environment, severity, priority, custom labels) |
Yes | fsnotify file watcher | ~60s (kubelet sync) |
| AA approval policy | Yes | fsnotify file watcher | ~60s |
| Notification credentials | Yes | fsnotify file watcher | ~60s |
| Notification routing | Yes | fsnotify file watcher | ~60s |
| HolmesGPT config | Yes | Python watchdog | ~60s |
| Gateway config | No | Restart required | -- |
| DataStorage config | No | Restart required | -- |
| Proactive signal mappings | No | Restart required | -- |
Policies are validated before reload -- if the new policy has a syntax error, the previous policy is kept and an error is logged. No service interruption occurs.
Graceful Shutdown¶
All services implement graceful shutdown to ensure in-flight remediations are not disrupted during rolling updates:
| Service | Shutdown Behavior |
|---|---|
| Gateway | Sets shutdown flag → readiness probe returns 503 → waits 5s for endpoint removal → drains in-flight requests → closes resources |
| DataStorage | Same 4-step sequence as Gateway |
| CRD Controllers (SP, AA, RO, WFE, EM, NT) | controller-runtime built-in signal handling; in-flight reconciles complete |
| HolmesGPT API | Python SIGTERM handler; readiness returns 503; in-flight investigations complete |
This means helm upgrade and rolling updates do not disrupt in-flight remediations. The readiness probe change ensures no new traffic reaches the pod during drain.
Next Steps¶
- HolmesGPT SDK Config -- LLM provider, toolsets, and MCP server configuration
- SignalProcessing Rego Policies -- Policy bundle format and customization
- AIAnalysis Approval Policy -- Approval gates and risk factors
- Notification Routing -- Routing schema and Slack setup
- Rego Policies -- Rego language reference for classification policies
- Notification Channels -- Setting up Slack and other channels
- Remediation Workflows -- Authoring and registering workflows
- Installation -- Using these values during deployment