Kubernaut Agent SDK Config¶
The Kubernaut Agent reads its LLM configuration from an SDK config ConfigMap. This page documents the schema, provisioning methods, and provider-specific examples.
v1.4: breaking YAML changes for the Kubernaut Agent
CamelCase migration (ADR-030). Every field in KA-facing YAML configs now uses camelCase. Older snake_case keys (api_key, timeout_seconds, mcp_servers, prometheus_url, and similar) must be renamed — existing ConfigMaps fail validation until updated.
Three top-level domains. Configuration is reorganized under runtime, ai, and integrations:
runtime— operational/process settings (serverand related knobs are nested here).ai— LLM/provider options (for examplellmblocks live underai).integrations— external surfaces (tools/ toolsets andmcp_serversequivalents are nested here).
Two ConfigMaps. KA consumes a static ConfigMap mounted at pod startup (bootstrap and fields that cannot change safely at runtime) and a separate hot-reloadable ConfigMap watched at runtime. Edits to AI model, tooling, MCP, and other supported fields on the reloadable bundle take effect without restarting the pod (subject to watcher sync latency — see Hot-Reload).
Before vs after (illustrative)¶
Before (< v1.4, flat layout + snake_case):
llm:
provider: openai
model: gpt-4o
timeout_seconds: 120
max_retries: 3
toolsets:
prometheus/metrics:
enabled: true
config:
prometheus_url: "http://kube-prometheus-stack-prometheus.monitoring.svc:9090"
mcp_servers: {}
After (v1.4, three domains + camelCase):
runtime:
server: {}
ai:
llm:
provider: openai
model: gpt-4o
timeoutSeconds: 120
maxRetries: 3
integrations:
tools:
prometheus/metrics:
enabled: true
config:
prometheusUrl: "http://kube-prometheus-stack-prometheus.monitoring.svc:9090"
mcpServers: {}
Regenerate manifests from values.schema.json and the canonical chart examples when upgrading Helm releases — do not partially rename keys.
Overview¶
| Property | Value |
|---|---|
| Historical ConfigMap (< v1.4) | kubernaut-agent-sdk-config with key sdk-config.yaml mounted under /etc/kubernaut-agent/sdk/ |
| v1.4+ manifests | Helm renders paired volumes: a static ConfigMap (startup) plus a hot-reloadable ConfigMap (runtime watcher); exact metadata keys and directories are defined alongside values.schema.json in the shipped chart templates — align upgrades with release examples instead of renaming keys ad hoc |
| Required | Yes — chart fails at install when LLM / SDK prerequisites are missing |
Re-read the v1.4 upgrade warning at the top of this page before touching live manifests.
Provisioning¶
Three options are available, with the following precedence: existingSdkConfigMap > sdkConfigContent > llm.provider + llm.model.
Option A: Quickstart (recommended for getting started)¶
Set the provider and model directly in Helm values. The chart generates a minimal SDK config ConfigMap automatically.
helm install kubernaut charts/kubernaut/ \
--set kubernautAgent.llm.provider=openai \
--set kubernautAgent.llm.model=gpt-4o \
...
Supported quickstart providers: openai, anthropic (any provider needing only an API key). For Vertex AI, Azure, or advanced setups, use Option B or C.
Option B: Inline content¶
Provide the full SDK config file via --set-file. The chart creates the ConfigMap from this content.
helm install kubernaut charts/kubernaut/ \
--set-file kubernautAgent.sdkConfigContent=my-sdk-config.yaml \
...
Option C: Pre-existing ConfigMap¶
Create the ConfigMap yourself and reference it by name. The chart skips creating kubernaut-agent-sdk-config and mounts your ConfigMap instead.
kubectl create configmap my-sdk-config \
--from-file=sdk-config.yaml=my-sdk-config.yaml \
-n kubernaut-system
helm install kubernaut charts/kubernaut/ \
--set kubernautAgent.existingSdkConfigMap=my-sdk-config \
...
Schema Reference¶
v1.4 camelCase / 3-domain structure
The schema below reflects the v1.4 structure. If you are migrating from v1.3 or earlier, see the breaking YAML changes warning at the top of this page.
runtime:
server: {} # Internal server settings (generally left at defaults)
maxTurns: 40 # Max LLM tool-call turns per investigation (v1.4: increased from 15)
ai:
llm:
provider: "" # Required. One of: openai, ollama, azure, vertex,
# vertexAi, anthropic, bedrock, huggingface, mistral
model: "" # Required. e.g., "gpt-4o", "gemini-2.5-pro"
endpoint: "" # Server origin without /v1 (required for ollama, azure, mistral)
apiKey: "" # Provider API key
azureApiVersion: "" # Azure-specific
vertexProject: "" # Vertex-specific (vertex and vertexAi)
vertexLocation: "" # Vertex-specific
bedrockRegion: "" # Bedrock-specific
structuredOutput: false # Reserved; KA always enables JSON mode internally (see note below)
temperature: 0.7 # Creativity vs determinism (0.0--1.0)
maxRetries: 3 # LLM call retry count
timeoutSeconds: 120 # Per-call timeout
customHeaders: # Optional custom HTTP headers
- name: "X-Custom"
value: "..."
oauth2: # Optional OAuth2 client credentials
enabled: false
tokenUrl: "" # Must use https:// when enabled
clientId: ""
clientSecret: ""
scopes: ["scope1"]
integrations:
toolsets: {} # Optional: data source toolsets
# prometheus/metrics:
# enabled: true
# config:
# prometheusUrl: "http://kube-prometheus-stack-prometheus.monitoring.svc:9090"
mcpServers: {} # Optional: Model Context Protocol servers
Operator: BYO runtime ConfigMap
When deploying via the Kubernaut Operator, set spec.kubernautAgent.llm.runtimeConfigMapName to point to a ConfigMap you manage. The operator mounts it as the hot-reloadable config, so changes take effect without pod restart. The static ConfigMap is managed by the operator and should not be edited directly.
Supported Providers¶
Config ai.llm.provider |
Backend | Implementation |
|---|---|---|
openai |
OpenAI or OpenAI-compatible API | LangChainGo llms/openai |
ollama |
Ollama | LangChainGo llms/ollama |
azure |
Azure OpenAI | LangChainGo llms/openai (Azure API type) |
vertex |
Google Vertex AI (Gemini models) | LangChainGo llms/googleai/vertex |
vertex_ai |
Claude on Google Vertex AI | Anthropic Go SDK (not LangChainGo) |
anthropic |
Anthropic API (direct) | LangChainGo llms/anthropic |
bedrock |
Amazon Bedrock | LangChainGo llms/bedrock |
huggingface |
Hugging Face | LangChainGo llms/huggingface |
mistral |
Mistral | LangChainGo llms/mistral |
Vertex AI provider distinction
vertex = Gemini models on Vertex AI. vertex_ai = Anthropic Claude models on Vertex AI. These use separate code paths and different SDKs.
Mandatory JSON structured output
KA internally sets JSONMode: true on every LLM request. This is not configurable — the structured_output field in the config is reserved and has no effect at runtime. Your LLM provider/model must support schema-constrained JSON responses (equivalent to response_format: {"type": "json_object"} in the OpenAI API). All listed providers support this natively. For self-hosted or air-gapped deployments using Ollama or OpenAI-compatible servers, ensure the model supports JSON mode (most instruction-tuned models do).
Toolset Optimization¶
Each enabled toolset injects its full tool schema into every LLM context turn. When a toolset is enabled but never called during an investigation, those schema tokens are pure overhead — they consume budget and can bias the LLM toward irrelevant investigation paths.
Empirical testing shows that loading a single unused toolset (Prometheus, 123 tools) can add ~30% token overhead and increase LLM latency by ~15%, with no benefit to the investigation outcome. See the two-phase toolkit selection discussion for detailed measurements.
Recommendation: Enable only the toolsets needed for your workload. The Kubernetes core toolset (kubectl commands and logs) is always available — it cannot be disabled.
Incident-Type to Toolset Mapping¶
| Incident Type | Recommended Toolsets | Notes |
|---|---|---|
| Config errors, CrashLoopBackOff, OOMKilled | (core only — no extra toolsets) | kubectl access to pods, events, and logs is sufficient |
| SLO burn-rate alerts, latency spikes | prometheus/metrics |
Requires Prometheus for metric queries |
| Cloud resource issues | Relevant cloud provider toolset | Add only the provider you use |
Example: Minimal SDK Config (No Optional Toolsets)¶
Enable Prometheus only when investigating metric-driven alerts:
toolsets:
prometheus/metrics:
enabled: true
config:
prometheus_url: "http://kube-prometheus-stack-prometheus.monitoring.svc:9090"
Automatic toolset selection
Two-phase toolkit selection automatically loads only the toolsets relevant to each investigation.
Provider Examples¶
OpenAI¶
Secret: kubectl create secret generic llm-credentials --from-literal=OPENAI_API_KEY=sk-...
OpenAI-Compatible (vLLM, LocalAI, TGI)¶
Set the endpoint to the server origin without /v1 — the agent appends /v1 automatically.
Azure OpenAI¶
llm:
provider: azure
model: gpt-4o
endpoint: https://my-resource.openai.azure.com/
azure_api_version: "2024-02-15-preview"
timeout_seconds: 120
Secret: kubectl create secret generic llm-credentials --from-literal=AZURE_API_KEY=...
Google Vertex AI (Gemini)¶
llm:
provider: vertex
model: gemini-2.5-pro
vertex_project: my-project-id
vertex_location: us-central1
timeout_seconds: 180
Secret: kubectl create secret generic llm-credentials --from-file=application_default_credentials.json=service-account-key.json -n kubernaut-system
The agent auto-detects application_default_credentials.json in the mounted secret and sets GOOGLE_APPLICATION_CREDENTIALS at runtime. GCP Workload Identity is also supported — the secret can be omitted when authentication is handled by the node metadata service.
Claude on Vertex AI¶
llm:
provider: vertex_ai
model: claude-sonnet-4-20250514
vertex_project: my-project-id
vertex_location: us-east5
timeout_seconds: 180
Uses the Anthropic Go SDK directly (not LangChainGo). Requires Vertex AI Model Garden access.
Anthropic (Direct)¶
Secret: kubectl create secret generic llm-credentials --from-literal=ANTHROPIC_API_KEY=...
Ollama (Local / Air-Gapped)¶
Recommended for disconnected/air-gapped environments. See Disconnected Installation for setup guidance.
Secrets Pairing¶
LLM API credentials are stored in a separate Kubernetes Secret (default name: llm-credentials). The chart mounts this Secret into the Kubernaut Agent pod alongside the SDK config. The Secret name is configured via kubernautAgent.llm.credentialsSecretName.
The Secret is marked optional: true — the agent starts without it but all LLM calls fail until it is created.
Temperature Tuning¶
- 0.3--0.5: More deterministic. Recommended for production.
- 0.7 (default): Balanced.
- 0.8--1.0: More creative. May discover non-obvious root causes but less consistent.
Hot-Reload¶
From v1.4 onward the agent watches only the hot-reloadable ConfigMap bundle; startup-only YAML remains on the static ConfigMap. On prior releases this page described a single mounted SDK bundle — treat AI/tool MCP fields as residing on the watched volume unless your chart splits them explicitly.
Reloadable changes are detected via an fsnotify file watcher (~60s kubelet ConfigMap sync delay). No pod restart is required for most fields on that bundle.
Restart-required fields (changes are rejected with a warning log):
ai.llm.providerai.llm.oauth2.tokenUrl,ai.llm.oauth2.clientId,ai.llm.oauth2.clientSecret
Hot-reloadable fields: model, endpoint, apiKey, azureApiVersion, vertexProject, vertexLocation, bedrockRegion, temperature, maxRetries, timeoutSeconds, customHeaders, oauth2.scopes
Active investigations are pinned to the client/model snapshot at start — reload only affects new investigations.
Reference File¶
A complete example is available in the chart: charts/kubernaut/examples/sdk-config.yaml