Kubernaut Agent SDK Config¶

The Kubernaut Agent reads its LLM configuration from an SDK config ConfigMap. This page documents the schema, provisioning methods, and provider-specific examples.

v1.4: breaking YAML changes for the Kubernaut Agent

CamelCase migration (ADR-030). Every field in KA-facing YAML configs now uses camelCase. Older snake_case keys (api_key, timeout_seconds, mcp_servers, prometheus_url, and similar) must be renamed — existing ConfigMaps fail validation until updated.

Three top-level domains. Configuration is reorganized under runtime, ai, and integrations:

runtime — operational/process settings (server and related knobs are nested here).
ai — LLM/provider options (for example llm blocks live under ai).
integrations — external surfaces (tools / toolsets and mcp_servers equivalents are nested here).

Two ConfigMaps. KA consumes a static ConfigMap mounted at pod startup (bootstrap and fields that cannot change safely at runtime) and a separate hot-reloadable ConfigMap watched at runtime. Edits to AI model, tooling, MCP, and other supported fields on the reloadable bundle take effect without restarting the pod (subject to watcher sync latency — see Hot-Reload).

Before vs after (illustrative)¶

Before (< v1.4, flat layout + snake_case):

llm:
  provider: openai
  model: gpt-4o
  timeout_seconds: 120
  max_retries: 3

toolsets:
  prometheus/metrics:
    enabled: true
    config:
      prometheus_url: "http://kube-prometheus-stack-prometheus.monitoring.svc:9090"

mcp_servers: {}

After (v1.4, three domains + camelCase):

runtime:
  server: {}

ai:
  llm:
    provider: openai
    model: gpt-4o
    timeoutSeconds: 120
    maxRetries: 3

integrations:
  tools:
    prometheus/metrics:
      enabled: true
      config:
        prometheusUrl: "http://kube-prometheus-stack-prometheus.monitoring.svc:9090"
  mcpServers: {}

Regenerate manifests from values.schema.json and the canonical chart examples when upgrading Helm releases — do not partially rename keys.

Overview¶

Property	Value
Historical ConfigMap (< v1.4)	`kubernaut-agent-sdk-config` with key `sdk-config.yaml` mounted under `/etc/kubernaut-agent/sdk/`
v1.4+ manifests	Helm renders paired volumes: a static ConfigMap (startup) plus a hot-reloadable ConfigMap (runtime watcher); exact metadata keys and directories are defined alongside `values.schema.json` in the shipped chart templates — align upgrades with release examples instead of renaming keys ad hoc
Required	Yes — chart fails at install when LLM / SDK prerequisites are missing

Re-read the v1.4 upgrade warning at the top of this page before touching live manifests.

Provisioning¶

Three options are available, with the following precedence: existingSdkConfigMap > sdkConfigContent > llm.provider + llm.model.

Option A: Quickstart (recommended for getting started)¶

Set the provider and model directly in Helm values. The chart generates a minimal SDK config ConfigMap automatically.

helm install kubernaut charts/kubernaut/ \
  --set kubernautAgent.llm.provider=openai \
  --set kubernautAgent.llm.model=gpt-4o \
  ...

Supported quickstart providers: openai, anthropic (any provider needing only an API key). For Vertex AI, Azure, or advanced setups, use Option B or C.

Option B: Inline content¶

Provide the full SDK config file via --set-file. The chart creates the ConfigMap from this content.

helm install kubernaut charts/kubernaut/ \
  --set-file kubernautAgent.sdkConfigContent=my-sdk-config.yaml \
  ...

Option C: Pre-existing ConfigMap¶

Create the ConfigMap yourself and reference it by name. The chart skips creating kubernaut-agent-sdk-config and mounts your ConfigMap instead.

kubectl create configmap my-sdk-config \
  --from-file=sdk-config.yaml=my-sdk-config.yaml \
  -n kubernaut-system

helm install kubernaut charts/kubernaut/ \
  --set kubernautAgent.existingSdkConfigMap=my-sdk-config \
  ...

Schema Reference¶

v1.4 camelCase / 3-domain structure

The schema below reflects the v1.4 structure. If you are migrating from v1.3 or earlier, see the breaking YAML changes warning at the top of this page.

runtime:
  server: {}                  # Internal server settings (generally left at defaults)
  maxTurns: 40                # Max LLM tool-call turns per investigation (v1.4: increased from 15)

ai:
  llm:
    provider: ""              # Required. One of: openai, ollama, azure, vertex,
                              #   vertexAi, anthropic, bedrock, huggingface, mistral
    model: ""                 # Required. e.g., "gpt-4o", "gemini-2.5-pro"
    endpoint: ""              # Server origin without /v1 (required for ollama, azure, mistral)
    apiKey: ""                # Provider API key
    azureApiVersion: ""       # Azure-specific
    vertexProject: ""         # Vertex-specific (vertex and vertexAi)
    vertexLocation: ""        # Vertex-specific
    bedrockRegion: ""         # Bedrock-specific
    structuredOutput: false   # Reserved; KA always enables JSON mode internally (see note below)
    temperature: 0.7          # Creativity vs determinism (0.0--1.0)
    maxRetries: 3             # LLM call retry count
    timeoutSeconds: 120       # Per-call timeout
    customHeaders:            # Optional custom HTTP headers
      - name: "X-Custom"
        value: "..."
    oauth2:                   # Optional OAuth2 client credentials
      enabled: false
      tokenUrl: ""            # Must use https:// when enabled
      clientId: ""
      clientSecret: ""
      scopes: ["scope1"]

integrations:
  toolsets: {}              # Optional: data source toolsets
    # prometheus/metrics:
    #   enabled: true
    #   config:
    #     prometheusUrl: "http://kube-prometheus-stack-prometheus.monitoring.svc:9090"

  mcpServers: {}            # Optional: Model Context Protocol servers

Operator: BYO runtime ConfigMap

When deploying via the Kubernaut Operator, set spec.kubernautAgent.llm.runtimeConfigMapName to point to a ConfigMap you manage. The operator mounts it as the hot-reloadable config, so changes take effect without pod restart. The static ConfigMap is managed by the operator and should not be edited directly.

Supported Providers¶

Config `ai.llm.provider`	Backend	Implementation
`openai`	OpenAI or OpenAI-compatible API	LangChainGo `llms/openai`
`ollama`	Ollama	LangChainGo `llms/ollama`
`azure`	Azure OpenAI	LangChainGo `llms/openai` (Azure API type)
`vertex`	Google Vertex AI (Gemini models)	LangChainGo `llms/googleai/vertex`
`vertex_ai`	Claude on Google Vertex AI	Anthropic Go SDK (not LangChainGo)
`anthropic`	Anthropic API (direct)	LangChainGo `llms/anthropic`
`bedrock`	Amazon Bedrock	LangChainGo `llms/bedrock`
`huggingface`	Hugging Face	LangChainGo `llms/huggingface`
`mistral`	Mistral	LangChainGo `llms/mistral`

Vertex AI provider distinction

vertex = Gemini models on Vertex AI. vertex_ai = Anthropic Claude models on Vertex AI. These use separate code paths and different SDKs.

Mandatory JSON structured output

KA internally sets JSONMode: true on every LLM request. This is not configurable — the structured_output field in the config is reserved and has no effect at runtime. Your LLM provider/model must support schema-constrained JSON responses (equivalent to response_format: {"type": "json_object"} in the OpenAI API). All listed providers support this natively. For self-hosted or air-gapped deployments using Ollama or OpenAI-compatible servers, ensure the model supports JSON mode (most instruction-tuned models do).

Toolset Optimization¶

Each enabled toolset injects its full tool schema into every LLM context turn. When a toolset is enabled but never called during an investigation, those schema tokens are pure overhead — they consume budget and can bias the LLM toward irrelevant investigation paths.

Empirical testing shows that loading a single unused toolset (Prometheus, 123 tools) can add ~30% token overhead and increase LLM latency by ~15%, with no benefit to the investigation outcome. See the two-phase toolkit selection discussion for detailed measurements.

Recommendation: Enable only the toolsets needed for your workload. The Kubernetes core toolset (kubectl commands and logs) is always available — it cannot be disabled.

Incident-Type to Toolset Mapping¶

Incident Type	Recommended Toolsets	Notes
Config errors, CrashLoopBackOff, OOMKilled	(core only — no extra toolsets)	`kubectl` access to pods, events, and logs is sufficient
SLO burn-rate alerts, latency spikes	`prometheus/metrics`	Requires Prometheus for metric queries
Cloud resource issues	Relevant cloud provider toolset	Add only the provider you use

Example: Minimal SDK Config (No Optional Toolsets)¶

llm:
  provider: openai
  model: gpt-4o
  temperature: 0.5

toolsets: {}

Enable Prometheus only when investigating metric-driven alerts:

toolsets:
  prometheus/metrics:
    enabled: true
    config:
      prometheus_url: "http://kube-prometheus-stack-prometheus.monitoring.svc:9090"

Automatic toolset selection

Two-phase toolkit selection automatically loads only the toolsets relevant to each investigation.

Provider Examples¶

OpenAI¶

llm:
  provider: openai
  model: gpt-4o
  temperature: 0.7
  timeout_seconds: 120

Secret: kubectl create secret generic llm-credentials --from-literal=OPENAI_API_KEY=sk-...

OpenAI-Compatible (vLLM, LocalAI, TGI)¶

llm:
  provider: openai
  model: gpt-4o
  endpoint: http://vllm.internal.svc:8000

Set the endpoint to the server origin without /v1 — the agent appends /v1 automatically.

Azure OpenAI¶

llm:
  provider: azure
  model: gpt-4o
  endpoint: https://my-resource.openai.azure.com/
  azure_api_version: "2024-02-15-preview"
  timeout_seconds: 120

Secret: kubectl create secret generic llm-credentials --from-literal=AZURE_API_KEY=...

Google Vertex AI (Gemini)¶

llm:
  provider: vertex
  model: gemini-2.5-pro
  vertex_project: my-project-id
  vertex_location: us-central1
  timeout_seconds: 180

Secret: kubectl create secret generic llm-credentials --from-file=application_default_credentials.json=service-account-key.json -n kubernaut-system

The agent auto-detects application_default_credentials.json in the mounted secret and sets GOOGLE_APPLICATION_CREDENTIALS at runtime. GCP Workload Identity is also supported — the secret can be omitted when authentication is handled by the node metadata service.

Claude on Vertex AI¶

llm:
  provider: vertex_ai
  model: claude-sonnet-4-20250514
  vertex_project: my-project-id
  vertex_location: us-east5
  timeout_seconds: 180

Uses the Anthropic Go SDK directly (not LangChainGo). Requires Vertex AI Model Garden access.

Anthropic (Direct)¶

llm:
  provider: anthropic
  model: claude-sonnet-4-20250514
  timeout_seconds: 180

Secret: kubectl create secret generic llm-credentials --from-literal=ANTHROPIC_API_KEY=...

Ollama (Local / Air-Gapped)¶

llm:
  provider: ollama
  model: llama3
  endpoint: http://ollama.internal.svc:11434

Recommended for disconnected/air-gapped environments. See Disconnected Installation for setup guidance.

Secrets Pairing¶

LLM API credentials are stored in a separate Kubernetes Secret (default name: llm-credentials). The chart mounts this Secret into the Kubernaut Agent pod alongside the SDK config. The Secret name is configured via kubernautAgent.llm.credentialsSecretName.

The Secret is marked optional: true — the agent starts without it but all LLM calls fail until it is created.

Temperature Tuning¶

0.3--0.5: More deterministic. Recommended for production.
0.7 (default): Balanced.
0.8--1.0: More creative. May discover non-obvious root causes but less consistent.

Hot-Reload¶

From v1.4 onward the agent watches only the hot-reloadable ConfigMap bundle; startup-only YAML remains on the static ConfigMap. On prior releases this page described a single mounted SDK bundle — treat AI/tool MCP fields as residing on the watched volume unless your chart splits them explicitly.

Reloadable changes are detected via an fsnotify file watcher (~60s kubelet ConfigMap sync delay). No pod restart is required for most fields on that bundle.

Restart-required fields (changes are rejected with a warning log):

ai.llm.provider
ai.llm.oauth2.tokenUrl, ai.llm.oauth2.clientId, ai.llm.oauth2.clientSecret

Hot-reloadable fields: model, endpoint, apiKey, azureApiVersion, vertexProject, vertexLocation, bedrockRegion, temperature, maxRetries, timeoutSeconds, customHeaders, oauth2.scopes

Active investigations are pinned to the client/model snapshot at start — reload only affects new investigations.

Reference File¶

A complete example is available in the chart: charts/kubernaut/examples/sdk-config.yaml