Skip to content

Kubernaut Agent SDK Config

The Kubernaut Agent reads its LLM configuration from an SDK config ConfigMap. This page documents the schema, provisioning methods, and provider-specific examples.

v1.4: breaking YAML changes for the Kubernaut Agent

CamelCase migration (ADR-030). Every field in KA-facing YAML configs now uses camelCase. Older snake_case keys (api_key, timeout_seconds, mcp_servers, prometheus_url, and similar) must be renamed — existing ConfigMaps fail validation until updated.

Three top-level domains. Configuration is reorganized under runtime, ai, and integrations:

  • runtime — operational/process settings (server and related knobs are nested here).
  • ai — LLM/provider options (for example llm blocks live under ai).
  • integrations — external surfaces (tools / toolsets and mcp_servers equivalents are nested here).

Two ConfigMaps. KA consumes a static ConfigMap mounted at pod startup (bootstrap and fields that cannot change safely at runtime) and a separate hot-reloadable ConfigMap watched at runtime. Edits to AI model, tooling, MCP, and other supported fields on the reloadable bundle take effect without restarting the pod (subject to watcher sync latency — see Hot-Reload).

Before vs after (illustrative)

Before (< v1.4, flat layout + snake_case):

llm:
  provider: openai
  model: gpt-4o
  timeout_seconds: 120
  max_retries: 3

toolsets:
  prometheus/metrics:
    enabled: true
    config:
      prometheus_url: "http://kube-prometheus-stack-prometheus.monitoring.svc:9090"

mcp_servers: {}

After (v1.4, three domains + camelCase):

runtime:
  server: {}

ai:
  llm:
    provider: openai
    model: gpt-4o
    timeoutSeconds: 120
    maxRetries: 3

integrations:
  tools:
    prometheus/metrics:
      enabled: true
      config:
        prometheusUrl: "http://kube-prometheus-stack-prometheus.monitoring.svc:9090"
  mcpServers: {}

Regenerate manifests from values.schema.json and the canonical chart examples when upgrading Helm releases — do not partially rename keys.

Overview

Property Value
Historical ConfigMap (< v1.4) kubernaut-agent-sdk-config with key sdk-config.yaml mounted under /etc/kubernaut-agent/sdk/
v1.4+ manifests Helm renders paired volumes: a static ConfigMap (startup) plus a hot-reloadable ConfigMap (runtime watcher); exact metadata keys and directories are defined alongside values.schema.json in the shipped chart templates — align upgrades with release examples instead of renaming keys ad hoc
Required Yes — chart fails at install when LLM / SDK prerequisites are missing

Re-read the v1.4 upgrade warning at the top of this page before touching live manifests.

Provisioning

Three options are available, with the following precedence: existingSdkConfigMap > sdkConfigContent > llm.provider + llm.model.

Set the provider and model directly in Helm values. The chart generates a minimal SDK config ConfigMap automatically.

helm install kubernaut charts/kubernaut/ \
  --set kubernautAgent.llm.provider=openai \
  --set kubernautAgent.llm.model=gpt-4o \
  ...

Supported quickstart providers: openai, anthropic (any provider needing only an API key). For Vertex AI, Azure, or advanced setups, use Option B or C.

Option B: Inline content

Provide the full SDK config file via --set-file. The chart creates the ConfigMap from this content.

helm install kubernaut charts/kubernaut/ \
  --set-file kubernautAgent.sdkConfigContent=my-sdk-config.yaml \
  ...

Option C: Pre-existing ConfigMap

Create the ConfigMap yourself and reference it by name. The chart skips creating kubernaut-agent-sdk-config and mounts your ConfigMap instead.

kubectl create configmap my-sdk-config \
  --from-file=sdk-config.yaml=my-sdk-config.yaml \
  -n kubernaut-system

helm install kubernaut charts/kubernaut/ \
  --set kubernautAgent.existingSdkConfigMap=my-sdk-config \
  ...

Schema Reference

v1.4 camelCase / 3-domain structure

The schema below reflects the v1.4 structure. If you are migrating from v1.3 or earlier, see the breaking YAML changes warning at the top of this page.

runtime:
  server: {}                  # Internal server settings (generally left at defaults)
  maxTurns: 40                # Max LLM tool-call turns per investigation (v1.4: increased from 15)

ai:
  llm:
    provider: ""              # Required. One of: openai, ollama, azure, vertex,
                              #   vertexAi, anthropic, bedrock, huggingface, mistral
    model: ""                 # Required. e.g., "gpt-4o", "gemini-2.5-pro"
    endpoint: ""              # Server origin without /v1 (required for ollama, azure, mistral)
    apiKey: ""                # Provider API key
    azureApiVersion: ""       # Azure-specific
    vertexProject: ""         # Vertex-specific (vertex and vertexAi)
    vertexLocation: ""        # Vertex-specific
    bedrockRegion: ""         # Bedrock-specific
    structuredOutput: false   # Reserved; KA always enables JSON mode internally (see note below)
    temperature: 0.7          # Creativity vs determinism (0.0--1.0)
    maxRetries: 3             # LLM call retry count
    timeoutSeconds: 120       # Per-call timeout
    customHeaders:            # Optional custom HTTP headers
      - name: "X-Custom"
        value: "..."
    oauth2:                   # Optional OAuth2 client credentials
      enabled: false
      tokenUrl: ""            # Must use https:// when enabled
      clientId: ""
      clientSecret: ""
      scopes: ["scope1"]

integrations:
  toolsets: {}              # Optional: data source toolsets
    # prometheus/metrics:
    #   enabled: true
    #   config:
    #     prometheusUrl: "http://kube-prometheus-stack-prometheus.monitoring.svc:9090"

  mcpServers: {}            # Optional: Model Context Protocol servers

Operator: BYO runtime ConfigMap

When deploying via the Kubernaut Operator, set spec.kubernautAgent.llm.runtimeConfigMapName to point to a ConfigMap you manage. The operator mounts it as the hot-reloadable config, so changes take effect without pod restart. The static ConfigMap is managed by the operator and should not be edited directly.

Supported Providers

Config ai.llm.provider Backend Implementation
openai OpenAI or OpenAI-compatible API LangChainGo llms/openai
ollama Ollama LangChainGo llms/ollama
azure Azure OpenAI LangChainGo llms/openai (Azure API type)
vertex Google Vertex AI (Gemini models) LangChainGo llms/googleai/vertex
vertex_ai Claude on Google Vertex AI Anthropic Go SDK (not LangChainGo)
anthropic Anthropic API (direct) LangChainGo llms/anthropic
bedrock Amazon Bedrock LangChainGo llms/bedrock
huggingface Hugging Face LangChainGo llms/huggingface
mistral Mistral LangChainGo llms/mistral

Vertex AI provider distinction

vertex = Gemini models on Vertex AI. vertex_ai = Anthropic Claude models on Vertex AI. These use separate code paths and different SDKs.

Mandatory JSON structured output

KA internally sets JSONMode: true on every LLM request. This is not configurable — the structured_output field in the config is reserved and has no effect at runtime. Your LLM provider/model must support schema-constrained JSON responses (equivalent to response_format: {"type": "json_object"} in the OpenAI API). All listed providers support this natively. For self-hosted or air-gapped deployments using Ollama or OpenAI-compatible servers, ensure the model supports JSON mode (most instruction-tuned models do).

Toolset Optimization

Each enabled toolset injects its full tool schema into every LLM context turn. When a toolset is enabled but never called during an investigation, those schema tokens are pure overhead — they consume budget and can bias the LLM toward irrelevant investigation paths.

Empirical testing shows that loading a single unused toolset (Prometheus, 123 tools) can add ~30% token overhead and increase LLM latency by ~15%, with no benefit to the investigation outcome. See the two-phase toolkit selection discussion for detailed measurements.

Recommendation: Enable only the toolsets needed for your workload. The Kubernetes core toolset (kubectl commands and logs) is always available — it cannot be disabled.

Incident-Type to Toolset Mapping

Incident Type Recommended Toolsets Notes
Config errors, CrashLoopBackOff, OOMKilled (core only — no extra toolsets) kubectl access to pods, events, and logs is sufficient
SLO burn-rate alerts, latency spikes prometheus/metrics Requires Prometheus for metric queries
Cloud resource issues Relevant cloud provider toolset Add only the provider you use

Example: Minimal SDK Config (No Optional Toolsets)

llm:
  provider: openai
  model: gpt-4o
  temperature: 0.5

toolsets: {}

Enable Prometheus only when investigating metric-driven alerts:

toolsets:
  prometheus/metrics:
    enabled: true
    config:
      prometheus_url: "http://kube-prometheus-stack-prometheus.monitoring.svc:9090"

Automatic toolset selection

Two-phase toolkit selection automatically loads only the toolsets relevant to each investigation.

Provider Examples

OpenAI

llm:
  provider: openai
  model: gpt-4o
  temperature: 0.7
  timeout_seconds: 120

Secret: kubectl create secret generic llm-credentials --from-literal=OPENAI_API_KEY=sk-...

OpenAI-Compatible (vLLM, LocalAI, TGI)

llm:
  provider: openai
  model: gpt-4o
  endpoint: http://vllm.internal.svc:8000

Set the endpoint to the server origin without /v1 — the agent appends /v1 automatically.

Azure OpenAI

llm:
  provider: azure
  model: gpt-4o
  endpoint: https://my-resource.openai.azure.com/
  azure_api_version: "2024-02-15-preview"
  timeout_seconds: 120

Secret: kubectl create secret generic llm-credentials --from-literal=AZURE_API_KEY=...

Google Vertex AI (Gemini)

llm:
  provider: vertex
  model: gemini-2.5-pro
  vertex_project: my-project-id
  vertex_location: us-central1
  timeout_seconds: 180

Secret: kubectl create secret generic llm-credentials --from-file=application_default_credentials.json=service-account-key.json -n kubernaut-system

The agent auto-detects application_default_credentials.json in the mounted secret and sets GOOGLE_APPLICATION_CREDENTIALS at runtime. GCP Workload Identity is also supported — the secret can be omitted when authentication is handled by the node metadata service.

Claude on Vertex AI

llm:
  provider: vertex_ai
  model: claude-sonnet-4-20250514
  vertex_project: my-project-id
  vertex_location: us-east5
  timeout_seconds: 180

Uses the Anthropic Go SDK directly (not LangChainGo). Requires Vertex AI Model Garden access.

Anthropic (Direct)

llm:
  provider: anthropic
  model: claude-sonnet-4-20250514
  timeout_seconds: 180

Secret: kubectl create secret generic llm-credentials --from-literal=ANTHROPIC_API_KEY=...

Ollama (Local / Air-Gapped)

llm:
  provider: ollama
  model: llama3
  endpoint: http://ollama.internal.svc:11434

Recommended for disconnected/air-gapped environments. See Disconnected Installation for setup guidance.

Secrets Pairing

LLM API credentials are stored in a separate Kubernetes Secret (default name: llm-credentials). The chart mounts this Secret into the Kubernaut Agent pod alongside the SDK config. The Secret name is configured via kubernautAgent.llm.credentialsSecretName.

The Secret is marked optional: true — the agent starts without it but all LLM calls fail until it is created.

Temperature Tuning

  • 0.3--0.5: More deterministic. Recommended for production.
  • 0.7 (default): Balanced.
  • 0.8--1.0: More creative. May discover non-obvious root causes but less consistent.

Hot-Reload

From v1.4 onward the agent watches only the hot-reloadable ConfigMap bundle; startup-only YAML remains on the static ConfigMap. On prior releases this page described a single mounted SDK bundle — treat AI/tool MCP fields as residing on the watched volume unless your chart splits them explicitly.

Reloadable changes are detected via an fsnotify file watcher (~60s kubelet ConfigMap sync delay). No pod restart is required for most fields on that bundle.

Restart-required fields (changes are rejected with a warning log):

  • ai.llm.provider
  • ai.llm.oauth2.tokenUrl, ai.llm.oauth2.clientId, ai.llm.oauth2.clientSecret

Hot-reloadable fields: model, endpoint, apiKey, azureApiVersion, vertexProject, vertexLocation, bedrockRegion, temperature, maxRetries, timeoutSeconds, customHeaders, oauth2.scopes

Active investigations are pinned to the client/model snapshot at start — reload only affects new investigations.

Reference File

A complete example is available in the chart: charts/kubernaut/examples/sdk-config.yaml