Authoring Workflows and Action Types¶
This guide explains how to design workflow schemas and action types so that Kubernaut's LLM selects the right workflow for each incident. It covers the 3-step discovery protocol, description engineering, customLabels-based differentiation, and common pitfalls.
Read Remediation Workflows first for schema syntax, registration, and lifecycle. This page focuses on the design decisions that affect selection.
How Workflow Selection Works¶
Kubernaut selects workflows through a 3-step discovery protocol (DD-HAPI-017). Understanding each step -- and where your authoring choices matter -- is essential.
Step 1: Action Type Selection¶
The LLM calls list_available_actions and receives every active action type with its description. It picks the action type whose whenToUse best matches the root cause it identified during investigation.
What matters here:
- The action type
description.whatanddescription.whenToUse - The LLM's understanding of the root cause
What does NOT matter here:
- CustomLabels (zero influence)
- DetectedLabels (zero influence)
- Workflow-level descriptions (not visible yet)
Step 2: Workflow Ranking¶
The LLM calls list_workflows for the chosen action type. DataStorage returns all active, latest-version workflows under that action type, ordered by final_score.
The DataStorage scoring algorithm combines detected label boosts, custom label boosts, and penalties. See Workflow Search and Scoring for the complete formula and boost values.
What matters here:
- DetectedLabels (highest impact on ranking)
- CustomLabels (operator-intent boost)
- Mandatory label filters (severity, environment, component, priority) -- workflows that don't match are excluded entirely
Step 3: Workflow Selection¶
The LLM receives the ranked list and picks the workflow whose description.whenToUse best fits the incident. It also considers remediation history (what worked or failed before on this resource).
What matters here:
- The workflow
description.whenToUseanddescription.whenNotToUse - The DataStorage ranking (LLM tends to prefer higher-ranked workflows when descriptions are similar)
- Remediation history
Key Insight¶
Steps 2 and 3 work together: DataStorage provides the ordering via scoring, and the LLM makes the final decision via descriptions. For reliable selection, both the ranking (via labels) and the description guidance must align.
Planning Your Workflow Catalog¶
When to Group Workflows Under the Same Action Type¶
If two workflows solve the same category of problem but differ in how they solve it, they should share the same action type. CustomLabels and descriptions then differentiate them.
| Scenario | Same Action Type? | Why |
|---|---|---|
| Direct kubectl patch vs GitOps commit for memory limits | Yes (IncreaseMemoryLimits) |
Same intent, different execution strategy |
| Fast restart vs safe rollback for CrashLoopBackOff | Depends | If risk tolerance is the differentiator, consider same type with customLabels. If the actions are fundamentally different (restart vs rollback), use separate types |
| Rollback a Deployment vs rollback a Helm release | No | Different resources, different rollback mechanisms |
| Scale replicas vs increase CPU limits | No | Different remediation categories |
CustomLabels cannot steer across action types
If two workflows are under different action types, customLabels have no effect on selection between them. The LLM picks the action type first (Step 1), then workflows within that type are ranked (Step 2). CustomLabels only influence Step 2.
When to Create a New Action Type¶
Create a new action type when:
- The remediation is a fundamentally different category (e.g., "scale" vs "restart" vs "rollback")
- No existing action type's
whenToUsecovers the scenario - The LLM would be confused choosing between this action and an existing one under the same type
Writing Effective Descriptions¶
Descriptions are the primary mechanism the LLM uses for selection. Poorly written descriptions are the #1 cause of incorrect workflow selection.
Action Type Descriptions¶
Action type descriptions should describe the category of action, not specific conditions or environments.
Good:
spec:
name: IncreaseMemoryLimits
description:
what: "Increase memory resource limits on containers that are being OOMKilled"
whenToUse: "When containers are being OOMKilled because the memory limit is too low and the correct new limit can be determined"
whenNotToUse: "When the OOMKill is caused by a memory leak -- increasing limits only delays the inevitable"
preconditions: "The deployment exists and defines explicit memory limits"
Bad:
spec:
name: IncreaseMemoryLimitsGitOps
description:
what: "Increase memory limits via GitOps commit for ArgoCD-managed deployments"
# Too specific -- this describes a workflow variant, not an action category
The bad example bakes environment-specific conditions into the action type, preventing other workflows (e.g., direct kubectl patch) from sharing the same type.
Workflow Descriptions¶
Workflow descriptions should reference the specific conditions under which this variant is preferred, including explicit references to customLabels and detectedLabels.
Good -- two workflows under IncreaseMemoryLimits:
# Workflow 1: Direct patch (metadata.name: increase-memory-limits)
spec:
description:
what: "Increases memory limits by patching the deployment directly via kubectl"
whenToUse: "When containers are being OOMKilled and the deployment is NOT managed by a GitOps tool. Suitable for environments where direct patching is acceptable."
whenNotToUse: "When the deployment is managed by ArgoCD or Flux -- direct patching will cause drift"
# Workflow 2: GitOps commit via Ansible (metadata.name: increase-memory-limits-gitops)
spec:
description:
what: "Increases memory limits by updating the deployment YAML in the source Git repository and letting the GitOps controller reconcile"
whenToUse: "When containers are being OOMKilled and the deployment is managed by a GitOps tool (ArgoCD or Flux). The new memory value must be higher than the current limit."
whenNotToUse: "When the environment is not GitOps-managed. When the OOMKill is caused by a memory leak."
The LLM reads both descriptions and, combined with the DataStorage ranking (which boosts the GitOps workflow when gitOpsManaged: "true" is detected), reliably picks the right one.
Description Engineering Checklist¶
- [ ] Action type
whenToUsedescribes the category (what problem does this solve?) - [ ] Workflow
whenToUsedescribes the variant (under what conditions is this variant preferred?) - [ ] Workflow
whenNotToUseexplicitly excludes scenarios where the other variant should be chosen - [ ] If customLabels differentiate workflows, the
whenToUsereferences the condition (e.g., "when risk tolerance is high") - [ ] Descriptions don't overlap semantically -- the LLM must be able to distinguish them
Using CustomLabels for Condition-Based Selection¶
CustomLabels are operator-defined key-value pairs that influence DataStorage scoring. They're the mechanism for steering selection based on organizational or operational conditions that aren't captured by infrastructure detection.
How CustomLabels Flow Through the System¶
flowchart LR
NS["Namespace label<br/><small>kubernaut.ai/label-team=payments</small>"] --> Rego["policy.rego labels rules<br/><small>extracts team=payments</small>"]
Rego --> SP["SignalProcessing<br/><small>CustomLabels field</small>"]
SP --> DS["DataStorage scoring<br/><small>custom label boost</small>"]
DS --> LLM["LLM sees ranked list"]
- Namespace labels: The operator labels namespaces with
kubernaut.ai/label-{key}={value} - Rego policy: The
labelsrules inpolicy.regoextract labels with thekubernaut.ai/label-prefix - Signal Processing: Stores them in the
CustomLabelsfield on the SP CRD - DataStorage: During
list_workflows, matches SP's custom labels against each workflow'scustomLabelsand boosts the score - LLM: Sees the ranked list and makes the final selection, guided by descriptions
Declaring CustomLabels on Workflow Schemas¶
spec:
customLabels:
risk_tolerance: "high" # exact match only
team: "payments" # matches this specific value
region: "*" # wildcard -- matches any value
- Exact match: The workflow's value must equal the incident's value.
- Wildcard (
"*"): The workflow matches any non-empty value for that key (half credit).
CustomLabels are map[string]string on the CRD -- each key maps to a single string value. Internally, DataStorage wraps these into arrays for JSONB storage and scoring.
Labeling Namespaces¶
kubectl label namespace payments-prod kubernaut.ai/label-team=payments
kubectl label namespace payments-prod kubernaut.ai/label-risk_tolerance=high
The default labels rules in policy.rego extract all kubernaut.ai/label-* labels automatically:
package signalprocessing
import rego.v1
labels[key] := value if {
some k, v in input.kubernetes.namespace.labels
startswith(k, "kubernaut.ai/label-")
key := trim_prefix(k, "kubernaut.ai/label-")
value := v
}
Custom Rego for Non-Standard Labels¶
If your labels don't follow the kubernaut.ai/label- convention, add custom rules to the labels section of your policy.rego:
package signalprocessing
import rego.v1
labels := result if {
rt := input.namespace.labels["company.io/risk-tolerance"]
rt != ""
result := {"risk_tolerance": [rt]}
}
Add these rules to the custom labels section of your unified policy.rego file in the signalprocessing-policy ConfigMap. Signal Processing hot-reloads the policy on ConfigMap updates.
Scoring Impact¶
Custom label matches add to the raw score (before normalization to 0-1). See Workflow Search and Scoring for the exact boost values.
This is a tiebreaker/ordering influence, not an override. It won't overcome a strong semantic mismatch in descriptions -- if the LLM strongly prefers a lower-ranked workflow based on its whenToUse, it will still pick it.
Standard Resource Parameters¶
Every workflow receives a set of standard TARGET_RESOURCE_* parameters that identify the Kubernetes resource selected for remediation. HAPI (HolmesGPT API) derives these from the K8s-verified root_owner during investigation and injects them into the selected workflow's parameters before the AIAnalysis completes -- workflow authors do not need to populate them manually.
| Parameter | Type | Description |
|---|---|---|
TARGET_RESOURCE_NAME |
string | Name of the root managing resource (e.g., my-app) |
TARGET_RESOURCE_KIND |
string | Kind of the root managing resource (e.g., Deployment, StatefulSet, Node) |
TARGET_RESOURCE_NAMESPACE |
string | Namespace of the root managing resource. Omitted for cluster-scoped resources (e.g., Nodes) |
Declaring Standard Parameters in Workflow Schemas¶
Workflows that operate on the target resource should declare these as required parameters in their schema. HAPI validates that all required parameters in the workflow schema are satisfied during its workflow response validation step.
parameters:
- name: TARGET_RESOURCE_NAME
type: string
required: true
description: "Name of the root managing resource (auto-injected)"
- name: TARGET_RESOURCE_KIND
type: string
required: true
description: "Kind of the root managing resource (auto-injected)"
- name: TARGET_RESOURCE_NAMESPACE
type: string
required: true
description: "Namespace of the root managing resource (auto-injected)"
Cluster-Scoped Resources¶
For cluster-scoped resources (e.g., Nodes, PersistentVolumes), TARGET_RESOURCE_NAMESPACE is not injected to prevent parameter validation failures. Workflows that handle both namespaced and cluster-scoped resources should declare TARGET_RESOURCE_NAMESPACE as optional (required: false).
See the worked example below for a complete workflow schema that declares these parameters.
Worked Example: Risk-Based CrashLoopBackOff Remediation¶
This example demonstrates two workflows for the same problem (CrashLoopBackOff), differentiated by risk tolerance, from Rego policy through workflow schema to successful selection.
Scenario¶
- Team Alpha (namespace
alpha-prod): Risk tolerance ishigh. They prefer fast restarts to minimize downtime. - Team Beta (namespace
beta-prod): Risk tolerance islow. They prefer safe rollbacks even if slower.
Both namespaces experience CrashLoopBackOff events. The same GracefulRestart action type should serve both, but with different workflows selected based on team preference.
Step 1: Label the Namespaces¶
kubectl label namespace alpha-prod kubernaut.ai/label-risk_tolerance=high
kubectl label namespace beta-prod kubernaut.ai/label-risk_tolerance=low
Step 2: Verify the Rego Policy¶
The default labels rules in policy.rego extract risk_tolerance automatically (it has the kubernaut.ai/label- prefix). No custom Rego needed.
Step 3: Create the Action Type¶
Both workflows share the same action type:
apiVersion: kubernaut.ai/v1alpha1
kind: ActionType
metadata:
name: graceful-restart
spec:
name: GracefulRestart
description:
what: "Perform a graceful rolling restart to reset runtime state"
whenToUse: "When pods are in a degraded state (CrashLoopBackOff, high restart count) but the deployment spec is correct"
whenNotToUse: "When the issue is caused by a bad image or config change -- a restart won't help"
preconditions: "The deployment exists and has at least one ready replica"
Step 4: Create the Workflows¶
Workflow A -- Fast restart (high risk tolerance):
apiVersion: kubernaut.ai/v1alpha1
kind: RemediationWorkflow
metadata:
name: restart-pods-v1
spec:
version: "1.0.0"
description:
what: "Restarts all pods in the deployment immediately via kubectl delete"
whenToUse: "When fast recovery is preferred over safety. Best for teams with high risk tolerance where minimizing downtime is the priority, even at the cost of brief unavailability during restart."
whenNotToUse: "When the team has low risk tolerance or the service handles financial transactions"
preconditions: "Deployment exists with at least one pod"
actionType: GracefulRestart
labels:
severity: [critical, high]
environment: ["*"]
component: deployment
priority: "*"
customLabels:
risk_tolerance: "high"
execution:
engine: job
bundle: registry.example.com/workflows/restart-pods@sha256:abc123...
parameters:
- name: TARGET_RESOURCE_NAME
type: string
required: true
description: "Name of the root managing resource (HAPI-injected)"
- name: TARGET_RESOURCE_KIND
type: string
required: true
description: "Kind of the root managing resource (HAPI-injected)"
- name: TARGET_RESOURCE_NAMESPACE
type: string
required: true
description: "Namespace of the root managing resource (HAPI-injected)"
- name: TARGET_DEPLOYMENT
type: string
required: true
description: "Name of the deployment to restart"
Workflow B -- Safe rollback (low risk tolerance):
apiVersion: kubernaut.ai/v1alpha1
kind: RemediationWorkflow
metadata:
name: crashloop-rollback-v1
spec:
version: "1.0.0"
description:
what: "Rolls back the deployment to the previous stable revision"
whenToUse: "When safe recovery is preferred. Best for teams with low risk tolerance where ensuring a known-good state is more important than speed."
whenNotToUse: "When the team has high risk tolerance and prefers faster restart over rollback"
preconditions: "Deployment exists with at least one previous revision"
actionType: GracefulRestart
labels:
severity: [critical, high]
environment: ["*"]
component: deployment
priority: "*"
customLabels:
risk_tolerance: "low"
execution:
engine: job
bundle: registry.example.com/workflows/crashloop-rollback@sha256:def456...
parameters:
- name: TARGET_RESOURCE_NAME
type: string
required: true
description: "Name of the root managing resource (HAPI-injected)"
- name: TARGET_RESOURCE_KIND
type: string
required: true
description: "Kind of the root managing resource (HAPI-injected)"
- name: TARGET_RESOURCE_NAMESPACE
type: string
required: true
description: "Namespace of the root managing resource (HAPI-injected)"
- name: TARGET_DEPLOYMENT
type: string
required: true
description: "Name of the deployment to roll back"
Step 5: What Happens at Runtime¶
Incident in alpha-prod (risk_tolerance=high):
- Step 1: LLM picks
GracefulRestartbased on the CrashLoopBackOff root cause - Step 2: DataStorage scores both workflows:
restart-pods-v1: base 0.50 + customLabel match (risk_tolerance: high==high) = 0.515crashloop-rollback-v1: base 0.50 + no match (risk_tolerance: low!=high) = 0.50
- Step 3: LLM sees
restart-pods-v1ranked first, reads itswhenToUse("high risk tolerance"), confirms it fits. Selected.
Incident in beta-prod (risk_tolerance=low):
- Step 1: LLM picks
GracefulRestart(same action type) - Step 2: DataStorage scores:
crashloop-rollback-v1: base 0.50 + customLabel match = 0.515restart-pods-v1: base 0.50 + no match = 0.50
- Step 3: LLM sees
crashloop-rollback-v1ranked first, reads itswhenToUse("low risk tolerance"), confirms it fits. Selected.
Why This Works¶
The ranking and the descriptions reinforce each other:
- DataStorage puts the correct workflow first via the customLabel score boost
- The LLM confirms the choice by reading the
whenToUsedescription, which explicitly references risk tolerance - If the descriptions were generic (no mention of risk tolerance), the LLM would have no basis to differentiate and might ignore the ranking
Troubleshooting¶
The LLM selects the wrong workflow¶
Symptom: The correct workflow exists but the LLM consistently picks a different one.
Diagnostic steps:
-
Check the action type: Are both workflows under the same action type? If they're under different action types, customLabels can't differentiate them.
-
Check DataStorage ranking: Query the DataStorage API directly to see how workflows are scored:
curl -s "http://data-storage:8080/api/v1/workflows/actions/GracefulRestart?severity=critical&environment=production&component=deployment&priority=P1" | jq '.[] | {name: .name, score: .confidence}'If the wrong workflow is ranked higher, check label matching.
-
Check customLabels on the SP CRD: Verify that Signal Processing extracted the expected custom labels:
-
Check namespace labels: Verify the source namespace has the expected labels:
-
Check workflow customLabels: Verify the workflow declares the matching customLabels:
No workflows found for the action type¶
Symptom: The LLM reports no workflows available after selecting an action type.
Causes:
- Mandatory label mismatch: The workflow's
severity,environment,component, orprioritydon't match the incident. Check that labels include the incident's values or use"*"wildcards. - Workflow not active: The workflow might be
disabledorsuperseded. Check:kubectl get remediationworkflow <name> -o jsonpath='{.status.catalogStatus}' - Not latest version: If a newer version was registered, the old one has
is_latest_version = falseand is excluded.
CustomLabels have no effect¶
Symptom: Both workflows have the same DataStorage score despite different customLabels.
Causes:
- Rego policy not extracting labels: Check that the
labelsrules inpolicy.regooutput the expected keys. Test withopa evalor check the SP CRD'sstatus.customLabels. - Namespace missing labels: The namespace must have
kubernaut.ai/label-{key}={value}labels for the default Rego policy to extract them. - Workflow not declaring customLabels: The workflow schema must have a
customLabelssection. Without it, there's nothing to match against. - Key mismatch: The Rego output key must exactly match the workflow's customLabel key (e.g.,
risk_tolerancein both).
The LLM ignores the DataStorage ranking¶
Symptom: The higher-ranked workflow is not selected.
This is expected behavior in some cases. The LLM makes the final decision based on descriptions and context. If the lower-ranked workflow's whenToUse is a much better semantic fit, the LLM will prefer it.
Fix: Ensure descriptions reinforce the ranking. If customLabels differentiate workflows, the whenToUse text should reference the same condition (e.g., "for teams with high risk tolerance"). When ranking and descriptions align, the LLM consistently follows the ranking.
Summary¶
| Authoring Decision | Affects Step | Impact |
|---|---|---|
Action type whenToUse |
Step 1 (action type selection) | Determines which action category the LLM picks |
| Mandatory labels (severity, environment, component, priority) | Step 2 (filtering) | Excludes workflows that don't match -- they never reach the LLM |
| DetectedLabels | Step 2 (scoring) | Highest-weight infrastructure boost |
| CustomLabels | Step 2 (scoring) | Operator-intent boost |
Workflow whenToUse / whenNotToUse |
Step 3 (LLM selection) | The LLM's primary decision input -- must reinforce the ranking |
| Remediation history | Step 3 (LLM context) | The LLM avoids repeating failed approaches |