Workflow Selection¶
Workflow selection is the process of finding the best remediation workflow for an incident. It uses a three-step discovery protocol (DD-HAPI-017) where HolmesGPT queries DataStorage, which applies mandatory filtering and semantic scoring before the LLM makes the final selection decision.
CRD Reference
For the complete CRD specifications, see RemediationWorkflow and ActionType in the API Reference.
Three-Step Discovery Protocol¶
sequenceDiagram
participant LLM as LLM
participant HAPI as HolmesGPT API
participant DS as DataStorage
LLM->>HAPI: list_available_actions()
HAPI->>DS: GET /api/v1/workflows/actions
DS-->>HAPI: Action types with workflow counts
HAPI-->>LLM: Available action types
LLM->>HAPI: list_workflows(action_type)
HAPI->>DS: GET /api/v1/workflows/actions/{action_type}
DS->>DS: Layer 1 filter + Layer 2 scoring
DS-->>HAPI: Scored candidates (scores stripped)
HAPI-->>LLM: Workflow summaries
LLM->>HAPI: get_workflow(workflow_id)
HAPI->>DS: GET /api/v1/workflows/{workflow_id}
DS-->>HAPI: Full workflow schema
HAPI-->>LLM: Full schema for evaluation
Step 1: List Action Types¶
The LLM calls list_available_actions() to discover what types of remediations are available. DataStorage returns action types from the action_type_taxonomy table, filtered to only include types that have at least one active workflow matching the signal context.
SELECT t.action_type, t.description, COUNT(w.workflow_id) AS workflow_count
FROM action_type_taxonomy t
INNER JOIN remediation_workflow_catalog w ON w.action_type = t.action_type
WHERE w.status = 'active' AND w.is_latest_version = true
AND t.status = 'active'
AND [context filters]
GROUP BY t.action_type, t.description
ORDER BY t.action_type
Each action type includes a structured description (what, whenToUse, whenNotToUse, preconditions) that the LLM uses to choose the appropriate action category based on the root cause analysis.
Step 2: List Workflows by Action Type¶
The LLM calls list_workflows(action_type) to get candidate workflows. DataStorage applies Layer 1 mandatory filtering and Layer 2 semantic scoring, returning results ordered by score. Scores are stripped before reaching the LLM -- they are used only for ordering.
Step 3: Get Full Workflow Schema¶
The LLM calls get_workflow(workflow_id) to retrieve the full schema for detailed evaluation. A context filter security gate ensures the workflow still matches the signal context.
Signal Context Propagation¶
HAPI propagates signal context from the investigation session to all DataStorage queries:
| Parameter | Source | Purpose |
|---|---|---|
severity |
SP classification | Mandatory filter |
component |
RCA target resource kind | Mandatory filter |
environment |
SP classification | Mandatory filter |
priority |
SP classification | Mandatory filter |
custom_labels |
SP custom labels | Layer 2 scoring boost |
detected_labels |
HAPI LabelDetector (post-RCA) |
Layer 2 scoring boost + penalty |
remediation_id |
Parent RR name | Audit correlation |
Detected labels are computed by HAPI for the RCA target resource (ADR-056). Labels with failedDetections entries are stripped before propagation to DataStorage.
Layer 1: Mandatory Filtering¶
Every workflow declares four mandatory labels. DataStorage applies these as hard filters -- a workflow must match all four to be a candidate.
| Label | Type | Values | SQL Pattern |
|---|---|---|---|
severity |
string[] |
critical, high, medium, low, "*" |
labels->'severity' ? $val OR labels->'severity' ? '*' |
component |
string |
pod, deployment, node, "*" |
LOWER(labels->>'component') = LOWER($val) OR labels->>'component' = '*' |
environment |
string[] |
production, staging, development, test, "*" |
labels->'environment' ? $val OR labels->'environment' ? '*' |
priority |
string |
P0, P1, P2, P3, "*" |
JSONB scalar or array containment |
Labels support:
- Exact match --
component: deployment - Wildcard --
component: "*"(matches any value) - Multi-value --
severity: [critical, high](matches if any value overlaps)
Detected Label Filters¶
Detected labels use inclusive filtering (Issue #197) -- workflows that don't declare a detected label are included, not excluded:
- Boolean labels (e.g.,
gitOpsManaged):detected_labels->>'field' = 'true' OR detected_labels->>'field' IS NULL - String labels (e.g.,
gitOpsTool):detected_labels->>'field' = $val OR detected_labels->>'field' = '*' OR detected_labels->>'field' IS NULL
This ensures newly registered workflows are not excluded simply because they haven't declared detected labels yet.
Why Business Classification Is Not a Layer 1 Filter¶
Business classification labels (businessUnit, serviceOwner, criticality, SLARequirement) are not part of mandatory filtering. This is by design:
- Optional labels -- Business labels are derived from namespace labels that may not be configured. Making them mandatory would exclude workflows for uncategorized namespaces.
- Deployment flexibility -- Early-stage deployments may not have business classification set up. Workflows should still be discoverable.
- Broad applicability -- Most remediation workflows (restart, scale, rollback) are not business-unit-specific. A
RestartPodworkflow works the same regardless of business unit. - Rego policy connection -- Signal Processing Rego policies determine severity, environment, and priority, which directly feed into Layer 1. Business classification feeds into Layer 2 scoring as a refinement signal, not a hard gate.
signalName is not a matching label
signalName is optional metadata in the workflow schema (DD-WORKFLOW-016). It is not used for filtering or matching — only for the final result ordering tiebreaker (ORDER BY final_score DESC, workflow_id ASC). The LLM selects workflows by actionType, not by signalName.
Layer 2: Semantic Scoring¶
DataStorage computes a final_score for each candidate to order results (DD-WORKFLOW-004 v1.5). Scores are used only for ordering and are stripped before reaching the LLM.
Scoring Formula¶
- Base score: 5.0 out of 10.0 (normalized to 0.50)
- Range: 0.0 -- 1.0 (clamped via
LEAST(..., 1.0)) - Ordering:
ORDER BY final_score DESC, workflow_id ASC
Detected Label Weights¶
| Label | Boost (exact match) | Penalty (mismatch) |
|---|---|---|
gitOpsManaged |
+0.10 | -0.10 |
gitOpsTool |
+0.10 | -0.10 |
pdbProtected |
+0.05 | -- |
serviceMesh |
+0.05 | -- |
networkIsolated |
+0.03 | -- |
helmManaged |
+0.02 | -- |
stateful |
+0.02 | -- |
hpaEnabled |
+0.02 | -- |
Maximum boost: +0.39 (all labels exact match) Maximum penalty: -0.20 (GitOps mismatch only)
Wildcard Weighting¶
- Exact match: Full weight (e.g.,
gitOpsManaged=truematchestrue→ +0.10) - Workflow declares wildcard (
"*"): Half weight (e.g., +0.05) - Query has wildcard (
"*"): Half weight
Custom Label Boost¶
Custom labels from Signal Processing's Rego policy output contribute additional scoring (DD-WORKFLOW-004 v1.5):
- Exact match: +0.15 per key
- Wildcard match: +0.075 per key
- SQL:
custom_labels->'key' @> 'value'::jsonb
Score Connection to SP Rego Policies¶
The mandatory filter labels (severity, environment, priority) are determined by Signal Processing's Rego classification policies. This creates a direct connection:
- SP Severity Rego policy → determines which severity values filter workflows in Layer 1
- SP Environment Rego policy → determines which environment value is used for filtering
- SP Priority Engine → determines the priority level for filtering
- SP Custom Labels Rego policy → produces labels that boost scores in Layer 2
Accurate Rego policy configuration is critical for workflow discovery -- incorrect severity or environment classification can exclude the correct workflow entirely.
LLM Selection¶
The LLM makes the final selection decision based on information available in the workflow schema:
- Action type description --
what,whenToUse,whenNotToUse, andpreconditions - Workflow description -- The workflow's own
whatandwhenToUsefields - Detected infrastructure context -- Prefer git-based workflows when
gitOpsManaged=true, respect PDB constraints whenpdbProtected=true - Remediation history -- Avoid workflows that recently failed on the same target (formatted warnings in the prompt)
- Root cause alignment -- How well the action type and parameters match the RCA
Action Type Taxonomy¶
Action types provide a stable vocabulary for categorizing remediation actions. They are provisioned as Kubernetes CRDs (ActionType) and synced to the action_type_taxonomy PostgreSQL table by the Auth Webhook admission handler.
Demo Action Types¶
When demoContent.enabled: true (the default), the chart seeds 24 demo action types:
ScaleReplicas, RestartPod, IncreaseCPULimits, IncreaseMemoryLimits, RollbackDeployment, DrainNode, CordonNode, RestartDeployment, CleanupNode, DeletePod, GitRevertCommit, ProvisionNode, GracefulRestart, CleanupPVC, RemoveTaint, PatchHPA, RelaxPDB, ProactiveRollback, CordonDrainNode, FixCertificate, HelmRollback, FixAuthorizationPolicy, FixStatefulSetPVC, FixNetworkPolicy
User-Extensible¶
Action types are fully configurable. Operators register custom types by applying an ActionType CRD:
apiVersion: kubernaut.ai/v1alpha1
kind: ActionType
metadata:
name: rotate-certificates
spec:
name: RotateCertificates
description:
what: "Rotates TLS certificates for services"
whenToUse: "When certificate expiry is approaching or certificates are invalid"
whenNotToUse: "When the issue is DNS resolution, not certificate validity"
preconditions: "cert-manager must be installed and the Certificate resource must exist"
The Auth Webhook intercepts the CREATE, registers the action type in the DataStorage taxonomy, captures the operator identity for audit attribution, and updates the CRD status. Deleting the CRD disables the action type in the catalog (soft delete).
The description fields are presented to the LLM during Step 1 of the discovery protocol, so accurate, unambiguous descriptions are essential. Follow these guidelines:
what-- One sentence describing the actionwhenToUse-- Specific conditions that warrant this actionwhenNotToUse-- Conditions where this action would be wrong (prevents misselection)preconditions-- What must be true for the action to succeed
Lifecycle¶
Action types support active and disabled states. Disabled action types and their associated workflows are excluded from the discovery protocol (t.status = 'active' filter). Re-applying a previously deleted ActionType CRD re-enables the existing taxonomy entry.
Validation¶
Workflow creation validates that the declared action_type exists and is active in the taxonomy. Unknown or disabled action types are rejected.
Confidence Thresholds¶
After selection, the confidence score determines the next step:
| Threshold | Action |
|---|---|
| >= 0.7 | Workflow selection accepted (investigation threshold) |
| >= 0.8 | Auto-approved for execution (approval threshold, configurable via Rego) |
| < 0.7 | Low-confidence; escalated to human review or no workflow selected |
API Endpoints¶
| Endpoint | Method | Purpose |
|---|---|---|
GET /api/v1/workflows/actions |
GET | Step 1: List action types with counts |
GET /api/v1/workflows/actions/{action_type} |
GET | Step 2: Scored candidates for action type |
GET /api/v1/workflows/{workflow_id} |
GET | Step 3: Full schema with security gate |
GET /api/v1/workflows |
GET | Catalog listing (no scoring) |
POST /api/v1/workflows |
POST | Create from OCI schema |
PATCH /api/v1/workflows/{id}/disable |
PATCH | Disable workflow |
PATCH /api/v1/workflows/{id}/enable |
PATCH | Enable workflow |
PATCH /api/v1/workflows/{id}/deprecate |
PATCH | Deprecate workflow |
Next Steps¶
- Workflow Execution -- How selected workflows are run
- Investigation Pipeline -- The HAPI investigation and selection process
- Remediation Workflows -- Writing workflow schemas
- Signal Processing -- How classification feeds into workflow filtering