Workflow Selection¶

Workflow selection is the process of finding the best remediation workflow for an incident. It uses a three-step discovery protocol (DD-HAPI-017) where HolmesGPT queries DataStorage, which applies mandatory filtering and semantic scoring before the LLM makes the final selection decision.

CRD Reference

For the complete CRD specifications, see RemediationWorkflow and ActionType in the API Reference.

Three-Step Discovery Protocol¶

sequenceDiagram
    participant LLM as LLM
    participant HAPI as HolmesGPT API
    participant DS as DataStorage

    LLM->>HAPI: list_available_actions()
    HAPI->>DS: GET /api/v1/workflows/actions
    DS-->>HAPI: Action types with workflow counts
    HAPI-->>LLM: Available action types

    LLM->>HAPI: list_workflows(action_type)
    HAPI->>DS: GET /api/v1/workflows/actions/{action_type}
    DS->>DS: Layer 1 filter + Layer 2 scoring
    DS-->>HAPI: Scored candidates (scores stripped)
    HAPI-->>LLM: Workflow summaries

    LLM->>HAPI: get_workflow(workflow_id)
    HAPI->>DS: GET /api/v1/workflows/{workflow_id}
    DS-->>HAPI: Full workflow schema
    HAPI-->>LLM: Full schema for evaluation

Step 1: List Action Types¶

The LLM calls list_available_actions() to discover what types of remediations are available. DataStorage returns action types from the action_type_taxonomy table, filtered to only include types that have at least one active workflow matching the signal context.

SELECT t.action_type, t.description, COUNT(w.workflow_id) AS workflow_count
FROM action_type_taxonomy t
INNER JOIN remediation_workflow_catalog w ON w.action_type = t.action_type
WHERE w.status = 'active' AND w.is_latest_version = true
  AND t.status = 'active'
  AND [context filters]
GROUP BY t.action_type, t.description
ORDER BY t.action_type

Each action type includes a structured description (what, whenToUse, whenNotToUse, preconditions) that the LLM uses to choose the appropriate action category based on the root cause analysis.

Step 2: List Workflows by Action Type¶

The LLM calls list_workflows(action_type) to get candidate workflows. DataStorage applies Layer 1 mandatory filtering and Layer 2 semantic scoring, returning results ordered by score. Scores are stripped before reaching the LLM -- they are used only for ordering.

Step 3: Get Full Workflow Schema¶

The LLM calls get_workflow(workflow_id) to retrieve the full schema for detailed evaluation. A context filter security gate ensures the workflow still matches the signal context.

Signal Context Propagation¶

HAPI propagates signal context from the investigation session to all DataStorage queries:

Parameter	Source	Purpose
`severity`	SP classification	Mandatory filter
`component`	RCA target resource kind	Mandatory filter
`environment`	SP classification	Mandatory filter
`priority`	SP classification	Mandatory filter
`custom_labels`	SP custom labels	Layer 2 scoring boost
`detected_labels`	HAPI `LabelDetector` (post-RCA)	Layer 2 scoring boost + penalty
`remediation_id`	Parent RR name	Audit correlation

Detected labels are computed by HAPI for the RCA target resource (ADR-056). Labels with failedDetections entries are stripped before propagation to DataStorage.

Layer 1: Mandatory Filtering¶

Every workflow declares four mandatory labels. DataStorage applies these as hard filters -- a workflow must match all four to be a candidate.

Label	Type	Values	SQL Pattern
`severity`	`string[]`	`critical`, `high`, `medium`, `low`, `"*"`	`labels->'severity' ? $val OR labels->'severity' ? '*'`
`component`	`string`	`pod`, `deployment`, `node`, `"*"`	`LOWER(labels->>'component') = LOWER($val) OR labels->>'component' = '*'`
`environment`	`string[]`	`production`, `staging`, `development`, `test`, `"*"`	`labels->'environment' ? $val OR labels->'environment' ? '*'`
`priority`	`string`	`P0`, `P1`, `P2`, `P3`, `"*"`	JSONB scalar or array containment

Labels support:

Exact match -- component: deployment
Wildcard -- component: "*" (matches any value)
Multi-value -- severity: [critical, high] (matches if any value overlaps)

Detected Label Filters¶

Detected labels use inclusive filtering (Issue #197) -- workflows that don't declare a detected label are included, not excluded:

Boolean labels (e.g., gitOpsManaged): detected_labels->>'field' = 'true' OR detected_labels->>'field' IS NULL
String labels (e.g., gitOpsTool): detected_labels->>'field' = $val OR detected_labels->>'field' = '*' OR detected_labels->>'field' IS NULL

This ensures newly registered workflows are not excluded simply because they haven't declared detected labels yet.

Why Business Classification Is Not a Layer 1 Filter¶

Business classification labels (businessUnit, serviceOwner, criticality, SLARequirement) are not part of mandatory filtering. This is by design:

Optional labels -- Business labels are derived from namespace labels that may not be configured. Making them mandatory would exclude workflows for uncategorized namespaces.
Deployment flexibility -- Early-stage deployments may not have business classification set up. Workflows should still be discoverable.
Broad applicability -- Most remediation workflows (restart, scale, rollback) are not business-unit-specific. A RestartPod workflow works the same regardless of business unit.
Rego policy connection -- Signal Processing Rego policies determine severity, environment, and priority, which directly feed into Layer 1. Business classification feeds into Layer 2 scoring as a refinement signal, not a hard gate.

signalName is not a matching label

signalName is optional metadata in the workflow schema (DD-WORKFLOW-016). It is not used for filtering or matching — only for the final result ordering tiebreaker (ORDER BY final_score DESC, workflow_id ASC). The LLM selects workflows by actionType, not by signalName.

Layer 2: Semantic Scoring¶

DataStorage computes a final_score for each candidate to order results (DD-WORKFLOW-004 v1.5). Scores are used only for ordering and are stripped before reaching the LLM.

Scoring Formula¶

final_score = LEAST((5.0 + detected_label_boost + custom_label_boost - label_penalty) / 10.0, 1.0)

Base score: 5.0 out of 10.0 (normalized to 0.50)
Range: 0.0 -- 1.0 (clamped via LEAST(..., 1.0))
Ordering: ORDER BY final_score DESC, workflow_id ASC

Detected Label Weights¶

Label	Boost (exact match)	Penalty (mismatch)
`gitOpsManaged`	+0.10	-0.10
`gitOpsTool`	+0.10	-0.10
`pdbProtected`	+0.05	--
`serviceMesh`	+0.05	--
`networkIsolated`	+0.03	--
`helmManaged`	+0.02	--
`stateful`	+0.02	--
`hpaEnabled`	+0.02	--

Maximum boost: +0.39 (all labels exact match) Maximum penalty: -0.20 (GitOps mismatch only)

Wildcard Weighting¶

Exact match: Full weight (e.g., gitOpsManaged=true matches true → +0.10)
Workflow declares wildcard ("*"): Half weight (e.g., +0.05)
Query has wildcard ("*"): Half weight

Custom Label Boost¶

Custom labels from Signal Processing's Rego policy output contribute additional scoring (DD-WORKFLOW-004 v1.5):

Exact match: +0.15 per key
Wildcard match: +0.075 per key
SQL: custom_labels->'key' @> 'value'::jsonb

Score Connection to SP Rego Policies¶

The mandatory filter labels (severity, environment, priority) are determined by Signal Processing's Rego classification policies. This creates a direct connection:

SP Severity Rego policy → determines which severity values filter workflows in Layer 1
SP Environment Rego policy → determines which environment value is used for filtering
SP Priority Engine → determines the priority level for filtering
SP Custom Labels Rego policy → produces labels that boost scores in Layer 2

Accurate Rego policy configuration is critical for workflow discovery -- incorrect severity or environment classification can exclude the correct workflow entirely.

LLM Selection¶

The LLM makes the final selection decision based on information available in the workflow schema:

Action type description -- what, whenToUse, whenNotToUse, and preconditions
Workflow description -- The workflow's own what and whenToUse fields
Detected infrastructure context -- Prefer git-based workflows when gitOpsManaged=true, respect PDB constraints when pdbProtected=true
Remediation history -- Avoid workflows that recently failed on the same target (formatted warnings in the prompt)
Root cause alignment -- How well the action type and parameters match the RCA

Action Type Taxonomy¶

Action types provide a stable vocabulary for categorizing remediation actions. They are provisioned as Kubernetes CRDs (ActionType) and synced to the action_type_taxonomy PostgreSQL table by the Auth Webhook admission handler.

Demo Action Types¶

When demoContent.enabled: true (the default), the chart seeds 24 demo action types:

ScaleReplicas, RestartPod, IncreaseCPULimits, IncreaseMemoryLimits, RollbackDeployment, DrainNode, CordonNode, RestartDeployment, CleanupNode, DeletePod, GitRevertCommit, ProvisionNode, GracefulRestart, CleanupPVC, RemoveTaint, PatchHPA, RelaxPDB, ProactiveRollback, CordonDrainNode, FixCertificate, HelmRollback, FixAuthorizationPolicy, FixStatefulSetPVC, FixNetworkPolicy

User-Extensible¶

Action types are fully configurable. Operators register custom types by applying an ActionType CRD:

apiVersion: kubernaut.ai/v1alpha1
kind: ActionType
metadata:
  name: rotate-certificates
spec:
  name: RotateCertificates
  description:
    what: "Rotates TLS certificates for services"
    whenToUse: "When certificate expiry is approaching or certificates are invalid"
    whenNotToUse: "When the issue is DNS resolution, not certificate validity"
    preconditions: "cert-manager must be installed and the Certificate resource must exist"

The Auth Webhook intercepts the CREATE, registers the action type in the DataStorage taxonomy, captures the operator identity for audit attribution, and updates the CRD status. Deleting the CRD disables the action type in the catalog (soft delete).

The description fields are presented to the LLM during Step 1 of the discovery protocol, so accurate, unambiguous descriptions are essential. Follow these guidelines:

what -- One sentence describing the action
whenToUse -- Specific conditions that warrant this action
whenNotToUse -- Conditions where this action would be wrong (prevents misselection)
preconditions -- What must be true for the action to succeed

Lifecycle¶

Action types support active and disabled states. Disabled action types and their associated workflows are excluded from the discovery protocol (t.status = 'active' filter). Re-applying a previously deleted ActionType CRD re-enables the existing taxonomy entry.

Validation¶

Workflow creation validates that the declared action_type exists and is active in the taxonomy. Unknown or disabled action types are rejected.

Confidence Thresholds¶

After selection, the confidence score determines the next step:

Threshold	Action
>= 0.7	Workflow selection accepted (investigation threshold)
>= 0.8	Auto-approved for execution (approval threshold, configurable via Rego)
< 0.7	Low-confidence; escalated to human review or no workflow selected

API Endpoints¶

Endpoint	Method	Purpose
`GET /api/v1/workflows/actions`	GET	Step 1: List action types with counts
`GET /api/v1/workflows/actions/{action_type}`	GET	Step 2: Scored candidates for action type
`GET /api/v1/workflows/{workflow_id}`	GET	Step 3: Full schema with security gate
`GET /api/v1/workflows`	GET	Catalog listing (no scoring)
`POST /api/v1/workflows`	POST	Create from OCI schema
`PATCH /api/v1/workflows/{id}/disable`	PATCH	Disable workflow
`PATCH /api/v1/workflows/{id}/enable`	PATCH	Enable workflow
`PATCH /api/v1/workflows/{id}/deprecate`	PATCH	Deprecate workflow

Next Steps¶

Workflow Execution -- How selected workflows are run
Investigation Pipeline -- The HAPI investigation and selection process
Remediation Workflows -- Writing workflow schemas
Signal Processing -- How classification feeds into workflow filtering