Remediation Workflows¶

Kubernaut remediates issues by running workflows -- containerized actions that fix known problems. Workflows are registered as RemediationWorkflow CRDs, synced to a searchable catalog by the Auth Webhook, and matched to incidents by the LLM based on labels, infrastructure context, and remediation history.

This page covers everything you need to author, build, register, and manage workflows.

Registration Model¶

Workflows are registered by applying a RemediationWorkflow CRD. The Auth Webhook intercepts the admission request, registers the workflow in the DataStorage catalog, captures the operator identity for audit attribution, and computes a content hash for deduplication.

flowchart LR
    Op["Operator<br/><small>kubectl apply</small>"] --> AW["Auth Webhook<br/><small>admission</small>"]
    AW --> DS["DataStorage<br/><small>Catalog</small>"]
    Bundle["Execution Bundle<br/><small>OCI image / Git repo</small>"] --> WFE["WFE<br/><small>Job / Tekton / Ansible</small>"]
    DS -.->|"bundle ref"| WFE

Component	Contents	Purpose
RemediationWorkflow CRD	Workflow schema (version, description, labels, parameters, execution config)	Registered in DataStorage catalog for discovery and LLM selection
Execution bundle	The container or playbook that runs the remediation	Referenced in the CRD; pulled by WFE at execution time

The CRD approach replaces the previous OCI schema image model. Workflow schemas are now native Kubernetes resources, enabling kubectl management, GitOps workflows, and admission webhook integration for audit attribution.

Create Your First Workflow¶

This tutorial walks through creating a workflow that restarts a deployment.

Step 1: Write the Schema¶

Create restart-deployment.yaml:

apiVersion: kubernaut.ai/v1alpha1
kind: RemediationWorkflow
metadata:
  name: restart-deployment-v1
spec:
  version: "1.0.0"
  description:
    what: "Performs a rolling restart of a deployment to clear corrupted runtime state"
    whenToUse: "When pods are in a degraded state but the deployment spec is correct"
    whenNotToUse: "When the issue is caused by a bad image or config change"
    preconditions: "The deployment exists and has at least one ready replica"
  maintainers:
    - name: "Platform Team"
      email: "platform@example.com"

  actionType: RestartDeployment

  labels:
    severity: [critical, high, medium]
    environment: [production, staging, development, "*"]
    component: deployment
    priority: "*"

  detectedLabels:
    helmManaged: "true"

  execution:
    engine: job
    bundle: registry.example.com/workflows/restart-deployment@sha256:abc123...

  parameters:
    - name: TARGET_RESOURCE_NAME
      type: string
      required: true
      description: "Name of the root managing resource (HAPI-injected)"
    - name: TARGET_RESOURCE_KIND
      type: string
      required: true
      description: "Kind of the root managing resource (HAPI-injected)"
    - name: TARGET_RESOURCE_NAMESPACE
      type: string
      required: true
      description: "Namespace of the root managing resource (HAPI-injected)"
    - name: TARGET_DEPLOYMENT
      type: string
      required: true
      description: "Name of the deployment to restart"

  dependencies:
    secrets: []
    configMaps: []

Step 2: Write the Remediation Script¶

Create remediate.sh:

#!/bin/bash
set -euo pipefail

echo "Validating deployment exists..."
kubectl get deployment "$TARGET_DEPLOYMENT" -n "$TARGET_RESOURCE_NAMESPACE" || {
  echo "ERROR: Deployment not found"
  exit 1
}

echo "Performing rolling restart..."
kubectl rollout restart deployment/"$TARGET_DEPLOYMENT" -n "$TARGET_RESOURCE_NAMESPACE"

echo "Waiting for rollout to complete..."
kubectl rollout status deployment/"$TARGET_DEPLOYMENT" -n "$TARGET_RESOURCE_NAMESPACE" --timeout=120s

echo "Verifying deployment health..."
READY=$(kubectl get deployment "$TARGET_DEPLOYMENT" -n "$TARGET_RESOURCE_NAMESPACE" -o jsonpath='{.status.readyReplicas}')
DESIRED=$(kubectl get deployment "$TARGET_DEPLOYMENT" -n "$TARGET_RESOURCE_NAMESPACE" -o jsonpath='{.spec.replicas}')

if [ "$READY" = "$DESIRED" ]; then
  echo "SUCCESS: All $READY/$DESIRED replicas ready"
else
  echo "WARNING: Only $READY/$DESIRED replicas ready"
  exit 1
fi

Step 3: Build the Execution Bundle¶

Create Dockerfile.exec:

FROM registry.access.redhat.com/ubi9/ubi-minimal:latest
RUN microdnf install -y tar gzip && microdnf clean all
COPY --from=bitnami/kubectl:latest /opt/bitnami/kubectl/bin/kubectl /usr/local/bin/kubectl
COPY remediate.sh /scripts/remediate.sh
RUN chmod +x /scripts/remediate.sh
USER 1001
ENTRYPOINT ["/scripts/remediate.sh"]

Build and push:

docker build -f Dockerfile.exec -t registry.example.com/workflows/restart-deployment:v1.0.0 .
docker push registry.example.com/workflows/restart-deployment:v1.0.0

Note the image digest from the push output -- update the execution.bundle field in the CRD with the digest-pinned reference.

Step 4: Register the Workflow¶

kubectl apply -f restart-deployment.yaml

The Auth Webhook intercepts the CREATE request, registers the workflow in the DataStorage catalog, captures the operator identity for audit attribution, and updates the CRD status with the assigned workflowId and catalogStatus.

Verify registration:

kubectl get remediationworkflow restart-deployment-v1 -o wide

Schema Reference¶

For the complete field specification, see RemediationWorkflow and ActionType in the CRD Reference.

Labels¶

Mandatory labels control when a workflow is eligible during discovery:

Label	Type	Required	Description
`severity`	string[]	Yes	Severity levels: `critical`, `high`, `medium`, `low`
`environment`	string[]	Yes	Environments: `production`, `staging`, `development`, `test`, or `"*"`
`component`	string	Yes	Resource kind: `pod`, `deployment`, `node`, or `"*"`
`priority`	string	Yes	Priority: `P0`, `P1`, `P2`, `P3`, or `"*"`
`signalName`	string	No	Optional metadata for workflow authors. Not used for matching -- the LLM selects by `actionType`

Labels support:

Exact match -- component: deployment
Wildcard -- component: "*" (matches any value)
Multi-value -- severity: [critical, high] (matches either)

Labels determine discoverability

Workflows that don't match the mandatory label filters are excluded entirely -- they never reach the LLM. A misconfigured severity or environment can silently hide a workflow from the candidate set. See Workflow Search and Scoring for details.

Detected Labels¶

Optional infrastructure-awareness labels that influence scoring and help the LLM select the right workflow for the target environment:

detectedLabels:
  gitOpsManaged: "true"
  gitOpsTool: "argocd"      # argocd | flux | "*"
  helmManaged: "true"
  pdbProtected: "true"
  hpaEnabled: "true"
  stateful: "true"
  networkIsolated: "true"
  serviceMesh: "istio"       # istio | linkerd | "*"

Label	Type	Valid Values
`gitOpsManaged`	boolean	`"true"` only
`gitOpsTool`	string	`argocd`, `flux`, `"*"`
`pdbProtected`	boolean	`"true"` only
`hpaEnabled`	boolean	`"true"` only
`stateful`	boolean	`"true"` only
`helmManaged`	boolean	`"true"` only
`networkIsolated`	boolean	`"true"` only
`serviceMesh`	string	`istio`, `linkerd`, `"*"`

Workflows that declare detected labels earn scoring boosts when the target resource matches -- see Workflow Search and Scoring.

Bundle Digest Format¶

For job and tekton engines, the execution.bundle field must use a digest-pinned OCI reference:

registry.example.com/repo/image@sha256:<64 hex characters>

Must contain @ (tag-only references are rejected)
Must use sha256: algorithm
Digest must be exactly 64 hex characters
An optional tag before @ is allowed: image:v1.0.0@sha256:abc123...

For the ansible engine, the execution.bundle field is a Git repository URL, and execution.bundleDigest is the Git commit SHA.

Dependencies¶

Workflows can declare Secrets and ConfigMaps that must exist in the execution namespace:

dependencies:
  secrets:
    - name: registry-credentials
  configMaps:
    - name: app-config

The Workflow Execution controller validates that these resources exist before creating the execution resource. How they are delivered to the workflow depends on the engine:

Kubernetes Jobs:

Secrets mounted at /run/kubernaut/secrets/<name> (read-only)
ConfigMaps mounted at /run/kubernaut/configmaps/<name> (read-only)

Tekton Pipelines:

Secrets bound as secret-<name> workspaces
ConfigMaps bound as configmap-<name> workspaces

Ansible (AWX/AAP):

For each Secret in dependencies.secrets, the executor reads the Kubernetes Secret, dynamically creates an AWX credential type with an env injector, creates an ephemeral AWX credential, and attaches it to the job launch. AWX injects the values as environment variables into the Execution Environment:
```
KUBERNAUT_SECRET_{SECRET_NAME}_{KEY}
```
For example, a Secret named gitea-repo-creds with keys username and password becomes:
- KUBERNAUT_SECRET_GITEA_REPO_CREDS_USERNAME
- KUBERNAUT_SECRET_GITEA_REPO_CREDS_PASSWORD
Ephemeral credentials are automatically deleted after the AWX job completes or is cleaned up. The Kubernetes Secret remains the single source of truth -- if it changes, the next execution picks up the new values.
For each ConfigMap in dependencies.configMaps, the executor reads the Kubernetes ConfigMap and merges its data into AWX extra_vars with a standardized prefix:
```
KUBERNAUT_CONFIGMAP_{CONFIGMAP_NAME}_{KEY}
```
For example, a ConfigMap named app-settings with keys timeout and log-level becomes extra_vars:
- KUBERNAUT_CONFIGMAP_APP_SETTINGS_TIMEOUT
- KUBERNAUT_CONFIGMAP_APP_SETTINGS_LOG_LEVEL
ConfigMap data is non-sensitive, so it uses AWX extra_vars (not credentials). The playbook accesses these as standard Ansible variables.

Security: Use no_log: true for sensitive Ansible tasks

When writing Ansible playbooks that handle secrets (credentials, tokens, passwords), always set no_log: true on tasks that read or use sensitive values. This prevents AWX from recording secret data in job output logs:

- name: Read Git credentials from AWX credential env vars
  ansible.builtin.set_fact:
    git_username: "{{ lookup('env', 'KUBERNAUT_SECRET_GITEA_REPO_CREDS_USERNAME') }}"
    git_password: "{{ lookup('env', 'KUBERNAUT_SECRET_GITEA_REPO_CREDS_PASSWORD') }}"
  no_log: true

Tasks to protect include: reading credentials from environment variables, building authenticated URLs, cloning repositories with embedded credentials, and any task that passes secrets as arguments.

Parameters¶

Parameters use UPPER_SNAKE_CASE names and are injected as environment variables. For the complete parameter schema, see RemediationWorkflow in the CRD Reference.

The description field is shown to the LLM during get_workflow, so it should be clear enough for the LLM to populate the parameter from its investigation findings.

Canonical target resource parameters¶

Every workflow schema must declare these three parameters as required: true:

Parameter	Description
`TARGET_RESOURCE_NAME`	Name of the root managing resource
`TARGET_RESOURCE_KIND`	Kind of the root managing resource (e.g., `Deployment`)
`TARGET_RESOURCE_NAMESPACE`	Namespace of the root managing resource

These are HAPI-injected -- HAPI derives them from the K8s-verified root_owner (resolved via the Pod → ReplicaSet → Deployment owner chain) and injects them into selected_workflow.parameters before the AIAnalysis completes. The LLM never sees or populates these fields (they are stripped from the schema before the LLM receives it).

If HAPI cannot determine the root_owner (e.g., the resource context tools were never called), the investigation is flagged rca_incomplete with needs_human_review=true.

Additionally, the WFE controller injects TARGET_RESOURCE (composite format namespace/kind/name) from wfe.Spec.TargetResource into every Job and Tekton PipelineRun as a system variable.

Workflows may also declare additional operational parameters (e.g., TARGET_DEPLOYMENT, GRACE_PERIOD_SECONDS) that the LLM populates from its investigation findings.

Ansible auto-injected variables¶

For ansible executions, the executor also auto-injects remediation context variables into AWX extra_vars (BR-WE-015 TR-6):

Variable	Source	Purpose
`WFE_NAME`	WorkflowExecution CRD name	Query WFE status, parameters, or execution metadata via the Kubernetes API
`WFE_NAMESPACE`	WorkflowExecution CRD namespace	Namespace of the WFE
`RR_NAME`	`wfe.Spec.RemediationRequestRef.Name`	Reference the parent RemediationRequest in commit messages, logs, and audit annotations -- no Kubernetes API lookup needed
`RR_NAMESPACE`	`wfe.Spec.RemediationRequestRef.Namespace`	Namespace of the parent RR

These variables are injected by the executor and must not be declared as parameters in the workflow schema. RR_NAME is the most commonly used -- for example, the GitOps memory limits playbook includes the RR name in its Git commit message to link the code change back to the remediation event.

Execution Engines¶

Kubernetes Jobs¶

Single-step remediations run as Kubernetes Jobs in the kubernaut-workflows namespace:

execution:
  engine: job
  bundle: registry.example.com/workflows/restart-deployment@sha256:abc123...

The Workflow Execution controller creates a Job with:

Environment variables -- All parameters (including the three canonical TARGET_RESOURCE_* params) injected as env vars, plus the system-injected TARGET_RESOURCE
Dependency mounts -- Secrets at /run/kubernaut/secrets/<name>, ConfigMaps at /run/kubernaut/configmaps/<name>
ServiceAccount -- kubernaut-workflow-runner (pre-configured RBAC)

Tekton Pipelines¶

Multi-step remediations use Tekton Pipelines:

execution:
  engine: tekton
  bundle: registry.example.com/tekton-bundles/oom-recovery:v1.0.0@sha256:abc123...

The bundle must contain a Tekton Pipeline named workflow. The controller creates a PipelineRun with:

Tekton bundle resolver -- The bundle is referenced via resolver: bundles with the digest-pinned image
Parameters -- All parameters (including the three canonical TARGET_RESOURCE_* params) injected as Tekton params, plus the system-injected TARGET_RESOURCE
Dependency workspaces -- Secrets as secret-<name> workspace bindings, ConfigMaps as configmap-<name> workspace bindings

Tekton provides step ordering, retries, and artifact passing between steps.

Ansible (AWX/AAP)¶

Workflows that run Ansible playbooks via AWX or Ansible Automation Platform (AAP) use the ansible engine:

apiVersion: kubernaut.ai/v1alpha1
kind: RemediationWorkflow
metadata:
  name: ansible-fix-config
spec:
  version: "1.0.0"
  description:
    what: "Fixes application configuration drift using Ansible"
    whenToUse: "When config drift is detected and the correct state is in a Git repository"
  actionType: FixConfiguration
  labels:
    severity: [high, medium]
    environment: [production, staging]
    component: deployment
    priority: "*"
  execution:
    engine: ansible
    bundle: https://github.com/org/remediation-playbooks.git
    bundleDigest: b7e6a135be2019f995cb4875dbc0116dfda39d21
    engineConfig:
      playbookPath: "playbooks/fix-config.yml"
      jobTemplateName: "kubernaut-fix-config"
  parameters:
    - name: TARGET_RESOURCE_NAME
      type: string
      required: true
      description: "Name of the root managing resource (HAPI-injected)"
    - name: TARGET_RESOURCE_KIND
      type: string
      required: true
      description: "Kind of the root managing resource (HAPI-injected)"
    - name: TARGET_RESOURCE_NAMESPACE
      type: string
      required: true
      description: "Namespace of the root managing resource (HAPI-injected)"

The engineConfig fields for Ansible:

Field	Required	Description
`playbookPath`	Yes	Path to the playbook within the Git repository
`jobTemplateName`	Yes	AWX/AAP Job Template name to launch
`inventoryName`	No	AWX/AAP inventory to use

The Workflow Execution controller launches the AWX job template, passes parameters as extra variables, and monitors the job status until completion.

Dependencies in Ansible workflows:

Secrets (dependencies.secrets): Injected as environment variables via ephemeral AWX credentials (KUBERNAUT_SECRET_{NAME}_{KEY}). Use lookup('env', ...) in your playbook.
ConfigMaps (dependencies.configMaps): Merged into AWX extra_vars (KUBERNAUT_CONFIGMAP_{NAME}_{KEY}). Access as standard Ansible variables.

Automatic K8s API credentials:

The executor automatically injects the WE controller's in-cluster ServiceAccount token as an ephemeral AWX credential on every job launch. Playbooks using kubernetes.core modules receive K8S_AUTH_HOST, K8S_AUTH_API_KEY, and K8S_AUTH_SSL_CA_CERT environment variables without manual credential configuration. The credential is ephemeral and deleted after the job completes. If the in-cluster environment is unavailable, the job proceeds without K8s credentials.

Action Type Taxonomy¶

Action types form the vocabulary the LLM uses to reason about remediation. Each action type has a structured description (what, whenToUse, whenNotToUse, preconditions) that the LLM reads during the list_available_actions step.

Demo Action Types¶

When demoContent.enabled: true (the default), the chart seeds the following action types:

Action Type	What It Does
`ScaleReplicas`	Horizontally scale a workload by adjusting the replica count
`RestartPod`	Kill and recreate one or more pods
`IncreaseCPULimits`	Increase CPU resource limits on containers
`IncreaseMemoryLimits`	Increase memory resource limits on containers
`RollbackDeployment`	Revert a deployment to its previous stable revision
`DrainNode`	Drain and cordon a Kubernetes node, evicting all pods
`CordonNode`	Cordon a node to prevent new pod scheduling
`RestartDeployment`	Perform a rolling restart of all pods in a workload
`CleanupNode`	Reclaim disk space on a node by purging temporary files
`DeletePod`	Delete specific pods without waiting for graceful termination
`GitRevertCommit`	Revert a bad commit in a Git repository managed by GitOps
`ProvisionNode`	Request provisioning of a new Kubernetes node
`GracefulRestart`	Perform a graceful rolling restart to reset runtime state
`CleanupPVC`	Remove old or unnecessary files from a PVC
`RemoveTaint`	Remove a taint from a Kubernetes node
`PatchHPA`	Patch an HPA to increase maxReplicas or adjust thresholds
`RelaxPDB`	Temporarily relax a PDB to unblock a pending node drain
`ProactiveRollback`	Proactively roll back based on predictive SLO burn rate analysis
`CordonDrainNode`	Cordon a node, then drain existing pods to other nodes
`FixCertificate`	Recreate a missing or corrupted CA Secret for cert-manager
`HelmRollback`	Roll back a Helm release to its previous healthy revision
`FixAuthorizationPolicy`	Remove or fix a Linkerd AuthorizationPolicy blocking traffic
`FixStatefulSetPVC`	Recreate a missing PVC for a StatefulSet and restart the stuck pod
`FixNetworkPolicy`	Remove a deny-all NetworkPolicy blocking legitimate ingress
`MigrateEmptyDirToPVC`	Migrate a stateful workload from ephemeral emptyDir storage to a persistent volume claim

Registering Custom Action Types¶

The taxonomy is user-extensible. Operators register custom action types by applying an ActionType CRD:

apiVersion: kubernaut.ai/v1alpha1
kind: ActionType
metadata:
  name: restart-sidecar
spec:
  name: RestartSidecar
  description:
    what: "Restart only the sidecar container without affecting the main application"
    whenToUse: "When a service mesh sidecar is in a degraded state but the main container is healthy"
    whenNotToUse: "When the main application container is also failing"
    preconditions: "The pod has a sidecar container identified by the service mesh annotation"

kubectl apply -f restart-sidecar-actiontype.yaml

The Auth Webhook intercepts the CREATE, registers the action type in the DataStorage taxonomy, and captures the operator identity for audit attribution. Deleting the CRD disables the action type (soft delete). Re-applying a previously deleted CRD re-enables the existing entry.

Action type descriptions directly affect LLM behavior

The LLM reads what, whenToUse, whenNotToUse, and preconditions verbatim during workflow discovery. Poorly written or overlapping descriptions degrade workflow selection quality:

Write clear, unambiguous descriptions so the LLM can distinguish between action types
Avoid semantic collisions -- action types that overlap in meaning (e.g., RestartPod vs RecyclePod) will confuse the LLM
Use PascalCase naming consistent with existing demo types
Action types are intentionally stable -- they should not change frequently during a deployment's lifecycle

Workflow Lifecycle¶

Workflows have five lifecycle states:

State	Description	Discoverable
`active`	Available for selection	Yes (if `is_latest_version`)
`disabled`	Temporarily unavailable (CRD deleted, or manually disabled)	No
`superseded`	Replaced by a new registration with different content for the same `metadata.name` + `version`	No
`deprecated`	Marked for removal, still usable	No
`archived`	Permanently removed from catalog	No

State transitions via kubectl:

# Register (active)
kubectl apply -f my-workflow.yaml

# Disable (deleting the CRD disables the catalog entry)
kubectl delete remediationworkflow my-workflow

# Re-enable (re-applying a previously deleted CRD re-enables it)
kubectl apply -f my-workflow.yaml

State transitions via the DataStorage API (for advanced lifecycle management):

curl -X PATCH http://data-storage:8080/api/v1/workflows/{workflow_id}/disable
curl -X PATCH http://data-storage:8080/api/v1/workflows/{workflow_id}/enable
curl -X PATCH http://data-storage:8080/api/v1/workflows/{workflow_id}/deprecate

Content Integrity and Supersede¶

When a RemediationWorkflow CRD is applied with the same metadata.name + version as an existing active workflow, the Auth Webhook computes a content hash of the incoming schema and compares it to the existing entry:

Existing State	Content Hash	Result
`active`	Same	Idempotent return (no DB writes)
`active`	Different	Old workflow marked `superseded`, new one created as `active`
`disabled`	Same	Re-enabled
`disabled`	Different	New workflow created as `active`

Version Management¶

When a new version of a workflow is registered (same metadata.name, different version), the previous version's is_latest_version flag is set to false. Only workflows with status = 'active' AND is_latest_version = true are discoverable.

This means you can register a new version and the old one is automatically excluded from discovery without needing to disable it.

Workflow Search and Scoring¶

Understanding how DataStorage filters and scores workflows is critical for authoring effective schemas. Your label and detected label choices directly affect whether the LLM ever sees your workflow.

Layer 1: Mandatory Label Filtering (WHERE clause)¶

Before scoring, DataStorage filters candidates using the mandatory labels from the schema. Workflows that fail any filter are excluded entirely -- they never reach the LLM.

Filter	Matching Rule
Severity	JSONB array `?` operator: workflow's `severity` array must contain the query value, or contain `"*"`
Component	Case-insensitive comparison: Kubernetes Kind is PascalCase (e.g., `Deployment`), workflow labels store lowercase (e.g., `deployment`)
Environment	JSONB array `?` operator with `"*"` wildcard fallback
Priority	Handles both scalar (`"P1"`) and array (`["P0","P1"]`) values with `"*"` wildcard

Additionally, only active + is_latest_version = true workflows pass.

Layer 2: Semantic Scoring (ORDER BY)¶

Surviving candidates are scored by infrastructure label overlap. The formula:

final_score = LEAST((5.0 + detected_boost + custom_boost - penalty) / 10.0, 1.0)

The base score is 0.5 (5.0/10.0). Boosts increase it, penalties decrease it. The LEAST clamp ensures the score never exceeds 1.0 when many labels match.

Detected label boost weights:

Label	Exact Match	Workflow Wildcard (`"*"`)	Query Wildcard
`gitOpsManaged`	+0.10	--	--
`gitOpsTool`	+0.10	+0.05	+0.05
`pdbProtected`	+0.05	--	--
`serviceMesh`	+0.05	+0.025	+0.025
`networkIsolated`	+0.03	--	--
`helmManaged`	+0.02	--	--
`stateful`	+0.02	--	--
`hpaEnabled`	+0.02	--	--

Maximum possible boost: 0.39 (all labels match exactly).

Penalty rules (high-impact only):

Condition	Penalty
Target IS GitOps-managed but workflow doesn't declare `gitOpsManaged`	-0.10
Target uses a specific GitOps tool but workflow declares a different one	-0.10

Maximum possible penalty: 0.20.

Custom labels: +0.15 per exact match, +0.075 per wildcard match.

What This Means for Workflow Authors¶

Detected labels have the highest impact on ranking. A GitOps-aware workflow (gitOpsManaged: "true", gitOpsTool: "argocd") will consistently outrank a generic one when the target is ArgoCD-managed.
Setting "*" wildcards earns half credit. Useful for broadly applicable workflows that work with any GitOps tool or service mesh.
Not declaring a detected label means "no requirement" -- no boost, no penalty (except for GitOps, which applies a penalty when the target IS GitOps-managed).
Custom labels provide fine-grained differentiation for organization-specific matching (e.g., team: payments).

Connection to Signal Processing Rego Policies¶

The SP Rego policies determine the values that feed into discovery:

All classification rules live in a single policy.rego file under package signalprocessing:

The severity and priority rules produce values for Layer 1 filtering -- a misconfigured rule can silently exclude correct workflows
The environment rules produce the environment value for Layer 1 filtering
The labels rules (kubernaut.ai/label-*) produce values for Layer 2 scoring at +0.15 per match

Why business classification is not used for discovery

Workflows are reusable across organizational boundaries -- a RollbackDeployment works for any team, business unit, or SLA tier. Mandatory labels describe the technical remediation context (severity, resource type, environment). Business classification describes who owns the resource, which is orthogonal to what fix is needed. Operators who want organizational matching can use custom labels (e.g., kubernaut.ai/label-team=payments on the namespace + customLabels: {team: ["payments"]} in the schema).

Scoring Is Internal Ordering, Not Selection¶

The final_score determines the order in which workflows are presented to the LLM, but the LLM makes the final selection based on descriptions, remediation history, and context. A workflow ranked #2 by score can still be selected if its description better matches the root cause.

Demo Workflows¶

When demoContent.enabled: true (the default), the chart seeds the following demo workflows:

Workflow	Action Type
`crashloop-rollback-v1`	RollbackDeployment
`crashloop-rollback-risk-v1`	RollbackDeployment
`restart-pod-v1`	RestartPod
`rollback-deployment-v1`	RollbackDeployment
`increase-memory-limits-v1`	IncreaseMemoryLimits
`increase-memory-limits-gitops-v1`	IncreaseMemoryLimits
`graceful-restart-v1`	GracefulRestart
`git-revert-v2`	GitRevertCommit
`provision-node-v1`	ProvisionNode
`proactive-rollback-v1`	ProactiveRollback
`patch-hpa-v1`	PatchHPA
`relax-pdb-v1`	RelaxPDB
`remove-taint-v1`	RemoveTaint
`cleanup-pvc-v1`	CleanupPVC
`cordon-drain-v1`	CordonDrainNode
`fix-certificate-v1`	FixCertificate
`fix-certificate-gitops-v1`	GitRevertCommit
`helm-rollback-v1`	HelmRollback
`fix-authz-policy-v1`	FixAuthorizationPolicy
`fix-statefulset-pvc-v1`	FixStatefulSetPVC
`fix-network-policy-v1`	FixNetworkPolicy
`migrate-emptydir-to-pvc-gitops-v1`	MigrateEmptyDirToPVC

These are starting points. Operators supplement them with custom workflows using custom or existing action types.

CRUD Operations¶

RemediationWorkflow¶

Create:

kubectl apply -f my-workflow.yaml

Read:

kubectl get remediationworkflows
kubectl get remediationworkflow my-workflow -o yaml

Update (spec fields are immutable after creation; apply a new version instead):

Re-apply the CRD with the same metadata.name + version but different content. The old workflow is marked superseded and a new one is created.

Delete (disables in catalog):

kubectl delete remediationworkflow my-workflow

ActionType¶

Create:

kubectl apply -f my-actiontype.yaml

Read:

kubectl get actiontypes
kubectl get actiontype my-actiontype -o yaml

Update (description field is mutable):

Edit the CRD and re-apply:

kubectl apply -f my-actiontype.yaml

Delete (disables in taxonomy):

kubectl delete actiontype my-actiontype

Re-applying a previously deleted ActionType CRD re-enables it with the previous workflow associations intact.

Next Steps¶

Investigation Pipeline -- How the LLM discovers and selects workflows
Human Approval -- When workflows require approval before execution
Effectiveness Monitoring -- How outcomes are evaluated
Architecture: Workflow Execution -- Deep-dive into the execution engine