Skip to content

Remediation Workflows

Kubernaut remediates issues by running workflows -- containerized actions that fix known problems. Workflows are registered as RemediationWorkflow CRDs, synced to a searchable catalog by the Auth Webhook, and matched to incidents by the LLM based on labels, infrastructure context, and remediation history.

This page covers everything you need to author, build, register, and manage workflows.

Registration Model

Workflows are registered by applying a RemediationWorkflow CRD. The Auth Webhook intercepts the admission request, registers the workflow in the DataStorage catalog, captures the operator identity for audit attribution, and computes a content hash for deduplication.

flowchart LR
    Op["Operator<br/><small>kubectl apply</small>"] --> AW["Auth Webhook<br/><small>admission</small>"]
    AW --> DS["DataStorage<br/><small>Catalog</small>"]
    Bundle["Execution Bundle<br/><small>OCI image / Git repo</small>"] --> WFE["WFE<br/><small>Job / Tekton / Ansible</small>"]
    DS -.->|"bundle ref"| WFE
Component Contents Purpose
RemediationWorkflow CRD Workflow schema (version, description, labels, parameters, execution config) Registered in DataStorage catalog for discovery and LLM selection
Execution bundle The container or playbook that runs the remediation Referenced in the CRD; pulled by WFE at execution time

The CRD approach replaces the previous OCI schema image model. Workflow schemas are now native Kubernetes resources, enabling kubectl management, GitOps workflows, and admission webhook integration for audit attribution.

Namespace placement

  • RemediationWorkflow and ActionType CRDs must be applied in kubernaut-system (or your configured platform namespace).
  • ServiceAccount, ClusterRole, and ClusterRoleBinding for per-workflow RBAC go in kubernaut-workflows (the execution namespace).

All production workflows in the kubernaut-demo-scenarios repository follow this convention.

Create Your First Workflow

This tutorial walks through creating a workflow that restarts a deployment.

Step 1: Write the Schema

Create restart-deployment.yaml. Production workflows ship as multi-document YAML: a per-workflow ServiceAccount with scoped RBAC, followed by the RemediationWorkflow CRD.

# --- Per-workflow ServiceAccount and RBAC (in the execution namespace) ---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: restart-deployment-v1-runner
  namespace: kubernaut-workflows
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kubernaut:workflow:restart-deployment-v1
rules:
  - apiGroups: ["apps"]
    resources: ["deployments"]
    verbs: ["get", "patch"]
  - apiGroups: ["apps"]
    resources: ["deployments/status", "replicasets"]
    verbs: ["get", "list"]
  - apiGroups: [""]
    resources: ["pods"]
    verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kubernaut:workflow:restart-deployment-v1
subjects:
  - kind: ServiceAccount
    name: restart-deployment-v1-runner
    namespace: kubernaut-workflows
roleRef:
  kind: ClusterRole
  name: kubernaut:workflow:restart-deployment-v1
  apiGroup: rbac.authorization.k8s.io
---
# --- RemediationWorkflow CRD (in the platform namespace) ---
apiVersion: kubernaut.ai/v1alpha1
kind: RemediationWorkflow
metadata:
  name: restart-deployment-v1
  namespace: kubernaut-system
spec:
  version: "1.0.0"
  description:
    what: "Performs a rolling restart of a deployment to clear corrupted runtime state"
    whenToUse: "When pods are in a degraded state but the deployment spec is correct"
    whenNotToUse: "When the issue is caused by a bad image or config change"
    preconditions: "The deployment exists and has at least one ready replica"
  maintainers:
    - name: "Platform Team"
      email: "platform@example.com"

  actionType: RestartDeployment

  labels:
    severity: [critical, high, medium]
    environment: [production, staging, development, "*"]
    component: [deployment]
    priority: "*"

  detectedLabels:
    helmManaged: "true"

  execution:
    engine: job
    bundle: registry.example.com/workflows/restart-deployment@sha256:abc123...
    serviceAccountName: restart-deployment-v1-runner

  parameters:
    - name: TARGET_RESOURCE_NAME
      type: string
      required: true
      description: "Name of the root managing resource (KA-injected)"
    - name: TARGET_RESOURCE_KIND
      type: string
      required: true
      description: "Kind of the root managing resource (KA-injected)"
    - name: TARGET_RESOURCE_NAMESPACE
      type: string
      required: true
      description: "Namespace of the root managing resource (KA-injected)"
    - name: TARGET_DEPLOYMENT
      type: string
      required: true
      description: "Name of the deployment to restart"

  dependencies:
    secrets: []
    configMaps: []

Per-workflow ServiceAccounts

Each workflow should declare its own ServiceAccount with least-privilege RBAC scoped to the resources it needs. The SA and its ClusterRole/ClusterRoleBinding go in the execution namespace (kubernaut-workflows), while the RemediationWorkflow CRD itself goes in the platform namespace (kubernaut-system). See the kubernaut-demo-scenarios repository for complete examples.

Step 2: Write the Remediation Script

Create remediate.sh:

#!/bin/bash
set -euo pipefail

echo "Validating deployment exists..."
kubectl get deployment "$TARGET_DEPLOYMENT" -n "$TARGET_RESOURCE_NAMESPACE" || {
  echo "ERROR: Deployment not found"
  exit 1
}

echo "Performing rolling restart..."
kubectl rollout restart deployment/"$TARGET_DEPLOYMENT" -n "$TARGET_RESOURCE_NAMESPACE"

echo "Waiting for rollout to complete..."
kubectl rollout status deployment/"$TARGET_DEPLOYMENT" -n "$TARGET_RESOURCE_NAMESPACE" --timeout=120s

echo "Verifying deployment health..."
READY=$(kubectl get deployment "$TARGET_DEPLOYMENT" -n "$TARGET_RESOURCE_NAMESPACE" -o jsonpath='{.status.readyReplicas}')
DESIRED=$(kubectl get deployment "$TARGET_DEPLOYMENT" -n "$TARGET_RESOURCE_NAMESPACE" -o jsonpath='{.spec.replicas}')

if [ "$READY" = "$DESIRED" ]; then
  echo "SUCCESS: All $READY/$DESIRED replicas ready"
else
  echo "WARNING: Only $READY/$DESIRED replicas ready"
  exit 1
fi

Step 3: Build the Execution Bundle

Create Containerfile.exec:

FROM registry.access.redhat.com/ubi9/ubi-minimal:latest
RUN microdnf install -y tar gzip && microdnf clean all
COPY --from=bitnami/kubectl:latest /opt/bitnami/kubectl/bin/kubectl /usr/local/bin/kubectl
COPY remediate.sh /scripts/remediate.sh
RUN chmod +x /scripts/remediate.sh
USER 1001
ENTRYPOINT ["/scripts/remediate.sh"]

Build and push:

podman build -f Containerfile.exec -t registry.example.com/workflows/restart-deployment:v1.0.0 .
podman push registry.example.com/workflows/restart-deployment:v1.0.0

Note the image digest from the push output -- update the execution.bundle field in the CRD with the digest-pinned reference.

Step 4: Register the Workflow

kubectl apply -f restart-deployment.yaml

The Auth Webhook intercepts the CREATE request, registers the workflow in the DataStorage catalog, captures the operator identity for audit attribution, and updates the CRD status with the assigned workflowId and catalogStatus.

Verify registration:

kubectl get remediationworkflow restart-deployment-v1 -o wide

Create Your First Ansible Workflow

This tutorial walks through creating an Ansible-based workflow that fixes application configuration drift via a GitOps commit. It mirrors the Job tutorial above but uses AWX/AAP as the execution engine.

Step 1: Set Up the AWX Job Template

Create a Job Template in your AWX/AAP instance:

  • Name: kubernaut-fix-config-drift (this becomes engineConfig.jobTemplateName)
  • Project: Point to a Git repository containing your playbooks (AWX syncs the repo via its SCM credential)
  • Playbook: Select the playbook path (e.g., playbooks/fix-config-drift.yml)
  • Execution Environment: Use an EE that includes the kubernetes.core and community.general collections

Step 2: Write the Playbook

Create playbooks/fix-config-drift.yml in your playbook repository:

---
- name: Fix configuration drift via GitOps commit
  hosts: localhost
  connection: local
  gather_facts: false

  tasks:
    - name: Read Git credentials from AWX credential env vars
      ansible.builtin.set_fact:
        git_username: "{{ lookup('env', 'KUBERNAUT_SECRET_GIT_REPO_CREDS_USERNAME') }}"
        git_password: "{{ lookup('env', 'KUBERNAUT_SECRET_GIT_REPO_CREDS_PASSWORD') }}"
      no_log: true

    - name: Clone application config repository
      ansible.builtin.git:
        repo: "https://{{ git_username }}:{{ git_password }}@github.com/org/app-config.git"
        dest: /tmp/app-config
        version: main
      no_log: true

    - name: Apply corrected configuration
      ansible.builtin.template:
        src: templates/deployment-config.yml.j2
        dest: "/tmp/app-config/{{ TARGET_RESOURCE_NAMESPACE }}/{{ TARGET_RESOURCE_NAME }}/config.yml"

    - name: Commit and push fix
      ansible.builtin.shell: |
        cd /tmp/app-config
        git add -A
        git commit -m "fix({{ TARGET_RESOURCE_NAMESPACE }}): correct config drift for {{ TARGET_RESOURCE_NAME }} [RR: {{ RR_NAME }}]"
        git push origin main
      no_log: true

Key points:

  • Secrets are accessed via lookup('env', 'KUBERNAUT_SECRET_...') — always use no_log: true on tasks that handle credentials
  • RR_NAME is auto-injected by the executor and used here to link the Git commit back to the remediation event
  • kubernetes.core modules authenticate automatically via the injected K8S_AUTH_* credentials

Step 3: Write the Schema

Create fix-config-drift.yaml with the SA + RBAC + CRD multi-doc pattern:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: fix-config-drift-v1-runner
  namespace: kubernaut-workflows
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: kubernaut:workflow:fix-config-drift-v1
rules:
  - apiGroups: ["apps"]
    resources: ["deployments"]
    verbs: ["get", "list"]
  - apiGroups: [""]
    resources: ["configmaps"]
    verbs: ["get", "list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: kubernaut:workflow:fix-config-drift-v1
subjects:
  - kind: ServiceAccount
    name: fix-config-drift-v1-runner
    namespace: kubernaut-workflows
roleRef:
  kind: ClusterRole
  name: kubernaut:workflow:fix-config-drift-v1
  apiGroup: rbac.authorization.k8s.io
---
apiVersion: kubernaut.ai/v1alpha1
kind: RemediationWorkflow
metadata:
  name: fix-config-drift-v1
  namespace: kubernaut-system
spec:
  version: "1.0.0"
  description:
    what: "Fixes application configuration drift by committing the correct state to the source Git repository"
    whenToUse: "When config drift is detected and the correct state is defined in a GitOps-managed repository"
    whenNotToUse: "When the drift is intentional or the application is not GitOps-managed"
    preconditions: "The deployment exists, the config repository is accessible, and Git credentials are available"
  maintainers:
    - name: "Platform Team"
      email: "platform@example.com"

  actionType: FixConfiguration

  labels:
    severity: [high, medium]
    environment: [production, staging]
    component: [deployment]
    priority: "*"

  detectedLabels:
    gitOpsManaged: "true"

  execution:
    engine: ansible
    bundle: https://github.com/org/remediation-playbooks.git
    bundleDigest: b7e6a135be2019f995cb4875dbc0116dfda39d21
    serviceAccountName: fix-config-drift-v1-runner
    engineConfig:
      playbookPath: "playbooks/fix-config-drift.yml"
      jobTemplateName: "kubernaut-fix-config-drift"

  parameters:
    - name: TARGET_RESOURCE_NAME
      type: string
      required: true
      description: "Name of the root managing resource (KA-injected)"
    - name: TARGET_RESOURCE_KIND
      type: string
      required: true
      description: "Kind of the root managing resource (KA-injected)"
    - name: TARGET_RESOURCE_NAMESPACE
      type: string
      required: true
      description: "Namespace of the root managing resource (KA-injected)"

  dependencies:
    secrets:
      - name: git-repo-creds
    configMaps: []

bundleDigest is the Git commit SHA

Unlike Job/Tekton workflows where the bundle is an OCI image digest, Ansible workflows use the Git commit SHA to pin the exact playbook version. Update bundleDigest when you push new playbook changes to ensure Kubernaut runs the version you registered.

Step 4: Create the Secret

The workflow declares git-repo-creds as a dependency. Create it in the execution namespace:

kubectl create secret generic git-repo-creds \
  -n kubernaut-workflows \
  --from-literal=username=kubernaut-bot \
  --from-literal=password=ghp_xxxxxxxxxxxx

The executor reads this Secret, creates an ephemeral AWX credential, and injects the values as KUBERNAUT_SECRET_GIT_REPO_CREDS_USERNAME and KUBERNAUT_SECRET_GIT_REPO_CREDS_PASSWORD environment variables into the Execution Environment. The ephemeral credential is deleted after the AWX job completes.

Step 5: Register the Workflow

kubectl apply -f fix-config-drift.yaml

Verify registration:

kubectl get remediationworkflow fix-config-drift-v1 -n kubernaut-system -o wide

The CATALOGSTATUS column should show Active once the Auth Webhook completes registration. If it shows Pending, the webhook is still processing. If it shows Invalid, check the workflow spec for errors (e.g., unreachable bundle URL).

Schema Reference

For the complete field specification, see RemediationWorkflow and ActionType in the CRD Reference.

Labels

Mandatory labels control when a workflow is eligible during discovery:

Label Type Required Description
severity string[] Yes Severity levels: critical, high, medium, low (array, minItems: 1)
environment string[] Yes Environments: production, staging, development, test, or "*" (array, minItems: 1)
component string[] Yes Resource kind(s): pod, deployment, node, or "*" (array, minItems: 1)
priority string Yes Priority: P0, P1, P2, P3, or "*" (single value)

Labels support:

  • Exact match -- component: [deployment]
  • Wildcard -- component: ["*"] (matches any value)
  • Multi-value -- severity: [critical, high] (matches either)

Labels determine discoverability

Workflows that don't match the mandatory label filters are excluded entirely -- they never reach the LLM. A misconfigured severity or environment can silently hide a workflow from the candidate set. See Workflow Search and Scoring for details.

Workflow display name

FormatWorkflowDisplay(actionType, workflowName) returns ActionType:WorkflowName for user-visible strings. The runtime looks up a friendly label through DataStorage via ResolveWorkflowDisplay (catalog lookup by workflow identity). If the resolver is nil or DataStorage returns no row, the fallback is the raw workflow UUID (no ActionType: prefix).

Detected Labels

Optional infrastructure-awareness labels that influence scoring and help the LLM select the right workflow for the target environment:

detectedLabels:
  gitOpsManaged: "true"
  gitOpsTool: "argocd"      # argocd | flux | "*"
  helmManaged: "true"
  pdbProtected: "true"
  hpaEnabled: "true"
  stateful: "true"
  networkIsolated: "true"
  serviceMesh: "istio"       # istio | linkerd | "*"
Label Type Valid Values
gitOpsManaged boolean "true" only
gitOpsTool string argocd, flux, "*"
pdbProtected boolean "true" only
hpaEnabled boolean "true" only
stateful boolean "true" only
helmManaged boolean "true" only
networkIsolated boolean "true" only
serviceMesh string istio, linkerd, "*"

Workflows that declare detected labels earn scoring boosts when the target resource matches -- see Workflow Search and Scoring.

Bundle Digest Format

For job and tekton engines, the execution.bundle field must use a digest-pinned OCI reference:

registry.example.com/repo/image@sha256:<64 hex characters>
  • Must contain @ (tag-only references are rejected)
  • Must use sha256: algorithm
  • Digest must be exactly 64 hex characters
  • An optional tag before @ is allowed: image:v1.0.0@sha256:abc123...

For the ansible engine, the execution.bundle field is a Git repository URL, and execution.bundleDigest is the Git commit SHA.

Dependencies

Workflows can declare Secrets and ConfigMaps that must exist in the execution namespace:

dependencies:
  secrets:
    - name: registry-credentials
  configMaps:
    - name: app-config

The Workflow Execution controller validates that these resources exist before creating the execution resource. How they are delivered to the workflow depends on the engine:

Kubernetes Jobs:

  • Secrets mounted at /run/kubernaut/secrets/<name> (read-only)
  • ConfigMaps mounted at /run/kubernaut/configmaps/<name> (read-only)

Tekton Pipelines:

  • Secrets bound as secret-<name> workspaces
  • ConfigMaps bound as configmap-<name> workspaces

Ansible (AWX/AAP):

  • For each Secret in dependencies.secrets, the executor reads the Kubernetes Secret, dynamically creates an AWX credential type with an env injector, creates an ephemeral AWX credential, and attaches it to the job launch. AWX injects the values as environment variables into the Execution Environment:

    KUBERNAUT_SECRET_{SECRET_NAME}_{KEY}
    

    For example, a Secret named gitea-repo-creds with keys username and password becomes:

    • KUBERNAUT_SECRET_GITEA_REPO_CREDS_USERNAME
    • KUBERNAUT_SECRET_GITEA_REPO_CREDS_PASSWORD

    Ephemeral credentials are automatically deleted after the AWX job completes or is cleaned up. The Kubernetes Secret remains the single source of truth -- if it changes, the next execution picks up the new values.

  • For each ConfigMap in dependencies.configMaps, the executor reads the Kubernetes ConfigMap and merges its data into AWX extra_vars with a standardized prefix:

    KUBERNAUT_CONFIGMAP_{CONFIGMAP_NAME}_{KEY}
    

    For example, a ConfigMap named app-settings with keys timeout and log-level becomes extra_vars:

    • KUBERNAUT_CONFIGMAP_APP_SETTINGS_TIMEOUT
    • KUBERNAUT_CONFIGMAP_APP_SETTINGS_LOG_LEVEL

    ConfigMap data is non-sensitive, so it uses AWX extra_vars (not credentials). The playbook accesses these as standard Ansible variables.

Security: Use no_log: true for sensitive Ansible tasks

When writing Ansible playbooks that handle secrets (credentials, tokens, passwords), always set no_log: true on tasks that read or use sensitive values. This prevents AWX from recording secret data in job output logs:

- name: Read Git credentials from AWX credential env vars
  ansible.builtin.set_fact:
    git_username: "{{ lookup('env', 'KUBERNAUT_SECRET_GITEA_REPO_CREDS_USERNAME') }}"
    git_password: "{{ lookup('env', 'KUBERNAUT_SECRET_GITEA_REPO_CREDS_PASSWORD') }}"
  no_log: true

Tasks to protect include: reading credentials from environment variables, building authenticated URLs, cloning repositories with embedded credentials, and any task that passes secrets as arguments.

Parameters

Parameters use UPPER_SNAKE_CASE names and are injected as environment variables.

Field Type Description
name string Parameter name in UPPER_SNAKE_CASE
type string One of: string, integer, boolean, float, array
required boolean Whether the parameter must be provided
description string Shown to the LLM during get_workflow — write it clearly enough for the LLM to populate the value from its investigation findings
default JSON Default value (type must match type field)
enum string[] Allowed values (validated at execution time)
pattern string Regex validation for string parameters
minimum number Minimum value for integer/float parameters
maximum number Maximum value for integer/float parameters
dependsOn string[] Names of other parameters this one depends on

For the complete parameter schema, see RemediationWorkflow in the CRD Reference.

Canonical target resource parameters

Every workflow schema must declare these three parameters as required: true:

Parameter Description
TARGET_RESOURCE_NAME Name of the root managing resource
TARGET_RESOURCE_KIND Kind of the root managing resource (e.g., Deployment)
TARGET_RESOURCE_NAMESPACE Namespace of the root managing resource

These are KA-injected -- Kubernaut Agent derives them from the K8s-verified root_owner (resolved via the Pod → ReplicaSet → Deployment owner chain) and injects them into selected_workflow.parameters before the AIAnalysis completes. The LLM never sees or populates these fields (they are stripped from the schema before the LLM receives it).

If Kubernaut Agent cannot determine the root_owner (e.g., the resource context tools were never called), the investigation is flagged rca_incomplete with needs_human_review=true.

Additionally, the WFE controller injects TARGET_RESOURCE (composite format namespace/kind/name) from wfe.Spec.TargetResource into every Job and Tekton PipelineRun as a system variable.

Workflows may also declare additional operational parameters (e.g., TARGET_DEPLOYMENT, GRACE_PERIOD_SECONDS) that the LLM populates from its investigation findings.

Ansible auto-injected variables

For ansible executions, the executor also auto-injects remediation context variables into AWX extra_vars (BR-WE-015 TR-6):

Variable Source Purpose
WFE_NAME WorkflowExecution CRD name Query WFE status, parameters, or execution metadata via the Kubernetes API
WFE_NAMESPACE WorkflowExecution CRD namespace Namespace of the WFE
RR_NAME wfe.Spec.RemediationRequestRef.Name Reference the parent RemediationRequest in commit messages, logs, and audit annotations -- no Kubernetes API lookup needed
RR_NAMESPACE wfe.Spec.RemediationRequestRef.Namespace Namespace of the parent RR

These variables are injected by the executor and must not be declared as parameters in the workflow schema. RR_NAME is the most commonly used -- for example, the GitOps memory limits playbook includes the RR name in its Git commit message to link the code change back to the remediation event.

Execution Engines

Kubernetes Jobs

Single-step remediations run as Kubernetes Jobs in the kubernaut-workflows namespace:

execution:
  engine: job
  bundle: registry.example.com/workflows/restart-deployment@sha256:abc123...

The Workflow Execution controller creates a Job with:

  • Environment variables -- All parameters (including the three canonical TARGET_RESOURCE_* params) injected as env vars, plus the system-injected TARGET_RESOURCE
  • Dependency mounts -- Secrets at /run/kubernaut/secrets/<name>, ConfigMaps at /run/kubernaut/configmaps/<name>
  • ServiceAccount -- Per-workflow SA from execution.serviceAccountName (recommended). Falls back to the execution namespace default SA if omitted.

Tekton Pipelines

Multi-step remediations use Tekton Pipelines:

execution:
  engine: tekton
  bundle: registry.example.com/tekton-bundles/oom-recovery:v1.0.0@sha256:abc123...

The bundle must contain a Tekton Pipeline named workflow. The controller creates a PipelineRun with:

  • Tekton bundle resolver -- The bundle is referenced via resolver: bundles with the digest-pinned image
  • Parameters -- All parameters (including the three canonical TARGET_RESOURCE_* params) injected as Tekton params, plus the system-injected TARGET_RESOURCE
  • Dependency workspaces -- Secrets as secret-<name> workspace bindings, ConfigMaps as configmap-<name> workspace bindings

Tekton provides step ordering, retries, and artifact passing between steps.

Ansible (AWX/AAP)

Workflows that run Ansible playbooks via AWX or Ansible Automation Platform (AAP) use the ansible engine:

apiVersion: kubernaut.ai/v1alpha1
kind: RemediationWorkflow
metadata:
  name: ansible-fix-config
  namespace: kubernaut-system
spec:
  version: "1.0.0"
  description:
    what: "Fixes application configuration drift using Ansible"
    whenToUse: "When config drift is detected and the correct state is in a Git repository"
  actionType: FixConfiguration
  labels:
    severity: [high, medium]
    environment: [production, staging]
    component: [deployment]
    priority: "*"
  execution:
    engine: ansible
    bundle: https://github.com/org/remediation-playbooks.git
    bundleDigest: b7e6a135be2019f995cb4875dbc0116dfda39d21
    serviceAccountName: ansible-fix-config-runner
    engineConfig:
      playbookPath: "playbooks/fix-config.yml"
      jobTemplateName: "kubernaut-fix-config"
  parameters:
    - name: TARGET_RESOURCE_NAME
      type: string
      required: true
      description: "Name of the root managing resource (KA-injected)"
    - name: TARGET_RESOURCE_KIND
      type: string
      required: true
      description: "Kind of the root managing resource (KA-injected)"
    - name: TARGET_RESOURCE_NAMESPACE
      type: string
      required: true
      description: "Namespace of the root managing resource (KA-injected)"

The engineConfig fields for Ansible:

Field Required Description
playbookPath Yes Path to the playbook within the Git repository
jobTemplateName Yes AWX/AAP Job Template name to launch
inventoryName No AWX/AAP inventory to use

The Workflow Execution controller launches the AWX job template, passes parameters as extra variables, and monitors the job status until completion.

Dependencies in Ansible workflows:

  • Secrets (dependencies.secrets): Injected as environment variables via ephemeral AWX credentials (KUBERNAUT_SECRET_{NAME}_{KEY}). Use lookup('env', ...) in your playbook.
  • ConfigMaps (dependencies.configMaps): Merged into AWX extra_vars (KUBERNAUT_CONFIGMAP_{NAME}_{KEY}). Access as standard Ansible variables.

Automatic K8s API credentials:

The executor automatically injects the WE controller's in-cluster ServiceAccount token as an ephemeral AWX credential on every job launch. Playbooks using kubernetes.core modules receive K8S_AUTH_HOST, K8S_AUTH_API_KEY, and K8S_AUTH_SSL_CA_CERT environment variables without manual credential configuration. The credential is ephemeral and deleted after the job completes. If the in-cluster environment is unavailable, the job proceeds without K8s credentials.

For a complete end-to-end walkthrough, see Create Your First Ansible Workflow.

Action Type Taxonomy

Action types form the vocabulary the LLM uses to reason about remediation. Each action type has a structured description (what, whenToUse, whenNotToUse, preconditions) that the LLM reads during the list_available_actions step.

Registering Action Types

The taxonomy is user-extensible. Operators register custom action types by applying an ActionType CRD:

apiVersion: kubernaut.ai/v1alpha1
kind: ActionType
metadata:
  name: restart-sidecar
  namespace: kubernaut-system
spec:
  name: RestartSidecar
  description:
    what: "Restart only the sidecar container without affecting the main application"
    whenToUse: "When a service mesh sidecar is in a degraded state but the main container is healthy"
    whenNotToUse: "When the main application container is also failing"
    preconditions: "The pod has a sidecar container identified by the service mesh annotation"
kubectl apply -f restart-sidecar-actiontype.yaml

The Auth Webhook intercepts the CREATE, registers the action type in the DataStorage taxonomy, and captures the operator identity for audit attribution. Deleting the CRD disables the action type (soft delete). Re-applying a previously deleted CRD re-enables the existing entry.

whenNotToUse and preconditions are optional but strongly recommended

The whenNotToUse and preconditions fields are optional on both ActionType and RemediationWorkflow descriptions. However, providing them significantly improves LLM selection quality by giving the model explicit exclusion criteria and validation requirements.

Action type descriptions directly affect LLM behavior

The LLM reads what, whenToUse, whenNotToUse, and preconditions verbatim during workflow discovery. Poorly written or overlapping descriptions degrade workflow selection quality:

  • Write clear, unambiguous descriptions so the LLM can distinguish between action types
  • Avoid semantic collisions -- action types that overlap in meaning (e.g., RestartPod vs RecyclePod) will confuse the LLM
  • Use PascalCase naming consistent with existing demo types
  • Action types are intentionally stable -- they should not change frequently during a deployment's lifecycle

Workflow Lifecycle

Workflows have seven lifecycle states:

State Description Discoverable
Pending Initial state before Auth Webhook registration completes No
Active Available for selection Yes (if is_latest_version)
Invalid Registration failed (e.g., execution bundle image not found in registry) No
Disabled Temporarily unavailable (CRD deleted, or manually disabled) No
Superseded Replaced by a new registration with different content for the same metadata.name + version No
Deprecated Marked for removal, still usable No
Archived Permanently removed from catalog No

State transitions via kubectl:

# Register (active)
kubectl apply -f my-workflow.yaml

# Disable (deleting the CRD disables the catalog entry)
kubectl delete remediationworkflow my-workflow

# Re-enable (re-applying a previously deleted CRD re-enables it)
kubectl apply -f my-workflow.yaml

State transitions via the DataStorage API (for advanced lifecycle management):

curl -X PATCH https://data-storage:8080/api/v1/workflows/{workflow_id}/disable
curl -X PATCH https://data-storage:8080/api/v1/workflows/{workflow_id}/enable
curl -X PATCH https://data-storage:8080/api/v1/workflows/{workflow_id}/deprecate

Content Integrity and Supersede

When a RemediationWorkflow CRD is applied with the same metadata.name + version as an existing active workflow, the Auth Webhook computes a content hash of the incoming schema and compares it to the existing entry:

Existing State Content Hash Result
active Same Idempotent return (no DB writes)
active Different Old workflow marked superseded, new one created as active
disabled Same Re-enabled
disabled Different New workflow created as active

Version Management

When a new version of a workflow is registered (same metadata.name, different version), the previous version's is_latest_version flag is set to false. Only workflows with status = 'active' AND is_latest_version = true are discoverable.

This means you can register a new version and the old one is automatically excluded from discovery without needing to disable it.

Workflow Search and Scoring

Understanding how DataStorage filters and scores workflows is critical for authoring effective schemas. Your label and detected label choices directly affect whether the LLM ever sees your workflow.

Layer 1: Mandatory Label Filtering (WHERE clause)

Before scoring, DataStorage filters candidates using the mandatory labels from the schema. Workflows that fail any filter are excluded entirely -- they never reach the LLM.

Filter Matching Rule
Severity JSONB array ? operator: workflow's severity array must contain the query value, or contain "*"
Component Case-insensitive comparison: Kubernetes Kind is PascalCase (e.g., Deployment), workflow labels store lowercase (e.g., deployment)
Environment JSONB array ? operator with "*" wildcard fallback
Priority String match (single value) with "*" wildcard

Additionally, only active + is_latest_version = true workflows pass.

Layer 2: Semantic Scoring (ORDER BY)

Surviving candidates are scored by infrastructure label overlap. The formula:

final_score = LEAST((5.0 + detected_boost + custom_boost - penalty) / 10.0, 1.0)

The base score is 0.5 (5.0/10.0). Boosts increase it, penalties decrease it. The LEAST clamp ensures the score never exceeds 1.0 when many labels match.

Detected label boost weights:

Label Exact Match Workflow Wildcard ("*") Query Wildcard
gitOpsManaged +0.10 -- --
gitOpsTool +0.10 +0.05 +0.05
pdbProtected +0.05 -- --
serviceMesh +0.05 +0.025 +0.025
networkIsolated +0.03 -- --
helmManaged +0.02 -- --
stateful +0.02 -- --
hpaEnabled +0.02 -- --

Maximum possible boost: 0.39 (all labels match exactly).

Penalty rules (high-impact only):

Condition Penalty
Target IS GitOps-managed but workflow doesn't declare gitOpsManaged -0.10
Target uses a specific GitOps tool but workflow declares a different one -0.10

Maximum possible penalty: 0.20.

Custom labels: +0.15 per exact match, +0.075 per wildcard match.

What This Means for Workflow Authors

  • Detected labels have the highest impact on ranking. A GitOps-aware workflow (gitOpsManaged: "true", gitOpsTool: "argocd") will consistently outrank a generic one when the target is ArgoCD-managed.
  • Setting "*" wildcards earns half credit. Useful for broadly applicable workflows that work with any GitOps tool or service mesh.
  • Not declaring a detected label means "no requirement" -- no boost, no penalty (except for GitOps, which applies a penalty when the target IS GitOps-managed).
  • Custom labels provide fine-grained differentiation for organization-specific matching (e.g., team: payments).

Connection to Signal Processing Rego Policies

The SP Rego policies determine the values that feed into discovery:

All classification rules live in a single policy.rego file under package signalprocessing:

  • The severity and priority rules produce values for Layer 1 filtering -- a misconfigured rule can silently exclude correct workflows
  • The environment rules produce the environment value for Layer 1 filtering
  • The labels rules (kubernaut.ai/label-*) produce values for Layer 2 scoring at +0.15 per match

Why business classification is not used for discovery

Workflows are reusable across organizational boundaries -- a RollbackDeployment works for any team, business unit, or SLA tier. Mandatory labels describe the technical remediation context (severity, resource type, environment). Business classification describes who owns the resource, which is orthogonal to what fix is needed. Operators who want organizational matching can use custom labels (e.g., kubernaut.ai/label-team=payments on the namespace + customLabels: {team: ["payments"]} in the schema).

Scoring Is Internal Ordering, Not Selection

The final_score determines the order in which workflows are presented to the LLM, but the LLM makes the final selection based on descriptions, remediation history, and context. A workflow ranked #2 by score can still be selected if its description better matches the root cause.

Example Workflows

The kubernaut-demo-scenarios repository contains a library of reference workflows covering common remediation patterns (CrashLoopBackOff rollback, OOM memory increase, GitOps revert, node drain, certificate repair, etc.). These are ready-to-use starting points that operators can adapt to their environment.

kubernaut-demo-scenarios/deploy/
├── action-types/          # One YAML per ActionType CRD
│   ├── crashloop-rollback.yaml
│   └── increase-memory-limits.yaml
└── remediation-workflows/
    └── <scenario>/        # Multi-doc YAML (SA + RBAC + CRD),
        ├── <scenario>.yaml    plus Dockerfile and remediate.sh
        ├── Containerfile.exec # for job-engine workflows
        └── remediate.sh

Operators register workflows by applying RemediationWorkflow CRDs — see Authoring Workflows for guidelines.

CRUD Operations

RemediationWorkflow

Create:

kubectl apply -f my-workflow.yaml

Read:

kubectl get remediationworkflows
kubectl get remediationworkflow my-workflow -o yaml

Update (spec fields are immutable after creation; apply a new version instead):

Re-apply the CRD with the same metadata.name + version but different content. The old workflow is marked superseded and a new one is created.

Delete (disables in catalog):

kubectl delete remediationworkflow my-workflow

ActionType

Create:

kubectl apply -f my-actiontype.yaml

Read:

kubectl get actiontypes
kubectl get actiontype my-actiontype -o yaml

Update (description field is mutable):

Edit the CRD and re-apply:

kubectl apply -f my-actiontype.yaml

Delete (disables in taxonomy):

kubectl delete actiontype my-actiontype

Re-applying a previously deleted ActionType CRD re-enables it with the previous workflow associations intact.

Next Steps