Workflow Execution¶
CRD Reference
For the complete WorkflowExecution CRD specification, see API Reference: CRDs.
The Workflow Execution controller runs remediation workflows via Kubernetes Jobs, Tekton Pipelines, or Ansible (AWX/AAP). It manages spec validation, dependency resolution, cooldown enforcement, deterministic locking, and failure reporting.
CRD Specification¶
Spec (Immutable)¶
For the complete field specification, see WorkflowExecution in the CRD Reference.
Status¶
For the complete field specification, see WorkflowExecution in the CRD Reference.
Failure Categories¶
| Reason | Description |
|---|---|
OOMKilled |
Container killed by OOM |
DeadlineExceeded |
Execution timeout |
Forbidden |
RBAC error during execution |
ResourceExhausted |
Cluster resources unavailable |
ConfigurationError |
Spec validation or dependency failure |
ImagePullBackOff |
Bundle image pull failure |
TaskFailed |
Tekton task or Job step failure |
Unknown |
Unclassified failure |
FailureDetails¶
For the complete field specification, see WorkflowExecution in the CRD Reference.
Phase State Machine¶
stateDiagram-v2
[*] --> Pending
Pending --> Running : Spec valid + engine resolved + cooldown clear + deps resolved + exec created
Pending --> Failed : Validation / dependency / cooldown failure
Running --> Completed : Job/PipelineRun succeeded
Running --> Failed : Job/PipelineRun failed
| Phase | Terminal | Description |
|---|---|---|
| Pending | No | Spec validation, engine resolution, cooldown check, dependency resolution, execution creation |
| Running | No | Job or PipelineRun is active, polled every 10 seconds |
| Completed | Yes | Execution succeeded |
| Failed | Yes | Execution failed (pre-execution or runtime) |
Pending Phase¶
The Pending phase performs several checks before creating an execution resource:
1. Spec Validation¶
Validates required fields:
ExecutionBundleis non-emptyTargetResourcematches the expected format (namespace/kind/nameorkind/name)
Failure → MarkFailed with ConfigurationError.
2. Engine Resolution¶
resolveExecutionEngine queries the workflow catalog in DataStorage to determine the execution engine (tekton, job, or ansible). This runs immediately after validation, before cooldown check.
Failure → MarkFailed with ConfigurationError.
3. Cooldown Check¶
Before creating a new execution, the controller checks for recently completed WFEs on the same target resource:
- Lists WFEs using a field index on
spec.targetResource - If a Completed or Failed WFE exists with
CompletionTimewithin the cooldown window → block - Returns the remaining cooldown time for requeue
Default cooldown: 1 minute (configurable via workflowexecution.config.execution.cooldownPeriod). Prevents rapid re-execution of the same workflow on the same target.
Audit: workflowexecution.selection.completed¶
Emitted after validation + engine resolution + cooldown check pass, before dependency resolution and execution creation.
4. Dependency Resolution¶
Fetches workflow dependencies from DataStorage and validates them in the execution namespace:
- Query DataStorage via
WorkflowQuerier.GetWorkflowDependencies(ctx, workflowID)for declared Secrets and ConfigMaps - Validate via
DependencyValidator.ValidateDependenciesthat each declared dependency exists in the execution namespace - Failure modes:
- DataStorage fetch failure → non-fatal, continue without dependency data
- Dependency validation failure →
MarkFailedwithConfigurationError
5. Execution Creation¶
Creates a Kubernetes Job, Tekton PipelineRun, or AWX Job using the engine resolved in Step 2:
- The executor registry dispatches to the appropriate engine
- AlreadyExists handling (Job/Tekton only): If the resource already exists and belongs to this WFE, adopt it (idempotent). If it belongs to another WFE, mark as
Failed(race condition).
Audit: workflowexecution.execution.started¶
Emitted after execution resource creation succeeds.
Running Phase¶
The Running phase polls the executor status every 10 seconds:
- Call
exec.GetStatus(ctx, wfe, namespace) - If
Completed→MarkCompletedwithCompletionTimeandDuration - If
Failed→MarkFailedwithFailureReason,FailureDetails, andWasExecutionFailure=true - If still running → requeue after 10s
Terminal Phase (Cooldown and Cleanup)¶
After reaching Completed or Failed, the controller does not immediately clean up:
- Wait for cooldown (default 1m) after
CompletionTime - Cleanup --
exec.Cleanup(ctx, wfe, namespace):- Job/Tekton: Deletes the Job or PipelineRun
- Ansible: Deletes ephemeral AWX credentials via
cleanupEphemeralCredentialsand cancels the AWX job if still running
- Emit
LockReleasedKubernetes event
The cooldown period serves two purposes:
- Prevents immediate re-execution of the same workflow on the same target
- Allows the Orchestrator to read execution results before the resource is deleted
Execution Engines¶
Kubernetes Jobs¶
For single-step remediations:
sequenceDiagram
participant WE as WE Controller
participant DS as DataStorage
participant K8s as Kubernetes API
WE->>DS: Query workflow dependencies
WE->>WE: Validate dependencies
WE->>K8s: Create Job in execution namespace
K8s-->>WE: Job status (Running → Succeeded/Failed)
WE->>WE: Update WFE status
Tekton Pipelines¶
For multi-step remediations with step ordering, retries, and artifact passing:
sequenceDiagram
participant WE as WE Controller
participant DS as DataStorage
participant Tekton as Tekton API
WE->>DS: Query workflow dependencies
WE->>WE: Validate dependencies
WE->>Tekton: Create PipelineRun
Tekton-->>WE: PipelineRun status
WE->>WE: Update WFE status
Ansible (AWX/AAP)¶
For remediations that use Ansible playbooks managed via AWX or Ansible Automation Platform (BR-WE-015):
sequenceDiagram
participant WE as WE Controller
participant DS as DataStorage
participant K8s as Kubernetes API
participant AWX as AWX/AAP
WE->>DS: Query workflow dependencies (Secrets, ConfigMaps)
WE->>K8s: Read dependency Secrets and ConfigMaps
WE->>AWX: Create ephemeral credentials (from Secrets)
WE->>AWX: Launch Job Template (extra_vars + credentials)
AWX-->>WE: Job status (pending → running → successful/failed)
WE->>AWX: Delete ephemeral credentials (cleanup)
WE->>WE: Update WFE status
The Ansible executor:
- Resolves the Job Template by name via the AWX REST API (
engineConfig.jobTemplateName) -
Builds
extra_varsfrom workflow parameters (with automatic type coercion for integers, booleans, floats, and JSON) plus four auto-injected context variables:Variable Source Purpose WFE_NAMEwfe.NameWorkflowExecution identity for audit/logging WFE_NAMESPACEwfe.NamespaceWorkflowExecution namespace RR_NAMEwfe.Spec.RemediationRequestRef.NameParent RemediationRequest identity RR_NAMESPACEwfe.Spec.RemediationRequestRef.NamespaceParent RemediationRequest namespace -
Injects dependency ConfigMaps as
extra_varswith aKUBERNAUT_CONFIGMAP_{NAME}_{KEY}prefix (non-sensitive data) - Injects dependency Secrets as ephemeral AWX credentials with
KUBERNAUT_SECRET_{NAME}_{KEY}environment variables (sensitive data, never inextra_vars) - Injects K8s API credentials -- reads the controller's in-cluster ServiceAccount token and creates an ephemeral AWX credential that injects
K8S_AUTH_HOST,K8S_AUTH_API_KEY, andK8S_AUTH_SSL_CA_CERTinto the Execution Environment. Playbooks usingkubernetes.coremodules authenticate automatically. If the in-cluster environment is unavailable, the job proceeds without K8s credentials. - Launches the AWX Job with the combined
extra_varsand credential IDs. When ephemeral credentials are present, the executor also fetches the job template's pre-configured credentials and merges them (deduplicated, template-first ordering) so AWX receives the full union. - Polls job status via
GET /api/v2/jobs/{id}/mapping AWX states (pending,waiting,running,successful,failed,error,canceled) to WFE phases - Cleans up ephemeral credentials (including K8s credentials) after execution completes (credential IDs are persisted in
status.ephemeralCredentialIDsvia the status subresource)
The credential lifecycle ensures Kubernetes Secret data is never persisted in AWX extra_vars (which are logged). Instead, each Secret gets a dynamic AWX credential type with env injectors, and an ephemeral credential is created per execution and deleted on cleanup.
Deterministic Locking (DD-WE-003)¶
To prevent concurrent execution on the same target resource, the controller uses deterministic naming:
The same target resource always produces the same execution resource name. If two WFEs attempt to run on the same target:
- The first one creates the resource successfully
- The second receives
AlreadyExists→ the controller checks:- Ownership check: Does the existing resource have a
kubernaut.ai/workflow-executionlabel matching another WFE? If so, it fails with a race condition error (concurrent lock held by another WFE). - Completed check: Is the existing resource in a terminal state (completed or failed)? If so, it is cleaned up and creation is retried (stale lock from a previous execution).
- Running check: If the resource is still running and owned by another WFE, the current WFE waits.
- Ownership check: Does the existing resource have a
This pre-execution cleanup resolves the stale lock problem where a completed Job from a previous WFE would permanently block new executions on the same target.
Ownership-Verified Cleanup¶
During cooldown cleanup, both JobExecutor and TektonExecutor verify the kubernaut.ai/workflow-execution label matches the WFE name before deleting execution resources. This prevents WFE1's cooldown cleanup from destroying WFE2's newly created Job or PipelineRun when they share deterministic names.
Engine Configuration Resolution¶
When a WFE spec omits engineConfig, the controller resolves it from the workflow catalog in DataStorage. This prevents nil-pointer panics when workflow registration did not include engine-specific configuration.
Execution Namespace and RBAC¶
All Jobs and PipelineRuns execute in the dedicated kubernaut-workflows namespace. They share a common ServiceAccount (kubernaut-workflow-runner) managed by the controller. See Security & RBAC -- Workflow Execution for the full list of permissions granted to this ServiceAccount. Per-workflow scoped RBAC is planned for v1.2.
Parameter Injection¶
The executor injects system variables and passes through all parameters from the workflow selection:
Kubernetes Jobs and Tekton Pipelines¶
| Variable | Source |
|---|---|
TARGET_RESOURCE |
wfe.Spec.TargetResource (system-injected by WFE controller) |
TARGET_RESOURCE_NAME |
wfe.Spec.Parameters (HAPI-injected from K8s root_owner) |
TARGET_RESOURCE_KIND |
wfe.Spec.Parameters (HAPI-injected from K8s root_owner) |
TARGET_RESOURCE_NAMESPACE |
wfe.Spec.Parameters (HAPI-injected from K8s root_owner) |
| Custom parameters | All remaining entries from wfe.Spec.Parameters (LLM-populated) |
Custom parameters use UPPER_SNAKE_CASE names and are injected as environment variables (Jobs) or Tekton params (PipelineRuns).
Ansible (AWX/AAP)¶
| Variable | Source |
|---|---|
TARGET_RESOURCE_NAME |
wfe.Spec.Parameters (HAPI-injected from K8s root_owner) |
TARGET_RESOURCE_KIND |
wfe.Spec.Parameters (HAPI-injected from K8s root_owner) |
TARGET_RESOURCE_NAMESPACE |
wfe.Spec.Parameters (HAPI-injected from K8s root_owner) |
WFE_NAME |
wfe.Name (auto-injected) |
WFE_NAMESPACE |
wfe.Namespace (auto-injected) |
RR_NAME |
wfe.Spec.RemediationRequestRef.Name (auto-injected) |
RR_NAMESPACE |
wfe.Spec.RemediationRequestRef.Namespace (auto-injected) |
KUBERNAUT_CONFIGMAP_{NAME}_{KEY} |
Dependency ConfigMap data (auto-injected) |
KUBERNAUT_SECRET_{NAME}_{KEY} |
Dependency Secret data (via ephemeral AWX credentials) |
K8S_AUTH_HOST |
WE controller in-cluster SA (ephemeral AWX credential) |
K8S_AUTH_API_KEY |
WE controller in-cluster SA (ephemeral AWX credential) |
K8S_AUTH_SSL_CA_CERT |
WE controller in-cluster SA (ephemeral AWX credential) |
| Custom parameters | All remaining entries from wfe.Spec.Parameters (type-coerced into extra_vars) |
Handoff¶
The WFE controller reports status back to the Orchestrator through the CRD status:
WFE Completed → RO creates EffectivenessAssessment → Verifying phase
WFE Failed → RO creates EA (for tracking) + NotificationRequest → Failed phase
For Ansible executions, the handoff is identical -- the AWX job status is mapped to the same WFE phases (Completed/Failed), so the Orchestrator does not need to distinguish between execution engines.
Next Steps¶
- Effectiveness Assessment -- Post-execution health evaluation
- Remediation Workflows -- Writing workflow schemas
- Remediation Routing -- How the Orchestrator manages the lifecycle