AI Analysis¶
The AI Analysis service performs root cause investigation using an LLM (via HolmesGPT) and decides whether the selected workflow should be auto-approved or require human review.
CRD Reference
For the complete AIAnalysis CRD specification, see API Reference: CRDs.
Architecture¶
graph TB
AA[AI Analysis<br/>Controller] -->|session submit| HAPI[HolmesGPT API]
AA -->|session poll| HAPI
AA -->|session result| HAPI
HAPI -->|LLM call| LLM[LLM Provider<br/><small>Vertex AI / OpenAI</small>]
HAPI -->|workflow query| DS[DataStorage]
AA -->|Rego eval| REGO[Approval Policy]
AA -->|audit| DS
Session-Based Async Pattern¶
The AI Analysis controller communicates with HolmesGPT using a session-based asynchronous pattern (BR-AA-HAPI-064):
Flow¶
- Submit —
POST /api/v1/incident/analyze→202 Accepted+session_id - Poll —
GET /api/v1/incident/session/{session_id}→ status (pending,investigating,completed,failed) - Result —
GET /api/v1/incident/session/{session_id}/result→ full analysis
sequenceDiagram
participant AA as AI Analysis Controller
participant HAPI as HolmesGPT API
participant LLM as LLM Provider
AA->>HAPI: POST /api/v1/incident/analyze
HAPI-->>AA: 202 {session_id}
Note over AA: Phase: Investigating
HAPI->>LLM: Run investigation (kubectl access)
LLM-->>HAPI: Analysis result
AA->>HAPI: GET /session/{id}
HAPI-->>AA: {status: "completed"}
AA->>HAPI: GET /session/{id}/result
HAPI-->>AA: IncidentResponse
Note over AA: Phase: Analyzing
This pattern avoids long HTTP timeouts and allows the controller to use Kubernetes-native requeue mechanisms (RequeueAfter) while the LLM investigation runs. The controller polls at a constant 15-second interval (configurable from 1s to 5m via the holmesgpt.sessionPollInterval YAML config field).
Session Recovery¶
If HolmesGPT API restarts and returns 404 for a session, the controller regenerates the session (up to 5 attempts per BR-AA-HAPI-064.5/064.6).
Timeout Configuration¶
The Orchestrator passes per-analysis timeout configuration via the AIAnalysis CRD spec:
| Field | Default | Description |
|---|---|---|
investigatingTimeout |
Inherited from RR | Maximum time in the Investigating phase |
analyzingTimeout |
Inherited from RR | Maximum time in the Analyzing phase |
If either timeout expires, the AIAnalysis transitions to Failed.
Phases¶
| Phase | Description |
|---|---|
Pending |
CRD created by Orchestrator |
Investigating |
Session submitted to HolmesGPT, polling for completion |
Analyzing |
Results received, evaluating Rego approval policy |
Completed |
Analysis and approval decision recorded |
Failed |
Investigation or analysis failed |
HolmesGPT Investigation¶
HolmesGPT is a Python FastAPI service that orchestrates LLM-driven investigation with live Kubernetes access and configurable observability toolsets. During investigation, it:
- Reads the enriched signal — Alert details, target resource, namespace context
- Investigates using K8s tools — Inspects pod logs, events, resource state, and live metrics via
kubectl; optionally queries Prometheus, Grafana Loki/Tempo, and other configured toolsets - Produces a root cause analysis — Structured explanation of what went wrong
- Resolves the target resource — Calls
get_namespaced_resource_context(orget_cluster_resource_contextfor cluster-scoped resources) to resolve the owner chain, compute a spec hash, fetch remediation history (past outcomes and effectiveness scores via internal DataStorage lookup), and detect infrastructure labels (GitOps, Helm, service mesh, HPA, PDB) - Discovers workflows via DataStorage — The LLM uses a three-step protocol:
list_available_actions→list_workflows→get_workflow. Signal context and detected labels are auto-injected as filters; DataStorage orders results by label-match scoring (scores not exposed to the LLM). - LLM selects a workflow — Based on workflow descriptions (
what,whenToUse,whenNotToUse), detected infrastructure context, and remediation history - Returns
actionableflag — Indicates whether the investigation identified a concrete remediation action. Propagated to theAIAnalysisCRD status and used downstream for audit and decision filtering.
Response Processing¶
When the controller receives the analysis result, it applies two confidence thresholds:
Investigation Threshold (0.7)¶
Applied in the response processor during the Investigating phase:
- Confidence >= 0.7 with no workflow — Treated as "problem already resolved" (no remediation needed)
- Confidence < 0.7 with a selected workflow — Workflow selection rejected as low-confidence
Problem Self-Resolved Bypass (#301)¶
When HolmesGPT reports investigation_outcome=resolved, it appends a "Problem self-resolved" warning to the response. The response processor detects this signal and bypasses the substantive RCA check -- even if the LLM produced a root cause analysis with contributing factors, the RCA is treated as documenting a transient condition (e.g., a pod that recovered on its own) rather than an active problem.
Without this bypass, a resolved incident with a detailed RCA would be incorrectly escalated to human review (because hasSubstantiveRCA would return true, preventing the WorkflowNotNeeded completion path). The fix ensures that HAPI's authoritative "resolved" signal takes priority over the RCA content check.
Approval Gate (Rego policy, operator-provided)¶
The Analyzing handler evaluates a Rego policy to determine whether the remediation requires human approval:
- Query:
data.aianalysis.approval - Input: Full analysis context (see below)
- Output:
require_approval(boolean) andreason(string)
When no policy is mounted, the controller auto-approves all remediations. Operators provide their own approval.rego to enforce custom gates.
The default shipped policy (bundled for reference) gates on:
- Production — always requires approval
- Non-production — auto-approved when
remediation_targetis present - Missing
remediation_target— always requires approval (default-deny per ADR-055) - Sensitive resource kinds — requires approval for Deployments, StatefulSets, DaemonSets in production
The policy receives confidence, confidence_threshold, detected_labels (snake_case keys: "stateful", "pdb_protected", "hpa_enabled"), failed_detections, custom_labels, and business_classification. Operators can write custom policies that use any combination of these inputs — for example, confidence-gated approval for production.
The confidence threshold is configurable via Helm (aianalysis.rego.confidenceThreshold, default 0.8) and passed as input.confidence_threshold.
See Human Approval for the full approval flow and policy customization details.
Next Steps¶
- Investigation Pipeline — Deep-dive into the LLM investigation phases, resource context, remediation history, decision outcomes, and approval gate
- Remediation Routing — How the Orchestrator routes the result
- Workflow Selection — Catalog query and scoring details
- Human Approval — The approval flow