HolmesGPT API¶
HolmesGPT is a Python FastAPI service that wraps LLM calls with live Kubernetes access for root cause analysis. The AI Analysis controller communicates with it using a session-based asynchronous pattern.
OpenAPI Spec
The full OpenAPI 3.1.0 specification is available at holmesgpt-api/api/openapi.json in the main repository. The Go client (pkg/holmesgpt/client/) uses the generated ogen client for all endpoints, including session management (DD-HAPI-003).
Base URL¶
Internal services use the short form http://holmesgpt-api:8080 when communicating within the same namespace.
Session-Based Async Pattern¶
The API uses a submit-poll-result pattern to handle long-running LLM investigations:
sequenceDiagram
participant Client as AI Analysis Controller
participant HAPI as HolmesGPT API
participant LLM as LLM Provider
Client->>HAPI: POST /api/v1/incident/analyze
HAPI-->>Client: 202 {session_id}
HAPI->>LLM: Run investigation
Note over HAPI,LLM: kubectl access, log analysis
Client->>HAPI: GET /api/v1/incident/session/{id}
HAPI-->>Client: {status: "investigating"}
LLM-->>HAPI: Analysis complete
Client->>HAPI: GET /api/v1/incident/session/{id}
HAPI-->>Client: {status: "completed"}
Client->>HAPI: GET /api/v1/incident/session/{id}/result
HAPI-->>Client: IncidentResponse
Endpoints¶
Incident Analysis¶
Submit Investigation¶
Starts an asynchronous investigation session.
Request: IncidentRequest — enriched signal data, target resource, analysis parameters
Response: 202 Accepted
Poll Session Status¶
Returns the current status of an investigation session.
Response: 200 OK
Session statuses: pending, investigating, completed, failed
Response: 404 Not Found — Session does not exist (e.g., after pod restart). The AI Analysis controller handles this by regenerating the session (up to 5 attempts per BR-AA-HAPI-064.5/064.6).
Get Session Result¶
Returns the analysis result when the session is complete.
Response: 200 OK — IncidentResponse with RCA, selected workflow, confidence score, and actionable flag
Key response fields:
| Field | Type | Description |
|---|---|---|
root_cause |
string | Natural language root cause explanation |
confidence |
float | Investigation confidence (0.0--1.0) |
investigation_outcome |
string | Outcome classification (e.g., resolved, workflow_selected) |
selected_workflow |
object | Workflow recommendation (name, action type, parameters) |
actionable |
boolean | Whether the investigation identified a concrete remediation action |
remediation_target |
object | Target resource (kind, name, namespace) — constructed by the AA controller from HAPI's root_owner tool result, not a direct HAPI response field |
detected_labels |
object | Infrastructure labels detected during investigation |
Response: 409 Conflict — Session not yet complete
Health¶
| Method | Path | Description |
|---|---|---|
GET |
/health |
Liveness probe |
GET |
/ready |
Readiness probe (checks SDK, context API, Prometheus client) |
GET |
/config |
Configuration snapshot (dev mode only) |
GET |
/metrics |
Prometheus metrics |
Error Responses¶
All error responses (4xx, 5xx) use RFC 7807 Problem Details format with Content-Type: application/problem+json. See the error type catalog for the full list of error types.
Example (session not ready):
{
"type": "https://kubernaut.ai/problems/conflict",
"title": "Conflict",
"detail": "Session is still investigating, result not yet available",
"status": 409
}
Session Management¶
- Sessions are stored in-memory in the HolmesGPT API pod
- If the pod restarts, sessions are lost — the AI Analysis controller handles this by regenerating sessions (up to 5 attempts)
- Session results are available until the pod restarts or the session is garbage-collected
LLM Providers¶
HolmesGPT uses LiteLLM under the hood, supporting any compatible provider:
| Provider | Configuration |
|---|---|
| OpenAI | provider: openai, model: gpt-4o |
| Vertex AI | provider: vertex_ai, model: gemini-2.5-pro, gcp_project_id, gcp_region |
| Azure OpenAI | provider: azure, model: gpt-4o, endpoint |
| Any LiteLLM provider | See LiteLLM documentation |
Next Steps¶
- AI Analysis Architecture — How the controller uses this API
- DataStorage API — Audit and workflow APIs
- Configuration Reference — LLM provider settings