Enterprise SRE teams waste 60–90 minutes per incident manually correlating logs, deployment history, and service topology before identifying a root cause. OpsPulse Sentinel eliminates that triage window entirely. OpsPulse Sentinel is an autonomous Root Cause Analysis agent built for production SRE workflows. It ingests raw telemetry, deployment history, and cluster architecture — then runs a structured multi-stage pipeline to deliver a confidence-scored RCA report with a ready-to-run remediation command in under 60 seconds. The agent uses a dual-model Gemini pipeline. Stage 1: Gemini Flash-Lite acts as a high-speed anomaly filter, stripping INFO noise and returning only ERRORs, WARNs, and suspicious patterns. Stage 2: filtered anomalies are combined with ChromaDB vector memory of past incidents, a live cluster manifest, and the deployment log. Gemini Pro reasons across all four sources using temporal causation logic — if a deployment at time T precedes errors at T+N, it is surfaced as primary causal evidence. Output is a Pydantic-validated JSON RCA: root cause, affected services, confidence score, evidence chain, and a specific remediation command. A policy guardrail blocks execution below 0.75 confidence and escalates to human review — the agent never acts on incomplete reasoning. The dual-model routing isn't just cost optimization. Flash-Lite's low latency makes real-time anomaly filtering practical. Pro's reasoning depth handles deployment context, telemetry, historical memory, and cluster topology in one coherent pass. The human approval gate before execution is a deliberate design decision. Autonomous reasoning with human-controlled execution is the right model for production systems — and means junior engineers can act on incidents that would previously require a senior SRE. Stack: Python · Streamlit · Gemini API (Flash-Lite + Pro) · ChromaDB · Pydantic
Category tags: