DAGIntel-Multi-Agent Airflow Incident Investigator

Created by team One Forge on May 09, 2026
AI Agents & Agentic Workflows (Best Track for Beginners)

DAGIntel transforms how on-call engineers debug Airflow pipeline failures. When a data pipeline crashes at 2am, engineers typically spend 2-4 hours reading stack traces, Googling errors, and piecing together fixes from Slack history and runbooks. DAGIntel automates this entire process using three CrewAI agents: 1. **Log Analyzer** (Senior SRE): Parses raw Airflow logs into structured JSON with error classification and severity 2. **Root Cause Detective** (Principal Data Engineer): Applies 12 years of Large-scale debugging expertise to identify true root causes vs. symptoms, with 95% confidence scoring 3. **Fix Suggester** (Staff Engineer): Generates production-ready runbooks with Kubernetes YAML, SQL queries, Prometheus alerts, and step-by-step remediation The system demonstrates practical AI agent collaboration where domain expertise (encoded in agent backstories and prompts) produces actionable deliverables on-call teams can execute immediately. Built on AMD MI300X GPU infrastructure using Qwen, CrewAI orchestration, and Streamlit UI. Real impact: Incident resolution time drops from 4 hours to 90 seconds, junior engineers get senior-level insights, and every investigation auto-generates polished documentation for post-mortems.

Category tags: