
Modern reasoning-enhanced LLM agents hallucinate tool names that don't exist in their registry — a failure mode documented across GPT, Claude, Llama, and Qwen models. The 2026 literature measures Llama-3.1-8B at 62–99% phantom-tool rate depending on task complexity. No commercial product detected this before Sentinel. Sentinel wraps any autonomous agent with a three-layer detection cascade. Layer 1 is an in-memory registry hash that passes real tools through in under 1 millisecond at zero cost. Layer 2 uses Gemini text-embedding-001 with three fusion heuristics — Levenshtein name distance (F1), argument key Jaccard overlap (F2), and top-1 vs top-2 similarity gap (F3) — to catch near-miss phantoms. Confident matches above 0.85 trigger AUTO_CORRECT, injecting the real tool name back into the agent's next turn. Ambiguous cases (0.60–0.85) escalate to Layer 3: a Gemini 2.5-flash semantic verifier producing structured JSON judgments. The system is deployed and measurable. Live at https://sentinel.66-245-207-218.nip.io on Vultr Milan behind Caddy and Let's Encrypt TLS. Registry: 20 Claude Code and MCP tools. Latency benchmark: p50 = 0.179ms on 2,000 calls — 56 times under the 10ms budget. Phantom detection F1 = 1.000 on a 55-example corpus. A real-time dashboard streams every intercept decision via SSE. Three distribution surfaces ship in v0.1.0: a Claude Code PreToolUse hook (one line in settings.json), a public REST endpoint at /detect for any agent platform, and the live dashboard. Built on four published 2026 papers (arXiv:2510.22977, 2601.05214, 2602.08082, 2605.09252). Apache 2.0. 141 tests passing.
19 May 2026