
Crucible runs six architecturally distinct open-source LLMs in parallel on a single AMD MI300X GPU, each playing a different adversarial reviewer persona: Skeptic (Qwen2.5-7B), Red Teamer (Hermes-3-Llama-8B), Ethics Auditor (Falcon3-7B), Market Critic (Phi-3.5-mini), Devil's Advocate (Yi-1.5-9B), and Pragmatist (InternLM2.5-7B). Their critiques are synthesised into a structured report scored by cross-architecture agreement. The problem: most "AI judges AI" systems run one model with different prompts. Agreement becomes a correlated signal — one prior repeated. Crucible replaces this with genuine architectural diversity. When Qwen, Phi, Falcon, Hermes-Llama, InternLM, and Yi independently flag the same risk, that's real cross-validation — backed by recent research showing 2 diverse models can match 16 homogeneous ones. Architecture: LiteLLM proxy on port 8000 routes to six vLLM containers (ports 8001–8006), each at --gpu-memory-utilization 0.13, fitting all six in 192GB VRAM (~85% utilized). The pipeline runs Round 1 (six parallel critiques via asyncio.gather), Round 2 (cross-debate where each persona reads the others), and a two-pass synthesiser emitting structured JSON with verbatim evidence per finding. Trust mechanisms in every report: evidence citations per reviewer, self-disclosure metadata (run_id, schema version, per-persona model, distinct-model count), prompt-injection defense (SECURITY NOTICE prefix + <<USER_INPUT>> markers), input length cap, and OWASP LLM Top-10 categorisation with regex validation. Evaluation on a 6-input corpus (SaaS, healthcare AI, fintech, insecure code, prompt-injection, Romanian-language proposal): multi-model surfaces 8 OWASP-tagged findings vs 3 for single-model. Both modes hit 100% schema adherence. Median runtime: 58s multi vs 31s single. Why MI300X: six concurrent unquantised vLLM instances fit in 192GB VRAM at FP16. Same workload would need 4–6 consumer GPUs or quantisation on an A100-80GB that degrades smaller models.
10 May 2026