.png&w=828&q=75)
GuardianRail is a representation-aware action firewall for open-weight customer-operations agents. The project shows a regulated support agent running on Gemma 3 12B IT with Gemma Scope 2 sparse-autoencoder features monitored during inference on AMD MI300X. Instead of only checking the final text response, GuardianRail displays the operation the agent is about to take, reads internal safety-relevant features, gates that proposed action, and logs the evidence behind every allow, block, or escalation. In the demo, a normal support request is allowed, a prompt-injection attempt is blocked before a restricted action can run, and a social-engineering request is escalated to human review. The Streamlit interface shows live safety signals, feature thresholds, the Action Firewall decision, policy-layer clamp/boost interventions, GPU usage, and a SQLite audit trail. The goal is not to claim jailbreaks are solved; it is to make open-weight agent safety observable, tunable, and auditable for teams deploying agents on their own infrastructure.
10 May 2026