birddog puts a leash on AI scraping agents. One context manager wraps an agent that hits the web and gives you: 1. Domain allowlist with wildcard subdomains — deny everything outside it 2. Per-domain rate caps via token bucket, one bucket per host 3. JSONL audit log with one line per fetch (url, status, bytes, ms) 4. Bright Data Web Unlocker proxy as an opt-in flag 5. Streamlit dashboard showing per-host bytes, denials, and p50 latency LLM agents don't know what a sane scraping cadence looks like. They'll hammer a site, follow links into spammy subdomains, and burn through a Bright Data quota in a single run. birddog stops that. Use it as a context manager: bd = Birddog( allowed_domains={"docs.brightdata.com", "*.example.com"}, per_domain_qps=1.0, audit_path="runs/scrape.jsonl", bright_data={"host": "brd.superproxy.io:33335", "username": "...", "password": "..."}, ) with bd.session("research-bot") as s: r = s.fetch("https://docs.brightdata.com/api") s.fetch("https://evil.example/exfil") # DomainDeniedError, logged The audit log is JSONL — one event per fetch, including domain_denied and rate_limited events. The bundled Streamlit dashboard reads the log and shows total fetches, denials, bytes, and a per-host breakdown with p50 latency. Built for research bots, price trackers, and RAG ingest jobs that hit live sites. Pairs with agentleash for USD budget caps on the same agent. Includes a Jupyter notebook walkthrough and two runnable example scripts. 10 tests, MIT license.
Category tags: