Bright Data Datasets

Top Builders

Explore the top contributors showcasing the highest number of app submissions within our community.

Bright Data Datasets

Bright Data Datasets is a marketplace of pre-collected, validated datasets sourced from over 100 popular websites. Teams that need structured data for AI training, market research, or business intelligence can purchase or subscribe to a dataset and receive clean, structured records without building or maintaining any scraping infrastructure.

General
Developer	Bright Data
Type	Ready-made Data Marketplace
Sources	100+ popular websites and platforms
Documentation	docs.brightdata.com/datasets
Product Page	brightdata.com/products/datasets

Core Features

100+ platform datasets: pre-collected data from Amazon, LinkedIn, Instagram, TikTok, YouTube, Reddit, Glassdoor, and dozens of other sources.
Clean and validated records: data is structured, deduplicated, and validated before delivery, reducing processing overhead.
Multiple delivery formats: JSON, CSV, and other formats available depending on the dataset.
Scheduled refresh: subscribe to datasets that update on a set schedule (daily, weekly, or custom) to keep data current.
Instant download: purchase a snapshot and download immediately, with no wait for scraping to complete.
Custom datasets: request a custom dataset from a specific source if it is not already in the marketplace.

Common Dataset Categories

E-commerce product listings and pricing (Amazon, eBay, Shopify stores)
Social media profiles and posts (LinkedIn, Instagram, TikTok, Reddit)
Review and rating data (Glassdoor, Yelp, Trustpilot, Google Maps)
Real estate listings (Zillow, Realtor.com)
Job postings (LinkedIn, Indeed, Glassdoor)
Video and content metadata (YouTube, podcasts)

Tools and Resources

Dataset Marketplace: browse available datasets, preview schema, and purchase or subscribe.
Scrapers Overview: documentation on how datasets are built and maintained.
Python SDK: access 100+ datasets programmatically via API.
Custom Dataset Request: submit a request for a dataset not yet in the marketplace.

Ecosystem and Integrations

Datasets integrate with AI training pipelines, vector databases, and data warehouses via standard formats.
Available alongside Bright Data's scraping APIs for teams that need both pre-built and custom data collection.
The Python SDK and JavaScript SDK expose dataset access programmatically for automated ingestion.

Browse available datasets and preview schemas at brightdata.com/products/datasets.

Edit on GitHub

Bright Data Bright Data Datasets AI technology Hackathon projects

Discover innovative solutions crafted with Bright Data Bright Data Datasets AI technology, developed by our community members during our engaging hackathons.

Neuroloom — Family Care Command Center

THE PROBLEM Over 53 million unpaid family caregivers coordinate aging parent care across WhatsApp, sticky notes, and scattered PDFs. Medication changes get lost during shift handoffs, and emergencies leave families scrambling for critical information. THE SOLUTION Neuroloom is a Family Care Command Center — a multi-agent AI platform giving every caregiver in a "care circle" one live workspace. A Conductor agent routes tasks through nine specialized agents: MedGuard (medication extraction), Schedule Keeper (reminders), Document Vault (care document indexing), Handoff (shift briefings), Check-in Companion (daily wellness), Emergency Pack (PIN-protected shareable care packet), Family Sync (task coordination), and Trend Analyst (pattern detection). Four care modes tailor workflows: Post-Hospital, Dementia, Chronic Care, and Long-Distance. KEY FEATURES • Live Agent Feed — WebSocket stream of agent activity in real time • Care Knowledge Graph — interactive visualization of meds, events, documents, and handoffs • Senior View — large-text accessible interface for care recipients • Emergency Pack — one-tap shareable packet for EMTs and hospital staff AMD + GEMMA Sensitive care data routes to Gemma on AMD GPUs first via our OpenAI-compatible inference service (ROCm + vLLM on AMD Developer Cloud). When AMD is unavailable, the system falls back to Gemma on Fireworks AI. A live dashboard badge confirms "Gemma on AMD" status. MARKET Family caregiving is a massive unpaid labor market. Neuroloom targets the coordination gap between hospital discharge and daily home care — when families are most overwhelmed and most willing to adopt tools. Stack: Next.js, FastAPI, PostgreSQL, Redis, Docker. Fully containerized. MIT licensed. Disclaimer: Care coordination tool only — not medical advice.

Data Center Ops

Data Center Ops is a real-time, on-device AI assistant that helps data center technicians inspect and assemble server racks correctly the first time. Over 1000 racks are built every week — dense meshes of cables, trays, and ports assembled at volume by labor crews cross-referencing paper guides. Roughly 80% are installed incorrectly on the first attempt. DC-Ops turns any Snapdragon phone into a smart inspection tool: point the camera at a rack and it instantly identifies 16 classes of components — compute trays, network ports, LEDs, cables, fans, drive bays, power shelves, DPUs and more — drawing live overlays on the camera feed. Everything runs entirely on-device using PyTorch and ExecuTorch, compiled to the Qualcomm QNN Hexagon Tensor Processor (HTP) NPU on the Samsung Galaxy S25 Ultra (Snapdragon 8 Elite, SM8750). This matters because data centers that manage private data have limited cloud availability but technicians need instant feedback. By running inference directly on the NPU with INT8 quantization, DC-Ops delivers low-latency, power-efficient detection with zero data ever leaving the device. We trained on 2,036 human-labeled images across four datasets, bootstrapped with a BrightData web-scraping + auto-labeling pipeline (Grounding DINO + SAM). Models and dataset are published openly on Hugging Face.

RevDev

CaliSignal is an AI SDR and GTM intelligence platform built for California startups. It uses Bright Data live web signals to discover high-fit accounts, detect buying windows, score ICP match, and identify the right stakeholders before competitors reach them. The platform turns fragmented public signals like funding news, hiring activity, job posts, reviews, and web changes into prioritized pipeline, account briefs, outreach drafts, alerts, and action workflows. Instead of making SDRs manually research accounts, CaliSignal gives teams a live revenue cockpit that shows who to contact, why now, what to say, and what action to take next.

VanTage - Due diligence, on a timeline

Private equity associates spend roughly 40 hours per target on preliminary due diligence—a full week lost to browser tabs, public filings, news archives, and litigation records, manually stitched into something an investment committee will trust. Most of that week isn't analysis; it's gathering. Vantage does it in 40 seconds. Point it at a company and it pulls from 12 distinct web sources at once, assembling them into a live knowledge graph: a connected map of the target's people, financials, legal exposure, customers, suppliers, and reputational signals. Relationships that normally take days to cross-reference appear instantly—the board member sitting on a competitor's audit committee, the lawsuit filed quietly three states away, the executive churn that began before the numbers softened. Red flags don't wait to be found. Vantage automatically surfaces litigation spikes, leadership departures, restatements, and regulatory actions, ranked and explained. A 90-day time slider lets you drag through recent history and watch the target's profile change—because knowing when something shifted is often more revealing than knowing that it did. Every claim is cited back to its source, so partners can audit it and committees can rely on it. Every memo lands IC-ready. This defensibility is powered by Bright Data, whose reliable, large-scale, structured web access makes a trustworthy knowledge graph possible where brittle scrapers and stale databases fail. We target middle-market PE associates—the highest willingness-to-pay segment in B2B software, where seats command $500–$2,000 per month. Their time is billed against nine-figure decisions, and a single avoided bad deal or faster close justifies the spend many times over. Vantage turns the most tedious week in the deal process into a 40-second starting point.

ROGUE: Open-web LLM Threat Intelligence Agent

A new way to jailbreak AI appears on Reddit, X, or arXiv almost every day. By the time a quarterly red-team catches it, it has already worked on a production chatbot. ROGUE closes that gap , the red-team that never sleeps. ROGUE is an autonomous red-team agent. It continuously harvests new LLM attacks from 19 live open-web sources — Reddit/X jailbreak communities, arXiv, GitHub (the Pliny umbrella), HuggingFace, MITRE ATLAS, OWASP, and vendor safety blogs — then reproduces each against YOUR deployment: your system prompt, your declared tools, your target model, scored together. Not a bare model. Not a frozen test bank. Your actual setup, against today's attacks. It's the only project here using Bright Data MCP on BOTH sides. As a consumer, the discovery agent reasons over Bright Data's MCP tools (Web Scraper, SERP, Web Unlocker, Scraping Browser) to reach sources that block bots. As a producer, ROGUE exposes its own MCP server. Try it now ,the dashboard has one-click "Add to Cursor / VS Code" buttons, and the hosted endpoint (rogue-api-mr5w.onrender.com/mcp) needs zero setup. Connect it and ask, from your own IDE, "what new attacks broke our support bot in the last 24 hours?" — live, during judging. The numbers are real, not a demo fixture. One live sweep: 8,321 breach trials across 6 deployment configs, a 16.5× vulnerability spread between weakest and strongest model. A separate judge scores every trial (REFUSED / EVADED / PARTIAL / FULL) and is calibrated against blind human labels, 98% breach-axis agreement, validated on WildGuardTest and StrongREJECT, not "trust the AI." Bright Data spend: $0.15 per detected breach. Publication-to-breach: ~2 minutes. It also red-teams multimodally, rendering text attacks as images and audio, because a jailbreak refused as text often succeeds as a picture of that text. Built solo in 6 days. Prior: GPTFuzz Grand Prize (Yonsei, 2024) and adversarial-ML research at AIM Intelligence.

🦷 Dental Surgical Simulator

The simulator presents a stylized, professional interface featuring a 3D viewport of a human dental arch. Users select from various surgical missions, each requiring a specific sequence of surgical steps. Success depends on selecting the correct surgical instrument and interacting with the precise target tooth at the right time. Procedural Anatomy: Anatomically accurate tooth crowns and root clusters generated via math-based geometry. Multi-Mode Visualization: X-Ray Mode: Toggles transparency of soft tissues (gums, palate) to reveal bone and roots. Wireframe Mode: Provides structural visualization of dental anatomy. View Switching: Toggle between Full Arch, Upper Jaw, and Lower Jaw views. Dynamic Theming: Seamless transition between a professional high-contrast Dark Mode and a clean Light Mode. Interactive HUD: Real-time feedback through toasts, tooltips, and a crosshair for precise targeting. Advanced Missions: Includes complex procedures like Dental Implants, Root Canals, and Bone Grafting.