What I built Gemma Miner is a command-line agent that turns a one-sentence brief "build me a dataset of every CNIL sanction since 2011 with sector, fine amount, and the GDPR articles violated" into a typed Parquet dataset with a codebook and charts, in minutes. It runs locally on Gemma 4 31B through Ollama. So your data never leaves your laptop. You can also point it at hosted Gemini Flash if you want speed. Same code, same agent. Install: uv tool install gemma-miner gemma-miner That's it. PyPI: https://pypi.org/project/gemma-miner/ — repo: https://github.com/moncifem/gemma-miner Why I built it I kept hitting the same wall. The world has a lot of text. Regulatory filings, court decisions, clinical trials, news, PDFs sitting on government websites. We have foundation models that can read all of it. What we don't have is the table you need to actually study any of it. You can't run statistics on a paragraph. RAG won't tell you whether CNIL fines have grown 5× since 2014, or whether China has overtaken the US in clinical AI trials. Those questions need a typed table, columns with real types, rows with the same shape, nulls where the source was silent. The kind of dataset every social scientist or epidemiologist has been building by hand in Excel for a hundred years That's the gap I wanted to close.
Category tags: