SI Copilot is a powerful, AI-driven tool that simplifies the process of creating high-quality datasets tailored to your unique requirements. Our model allows you to use your own data to design and build a dataset optimized for training and fine-tuning your large language model (LLM). Whether you're focusing on specialized topics, industry-specific information, or user-specific insights, SI Copilot provides an intuitive, hands-free approach to data generation, leveraging AI to refine and structure your input data. This empowers businesses, researchers, and developers to maximize the performance of their LLMs with relevant, high-quality training data customized for optimal results.
Category tags:"I personally struggled a lot with gathering datasets for fine-tuning of LLMs and this project is a great use case for google/gemma-2-27b-it, great job team! It's really convenient that it automatically identifies which synthetic data to generate based on existing data. My suggestions for next steps would be: 1) Add processing of LLM non-popular languages (e.g. Japanese e.t.c.) - fine-tuning for this languages is a big pain and to create those types of datasets you need a model that can process and generate text on this language. Same goes for programming languages. 2) Make dataset schema creation more flexible and straight-forward - it's really hard to categorise dataset schemas for all sorts of use cases. So i'd suggest coming up with a more generic way for setting schema up."
Aleksei Naumov
Lead AI Product Engineer
"Very nice idea. Synthetic data generation is a very exciting topic for the ML community. I would recommend checking OS libraries such as distilabel as there might be some overlap with the project. As a suggestion, it would be great to clarify a bit what kind of input file (e.g. which requirements for csvs) would be very useful from the UI perspective. This project can be very good for synthetic datasets for niche use cases. Cool work!"
Omar Sanseviero
Gemma Lead