
DC-Ops is a research project that teaches a 7B language model to operate a datacenter through physics-grounded reinforcement learning. The agent reads a text-rendered NOC dashboard and issues natural-language operator commands, just like a real shift engineer. The simulation lives on Meta's OpenEnv framework and combines an RC lumped-capacitance thermal network with a quadratic-loss UPS model (calibrated against APC WP-108), a state-machine diesel generator, and an automatic transfer switch. Six scenarios across three difficulty tiers cover thermal events (CRAC failures, cascades) and power events (UPS alarms, utility loss), with safety thresholds from ASHRAE TC 9.9. The environment link: https://huggingface.co/spaces/Melikshah/dc_ops_env Training runs in three phases. A reasoning teacher (DeepSeek-R1-Distill-Qwen-32B) plays 934 episodes via vLLM, narrating decisions in a three-block think/reasoning/command format. Qwen2.5-7B-Instruct is then QLoRA fine-tuned on the cleaned student-format data with Unsloth. Finally, TRL's GRPOTrainer trains against the live simulation, scored by four reward functions: format compliance, a physics env reward, scenario-aware command quality, and an anti-looping penalty. For reward mechanism, softplus barriers give smooth gradients near safety limits. The format reward is asymmetric to prevent GRPO advantage collapse when SFT already produces clean output. A continuous proxy-health score plus a 4-step wait probe replaces sparse resolved/crashed signals with dense, action-attributable feedback. Mid-game prompts from per-scenario warmup sequences expose the agent to substantively different simulator states. The whole stack runs on a single AMD Instinct MI300X under ROCm 7.2. GRPO takes about 4 hours and lifts the composite reward by +188% over SFT. On the hardest cooling-cascade scenario, mean per-step reward goes from 0.004 on base Qwen2.5-7B-Instruct to 0.043 on the GRPO model. Code, dataset, and the final QLoRA adapter are all open.
10 May 2026

Conventional SME Fashion brands and startups suffer greatly from a good platform to properly serve their user base. Let's say, they have a good website to showcase their products. But in this busy life, no one has enough time to traverse web pages to get their desired products. The conventional filtering methods are based on keyword-based search and they don't satisfy a customer's needs properly. That's why we are introducing ShopGenie. ShopGenie utilizes semantic search to enable user get their desired product in the shortest possible time. If you can think of a product, just write its short description and we will get you the most relevant products. If you like a shoe, capture the photo and give it to us. We find your desired products. You have photo but want some changes also? Well, give us the photo and your requirements. We get what you will need. Then we have a recommendation model to recommend user products based on their favourites list. Finally, we have a sentiment analysis feature to categorize user feedback on a particular product. The second portion of our project is to empower the SME itself. If a company cannot properly understand its user requirements, it will fail at a certain stage for sure. With our platform, you can analyze your user's behavior from clicking to providing feedback to you. For example, with our platform you can see users aging from 18-24 are using your platform the most. Then you can manufacture more products for this user group. Again, say some products are getting clicks but are not added to favourite lists frequently. Then you can try reducing the price or changing some features of that product. Say, some product is getting no clicks or getting negative feedback. You can stop its production or bring some changes. Proper analysis can help you augment your website and your production most effectively. For the prototype purpose, we made this large dataset: https://huggingface.co/datasets/Melikshah/products
11 Oct 2024