ML teams spend 60–80% of their time on data cleaning, not modeling. Existing AI cleaning tools demand API keys, leak your data to third-party servers, cost money per row, and produce non-deterministic output that breaks reproducibility. For regulated industries healthcare, finance, defense that simply isn't an option. CleanML Vision solves this with a different philosophy: AI-BUILT, AI-FREE. IBM Bob accelerated the development over 48 hours, but the shipped product uses ZERO external AI APIs at runtime. It runs 100% locally on your laptop. Your data never leaves your machine. The tool handles both modalities ML engineers actually use: TABULAR (CSV): 59+ deterministic operations across 8 families: missing values, outlier detection (IQR, Z-score, Isolation Forest, DBSCAN), categorical encoding (one-hot, label, frequency, target), scaling (Standard/MinMax/Robust/Log), text cleaning with stopword removal, datetime feature extraction, fuzzy label normalization ("Male"/"M"/"male"---> "Male"), and cross-field validation rules. IMAGE DATASETS: Upload a ZIP, profile every image (dimensions, formats, channels, integrity), detect duplicates via perceptual hashing, flag blurry shots via Laplacian variance, run augmentation pipelines, and export as ZIP, NumPy arrays, or PyTorch tensors. ML PREP: Feature engineering with a formula builder (BMI = weight/(height/100)²), stratified train/test split with zip download, class balancing via SMOTE, dimensionality reduction with PCA/VarianceThreshold/SelectKBest, and multi-CSV joins. The killer feature: every operation generates equivalent pandas/sklearn code. Users download a complete Jupyter notebook that reproduces the entire pipeline on any future dataset no CleanML Vision needed afterwards. Stats: 21 HTTP endpoints, 111 passing tests, 81% code coverage, 1 GB upload limit. Stack: Python · Flask · pandas · scikit-learn · imbalanced-learn · Pillow · OpenCV · imagehash · Plotly.js · vanilla JS (no framework).
Category tags: