
Robot training is expensive(VLA) and hard. What if a first person view of humans hand can be used for training robots? Why is this helpful? Anyone with a mac or an iphone can then start collect training data. They can get paid for it and robotic data can be accelerated. Here's the IDEA real time camera data is used along with media pipe, sam3 , yolo and vlm so the egocentric data can be enrinched with accurate masks from sam3, reasoning from vlm, mediapipe can detect joints data. All of this is super useful for training robots in a cheap and fast way. From lack of data , abundance of data is achieved fast. This is more useful especially for specialized tasks, where only certain humans can do and they are in remote places Object tracking with persistent IDs across frames Zero-shot state classification via SigLIP β 200x faster than VLM for open/closed/ajar labels Navigation state classification via VLM (doors, drawers, handles β open/closed/ajar/blocked) Temporal diff β VLM compares consecutive frames to detect state transitions Navigation timeline β per-object state timeline with colored bars and transition events Hand-object interactions via MediaPipe Ground truth export β structured JSON with per-frame annotations Accuracy evaluation β compare predictions against manual labels Live perception β real-time webcam inference with auto-recording and post-analysis H.264 video export β browser-playable annotated videos with in-app preview Per-frame timing β inference latency breakdown per model stage Tried Gemini 3 Flash, Cosmos and gemma for vlm
15 Feb 2026