
The race to train foundational Vision-Language-Action (VLA) models and embodied AI relies heavily on teleoperation and egocentric video. The most cost-effective data pipeline involves filming humans performing physical tasks and mapping those movements to robots. However, this method suffers from a massive, expensive flaw: human hands frequently drift out of the headset’s field of view (FoV). Because ML models require continuous visual tracking to learn, millions of dollars are currently wasted on "blind data"—trajectories that are completely useless for training because the hands aren't visible. InSight solves this multi-million dollar data leak directly at the source. We have developed an ultra-lightweight, on-device computer vision pipeline optimized for rapid edge inference on spatial computing hardware. By deploying the MediaPipe Hand Detection model via Qualcomm's AI Hub, our system runs entirely locally with zero cloud latency. The model continuously tracks hand presence, precise coordinates, and visibility margins within the frame in real time.
28 Jun 2026