The Multimodal-driven Computer Interface (MMCI) is a revolutionary framework empowering multimodal models to operate computers seamlessly. Mimicking human input methods, the MMCI processes visual and auditory cues, interpreting on-screen content, and generating mouse and keyboard actions to achieve specific objectives. By integrating advanced computer vision techniques and drawing grids, it refines mouse click predictions, adapts to user preferences, and enhances the overall user experience. This framework aims to redefine human-computer interaction, offering a natural and intuitive approach for users to effortlessly control computers through speech, text, and gestures, transcending traditional input methods. MMCI holds the potential to revolutionize accessibility, productivity, and entertainment realms.