Our project tackles the challenge of retrieving relevant information paired with matching images, addressing the limitations of traditional retrieval methods. We propose a Multi-Modal RAG application that combines text and visual data to enhance user understanding and trust. This application mitigates the risk of hallucinations in language models by supporting textual responses with visual evidence. Currently, our solution integrates static PDFs and YouTube videos, allowing users to ask questions and receive accurate text and images or video frames. We are exploring dynamic multi-modal vector databases to handle user data more effectively, leveraging on-device GPUs for privacy and security. Our innovation leads to the creation of a "Personalized Knowledge Model" (PKM), utilizing advanced knowledge graphs without exposing user data to cloud services. Our project includes comprehensive Colab tutorials on GitHub, enabling others to utilize their resources efficiently. Future enhancements involve dynamic multi-modal RAG, on-device privacy, knowledge graph development without LLMs, and encouraging open-source collaboration.