Aria is the flagship Mixture-of-Experts (MoE) model developed by Rhymes AI, specifically designed to handle multimodal inputs like text, images, and video. This open-source model focuses on efficiency and high performance. During inference, Aria activates only 3.9 billion parameters from its total 25.3 billion parameters, making it one of the fastest multimodal AI systems available today. Aria processes diverse data formats seamlessly, leveraging its 64K-long multimodal context window to deliver comprehensive insights. This capability allows it to handle long-form content, such as captioning 256-frame videos in just 10 seconds, with remarkable speed and precision.
Aria’s efficiency and performance set it apart from other leading AI models. It outperforms Pixtral 12B and Llama-3.2-12B on several benchmarks, including MMMU and MathVista. When handling complex tasks like long video processing, it surpasses GPT-4o, while in parsing lengthy documents, it outshines Gemini 1.5 Flash. These results make Aria a clear leader in tackling complex multimodal challenges.
Designed to foster collaboration and customization, Aria’s Apache 2.0 license ensures full transparency. Developers and researchers have full access to the model’s open weights, code, and demos. This openness encourages innovation, empowering the community to fine-tune and optimize Aria for diverse use cases, such as healthcare, content creation, AI research, and customer service.
Key Features of Aria:
• Multimodal Native: Seamlessly processes text, images, and videos within a unified model.
• Lightning-Fast Video Processing: Captures and captions 256-frame videos in just 10 seconds.
• Open-source model: Fully available for developers to modify, customize, and extend.
• Apache 2.0 License: Grants full access to weights, code, and demos.