Here we will track the latest AI Multimodal Models, including Multimodal Foundation Model, LLM, Agent, Audio, Image, Video, Music and 3D content. š„
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-06 | MINT-1T | Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens. | arXiv | |
2024-06 | OmniTokenizer | A Joint Image-Video Tokenizer for Visual Generation. | arXiv | Website |
2024-06 | ml-4m | A framework for training any-to-any multimodal foundation models. | arXiv | Website |
2024-06 | VideoLLaMA 2 | Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs. | arXiv | Hugging Face |
2024-05 | ManyICL | Many-Shot In-Context Learning in Multimodal Foundation Models. | arXiv | |
2024-05 | Contrastive ALignment (CAL) | Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment. | arXiv | |
2024-05 | Groma | Grounded Multimodal Large Language Model with Localized Visual Tokenization. | arXiv | Hugging Face |
2024-05 | CogVLM2 | GPT4V-level open-source multi-modal model based on Llama3-8B. | Hugging Face | |
2024-05 | Chameleon | Mixed-Modal Early-Fusion Foundation Models. | arXiv | |
2024-05 | Lumina-T2X | Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers. | arXiv | Hugging Face |
2024-05 | MiniCPM-Llama3-V 2.5 | MiniCPM-Llama3-V 2.5 is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. | Hugging Face | |
2024-05 | Gemini | Build with state-of-the-art generative models and tools to make AI helpful for everyone. | API | |
2024-05 | GPT-4o | GPT-4o (āoā for āomniā) is a step towards much more natural human-computer interactionāit accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. | API | |
2023-12 | Tokenize Anything | Tokenize Anything via Prompting. | arXiv | Hugging Face |
2023-11 | ShareGPT4V | Improving Large Multi-Modal Models with Better Captions. | arXiv | Hugging Face |
2023-07 | Emu | Emu: Generative Multimodal Models from BAAI. | arXiv | Hugging Face |
2023-05 | ImageBind | One Embedding Space To Bind Them All. | arXiv | Website |
2022-11 | EVA | EVA: Visual Representation Fantasies from BAAI. | arXiv | Hugging Face |
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-04 | Llama 3 | Meta Llama 3 is the next generation of our state-of-the-art open source large language model. | Hugging Face | |
2024-03 | Claude 3 | Talk with Claude, an AI assistant from Anthropic. | API | |
2023-09 | Baichuan 2 | A series of large language models developed by Baichuan Intelligent Technology. | Hugging Face | |
2023-07 | GPT-4 | GPT-4 is OpenAIās most advanced system, producing safer and more useful responses. | API |
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-06 | Mixture of Agents (MoA) | Mixture-of-Agents Enhances Large Language Model Capabilities. | arXiv | |
2024-06 | Buffer of Thoughts | Thought-Augmented Reasoning with Large Language Models. | arXiv | |
2024-06 | Translation Agent | Agentic translation using reflection workflow. | ||
2024-06 | Atomic Agents | The Atomic Agents framework is designed to be modular, extensible, and easy to use. | ||
2024-05 | Pipecat | Open Source framework for voice and multimodal conversational AI. |
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-05 | ChatTTS | ChatTTS is a text-to-speech model designed specifically for dialogue scenario such as LLM assistant. | ||
2023-06 | StyleTTS 2 | Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models. | arXiv | Hugging Face |
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-05 | TeleSpeech-ASR | Large speech model-super multi-dialect ASR. | Hugging Face | |
2022-12 | Whisper | Whisper is a general-purpose speech recognition model. | arXiv | API |
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-06 | SEE-2-SOUND | Zero-Shot Spatial Environment-to-Spatial Sound. | arXiv | |
2024-05 | Make-An-Audio 3 | Transforming Text into Audio via Flow-based Large Diffusion Transformers. | arXiv | Hugging Face |
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-06 | Depth Anything V2 | Depth Anything V2. | arXiv | Hugging Face |
2024-06 | AutoStudio | Crafting Consistent Subjects in Multi-turn Interactive Image Generation. | arXiv | |
2024-06 | MimicBrush | Zero-shot Image Editing with Reference Imitation. | arXiv | Hugging Face |
2024-06 | LlamaGen | Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation. | arXiv | Hugging Face |
2024-05 | Omost | Omost is a project to convert LLM's coding capability to image generation (or more accurately, image composing) capability. | Hugging Face | |
2024-05 | Hunyuan-DiT | A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding. | arXiv | Hugging Face |
2023-10 | DALLĀ·E 3 | DALLĀ·E is a AI system that can create realistic images and art from a description in natural language. | API |
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-05 | Video-MME | The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis. | ||
2024-05 | MotionLLM | Understanding Human Behaviors from Human Motions and Videos. | arXiv | |
2024-05 | Vidu | Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models. | arXiv | |
2024-02 | Sora | Sora is an AI model that can create realistic and imaginative scenes from text instructions. | Technical Report | |
2023-11 | Pika | Pika is the idea-to-video platform that sets your creativity in motion. | ||
2023-03 | Runway | Runway is an applied AI research company shaping the next era of art, entertainment and human creativity. |
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-04 | Udio | Udio - AI Music Generator | Website | |
2023-12 | Suno | Suno is building a future where anyone can make great music. | Website |
Date | Source | Description | Paper | Model |
---|---|---|---|---|
2024-06 | Unique3D | High-Quality and Efficient 3D Mesh Generation from a Single Image. | arXiv | Hugging Face |
2024-06 | DreamGaussian4D | Generative 4D Gaussian Splatting. | arXiv | Hugging Face |
2024-03 | GaussianCube | A Structured and Explicit Radiance Representation for 3D Generative Modeling. | arXiv | Hugging Face |
2024-03 | TripoSR | Fast 3D Object Reconstruction from a Single Image. | arXiv | Hugging Face |