Skip to content

Here we will track the latest AI Multimodal Models, including Multimodal Foundation Models, LLM, Agent, Audio, Image, Video, Music and 3D content. šŸ”„

License

Notifications You must be signed in to change notification settings

Yuan-ManX/ai-multimodal-timeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Ā 

History

65 Commits
Ā 
Ā 
Ā 
Ā 
Ā 
Ā 

Repository files navigation

AI Multimodal Timeline

ComfyUI

Here we will track the latest AI Multimodal Models, including Multimodal Foundation Model, LLM, Agent, Audio, Image, Video, Music and 3D content. šŸ”„

Table of Contents

Project List

Multimodal Model

Date Source Description Paper Model
2024-06 MINT-1T Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens. arXiv
2024-06 OmniTokenizer A Joint Image-Video Tokenizer for Visual Generation. arXiv Website
2024-06 ml-4m A framework for training any-to-any multimodal foundation models. arXiv Website
2024-06 VideoLLaMA 2 Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs. arXiv Hugging Face
2024-05 ManyICL Many-Shot In-Context Learning in Multimodal Foundation Models. arXiv
2024-05 Contrastive ALignment (CAL) Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment. arXiv
2024-05 Groma Grounded Multimodal Large Language Model with Localized Visual Tokenization. arXiv Hugging Face
2024-05 CogVLM2 GPT4V-level open-source multi-modal model based on Llama3-8B. Hugging Face
2024-05 Chameleon Mixed-Modal Early-Fusion Foundation Models. arXiv
2024-05 Lumina-T2X Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers. arXiv Hugging Face
2024-05 MiniCPM-Llama3-V 2.5 MiniCPM-Llama3-V 2.5 is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters. Hugging Face
2024-05 Gemini Build with state-of-the-art generative models and tools to make AI helpful for everyone. API
2024-05 GPT-4o GPT-4o (ā€œoā€ for ā€œomniā€) is a step towards much more natural human-computer interactionā€”it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs. API
2023-12 Tokenize Anything Tokenize Anything via Prompting. arXiv Hugging Face
2023-11 ShareGPT4V Improving Large Multi-Modal Models with Better Captions. arXiv Hugging Face
2023-07 Emu Emu: Generative Multimodal Models from BAAI. arXiv Hugging Face
2023-05 ImageBind One Embedding Space To Bind Them All. arXiv Website
2022-11 EVA EVA: Visual Representation Fantasies from BAAI. arXiv Hugging Face

^ Back to Contents ^

LLM

Date Source Description Paper Model
2024-04 Llama 3 Meta Llama 3 is the next generation of our state-of-the-art open source large language model. Hugging Face
2024-03 Claude 3 Talk with Claude, an AI assistant from Anthropic. API
2023-09 Baichuan 2 A series of large language models developed by Baichuan Intelligent Technology. Hugging Face
2023-07 GPT-4 GPT-4 is OpenAIā€™s most advanced system, producing safer and more useful responses. API

^ Back to Contents ^

Agent

Date Source Description Paper Model
2024-06 Mixture of Agents (MoA) Mixture-of-Agents Enhances Large Language Model Capabilities. arXiv
2024-06 Buffer of Thoughts Thought-Augmented Reasoning with Large Language Models. arXiv
2024-06 Translation Agent Agentic translation using reflection workflow.
2024-06 Atomic Agents The Atomic Agents framework is designed to be modular, extensible, and easy to use.
2024-05 Pipecat Open Source framework for voice and multimodal conversational AI.

^ Back to Contents ^

Audio

Audio/Text-to-Speech

Date Source Description Paper Model
2024-05 ChatTTS ChatTTS is a text-to-speech model designed specifically for dialogue scenario such as LLM assistant.
2023-06 StyleTTS 2 Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models. arXiv Hugging Face

Audio/Automatic Speech Recognition

Date Source Description Paper Model
2024-05 TeleSpeech-ASR Large speech model-super multi-dialect ASR. Hugging Face
2022-12 Whisper Whisper is a general-purpose speech recognition model. arXiv API

Audio/Audio Generation

Date Source Description Paper Model
2024-06 SEE-2-SOUND Zero-Shot Spatial Environment-to-Spatial Sound. arXiv
2024-05 Make-An-Audio 3 Transforming Text into Audio via Flow-based Large Diffusion Transformers. arXiv Hugging Face

^ Back to Contents ^

Image

Date Source Description Paper Model
2024-06 Depth Anything V2 Depth Anything V2. arXiv Hugging Face
2024-06 AutoStudio Crafting Consistent Subjects in Multi-turn Interactive Image Generation. arXiv
2024-06 MimicBrush Zero-shot Image Editing with Reference Imitation. arXiv Hugging Face
2024-06 LlamaGen Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation. arXiv Hugging Face
2024-05 Omost Omost is a project to convert LLM's coding capability to image generation (or more accurately, image composing) capability. Hugging Face
2024-05 Hunyuan-DiT A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding. arXiv Hugging Face
2023-10 DALLĀ·E 3 DALLĀ·E is a AI system that can create realistic images and art from a description in natural language. API

^ Back to Contents ^

Video

Date Source Description Paper Model
2024-05 Video-MME The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis.
2024-05 MotionLLM Understanding Human Behaviors from Human Motions and Videos. arXiv
2024-05 Vidu Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models. arXiv
2024-02 Sora Sora is an AI model that can create realistic and imaginative scenes from text instructions. Technical Report
2023-11 Pika Pika is the idea-to-video platform that sets your creativity in motion.
2023-03 Runway Runway is an applied AI research company shaping the next era of art, entertainment and human creativity.

^ Back to Contents ^

Music

Date Source Description Paper Model
2024-04 Udio Udio - AI Music Generator Website
2023-12 Suno Suno is building a future where anyone can make great music. Website

^ Back to Contents ^

3D

Date Source Description Paper Model
2024-06 Unique3D High-Quality and Efficient 3D Mesh Generation from a Single Image. arXiv Hugging Face
2024-06 DreamGaussian4D Generative 4D Gaussian Splatting. arXiv Hugging Face
2024-03 GaussianCube A Structured and Explicit Radiance Representation for 3D Generative Modeling. arXiv Hugging Face
2024-03 TripoSR Fast 3D Object Reconstruction from a Single Image. arXiv Hugging Face

^ Back to Contents ^

About

Here we will track the latest AI Multimodal Models, including Multimodal Foundation Models, LLM, Agent, Audio, Image, Video, Music and 3D content. šŸ”„

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published