AI Multimodal Timeline

Here we will track the latest AI Multimodal Models, including Multimodal Foundation Model, LLM, Agent, Audio, Image, Video, Music and 3D content. 🔥

Project List

Multimodal Model

Date	Source	Description	Paper	Model
2024-06	MINT-1T	Scaling Open-Source Multimodal Data by 10x: A Multimodal Dataset with One Trillion Tokens.	arXiv
2024-06	OmniTokenizer	A Joint Image-Video Tokenizer for Visual Generation.	arXiv	Website
2024-06	ml-4m	A framework for training any-to-any multimodal foundation models.	arXiv	Website
2024-06	VideoLLaMA 2	Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs.	arXiv	Hugging Face
2024-05	ManyICL	Many-Shot In-Context Learning in Multimodal Foundation Models.	arXiv
2024-05	Contrastive ALignment (CAL)	Seeing the Image: Prioritizing Visual Correlation by Contrastive Alignment.	arXiv
2024-05	Groma	Grounded Multimodal Large Language Model with Localized Visual Tokenization.	arXiv	Hugging Face
2024-05	CogVLM2	GPT4V-level open-source multi-modal model based on Llama3-8B.		Hugging Face
2024-05	Chameleon	Mixed-Modal Early-Fusion Foundation Models.	arXiv
2024-05	Lumina-T2X	Transforming Text into Any Modality, Resolution, and Duration via Flow-based Large Diffusion Transformers.	arXiv	Hugging Face
2024-05	MiniCPM-Llama3-V 2.5	MiniCPM-Llama3-V 2.5 is the latest model in the MiniCPM-V series. The model is built on SigLip-400M and Llama3-8B-Instruct with a total of 8B parameters.		Hugging Face
2024-05	Gemini	Build with state-of-the-art generative models and tools to make AI helpful for everyone.		API
2024-05	GPT-4o	GPT-4o (“o” for “omni”) is a step towards much more natural human-computer interaction—it accepts as input any combination of text, audio, image, and video and generates any combination of text, audio, and image outputs.		API
2023-12	Tokenize Anything	Tokenize Anything via Prompting.	arXiv	Hugging Face
2023-11	ShareGPT4V	Improving Large Multi-Modal Models with Better Captions.	arXiv	Hugging Face
2023-07	Emu	Emu: Generative Multimodal Models from BAAI.	arXiv	Hugging Face
2023-05	ImageBind	One Embedding Space To Bind Them All.	arXiv	Website
2022-11	EVA	EVA: Visual Representation Fantasies from BAAI.	arXiv	Hugging Face

^ Back to Contents ^

LLM

Date	Source	Description	Model
2024-04	Llama 3	Meta Llama 3 is the next generation of our state-of-the-art open source large language model.	Hugging Face
2024-03	Claude 3	Talk with Claude, an AI assistant from Anthropic.	API
2023-09	Baichuan 2	A series of large language models developed by Baichuan Intelligent Technology.	Hugging Face
2023-07	GPT-4	GPT-4 is OpenAI’s most advanced system, producing safer and more useful responses.	API

^ Back to Contents ^

Agent

Date	Source	Description	Paper
2024-06	Mixture of Agents (MoA)	Mixture-of-Agents Enhances Large Language Model Capabilities.	arXiv
2024-06	Buffer of Thoughts	Thought-Augmented Reasoning with Large Language Models.	arXiv
2024-06	Translation Agent	Agentic translation using reflection workflow.
2024-06	Atomic Agents	The Atomic Agents framework is designed to be modular, extensible, and easy to use.
2024-05	Pipecat	Open Source framework for voice and multimodal conversational AI.

^ Back to Contents ^

Audio

Audio/Text-to-Speech

Date	Source	Description	Paper	Model
2024-05	ChatTTS	ChatTTS is a text-to-speech model designed specifically for dialogue scenario such as LLM assistant.
2023-06	StyleTTS 2	Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models.	arXiv	Hugging Face

Audio/Automatic Speech Recognition

Date	Source	Description	Paper	Model
2024-05	TeleSpeech-ASR	Large speech model-super multi-dialect ASR.		Hugging Face
2022-12	Whisper	Whisper is a general-purpose speech recognition model.	arXiv	API

Audio/Audio Generation

Date	Source	Description	Paper	Model
2024-06	SEE-2-SOUND	Zero-Shot Spatial Environment-to-Spatial Sound.	arXiv
2024-05	Make-An-Audio 3	Transforming Text into Audio via Flow-based Large Diffusion Transformers.	arXiv	Hugging Face

^ Back to Contents ^

Image

Date	Source	Description	Paper	Model
2024-06	Depth Anything V2	Depth Anything V2.	arXiv	Hugging Face
2024-06	AutoStudio	Crafting Consistent Subjects in Multi-turn Interactive Image Generation.	arXiv
2024-06	MimicBrush	Zero-shot Image Editing with Reference Imitation.	arXiv	Hugging Face
2024-06	LlamaGen	Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation.	arXiv	Hugging Face
2024-05	Omost	Omost is a project to convert LLM's coding capability to image generation (or more accurately, image composing) capability.		Hugging Face
2024-05	Hunyuan-DiT	A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding.	arXiv	Hugging Face
2023-10	DALL·E 3	DALL·E is a AI system that can create realistic images and art from a description in natural language.		API

^ Back to Contents ^

Video

Date	Source	Description	Paper
2024-05	Video-MME	The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis.
2024-05	MotionLLM	Understanding Human Behaviors from Human Motions and Videos.	arXiv
2024-05	Vidu	Vidu: a Highly Consistent, Dynamic and Skilled Text-to-Video Generator with Diffusion Models.	arXiv
2024-02	Sora	Sora is an AI model that can create realistic and imaginative scenes from text instructions.	Technical Report
2023-11	Pika	Pika is the idea-to-video platform that sets your creativity in motion.
2023-03	Runway	Runway is an applied AI research company shaping the next era of art, entertainment and human creativity.

^ Back to Contents ^

Music

Date	Source	Description	Paper	Model
2024-04	Udio	Udio - AI Music Generator		Website
2023-12	Suno	Suno is building a future where anyone can make great music.		Website

^ Back to Contents ^

3D

Date	Source	Description	Paper	Model
2024-06	Unique3D	High-Quality and Efficient 3D Mesh Generation from a Single Image.	arXiv	Hugging Face
2024-06	DreamGaussian4D	Generative 4D Gaussian Splatting.	arXiv	Hugging Face
2024-03	GaussianCube	A Structured and Explicit Radiance Representation for 3D Generative Modeling.	arXiv	Hugging Face
2024-03	TripoSR	Fast 3D Object Reconstruction from a Single Image.	arXiv	Hugging Face

^ Back to Contents ^

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
AI-Multimodal-Timeline.png		AI-Multimodal-Timeline.png
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI Multimodal Timeline

Table of Contents

Project List

Multimodal Model

LLM

Agent

Audio

Audio/Text-to-Speech

Audio/Automatic Speech Recognition

Audio/Audio Generation

Image

Video

Music

3D

About

Releases

Packages

License

Yuan-ManX/ai-multimodal-timeline

Folders and files

Latest commit

History

Repository files navigation

AI Multimodal Timeline

Table of Contents

Project List

Multimodal Model

LLM

Agent

Audio

Audio/Text-to-Speech

Audio/Automatic Speech Recognition

Audio/Audio Generation

Image

Video

Music

3D

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages