Large Language Model catalog

Majority of the Large Language Models (and not only) summarized in a table. From the original Transformer to ChatGPT and beyond.

The list is long and still may not be exhaustive. If you think any other model is worth adding or you notice any incorrect information, let me know.

model	year	paper	model type / objective	short info	parameters	training corpora
-	2015	Dai & Le (Google)	autoregressive or autoencoder RNN (LSTM)	idea of pre-training domain-specific language models to be later fine-tuned	?	IMDB, DBPedia, 20 Newsgroups
Transformer	2017	Vaswani et al. (Google)	seq2seq for machine translation	original Transformer architecture	up to 213M	WMT 2014 (translation dataset)
ULMFiT	2018	Howard & Ruder (fast.ai)	autoregressive RNN (AWD-LSTM)	idea of pre-training general-domain language models to be later fine-tuned	?	Wikitext-103
ELMo	2018	Peters et al. (Allen Institute for AI)	bidirectional RNN LM (LSTM)	embeddings from LM added as input to other task-specific models	94M	1B Word LM Benchmark
GPT	2018	Radford et al. (OpenAI)	autoregressive	first LLM using the Transformer model (decoder-only)	117M	BooksCorpus
BERT (weights)	2018	Devlin et al. (Google)	masked LM + next sentence prediction	idea of masked language modeling (bidirectional encoder)	110M/340M	BooksCorpus + Wikipedia
Transformer-XL	2019	Dai et al. (CMU + Google)	autoregressive	learning dependency beyond fixed-length context (processing segments)	up to ~0.8B	Wikitext-103, 1B Word LM Benchmark
XLM	2019	Lample & Conneau (Facebook)	autoregressive or masked LM	cross-lingual language models	570M	Wikipedia, MultiUN, OPUS
GPT-2 (weights)	2019	Radford et al. (OpenAI)	autoregressive	first model to surpass 1B parameters	up to 1.5B	WebText (OpenAI internal, 40GB)
ERNIE	2019	Zhang et al. (Tsinghua University)	masked LM + denoising autoencoder	text encoder + knowledge graph	114M	Wikipedia + Wikidata
XLNet (weights)	2019	Yang et al. (CMU + Google)	permutation LM	idea of permutation language modeling	340M	BooksCorpus + Wikipedia + Giga5 + ClueWeb + CommonCrawl
RoBERTa (weights)	2019	Liu et al. (Facebook)	masked LM	modifications to BERT after ablation study	355M	BooksCorpus + Wikipedia + CC-News + OpenWebText + Stories, 160 GB
Megatron-LM	2019	Shoeybi et al. (NVIDIA)	autoregressive or MLM	even larger multi-billion parameter models based on GPT/BERT	8.3B	Wikipedia + CC-Stories + RealNews + OpenWebText
ALBERT (weights)	2019	Lan et al. (Google)	masked LM + sentence order prediction	reduced #params by embedding decomposition + cross-layer param sharing	up to 235M	same as BERT
DistilBERT (weights)	2019	Sanh et al. (Hugging Face)	masked LM + next sentence prediction	obtained from BERT via knowledge distillation (teacher-student)	66M	same as BERT
T5 (weights)	2019	Raffel et al. (Google)	seq2seq	encoder-decoder pre-trained with unsupervised denoising objective, fine-tuned with multi-task objective (tasks formulated as text-to-text)	up to 11B	C4 (Colossal Clean Crawled Corpus), 750GB (stage 1); supervised datasets (stage 2)
BART (weights)	2019	Lewis et al. (Facebook)	seq2seq	pre-trained as a denoising autoencoder: to restore the corrupted input	BERT+10%	same as RoBERTa
XLM-RoBERTa (weights)	2019	Conneau et al. (Facebook)	masked LM	multi-lingual model pre-trained on texts in 100 languages	550M	CommonCrawl in 100 languages
Meena	2020	Adiwardana et al. (Google)	seq2seq (for dialogue)	multi-turn chatbot trained to minimize perplexity of the next token	2.6B	public domain social media conversations
Turing NLG	2020	only blogpost (Microsoft)	autoregressive	a language model scaled up to 17B parameters	17B	"same type of data that Megatron-LM models were trained on"
ELECTRA (weights)	2020	Clark et al. (Stanford + Google)	replaced token detection	GAN-like pre-training; generator corrupts the input, discriminator detects corrupted tokens	same as BERT	same as BERT, for largest model: same as XLNet
GPT-3 (API)	2020	Brown et al. (OpenAI)	autoregressive	very similar to GPT-2, but larger (175B params; largest at that time)	175B	CommonCrawl + extended WebText + Books + Wikipedia
DeBERTa (weights)	2020	He et al. (Microsoft)	masked LM	BERT with disentangled attention (word content and position separated) + enhanced mask decoder	up to 1.5B	Wikipedia + BooksCorpus + OpenWebText + Stories
mT5 (weights)	2020	Xue et al. (Google)	seq2seq	multilingual T5 for 101 languages	up to 11B	CommonCrawl in 101 languages (mC4)
Switch Transformer	2021	Fedus et al. (Google)	seq2seq (Mixture of Experts)	sparsely-activated model / MoE - parameters (part of the model to be used) depend on the input data	1.6T (MoE)	same as in T5 and mT5
GLM (weights)	2021	Du et al. (Tsinghua University)	autoregressive blank infilling	idea of autoregressive blank infilling	up to 10B	same as BERT
GPT-Neo (weights)	2021	- (EleutherAI)	autoregressive	replication of the GPT-3 architecture (with much less parameters)	2.7B	The Pile
GPT-J (weights)	2021	- (EleutherAI)	autoregressive	replication of the GPT-3 architecture (with much less parameters); seems very similar to GPT-Neo	6B	The Pile
Jurassic-1 (API)	2021	Lieber et al. (AI21 Labs)	autoregressive	GPT-3 like with "optimized" depth-to-width ratio (shallower but wider?) and larger vocabulary	178B	attempt to replicate GPT-3 data using publicly available data
FLAN	2021	Wei et al. (Google)	autoregressive	137B LaMDA-PT model fine-tuned on instructions	137B	a mixture of 62 NLU and NLG tasks (see paper for details)
T0 (weights)	2021	Sanh et al. (Hugging Face)	seq2seq	T5 model fine-tuned on a large mixture of supervised tasks with a unified prompt format	11B	P3 (Public Pool of Prompts)
Megatron-Turing NLG	2021	Smith et al. (Microsoft + NVIDIA)	autoregressive	largest model at that time, 3x larger than GPT-3	530B	a subset of The Pile + CommonCrawl + RealNews + CC-Stories
RETRO	2022	Borgeaud et al. (DeepMind)	seq2seq (+ retrieval)	input is split into chunks; for each chunk, nearest neighbor entries are retrieved from DB to improve modeling	up to 7B	multilingual MassiveText (see Gopher paper)
GLaM	2022	Du et al. (Google)	autoregressive (Mixture of Experts)	another MoE model, this time autoregressive, with over a trillion parameters	1.3T (MoE)	a mixture of webpages, conversations, forums, books, news
Gopher	2022	Rae et al. (DeepMind)	autoregressive	a family of language models (up to 280B) plus analysis of effect of model scaling	up to 280B	MassiveText (MassiveWeb + C4 + Books + News + Wiki + GitHub)
LaMDA	2022	Thoppilan et al. (Google)	autoregressive (for dialogue)	pre-trained on public dialogues and web documents, fine-tuned for safety and factual correctness (knowledge retrieval from external tools)	137B	publicly available dialogues and web documents (details in paper)
ST-MoE	2022	Zoph et al. (Google)	seq2seq (Mixture of Experts)	stable training of a large-scale sparse (Mixture of Experts) language model	269B (MoE)	mix of C4 corpus and dataset used for GLaM
InstructGPT (API)	2022	Ouyang et al. (OpenAI)	autoregressive	GPT-3 model trained to follow instructions using Reinforcement Learning with Human Feedback (RLHF)	175B	human demonstrations of desired model behavior for prompts (manually written + collected via OpenAI API)
Chinchilla	2022	Hoffmann et al. (DeepMind)	autoregressive	compute-optimal training; 4x smaller than Gopher but trained on 4x more data, beats larger models on many downstream tasks	70B	MassiveText (a different subset distribution than in Gopher)
PaLM	2022	Chowdhery et al. (Google)	autoregressive	largest model to date, efficiently trained using Google Pathways system	540B	based on datasets used in GLaM and LaMDA
Anthropic assistant	2022	Bai et al. (Anthropic)	autoregressive (for dialogue)	dialogue agent based on a language model trained with RLHF to be helpful and harmless	up to 52B	The Pile
GPT-NeoX (weights)	2022	Black et al. (EleutherAI)	autoregressive	largest publicly available dense autoregressive model at that time	20B	The Pile
OPT (weights)	2022	Zhang et al. (Meta)	autoregressive	a family of language models (up to 175B) that (apart from the largest one) have publicly available weights	up to 175B	dataset from RoBERTa + The Pile + Reddit
YaLM (weights)	2022	only repository (Yandex)	autoregressive	bilingual GPT-like model for English and Russian	100B	The Pile + a large collection of Russian texts
Atlas	2022	Izacard et al. (Meta)	seq2seq (+ retrieval)	T5 language model + retrieval from a corpus of documents (joint pretraining)	up to 11B	Wikipedia, CommonCrawl
Sparrow	2022	Glaese et al. (DeepMind)	autoregressive (for dialogue)	dialogue agent based on Chinchilla LM trained with RLHF to be helpful and harmless, able to retrieve information from external source	70B	dialogue data collected by interaction with human annotators
GLM-130B (weights)	2022	Zeng et al. (Tsinghua University)	autoregressive blank infilling	open bilingual 130B model for English and Chinese	130B	The Pile, LAMBADA
Flan-T5 (weights) & Flan-PaLM	2022	Chung et al. (Google)	seq2seq / autoregressive	T5 and PaLM models fine-tuned with instructions (FLAN-T5 weights released in several sizes)	up to 540B	a mixture of 1836 finetuning tasks from 4 sources (details in paper)
BLOOM (weights)	2022	Le Scao et al. (BigScience)	autoregressive	a 176B parameter model resulting from the BigScience collaboration (trained for 3.5 months in the first half of the year)	176B	ROOTS dataset (mix of natural and programming languages)
BLOOMZ (weights)	2022	Muennighof et al. (BigScience)	autoregressive	BLOOM finetuned on instructions	176B	xP3
Galactica (weights)	2022	Taylor et al. (Meta)	autoregressive	a model trained on a corpus of scientific knowledge, performing strongly in knowledge-intensive scientific tasks	up to 120B	papers, textbooks, encyclopedias, code, knowledge bases etc.
ChatGPT (API)	2022	only blogpost for now (OpenAI)	autoregressive (for dialogue)	a model trained in a similar way as InstructGPT, using RLHF, in a dialogue/chat framework	?	human demonstrations of desired model behavior for prompts (see InstructGPT)

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Repository files navigation

Large Language Model catalog

About

azygadlo/LLM-catalog

Folders and files

Latest commit

History

README.md

README.md

Repository files navigation

Large Language Model catalog

About

Topics

Resources

Stars

Watchers

Forks