Majority of the Large Language Models (and not only) summarized in a table. From the original Transformer to ChatGPT and beyond.
The list is long and still may not be exhaustive. If you think any other model is worth adding or you notice any incorrect information, let me know.
model | year | paper | model type / objective | short info | parameters | training corpora |
---|---|---|---|---|---|---|
- | 2015 | Dai & Le (Google) | autoregressive or autoencoder RNN (LSTM) | idea of pre-training domain-specific language models to be later fine-tuned | ? | IMDB, DBPedia, 20 Newsgroups |
Transformer | 2017 | Vaswani et al. (Google) | seq2seq for machine translation | original Transformer architecture | up to 213M | WMT 2014 (translation dataset) |
ULMFiT | 2018 | Howard & Ruder (fast.ai) | autoregressive RNN (AWD-LSTM) | idea of pre-training general-domain language models to be later fine-tuned | ? | Wikitext-103 |
ELMo | 2018 | Peters et al. (Allen Institute for AI) | bidirectional RNN LM (LSTM) | embeddings from LM added as input to other task-specific models | 94M | 1B Word LM Benchmark |
GPT | 2018 | Radford et al. (OpenAI) | autoregressive | first LLM using the Transformer model (decoder-only) | 117M | BooksCorpus |
BERT (weights) | 2018 | Devlin et al. (Google) | masked LM + next sentence prediction | idea of masked language modeling (bidirectional encoder) | 110M/340M | BooksCorpus + Wikipedia |
Transformer-XL | 2019 | Dai et al. (CMU + Google) | autoregressive | learning dependency beyond fixed-length context (processing segments) | up to ~0.8B | Wikitext-103, 1B Word LM Benchmark |
XLM | 2019 | Lample & Conneau (Facebook) | autoregressive or masked LM | cross-lingual language models | 570M | Wikipedia, MultiUN, OPUS |
GPT-2 (weights) | 2019 | Radford et al. (OpenAI) | autoregressive | first model to surpass 1B parameters | up to 1.5B | WebText (OpenAI internal, 40GB) |
ERNIE | 2019 | Zhang et al. (Tsinghua University) | masked LM + denoising autoencoder | text encoder + knowledge graph | 114M | Wikipedia + Wikidata |
XLNet (weights) | 2019 | Yang et al. (CMU + Google) | permutation LM | idea of permutation language modeling | 340M | BooksCorpus + Wikipedia + Giga5 + ClueWeb + CommonCrawl |
RoBERTa (weights) | 2019 | Liu et al. (Facebook) | masked LM | modifications to BERT after ablation study | 355M | BooksCorpus + Wikipedia + CC-News + OpenWebText + Stories, 160 GB |
Megatron-LM | 2019 | Shoeybi et al. (NVIDIA) | autoregressive or MLM | even larger multi-billion parameter models based on GPT/BERT | 8.3B | Wikipedia + CC-Stories + RealNews + OpenWebText |
ALBERT (weights) | 2019 | Lan et al. (Google) | masked LM + sentence order prediction | reduced #params by embedding decomposition + cross-layer param sharing | up to 235M | same as BERT |
DistilBERT (weights) | 2019 | Sanh et al. (Hugging Face) | masked LM + next sentence prediction | obtained from BERT via knowledge distillation (teacher-student) | 66M | same as BERT |
T5 (weights) |
2019 | Raffel et al. (Google) | seq2seq | encoder-decoder pre-trained with unsupervised denoising objective, fine-tuned with multi-task objective (tasks formulated as text-to-text) | up to 11B | C4 (Colossal Clean Crawled Corpus), 750GB (stage 1); supervised datasets (stage 2) |
BART (weights) | 2019 | Lewis et al. (Facebook) | seq2seq | pre-trained as a denoising autoencoder: to restore the corrupted input | BERT+10% | same as RoBERTa |
XLM-RoBERTa (weights) | 2019 | Conneau et al. (Facebook) | masked LM | multi-lingual model pre-trained on texts in 100 languages | 550M | CommonCrawl in 100 languages |
Meena | 2020 | Adiwardana et al. (Google) | seq2seq (for dialogue) | multi-turn chatbot trained to minimize perplexity of the next token | 2.6B | public domain social media conversations |
Turing NLG | 2020 | only blogpost (Microsoft) | autoregressive | a language model scaled up to 17B parameters | 17B | "same type of data that Megatron-LM models were trained on" |
ELECTRA (weights) | 2020 | Clark et al. (Stanford + Google) | replaced token detection | GAN-like pre-training; generator corrupts the input, discriminator detects corrupted tokens | same as BERT | same as BERT, for largest model: same as XLNet |
GPT-3 (API) |
2020 | Brown et al. (OpenAI) | autoregressive | very similar to GPT-2, but larger (175B params; largest at that time) | 175B | CommonCrawl + extended WebText + Books + Wikipedia |
DeBERTa (weights) | 2020 | He et al. (Microsoft) | masked LM | BERT with disentangled attention (word content and position separated) + enhanced mask decoder | up to 1.5B | Wikipedia + BooksCorpus + OpenWebText + Stories |
mT5 (weights) |
2020 | Xue et al. (Google) | seq2seq | multilingual T5 for 101 languages | up to 11B | CommonCrawl in 101 languages (mC4) |
Switch Transformer | 2021 | Fedus et al. (Google) | seq2seq (Mixture of Experts) | sparsely-activated model / MoE - parameters (part of the model to be used) depend on the input data | 1.6T (MoE) | same as in T5 and mT5 |
GLM (weights) |
2021 | Du et al. (Tsinghua University) | autoregressive blank infilling | idea of autoregressive blank infilling | up to 10B | same as BERT |
GPT-Neo (weights) | 2021 | - (EleutherAI) |
autoregressive | replication of the GPT-3 architecture (with much less parameters) | 2.7B | The Pile |
GPT-J (weights) | 2021 | - (EleutherAI) | autoregressive | replication of the GPT-3 architecture (with much less parameters); seems very similar to GPT-Neo | 6B | The Pile |
Jurassic-1 (API) | 2021 | Lieber et al. (AI21 Labs) | autoregressive | GPT-3 like with "optimized" depth-to-width ratio (shallower but wider?) and larger vocabulary | 178B | attempt to replicate GPT-3 data using publicly available data |
FLAN | 2021 | Wei et al. (Google) | autoregressive | 137B LaMDA-PT model fine-tuned on instructions | 137B | a mixture of 62 NLU and NLG tasks (see paper for details) |
T0 (weights) |
2021 | Sanh et al. (Hugging Face) | seq2seq | T5 model fine-tuned on a large mixture of supervised tasks with a unified prompt format | 11B | P3 (Public Pool of Prompts) |
Megatron-Turing NLG | 2021 | Smith et al. (Microsoft + NVIDIA) | autoregressive | largest model at that time, 3x larger than GPT-3 | 530B | a subset of The Pile + CommonCrawl + RealNews + CC-Stories |
RETRO | 2022 | Borgeaud et al. (DeepMind) | seq2seq (+ retrieval) | input is split into chunks; for each chunk, nearest neighbor entries are retrieved from DB to improve modeling | up to 7B | multilingual MassiveText (see Gopher paper) |
GLaM | 2022 | Du et al. (Google) | autoregressive (Mixture of Experts) | another MoE model, this time autoregressive, with over a trillion parameters | 1.3T (MoE) | a mixture of webpages, conversations, forums, books, news |
Gopher | 2022 | Rae et al. (DeepMind) | autoregressive | a family of language models (up to 280B) plus analysis of effect of model scaling | up to 280B | MassiveText (MassiveWeb + C4 + Books + News + Wiki + GitHub) |
LaMDA | 2022 | Thoppilan et al. (Google) | autoregressive (for dialogue) | pre-trained on public dialogues and web documents, fine-tuned for safety and factual correctness (knowledge retrieval from external tools) | 137B | publicly available dialogues and web documents (details in paper) |
ST-MoE | 2022 | Zoph et al. (Google) | seq2seq (Mixture of Experts) | stable training of a large-scale sparse (Mixture of Experts) language model | 269B (MoE) | mix of C4 corpus and dataset used for GLaM |
InstructGPT (API) | 2022 | Ouyang et al. (OpenAI) | autoregressive | GPT-3 model trained to follow instructions using Reinforcement Learning with Human Feedback (RLHF) | 175B | human demonstrations of desired model behavior for prompts (manually written + collected via OpenAI API) |
Chinchilla | 2022 | Hoffmann et al. (DeepMind) | autoregressive | compute-optimal training; 4x smaller than Gopher but trained on 4x more data, beats larger models on many downstream tasks | 70B | MassiveText (a different subset distribution than in Gopher) |
PaLM | 2022 | Chowdhery et al. (Google) | autoregressive | largest model to date, efficiently trained using Google Pathways system | 540B | based on datasets used in GLaM and LaMDA |
Anthropic assistant | 2022 | Bai et al. (Anthropic) | autoregressive (for dialogue) | dialogue agent based on a language model trained with RLHF to be helpful and harmless | up to 52B | The Pile |
GPT-NeoX (weights) | 2022 | Black et al. (EleutherAI) | autoregressive | largest publicly available dense autoregressive model at that time | 20B | The Pile |
OPT (weights) | 2022 | Zhang et al. (Meta) | autoregressive | a family of language models (up to 175B) that (apart from the largest one) have publicly available weights | up to 175B | dataset from RoBERTa + The Pile + Reddit |
YaLM (weights) | 2022 | only repository (Yandex) | autoregressive | bilingual GPT-like model for English and Russian | 100B | The Pile + a large collection of Russian texts |
Atlas | 2022 | Izacard et al. (Meta) | seq2seq (+ retrieval) | T5 language model + retrieval from a corpus of documents (joint pretraining) | up to 11B | Wikipedia, CommonCrawl |
Sparrow | 2022 | Glaese et al. (DeepMind) | autoregressive (for dialogue) | dialogue agent based on Chinchilla LM trained with RLHF to be helpful and harmless, able to retrieve information from external source | 70B | dialogue data collected by interaction with human annotators |
GLM-130B (weights) | 2022 | Zeng et al. (Tsinghua University) | autoregressive blank infilling | open bilingual 130B model for English and Chinese | 130B | The Pile, LAMBADA |
Flan-T5 (weights) & Flan-PaLM | 2022 | Chung et al. (Google) | seq2seq / autoregressive | T5 and PaLM models fine-tuned with instructions (FLAN-T5 weights released in several sizes) | up to 540B | a mixture of 1836 finetuning tasks from 4 sources (details in paper) |
BLOOM (weights) | 2022 | Le Scao et al. (BigScience) | autoregressive | a 176B parameter model resulting from the BigScience collaboration (trained for 3.5 months in the first half of the year) | 176B | ROOTS dataset (mix of natural and programming languages) |
BLOOMZ (weights) | 2022 | Muennighof et al. (BigScience) | autoregressive | BLOOM finetuned on instructions | 176B | xP3 |
Galactica (weights) | 2022 | Taylor et al. (Meta) | autoregressive | a model trained on a corpus of scientific knowledge, performing strongly in knowledge-intensive scientific tasks | up to 120B | papers, textbooks, encyclopedias, code, knowledge bases etc. |
ChatGPT (API) | 2022 | only blogpost for now (OpenAI) | autoregressive (for dialogue) | a model trained in a similar way as InstructGPT, using RLHF, in a dialogue/chat framework | ? | human demonstrations of desired model behavior for prompts (see InstructGPT) |