Skip to content

azygadlo/LLM-catalog

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 

Repository files navigation

Large Language Model catalog

Majority of the Large Language Models (and not only) summarized in a table. From the original Transformer to ChatGPT and beyond.

The list is long and still may not be exhaustive. If you think any other model is worth adding or you notice any incorrect information, let me know.

model year paper model type / objective short info parameters training corpora
- 2015 Dai & Le (Google) autoregressive or autoencoder RNN (LSTM) idea of pre-training domain-specific language models to be later fine-tuned ? IMDB, DBPedia, 20 Newsgroups
Transformer 2017 Vaswani et al. (Google) seq2seq for machine translation original Transformer architecture up to 213M WMT 2014 (translation dataset)
ULMFiT 2018 Howard & Ruder (fast.ai) autoregressive RNN (AWD-LSTM) idea of pre-training general-domain language models to be later fine-tuned ? Wikitext-103
ELMo 2018 Peters et al. (Allen Institute for AI) bidirectional RNN LM (LSTM) embeddings from LM added as input to other task-specific models 94M 1B Word LM Benchmark
GPT 2018 Radford et al. (OpenAI) autoregressive first LLM using the Transformer model (decoder-only) 117M BooksCorpus
BERT (weights) 2018 Devlin et al. (Google) masked LM + next sentence prediction idea of masked language modeling (bidirectional encoder) 110M/340M BooksCorpus + Wikipedia
Transformer-XL 2019 Dai et al. (CMU + Google) autoregressive learning dependency beyond fixed-length context (processing segments) up to ~0.8B Wikitext-103, 1B Word LM Benchmark
XLM 2019 Lample & Conneau (Facebook) autoregressive or masked LM cross-lingual language models 570M Wikipedia, MultiUN, OPUS
GPT-2 (weights) 2019 Radford et al. (OpenAI) autoregressive first model to surpass 1B parameters up to 1.5B WebText (OpenAI internal, 40GB)
ERNIE 2019 Zhang et al. (Tsinghua University) masked LM + denoising autoencoder text encoder + knowledge graph 114M Wikipedia + Wikidata
XLNet (weights) 2019 Yang et al. (CMU + Google) permutation LM idea of permutation language modeling 340M BooksCorpus + Wikipedia + Giga5 + ClueWeb + CommonCrawl
RoBERTa (weights) 2019 Liu et al. (Facebook) masked LM modifications to BERT after ablation study 355M BooksCorpus + Wikipedia + CC-News + OpenWebText + Stories, 160 GB
Megatron-LM 2019 Shoeybi et al. (NVIDIA) autoregressive or MLM even larger multi-billion parameter models based on GPT/BERT 8.3B Wikipedia + CC-Stories + RealNews + OpenWebText
ALBERT (weights) 2019 Lan et al. (Google) masked LM + sentence order prediction reduced #params by embedding decomposition + cross-layer param sharing up to 235M same as BERT
DistilBERT (weights) 2019 Sanh et al. (Hugging Face) masked LM + next sentence prediction obtained from BERT via knowledge distillation (teacher-student) 66M same as BERT
T5
(weights)
2019 Raffel et al. (Google) seq2seq encoder-decoder pre-trained with unsupervised denoising objective, fine-tuned with multi-task objective (tasks formulated as text-to-text) up to 11B C4 (Colossal Clean Crawled Corpus), 750GB (stage 1); supervised datasets (stage 2)
BART (weights) 2019 Lewis et al. (Facebook) seq2seq pre-trained as a denoising autoencoder: to restore the corrupted input BERT+10% same as RoBERTa
XLM-RoBERTa (weights) 2019 Conneau et al. (Facebook) masked LM multi-lingual model pre-trained on texts in 100 languages 550M CommonCrawl in 100 languages
Meena 2020 Adiwardana et al. (Google) seq2seq (for dialogue) multi-turn chatbot trained to minimize perplexity of the next token 2.6B public domain social media conversations
Turing NLG 2020 only blogpost (Microsoft) autoregressive a language model scaled up to 17B parameters 17B "same type of data that Megatron-LM models were trained on"
ELECTRA (weights) 2020 Clark et al. (Stanford + Google) replaced token detection GAN-like pre-training; generator corrupts the input, discriminator detects corrupted tokens same as BERT same as BERT, for largest model: same as XLNet
GPT-3
(API)
2020 Brown et al. (OpenAI) autoregressive very similar to GPT-2, but larger (175B params; largest at that time) 175B CommonCrawl + extended WebText + Books + Wikipedia
DeBERTa (weights) 2020 He et al. (Microsoft) masked LM BERT with disentangled attention (word content and position separated) + enhanced mask decoder up to 1.5B Wikipedia + BooksCorpus + OpenWebText + Stories
mT5
(weights)
2020 Xue et al. (Google) seq2seq multilingual T5 for 101 languages up to 11B CommonCrawl in 101 languages (mC4)
Switch Transformer 2021 Fedus et al. (Google) seq2seq (Mixture of Experts) sparsely-activated model / MoE - parameters (part of the model to be used) depend on the input data 1.6T (MoE) same as in T5 and mT5
GLM
(weights)
2021 Du et al. (Tsinghua University) autoregressive blank infilling idea of autoregressive blank infilling up to 10B same as BERT
GPT-Neo (weights) 2021 -
(EleutherAI)
autoregressive replication of the GPT-3 architecture (with much less parameters) 2.7B The Pile
GPT-J (weights) 2021 - (EleutherAI) autoregressive replication of the GPT-3 architecture (with much less parameters); seems very similar to GPT-Neo 6B The Pile
Jurassic-1 (API) 2021 Lieber et al. (AI21 Labs) autoregressive GPT-3 like with "optimized" depth-to-width ratio (shallower but wider?) and larger vocabulary 178B attempt to replicate GPT-3 data using publicly available data
FLAN 2021 Wei et al. (Google) autoregressive 137B LaMDA-PT model fine-tuned on instructions 137B a mixture of 62 NLU and NLG tasks (see paper for details)
T0
(weights)
2021 Sanh et al. (Hugging Face) seq2seq T5 model fine-tuned on a large mixture of supervised tasks with a unified prompt format 11B P3
(Public Pool of Prompts)
Megatron-Turing NLG 2021 Smith et al. (Microsoft + NVIDIA) autoregressive largest model at that time, 3x larger than GPT-3 530B a subset of The Pile + CommonCrawl + RealNews + CC-Stories
RETRO 2022 Borgeaud et al. (DeepMind) seq2seq (+ retrieval) input is split into chunks; for each chunk, nearest neighbor entries are retrieved from DB to improve modeling up to 7B multilingual MassiveText (see Gopher paper)
GLaM 2022 Du et al. (Google) autoregressive (Mixture of Experts) another MoE model, this time autoregressive, with over a trillion parameters 1.3T (MoE) a mixture of webpages, conversations, forums, books, news
Gopher 2022 Rae et al. (DeepMind) autoregressive a family of language models (up to 280B) plus analysis of effect of model scaling up to 280B MassiveText (MassiveWeb + C4 + Books + News + Wiki + GitHub)
LaMDA 2022 Thoppilan et al. (Google) autoregressive (for dialogue) pre-trained on public dialogues and web documents, fine-tuned for safety and factual correctness (knowledge retrieval from external tools) 137B publicly available dialogues and web documents (details in paper)
ST-MoE 2022 Zoph et al. (Google) seq2seq (Mixture of Experts) stable training of a large-scale sparse (Mixture of Experts) language model 269B (MoE) mix of C4 corpus and dataset used for GLaM
InstructGPT (API) 2022 Ouyang et al. (OpenAI) autoregressive GPT-3 model trained to follow instructions using Reinforcement Learning with Human Feedback (RLHF) 175B human demonstrations of desired model behavior for prompts (manually written + collected via OpenAI API)
Chinchilla 2022 Hoffmann et al. (DeepMind) autoregressive compute-optimal training; 4x smaller than Gopher but trained on 4x more data, beats larger models on many downstream tasks 70B MassiveText (a different subset distribution than in Gopher)
PaLM 2022 Chowdhery et al. (Google) autoregressive largest model to date, efficiently trained using Google Pathways system 540B based on datasets used in GLaM and LaMDA
Anthropic assistant 2022 Bai et al. (Anthropic) autoregressive (for dialogue) dialogue agent based on a language model trained with RLHF to be helpful and harmless up to 52B The Pile
GPT-NeoX (weights) 2022 Black et al. (EleutherAI) autoregressive largest publicly available dense autoregressive model at that time 20B The Pile
OPT (weights) 2022 Zhang et al. (Meta) autoregressive a family of language models (up to 175B) that (apart from the largest one) have publicly available weights up to 175B dataset from RoBERTa + The Pile + Reddit
YaLM (weights) 2022 only repository (Yandex) autoregressive bilingual GPT-like model for English and Russian 100B The Pile + a large collection of Russian texts
Atlas 2022 Izacard et al. (Meta) seq2seq (+ retrieval) T5 language model + retrieval from a corpus of documents (joint pretraining) up to 11B Wikipedia, CommonCrawl
Sparrow 2022 Glaese et al. (DeepMind) autoregressive (for dialogue) dialogue agent based on Chinchilla LM trained with RLHF to be helpful and harmless, able to retrieve information from external source 70B dialogue data collected by interaction with human annotators
GLM-130B (weights) 2022 Zeng et al. (Tsinghua University) autoregressive blank infilling open bilingual 130B model for English and Chinese 130B The Pile, LAMBADA
Flan-T5 (weights) & Flan-PaLM 2022 Chung et al. (Google) seq2seq / autoregressive T5 and PaLM models fine-tuned with instructions (FLAN-T5 weights released in several sizes) up to 540B a mixture of 1836 finetuning tasks from 4 sources (details in paper)
BLOOM (weights) 2022 Le Scao et al. (BigScience) autoregressive a 176B parameter model resulting from the BigScience collaboration (trained for 3.5 months in the first half of the year) 176B ROOTS dataset (mix of natural and programming languages)
BLOOMZ (weights) 2022 Muennighof et al. (BigScience) autoregressive BLOOM finetuned on instructions 176B xP3
Galactica (weights) 2022 Taylor et al. (Meta) autoregressive a model trained on a corpus of scientific knowledge, performing strongly in knowledge-intensive scientific tasks up to 120B papers, textbooks, encyclopedias, code, knowledge bases etc.
ChatGPT (API) 2022 only blogpost for now (OpenAI) autoregressive (for dialogue) a model trained in a similar way as InstructGPT, using RLHF, in a dialogue/chat framework ? human demonstrations of desired model behavior for prompts (see InstructGPT)

About

Majority of the Large Language Models summarized in a table. From the original Transformer to ChatGPT and beyond.

Topics

Resources

Stars

Watchers

Forks