Skip to content

dair-ai/ML-Papers-Explained

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 

Repository files navigation

ML Papers Explained

Explanations to key concepts in ML

Language Models

Paper Date Description
Transformer June 2017 An Encoder Decoder model, that introduced multihead attention mechanism for language translation task.
Elmo February 2018 Deep contextualized word representations that captures both intricate aspects of word usage and contextual variations across language contexts.
GPT June 2018 A Decoder only transformer which is autoregressively pretrained and then finetuned for specific downstream tasks using task-aware input transformations.
BERT October 2018 Introduced pre-training for Encoder Transformers. Uses unified architecture across different tasks.
Transformer XL January 2019 Extends the original Transformer model to handle longer sequences of text by introducing recurrence into the self-attention mechanism.
GPT 2 February 2019 Demonstrates that language models begin to learn various language processing tasks without any explicit supervision.
Sparse Transformer April 2019 Introduced sparse factorizations of the attention matrix to reduce the time and memory consumption to O(n√ n) in terms of sequence lengths.
UniLM May 2019 Utilizes a shared Transformer network and specific self-attention masks to excel in both language understanding and generation tasks.
XLNet June 2019 Extension of the Transformer-XL, pre-trained using a new method that combines ideas from AR and AE objectives.
RoBERTa July 2019 Built upon BERT, by carefully optimizing hyperparameters and training data size to improve performance on various language tasks .
Sentence BERT August 2019 A modification of BERT that uses siamese and triplet network structures to derive sentence embeddings that can be compared using cosine-similarity.
Tiny BERT September 2019 Uses attention transfer, and task specific distillation for distilling BERT.
ALBERT September 2019 Presents certain parameter reduction techniques to lower memory consumption and increase the training speed of BERT.
Distil BERT October 2019 Distills BERT on very large batches leveraging gradient accumulation, using dynamic masking and without the next sentence prediction objective.
T5 October 2019 A unified encoder-decoder framework that converts all text-based language problems into a text-to-text format.
BART October 2019 An Encoder-Decoder pretrained to reconstruct the original text from corrupted versions of it.
UniLMv2 February 2020 Utilizes a pseudo-masked language model (PMLM) for both autoencoding and partially autoregressive language modeling tasks,significantly advancing the capabilities of language models in diverse NLP tasks.
FastBERT April 2020 A speed-tunable encoder with adaptive inference time having branches at each transformer output to enable early outputs.
MobileBERT April 2020 Compressed and faster version of the BERT, featuring bottleneck structures, optimized attention mechanisms, and knowledge transfer.
Longformer April 2020 Introduces a linearly scalable attention mechanism, allowing handling texts of exteded length.
GPT 3 May 2020 Demonstrates that scaling up language models greatly improves task-agnostic, few-shot performance.
DeBERTa June 2020 Enhances BERT and RoBERTa through disentangled attention mechanisms, an enhanced mask decoder, and virtual adversarial training.
DeBERTa v2 June 2020 Enhanced version of the DeBERTa featuring a new vocabulary, nGiE integration, optimized attention mechanisms, additional model sizes, and improved tokenization.
T5 v1.1 July 2020 An enhanced version of the original T5 model, featuring improvements such as GEGLU activation, no dropout in pre-training, exclusive pre-training on C4, no parameter sharing between embedding and classifier layers.
mT5 October 2020 A multilingual variant of T5 based on T5 v1.1, pre-trained on a new Common Crawl-based dataset covering 101 languages (mC4).
Codex July 2021 A GPT language model finetuned on publicly available code from GitHub.
FLAN September 2021 An instruction-tuned language model developed through finetuning on various NLP datasets described by natural language instructions.
T0 October 2021 A fine tuned encoder-decoder model on a multitask mixture covering a wide variety of tasks, attaining strong zero-shot performance on several standard datasets.
WebGPT December 2021 A fine-tuned GPT-3 model utilizing text-based web browsing, trained via imitation learning and human feedback, enhancing its ability to answer long-form questions with factual accuracy.
Gopher December 2021 Provides a comprehensive analysis of the performance of various Transformer models across different scales upto 280B on 152 tasks.
LaMDA January 2022 Transformer based models specialized for dialog, which are pre-trained on public dialog data and web text.
Instruct GPT March 2022 Fine-tuned GPT using supervised learning (instruction tuning) and reinforcement learning from human feedback to align with user intent.
CodeGen March 2022 An LLM trained for program synthesis using input-output examples and natural language descriptions.
Chinchilla March 2022 Investigated the optimal model size and number of tokens for training a transformer LLM within a given compute budget (Scaling Laws).
PaLM April 2022 A 540-B parameter, densely activated, Transformer, trained using Pathways, (ML system that enables highly efficient training across multiple TPU Pods).
GPT-NeoX-20B April 2022 An autoregressive LLM trained on the Pile, and the largest dense model that had publicly available weights at the time of submission.
OPT May 2022 A suite of decoder-only pre-trained transformers with parameter ranges from 125M to 175B. OPT-175B being comparable to GPT-3.
Flan T5, Flan PaLM October 2022 Explores instruction fine tuning with a particular focus on scaling the number of tasks, scaling the model size, and fine tuning on chain-of-thought data.
BLOOM November 2022 A 176B-parameter open-access decoder-only transformer, collaboratively developed by hundreds of researchers, aiming to democratize LLM technology.
BLOOMZ, mT0 November 2022 Applies Multitask prompted fine tuning to the pretrained multilingual models on English tasks with English prompts to attain task generalization to non-English languages that appear only in the pretraining corpus.
Galactica November 2022 An LLM trained on scientific data thus specializing in scientific knowledge.
ChatGPT November 2022 An interactive model designed to engage in conversations, built on top of GPT 3.5.
Self Instruct December 2022 A framework for improving the instruction-following capabilities of pretrained language models by bootstrapping off their own generations.
LLaMA February 2023 A collection of foundation LLMs by Meta ranging from 7B to 65B parameters, trained using publicly available datasets exclusively.
Alpaca March 2023 A fine-tuned LLaMA 7B model, trained on instruction-following demonstrations generated in the style of self-instruct using text-davinci-003.
GPT 4 March 2023 A multimodal transformer model pre-trained to predict the next token in a document, which can accept image and text inputs and produce text outputs.
Vicuna March 2023 A 13B LLaMA chatbot fine tuned on user-shared conversations collected from ShareGPT, capable of generating more detailed and well-structured answers compared to Alpaca.
BloombergGPT March 2023 A 50B language model train on general purpose and domain specific data to support a wide range of tasks within the financial industry.
Pythia April 2023 A suite of 16 LLMs all trained on public data seen in the exact same order and ranging in size from 70M to 12B parameters.
WizardLM April 2023 Introduces Evol-Instruct, a method to generate large amounts of instruction data with varying levels of complexity using LLM instead of humans to fine tune a Llama model
CodeGen2 May 2023 Proposes an approach to make the training of LLMs for program synthesis more efficient by unifying key components of model architectures, learning methods, infill sampling, and data distributions
PaLM 2 May 2023 Successor of PALM, trained on a mixture of different pre-training objectives in order to understand different aspects of language.
LIMA May 2023 A LLaMa model fine-tuned on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling.
Falcon June 2023 An Open Source LLM trained on properly filtered and deduplicated web data alone.
Phi-1 June 2023 An LLM for code, trained using a textbook quality data from the web and synthetically generated textbooks and exercises with GPT-3.5.
WizardCoder June 2023 Enhances the performance of the open-source Code LLM, StarCoder, through the application of Code Evol-Instruct.
LLaMA 2 July 2023 Successor of LLaMA. LLaMA 2-Chat is optimized for dialogue use cases.
Humpback August 2023 LLaMA finetuned using Instrustion backtranslation.
Code LLaMA August 2023 LLaMA 2 based LLM for code.
WizardMath August 2023 Proposes Reinforcement Learning from Evol-Instruct Feedback (RLEIF) method, applied to Llama-2 to enhance the mathematical reasoning abilities.
LLaMA 2 Long September 2023 A series of long context LLMs s that support effective context windows of up to 32,768 tokens.
Phi-1.5 September 2023 Follows the phi-1 approach, focusing this time on common sense reasoning in natural language.
Mistral 7B October 2023 Leverages grouped-query attention for faster inference, coupled with sliding window attention to effectively handle sequences of arbitrary length with a reduced inference cost.
Llemma October 2023 An LLM for mathematics, formed by continued pretraining of Code Llama on a mixture of scientific papers, web data containing mathematics, and mathematical code.
CodeFusion October 2023 A diffusion code generation model that iteratively refines entire programs based on encoded natural language, overcoming the limitation of auto-regressive models in code generation by allowing reconsideration of earlier tokens.
Zephyr 7B October 2023 Utilizes dDPO and AI Feedback (AIF) preference data to achieve superior intent alignment in chat-based language modeling.
Phi-2 December 2023 A 2.7B model, developed to explore whether emergent abilities achieved by large-scale language models can also be achieved at a smaller scale using strategic choices for training, such as data selection.
TinyLlama January 2024 A 1.1B language model built upon the architecture and tokenizer of Llama 2, pre-trained on around 1 trillion tokens for approximately 3 epochs, leveraging FlashAttention and Grouped Query Attention, to achieve better computational efficiency.
Mixtral 8x7B January 2024 A Sparse Mixture of Experts language model trained with multilingual data using a context size of 32k tokens.
H2O Danube 1.8B January 2024 A language model trained on 1T tokens following the core principles of LLama 2 and Mistral, leveraging and refining various techniques for pre-training large language models.
OLMo February 2024 A state-of-the-art, truly open language model and framework that includes training data, code, and tools for building, studying, and advancing language models.
Gemma February 2024 A family of 2B and 7B, state-of-the-art language models based on Google's Gemini models, offering advancements in language understanding, reasoning, and safety.
Aya 101 Februray 2024 A massively multilingual generative language model that follows instructions in 101 languages,trained by finetuning mT5.
Hawk, Griffin February 2024 Introduces Real Gated Linear Recurrent Unit Layer that forms the core of the new recurrent block, replacing Multi-Query Attention for better efficiency and scalability
WRAP March 2024 Uses an off-the-shelf instruction-tuned model prompted to paraphrase documents on the web in specific styles to jointly pre-train LLMs on real and synthetic rephrases.
DBRX March 2024 A 132B open, general-purpose fine grained Sparse MoE LLM surpassing GPT-3.5 and competitive with Gemini 1.0 Pro.
CodeGemma April 2024 Open code models based on Gemma models by further training on over 500 billion tokens of primarily code.
RecurrentGemma April 2024 Based on Griffin, uses a combination of linear recurrences and local attention instead of global attention to model long sequences efficiently.
Rho-1 April 2024 Introduces Selective Language Modelling that optimizes the loss only on tokens that align with a desired distribution, utilizing a reference model to score and select tokens.
Phi-3 April 2024 A series of language models trained on heavily filtered web and synthetic data set, achieving performance comparable to much larger models like Mixtral 8x7B and GPT-3.5.
Open ELM April 2024 A fully open language model designed to enhance accuracy while using fewer parameters and pre-training tokens. Utilizes a layer-wise scaling strategy to allocate smaller dimensions in early layers, expanding in later layers.
H2O Danube2 1.8B April 2024 An updated version of the original H2O-Danube model, with improvements including removal of sliding window attention, changes to the tokenizer, and adjustments to the training data, resulting in significant performance enhancements.

Multi Modal Language Models

Paper Date Description
Flamingo April 2022 Visual Language Models enabling seamless handling of interleaved visual and textual data, and facilitating few-shot learning on large-scale web corpora.
LLaVA 1 April 2023 A large multimodal model connecting CLIP and Vicuna trained end-to-end on instruction-following data generated through GPT-4 from image-text pairs.
GPT-4V September 2023 A multimodal model that combines text and vision capabilities, allowing users to instruct it to analyze image inputs.
LLaVA 1.5 October 2023 An enhanced version of the LLaVA model that incorporates a CLIP-ViT-L-336px with an MLP projection and academic-task-oriented VQA data to set new benchmarks in large multimodal models (LMM) research.
Gemini 1.0 December 2023 A family of highly capable multi-modal models, trained jointly across image, audio, video, and text data for the purpose of building a model with strong generalist capabilities across modalities.
MoE-LLaVA January 2024 A MoE-based sparse LVLM framework that activates only the top-k experts through routers during deployment, maintaining computational efficiency while achieving comparable performance to larger models.
LLaVA 1.6 January 2024 An improved version of a LLaVA 1.5 with enhanced reasoning, OCR, and world knowledge capabilities, featuring increased image resolution
Gemini 1.5 Pro February 2024 A highly compute-efficient multimodal mixture-of-experts model that excels in long-context retrieval tasks and understanding across text, video, and audio modalities.
MM1 March 2024 A multimodal llm that combines a ViT-H image encoder with 378x378px resolution, pretrained on a data mix of image-text documents and text-only documents, scaled up to 3B, 7B, and 30B parameters for enhanced performance across various tasks

Language Models for Retrieval

Paper Date Description
Dense Passage Retriever April 2020 Shows that retrieval can be practically implemented using dense representations alone, where embeddings are learned from a small number of questions and passages by a simple dual encoder framework.
ColBERT April 2020 Introduces a late interaction architecture that adapts deep LMs (in particular, BERT) for efficient retrieval.
ColBERTv2 December 2021 Couples an aggressive residual compression mechanism with a denoised supervision strategy to simultaneously improve the quality and space footprint of late interaction.
E5 December 2022 A family of text embeddings trained in a contrastive manner with weak supervision signals from a curated large-scale text pair dataset CCPairs.
E5 Mistral 7B December 2023 Leverages proprietary LLMs to generate diverse synthetic data to fine tune open-source decoder-only LLMs for hundreds of thousands of text embedding tasks.

Representation Learning

Paper Date Description
CLIP February 2021 A vision system that learns image representations from raw text-image pairs through pre-training, enabling zero-shot transfer to various downstream tasks.
Matryoshka Representation Learning May 2022 Encodes information at different granularities and allows a flexible representation that can adapt to multiple downstream tasks with varying computational resources using a single embedding.
Nomic Embed Text v1 February 2024 A 137M parameter, open-source English text embedding model with an 8192 context length that outperforms OpenAI's models on both short and long-context tasks.
Nomic Embed Text v1.5 February 2024 An advanced text embedding model that utilizes Matryoshka Representation Learning to offer flexible embedding sizes with minimal performance trade-offs

Compression, Pruning, Quantization

Paper Date Description
LLMLingua October 2023 A novel coarse-to-fine prompt compression method, incorporating a budget controller, an iterative token-level compression algorithm, and distribution alignment, achieving up to 20x compression with minimal performance loss.
LongLLMLingua October 2023 A novel approach for prompt compression to enhance performance in long context scenarios using question-aware compression and document reordering.

Vision Models

Paper Date Description
Vision Transformer October 2020 Images are segmented into patches, which are treated as tokens and a sequence of linear embeddings of these patches are input to a Transformer
DeiT December 2020 A convolution-free vision transformer that uses a teacher-student strategy with attention-based distillation tokens.
Swin Transformer March 2021 A hierarchical vision transformer that uses shifted windows to addresses the challenges of adapting the transformer model to computer vision.
BEiT June 2021 Utilizes a masked image modeling task inspired by BERT in, involving image patches and visual tokens to pretrain vision Transformers.
MobileViT October 2021 A lightweight vision transformer designed for mobile devices, effectively combining the strengths of CNNs and ViTs.
Masked AutoEncoder November 2021 An encoder-decoder architecture that reconstructs input images by masking random patches and leveraging a high proportion of masking for self-supervision.

Convolutional Neural Networks

Paper Date Description
Lenet December 1998 Introduced Convolutions.
Alex Net September 2012 Introduced ReLU activation and Dropout to CNNs. Winner ILSVRC 2012.
VGG September 2014 Used large number of filters of small size in each layer to learn complex features. Achieved SOTA in ILSVRC 2014.
Inception Net September 2014 Introduced Inception Modules consisting of multiple parallel convolutional layers, designed to recognize different features at multiple scales.
Inception Net v2 / Inception Net v3 December 2015 Design Optimizations of the Inception Modules which improved performance and accuracy.
Res Net December 2015 Introduced residual connections, which are shortcuts that bypass one or more layers in the network. Winner ILSVRC 2015.
Inception Net v4 / Inception ResNet February 2016 Hybrid approach combining Inception Net and ResNet.
Dense Net August 2016 Each layer receives input from all the previous layers, creating a dense network of connections between the layers, allowing to learn more diverse features.
Xception October 2016 Based on InceptionV3 but uses depthwise separable convolutions instead on inception modules.
Res Next November 2016 Built over ResNet, introduces the concept of grouped convolutions, where the filters in a convolutional layer are divided into multiple groups.
Mobile Net V1 April 2017 Uses depthwise separable convolutions to reduce the number of parameters and computation required.
Mobile Net V2 January 2018 Built upon the MobileNetv1 architecture, uses inverted residuals and linear bottlenecks.
Mobile Net V3 May 2019 Uses AutoML to find the best possible neural network architecture for a given problem.
Efficient Net May 2019 Uses a compound scaling method to scale the network's depth, width, and resolution to achieve a high accuracy with a relatively low computational cost.
NF Net February 2021 An improved class of Normalizer-Free ResNets that implement batch-normalized networks, offer faster training times, and introduce an adaptive gradient clipping technique to overcome instabilities associated with deep ResNets.
Conv Mixer January 2022 Processes image patches using standard convolutions for mixing spatial and channel dimensions.
ConvNeXt January 2022 A pure ConvNet model, evolved from standard ResNet design, that competes well with Transformers in accuracy and scalability.
ConvNeXt V2 January 2023 Incorporates a fully convolutional MAE framework and a Global Response Normalization (GRN) layer, boosting performance across multiple benchmarks.

Object Detection

Paper Date Description
SSD December 2015 Discretizes bounding box outputs over a span of various scales and aspect ratios per feature map.
Feature Pyramid Network December 2016 Leverages the inherent multi-scale hierarchy of deep convolutional networks to efficiently construct feature pyramids.
Focal Loss August 2017 Addresses class imbalance in dense object detectors by down-weighting the loss assigned to well-classified examples.
DETR May 2020 A novel object detection model that treats object detection as a set prediction problem, eliminating the need for hand-designed components.

Region-based Convolutional Neural Networks

Paper Date Description
RCNN November 2013 Uses selective search for region proposals, CNNs for feature extraction, SVM for classification followed by box offset regression.
Fast RCNN April 2015 Processes entire image through CNN, employs RoI Pooling to extract feature vectors from ROIs, followed by classification and BBox regression.
Faster RCNN June 2015 A region proposal network (RPN) and a Fast R-CNN detector, collaboratively predict object regions by sharing convolutional features.
Mask RCNN March 2017 Extends Faster R-CNN to solve instance segmentation tasks, by adding a branch for predicting an object mask in parallel with the existing branch.
Cascade RCNN December 2017 Proposes a multi-stage approach where detectors are trained with progressively higher IoU thresholds, improving selectivity against false positives.

Document AI

Paper Date Description
Table Net January 2020 An end-to-end deep learning model designed for both table detection and structure recognition.
Donut November 2021 An OCR-free Encoder-Decoder Transformer model. The encoder takes in images, decoder takes in prompts & encoded images to generate the required text.
DiT March 2022 An Image Transformer pre-trained (self-supervised) on document images
UDoP December 2022 Integrates text, image, and layout information through a Vision-Text-Layout Transformer, enabling unified representation.
DocLLM January 2024 A lightweight extension to traditional LLMs that focuses on reasoning over visual documents, by incorporating textual semantics and spatial layout without expensive image encoders.

Layout Transformers

Paper Date Description
Layout LM December 2019 Utilises BERT as the backbone, adds two new input embeddings: 2-D position embedding and image embedding (Only for downstream tasks).
LamBERT February 2020 Utilises RoBERTa as the backbone and adds Layout embeddings along with relative bias.
Layout LM v2 December 2020 Uses a multi-modal Transformer model, to integrate text, layout, and image in the pre-training stage, to learn end-to-end cross-modal interaction.
Structural LM May 2021 Utilises BERT as the backbone and feeds text, 1D and (2D cell level) embeddings to the transformer model.
Doc Former June 2021 Encoder-only transformer with a CNN backbone for visual feature extraction, combines text, vision, and spatial features through a multi-modal self-attention layer.
LiLT February 2022 Introduced Bi-directional attention complementation mechanism (BiACM) to accomplish the cross-modal interaction of text and layout.
Layout LM V3 April 2022 A unified text-image multimodal Transformer to learn cross-modal representations, that imputs concatenation of text embedding and image embedding.
ERNIE Layout October 2022 Reorganizes tokens using layout information, combines text and visual embeddings, utilizes multi-modal transformers with spatial aware disentangled attention.

Generative Adversarial Networks

Paper Date Description
Generative Adversarial Networks June 2014 Introduces a framework where, a generative and a discriminative model, are trained simultaneously in a minimax game.
Conditional Generative Adversarial Networks November 2014 A method for training GANs, enabling the generation based on specific conditions, by feeding them to both the generator and discriminator networks.
Deep Convolutional Generative Adversarial Networks November 2015 Demonstrates the ability of CNNs for unsupervised learning using specific architectural constraints designed.
Improved GAN June 2016 Presents a variety of new architectural features and training procedures that can be applied to the generative adversarial networks (GANs) framework.
Wasserstein Generative Adversarial Networks January 2017 An alternative GAN training algorithm that enhances learning stability, mitigates issues like mode collapse.
Cycle GAN March 2017 An approach for learning to translate an image from a source domain X to a target domain Y in the absence of paired examples by leveraging adversarial losses and cycle consistency constraints, using two GANs.

Tabular Deep Learning

Paper Date Description
Entity Embeddings April 2016 Maps categorical variables into continuous vector spaces through neural network learning, revealing intrinsic properties.
Wide and Deep Learning June 2016 Combines memorization of specific patterns with generalization of similarities.
Deep and Cross Network August 2017 Combines the a novel cross network with deep neural networks (DNNs) to efficiently learn feature interactions without manual feature engineering.
Tab Transformer December 2020 Employs multi-head attention-based Transformer layers to convert categorical feature embeddings into robust contextual embeddings.
Tabular ResNet June 2021 An MLP with skip connections.
Feature Tokenizer Transformer June 2021 Transforms all features (categorical and numerical) to embeddings and applies a stack of Transformer layers to the embeddings.

Miscellaneous

Paper Date Description
ColD Fusion December 2022 A method enabling the benefits of multitask learning through distributed computation without data sharing and improving model performance.
Are Emergent Abilities of Large Language Models a Mirage? April 2023 This paper presents an alternative explanation for emergent abilities, i.e. emergent abilities are created by the researcher’s choice of metrics, not fundamental changes in model family behaviour on specific tasks with scale.
Scaling Data-Constrained Language Models May 2023 This study investigates scaling language models in data-constrained regimes.
An In-depth Look at Gemini's Language Abilities December 2023 A third-party, objective comparison of the abilities of the OpenAI GPT and Google Gemini models with reproducible code and fully transparent results.
Dolma January 2024 An open corpus of three trillion tokens designed to support language model pretraining research.
Aya Dataset Februray 2024 A human-curated instruction-following dataset that spans 65 languages, created to bridge the language gap in datasets for natural language processing.
DSPy October 2023 A programming model that abstracts LM pipelines as text transformation graphs, i.e. imperative computation graphs where LMs are invoked through declarative modules, optimizing their use through a structured framework of signatures, modules, and teleprompters to automate and enhance text transformation tasks.

Literature Reviewed

Reading Lists


Reach out to Ritvik or Elvis if you have any questions.

If you are interested to contribute, feel free to open a PR.

Join our Discord

About

Explanation to key concepts in ML

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published