Skip to content

yuzhimanhua/Awesome-Scientific-Language-Models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 

Repository files navigation

Awesome Scientific Language Models

Awesome Stars

Papers License: MIT PRWelcome

A curated list of pre-trained language models in scientific domains (e.g., mathematics, physics, chemistry, biology, medicine, materials science, and geoscience), covering different model sizes (from <100M to 70B parameters) and modalities (e.g., language, vision, graph, molecule, protein, genome, and climate time series). The repository will be continuously updated.

NOTE 1: To avoid ambiguity, when we talk about the number of parameters in a model, "Base" refers to 110M (i.e., BERT-Base), and "Large" refers to 340M (i.e., BERT-Large). Other numbers will be written explicitly.

NOTE 2: In each subsection, papers are sorted chronologically. If a paper has a preprint (e.g., arXiv or bioRxiv) version, its publication date is according to the preprint service. Otherwise, its publication date is according to the conference proceeding or journal.

NOTE 3: We appreciate contributions. If you have any suggested papers, feel free to reach out to [email protected] or submit a pull request. For format consistency, we will include a paper after (1) it has a version with author names AND (2) its GitHub and/or Hugging Face links are available.

Contents

General

Language

Language + Graph

  • (SPECTER) SPECTER: Document-level Representation Learning using Citation-informed Transformers ACL 2020
    [Paper] [GitHub] [Model (Base)]

  • (OAG-BERT) OAG-BERT: Towards a Unified Backbone Language Model for Academic Knowledge Services KDD 2022
    [Paper] [GitHub]

  • (ASPIRE) Multi-Vector Models with Textual Guidance for Fine-Grained Scientific Document Similarity NAACL 2022
    [Paper] [GitHub] [Model (Base)]

  • (SciNCL) Neighborhood Contrastive Learning for Scientific Document Representations with Citation Embeddings EMNLP 2022
    [Paper] [GitHub] [Model (Base)]

  • (SPECTER 2.0) SciRepEval: A Multi-Format Benchmark for Scientific Document Representations EMNLP 2023
    [Paper] [GitHub] [Model (113M)]

  • (SciPatton) Patton: Language Model Pretraining on Text-Rich Networks ACL 2023
    [Paper] [GitHub]

  • (SciMult) Pre-training Multi-task Contrastive Learning Models for Scientific Literature Understanding EMNLP 2023 Findings
    [Paper] [GitHub] [Model (138M)]

Mathematics

Language

Language + Vision

  • (Inter-GPS) Inter-GPS: Interpretable Geometry Problem Solving with Formal Language and Symbolic Reasoning ACL 2021
    [Paper] [GitHub]

  • (Geoformer) UniGeo: Unifying Geometry Logical Reasoning via Reformulating Mathematical Expression EMNLP 2022
    [Paper] [GitHub]

  • (SCA-GPS) A Symbolic Character-Aware Model for Solving Geometry Problems ACM MM 2023
    [Paper] [GitHub]

  • (UniMath-Flan-T5) UniMath: A Foundational and Multimodal Mathematical Reasoner EMNLP 2023
    [Paper] [GitHub]

  • (G-LLaVA) G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model arXiv 2023
    [Paper] [GitHub]

Other Modalities (Table)

  • (TAPAS) TAPAS: Weakly Supervised Table Parsing via Pre-training ACL 2020
    [Paper] [GitHub] [Model (Base)] [Model (Large)]

  • (TaBERT) TaBERT: Learning Contextual Representations for Natural Language Utterances and Structured Tables ACL 2020
    [Paper] [GitHub] [Model (Base)] [Model (Large)]

  • (GraPPa) GraPPa: Grammar-Augmented Pre-training for Table Semantic Parsing ICLR 2021
    [Paper] [GitHub] [Model (355M)]

  • (TUTA) TUTA: Tree-based Transformers for Generally Structured Table Pre-training KDD 2021
    [Paper] [GitHub]

  • (RCI) Capturing Row and Column Semantics in Transformer Based Question Answering over Tables NAACL 2021
    [Paper] [GitHub] [Model (12M)]

  • (TABBIE) TABBIE: Pretrained Representations of Tabular Data NAACL 2021
    [Paper] [GitHub]

  • (TAPEX) TAPEX: Table Pre-training via Learning a Neural SQL Executor ICLR 2022
    [Paper] [GitHub] [Model (140M)] [Model (406M)]

  • (FORTAP) FORTAP: Using Formulas for Numerical-Reasoning-Aware Table Pretraining ACL 2022
    [Paper] [GitHub]

  • (OmniTab) OmniTab: Pretraining with Natural and Synthetic Data for Few-shot Table-based Question Answering NAACL 2022
    [Paper] [GitHub] [Model (406M)]

  • (ReasTAP) ReasTAP: Injecting Table Reasoning Skills During Pre-training via Synthetic Reasoning Examples EMNLP 2022
    [Paper] [GitHub] [Model (406M)]

  • (Table-GPT) Table-GPT: Table-tuned GPT for Diverse Table Tasks arXiv 2023
    [Paper]

  • (TableLlama) TableLlama: Towards Open Large Generalist Models for Tables NAACL 2024
    [Paper] [GitHub] [Model (7B)]

  • (TableLLM) TableLLM: Enabling Tabular Data Manipulation by LLMs in Real Office Usage Scenarios arXiv 2024
    [Paper] [GitHub] [Model (7B)] [Model (13B)]

Physics

Language

  • (astroBERT) Building astroBERT, a Language Model for Astronomy & Astrophysics arXiv 2021
    [Paper] [Model (Base)]

  • (AstroLLaMA) AstroLLaMA: Towards Specialized Foundation Models in Astronomy AACL 2023 Workshop
    [Paper] [Model (7B)]

  • (AstroLLaMA-Chat) AstroLLaMA-Chat: Scaling AstroLLaMA with Conversational and Diverse Datasets Research Notes of the AAS 2024
    [Paper] [Model (7B)]

Chemistry and Materials Science

Language

  • (ChemBERT) Automated Chemical Reaction Extraction from Scientific Literature Journal of Chemical Information and Modeling 2022
    [Paper] [GitHub] [Model (Base)]

  • (MatSciBERT) MatSciBERT: A Materials Domain Language Model for Text Mining and Information Extraction npj Computational Materials 2022
    [Paper] [GitHub] [Model (Base)]

  • (MatBERT) Quantifying the Advantage of Domain-Specific Pre-training on Named Entity Recognition Tasks in Materials Science Patterns 2022
    [Paper] [GitHub]

  • (BatteryBERT) BatteryBERT: A Pretrained Language Model for Battery Database Enhancement Journal of Chemical Information and Modeling 2022
    [Paper] [GitHub] [Model (Base)]

  • (MaterialsBERT) A General-Purpose Material Property Data Extraction Pipeline from Large Polymer Corpora using Natural Language Processing npj Computational Materials 2023
    [Paper] [Model (Base)]

  • (CatBERTa) Catalyst Property Prediction with CatBERTa: Unveiling Feature Exploration Strategies through Large Language Models ACS Catalysis 2023
    [Paper] [GitHub]

  • (LLM-Prop) LLM-Prop: Predicting Physical and Electronic Properties of Crystalline Solids from Their Text Descriptions arXiv 2023
    [Paper] [GitHub]

  • (ChemDFM) ChemDFM: Dialogue Foundation Model for Chemistry arXiv 2024
    [Paper] [GitHub] [Model (13B)]

  • (CrystalLLM) Fine-Tuned Language Models Generate Stable Inorganic Materials as Text ICLR 2024
    [Paper] [GitHub]

  • (ChemLLM) ChemLLM: A Chemical Large Language Model arXiv 2024
    [Paper] [Model (7B)]

  • (LlaSMol) LlaSMol: Advancing Large Language Models for Chemistry with a Large-Scale, Comprehensive, High-Quality Instruction Tuning Dataset arXiv 2024
    [Paper] [GitHub] [Model (6.7B, Galactica)] [Model (7B, LLaMA-2)] [Model (7B, Mistral)]

Language + Graph

  • (Text2Mol) Text2Mol: Cross-Modal Molecule Retrieval with Natural Language Queries EMNLP 2021
    [Paper] [GitHub]

  • (KV-PLM) A Deep-learning System Bridging Molecule Structure and Biomedical Text with Comprehension Comparable to Human Professionals Nature Communications 2022
    [Paper] [GitHub] [Model (Base)]

  • (MolT5) Translation between Molecules and Natural Language EMNLP 2022
    [Paper] [GitHub] [Model (60M)] [Model (220M)] [Model (770M)]

  • (MoMu) A Molecular Multimodal Foundation Model Associating Molecule Graphs with Natural Language arXiv 2022
    [Paper] [GitHub]

  • (MoleculeSTM) Multi-modal Molecule Structure-text Model for Text-based Retrieval and Editing Nature Machine Intelligence 2023
    [Paper] [GitHub]

  • (Text+Chem T5) Unifying Molecular and Textual Representations via Multi-task Language Modelling ICML 2023
    [Paper] [GitHub] [Model (60M)] [Model (220M)]

  • (GIMLET) GIMLET: A Unified Graph-Text Model for Instruction-based Molecule Zero-Shot Learning NeurIPS 2023
    [Paper] [GitHub] [Model (60M)]

  • (MolFM) MolFM: A Multimodal Molecular Foundation Model arXiv 2023
    [Paper] [GitHub]

  • (MolCA) MolCA: Molecular Graph-Language Modeling with Cross-Modal Projector and Uni-Modal Adapter EMNLP 2023
    [Paper] [GitHub]

  • (InstructMol) InstructMol: Multi-Modal Integration for Building a Versatile and Reliable Molecular Assistant in Drug Discovery arXiv 2023
    [Paper] [GitHub]

  • (3D-MoLM) Towards 3D Molecule-Text Interpretation in Language Models ICLR 2024
    [Paper] [GitHub]

Language + Vision

  • (GIT-Mol) GIT-Mol: A Multi-modal Large Language Model for Molecular Science with Graph, Image, and Text Computers in Biology and Medicine 2024
    [Paper] [GitHub]

Other Modalities (Molecule)

  • (SMILES-BERT) SMILES-BERT: Large Scale Unsupervised Pre-training for Molecular Property Prediction ACM BCB 2019
    [Paper] [GitHub]

  • (MAT) Molecule Attention Transformer arXiv 2020
    [Paper] [GitHub]

  • (ChemBERTa) ChemBERTa: Large-Scale Self-Supervised Pretraining for Molecular Property Prediction arXiv 2020
    [Paper] [GitHub] [Model (125M)]

  • (MolBERT) Molecular Representation Learning with Language Models and Domain-Relevant Auxiliary Tasks arXiv 2020
    [Paper] [GitHub] [Model (Base)]

  • (rxnfp) Mapping the Space of Chemical Reactions using Attention-based Neural Networks Nature Machine Intelligence 2021
    [Paper] [GitHub] [Model (Base)]

  • (RXNMapper) Extraction of Organic Chemistry Grammar from Unsupervised Learning of Chemical Reactions Science Advances 2021
    [Paper] [GitHub]

  • (MoLFormer) Large-Scale Chemical Language Representations Capture Molecular Structure and Properties Nature Machine Intelligence 2022
    [Paper] [GitHub] [Model (47M)]

  • (Chemformer) Chemformer: A Pre-trained Transformer for Computational Chemistry Machine Learning: Science and Technology 2022
    [Paper] [GitHub] [Model (45M)] [Model (230M)]

  • (R-MAT) Relative Molecule Self-Attention Transformer Journal of Cheminformatics 2024
    [Paper] [GitHub]

  • (MolGPT) MolGPT: Molecular Generation using a Transformer-Decoder Model Journal of Chemical Information and Modeling 2022
    [Paper] [GitHub]

  • (T5Chem) Unified Deep Learning Model for Multitask Reaction Predictions with Explanation Journal of Chemical Information and Modeling 2022
    [Paper] [GitHub]

  • (ChemGPT) Neural Scaling of Deep Chemical Models Nature Machine Intelligence 2023
    [Paper] [Model (4.7M)] [Model (19M)] [Model (1.2B)]

  • (TransPolymer) TransPolymer: A Transformer-based Language Model for Polymer Property Predictions npj Computational Materials 2023
    [Paper] [GitHub]

  • (polyBERT) polyBERT: A Chemical Language Model to Enable Fully Machine-Driven Ultrafast Polymer Informatics Nature Communications 2023
    [Paper] [GitHub] [Model (86M)]

  • (MFBERT) Large-Scale Distributed Training of Transformers for Chemical Fingerprinting Journal of Chemical Information and Modeling 2022
    [Paper] [GitHub]

  • (SPMM) Bidirectional Generation of Structure and Properties Through a Single Molecular Foundation Model Nature Communications 2024
    [Paper] [GitHub]

  • (BARTSmiles) BARTSmiles: Generative Masked Language Models for Molecular Representations arXiv 2022
    [Paper] [GitHub] [Model (406M)]

  • (MolGen) Domain-Agnostic Molecular Generation with Self-feedback ICLR 2024
    [Paper] [GitHub] [Model (406M, BART)] [Model (7B, LLaMA)]

  • (SELFormer) SELFormer: Molecular Representation Learning via SELFIES Language Models Machine Learning: Science and Technology 2023
    [Paper] [GitHub] [Model (58M)] [Model (87M)]

  • (PolyNC) PolyNC: A Natural and Chemical Language Model for the Prediction of Unified Polymer Properties Chemical Science 2024
    [Paper] [GitHub] [Model (220M)]

Biology and Medicine

Acknowledgment: We referred to Wang et al.'s survey paper Pre-trained Language Models in Biomedical Domain: A Systematic Survey when writing some parts of this section.

Language

  • (BioBERT) BioBERT: A Pre-trained Biomedical Language Representation Model for Biomedical Text Mining Bioinformatics 2020
    [Paper] [GitHub] [Model (Base)] [Model (Large)]

  • (BioELMo) Probing Biomedical Embeddings from Language Models NAACL 2019 Workshop
    [Paper] [GitHub] [Model (93M)]

  • (ClinicalBERT, Alsentzer et al.) Publicly Available Clinical BERT Embeddings NAACL 2019 Workshop
    [Paper] [GitHub] [Model (Base)]

  • (ClinicalBERT, Huang et al.) ClinicalBERT: Modeling Clinical Notes and Predicting Hospital Readmission arXiv 2019
    [Paper] [GitHub] [Model (Base)]

  • (BlueBERT, f.k.a. NCBI-BERT) Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets ACL 2019 Workshop
    [Paper] [GitHub] [Model (Base)] [Model (Large)]

  • (BEHRT) BEHRT: Transformer for Electronic Health Records Scientific Reports 2020
    [Paper] [GitHub]

  • (EhrBERT) Fine-Tuning Bidirectional Encoder Representations from Transformers (BERT)–Based Models on Large-Scale Electronic Health Record Notes: An Empirical Study JMIR Medical Informatics 2019
    [Paper] [GitHub]

  • (Clinical XLNet) Clinical XLNet: Modeling Sequential Clinical Notes and Predicting Prolonged Mechanical Ventilation EMNLP 2020 Workshop
    [Paper] [GitHub]

  • (ouBioBERT) Pre-training Technique to Localize Medical BERT and Enhance Biomedical BERT arXiv 2020
    [Paper] [GitHub] [Model (Base)]

  • (COVID-Twitter-BERT) COVID-Twitter-BERT: A Natural Language Processing Model to Analyse COVID-19 Content on Twitter Frontiers in Artificial Intelligence 2023
    [Paper] [GitHub] [Model (Large)]

  • (Med-BERT) Med-BERT: Pretrained Contextualized Embeddings on Large-Scale Structured Electronic Health Records for Disease Prediction npj Digital Medicine 2021
    [Paper] [GitHub]

  • (Bio-ELECTRA) On the Effectiveness of Small, Discriminatively Pre-trained Language Representation Models for Biomedical Text Mining EMNLP 2020 Workshop
    [Paper] [GitHub] [Model (Base)]

  • (BiomedBERT, f.k.a. PubMedBERT) Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing ACM Transactions on Computing for Healthcare 2021
    [Paper] [Model (Base)] [Model (Large)]

  • (MCBERT) Conceptualized Representation Learning for Chinese Biomedical Text Mining arXiv 2020
    [Paper] [GitHub] [Model (Base)]

  • (BRLTM) Bidirectional Representation Learning from Transformers using Multimodal Electronic Health Record Data to Predict Depression JBHI 2021
    [Paper] [GitHub]

  • (BioRedditBERT) COMETA: A Corpus for Medical Entity Linking in the Social Media EMNLP 2020
    [Paper] [GitHub] [Model (Base)]

  • (BioMegatron) BioMegatron: Larger Biomedical Domain Language Model EMNLP 2020
    [Paper] [GitHub] [Model (345M)]

  • (SapBERT) Self-Alignment Pretraining for Biomedical Entity Representations NAACL 2021
    [Paper] [GitHub] [Model (Base)]

  • (ClinicalTransformer) Clinical Concept Extraction using Transformers JAMIA 2020
    [Paper] [GitHub] [Model (Base, BERT)] [Model (125M, RoBERTa)] [Model (12M, ALBERT)] [Model (Base, ELECTRA)] [Model (117M, XLNet)] [Model (149M, Longformer)] [Model (86M, DeBERTa)]

  • (BioRoBERTa) Pretrained Language Models for Biomedical and Clinical Tasks: Understanding and Extending the State-of-the-Art EMNLP 2020 Workshop
    [Paper] [GitHub] [Model (125M)] [Model (355M)]

  • (RAD-BERT) Highly Accurate Classification of Chest Radiographic Reports using a Deep Learning Natural Language Model Pre-trained on 3.8 Million Text Reports Bioinformatics 2020
    [Paper] [GitHub]

  • (BioMedBERT) BioMedBERT: A Pre-trained Biomedical Language Model for QA and IR COLING 2020
    [Paper] [GitHub]

  • (LBERT) LBERT: Lexically Aware Transformer-based Bidirectional Encoder Representation Model for Learning Universal Bio-Entity Relations Bioinformatics 2021
    [Paper] [GitHub]

  • (ELECTRAMed) ELECTRAMed: A New Pre-trained Language Representation Model for Biomedical NLP arXiv 2021
    [Paper] [GitHub] [Model (Base)]

  • (SciFive) SciFive: A Text-to-Text Transformer Model for Biomedical Literature arXiv 2021
    [Paper] [GitHub] [Model (220M)] [Model (770M)]

  • (BioALBERT) Benchmarking for Biomedical Natural Language Processing Tasks with a Domain Specific ALBERT BMC Bioinformatics 2022
    [Paper] [GitHub] [Model (12M)] [Model (18M)]

  • (Clinical-Longformer) Clinical-Longformer and Clinical-BigBird: Transformers for Long Clinical Sequences arXiv 2021
    [Paper] [GitHub] [Model (149M, Longformer)] [Model (Base, BigBird)]

  • (BioBART) BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model ACL 2022 Workshop
    [Paper] [GitHub] [Model (140M)] [Model (406M)]

  • (BioGPT) BioGPT: Generative Pre-trained Transformer for Biomedical Text Generation and Mining Briefings in Bioinformatics 2022
    [Paper] [GitHub] [Model (355M)] [Model (1.5B)]

  • (Med-PaLM) Large Language Models Encode Clinical Knowledge Nature 2023
    [Paper]

  • (ChatDoctor) ChatDoctor: A Medical Chat Model Fine-Tuned on a Large Language Model Meta-AI (LLaMA) using Medical Domain Knowledge Cureus 2023
    [Paper] [GitHub]

  • (DoctorGLM) DoctorGLM: Fine-tuning your Chinese Doctor is not a Herculean Task arXiv 2023
    [Paper] [GitHub]

  • (BenTsao, f.k.a. HuaTuo) HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge arXiv 2023
    [Paper] [GitHub]

  • (MedAlpaca) MedAlpaca - An Open-Source Collection of Medical Conversational AI Models and Training Data arXiv 2023
    [Paper] [GitHub] [Model (7B)] [Model (13B)]

  • (PMC-LLaMA) PMC-LLaMA: Towards Building Open-source Language Models for Medicine arXiv 2023
    [Paper] [GitHub] [Model (7B)] [Model (13B)]

  • (Med-PaLM 2) Towards Expert-Level Medical Question Answering with Large Language Models arXiv 2023
    [Paper]

  • (GatorTronGPT) A Study of Generative Large Language Model for Medical Research and Healthcare arXiv 2023
    [Paper] [GitHub] [Model (345M)]

  • (HuatuoGPT) HuatuoGPT, towards Taming Language Model to Be a Doctor EMNLP 2023 Findings
    [Paper] [GitHub] [Model (7B)] [Model (13B)]

  • (MedCPT) MedCPT: Contrastive Pre-trained Transformers with Large-scale PubMed Search Logs for Zero-shot Biomedical Information Retrieval Bioinformatics 2023
    [Paper] [GitHub] [Model (Base)]

  • (DISC-MedLLM) DISC-MedLLM: Bridging General Large Language Models and Real-World Medical Consultation arXiv 2023
    [Paper] [GitHub] [Model (13B)]

  • (DRG-LLaMA) DRG-LLaMA: Tuning LLaMA Model to Predict Diagnosis-related Group for Hospitalized Patients npj Digital Medicine 2024
    [Paper] [GitHub]

  • (BioT5) BioT5: Enriching Cross-modal Integration in Biology with Chemical Knowledge and Natural Language Associations EMNLP 2023
    [Paper] [GitHub] [Model (220M)]

  • (HuatuoGPT-II) HuatuoGPT-II, One-stage Training for Medical Adaption of LLMs arXiv 2023
    [Paper] [GitHub] [Model (7B)] [Model (13B)] [Model (34B)]

  • (MEDITRON) MEDITRON-70B: Scaling Medical Pretraining for Large Language Models arXiv 2023
    [Paper] [GitHub] [Model (7B)] [Model (70B)]

  • (PLLaMa) PLLaMa: An Open-source Large Language Model for Plant Science arXiv 2024
    [Paper] [GitHub] [Model (7B)] [Model (13B)]

  • (BioMistral) BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains arXiv 2024
    [Paper] [Model (7B)]

  • (BioMedLM, f.k.a. PubMedGPT) BioMedLM: a Domain-Specific Large Language Model for Biomedical Text arXiv 2024
    [Paper] [GitHub] [Model (2.7B)]

  • (BMRetriever) BMRetriever: Tuning Large Language Models as Better Biomedical Text Retrievers arXiv 2024
    [Paper] [GitHub] [Model (410M)] [Model (1B)] [Model (2B)] [Model (7B)]

Language + Graph

  • (G-BERT) Pre-training of Graph Augmented Transformers for Medication Recommendation IJCAI 2019
    [Paper] [GitHub]

  • (CODER) CODER: Knowledge Infused Cross-Lingual Medical Term Embedding for Term Normalization JBI 2022
    [Paper] [GitHub] [Model (Base)]

  • (KeBioLM) Improving Biomedical Pretrained Language Models with Knowledge NAACL 2021 Workshop
    [Paper] [GitHub] [Model (155M)]

  • (MoP) Mixture-of-Partitions: Infusing Large Biomedical Knowledge Graphs into BERT EMNLP 2021
    [Paper] [GitHub]

  • (BioLinkBERT) LinkBERT: Pretraining Language Models with Document Links ACL 2022
    [Paper] [GitHub] [Model (Base)] [Model (Large)]

  • (DRAGON) Deep Bidirectional Language-Knowledge Graph Pretraining NeurIPS 2022
    [Paper] [GitHub] [Model (360M)]

Language + Vision

  • (ConVIRT) Contrastive Learning of Medical Visual Representations from Paired Images and Text MLHC 2022
    [Paper] [GitHub]

  • (MedViLL) Multi-modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-training JBHI 2022
    [Paper] [GitHub]

  • (GLoRIA) GLoRIA: A Multimodal Global-Local Representation Learning Framework for Label-efficient Medical Image Recognition ICCV 2021
    [Paper] [GitHub]

  • (LoVT) Joint Learning of Localized Representations from Medical Images and Reports ECCV 2022
    [Paper] [GitHub]

  • (CvT2DistilGPT2) Improving Chest X-Ray Report Generation by Leveraging Warm Starting Artificial Intelligence in Medicine 2023
    [Paper] [GitHub]

  • (BioViL) Making the Most of Text Semantics to Improve Biomedical Vision-Language Processing ECCV 2022
    [Paper] [GitHub]

  • (LViT) LViT: Language meets Vision Transformer in Medical Image Segmentation TMI 2022
    [Paper] [GitHub]

  • (M3AE) Multi-Modal Masked Autoencoders for Medical Vision-and-Language Pre-training MICCAI 2022
    [Paper] [GitHub]

  • (ARL) Align, Reason and Learn: Enhancing Medical Vision-and-Language Pre-training with Knowledge ACM MM 2022
    [Paper] [GitHub]

  • (CheXzero) Expert-Level Detection of Pathologies from Unannotated Chest X-ray Images via Self-Supervised Learning Nature Biomedical Engineering 2022
    [Paper] [GitHub]

  • (MGCA) Multi-Granularity Cross-modal Alignment for Generalized Medical Visual Representation Learning NeurIPS 2022
    [Paper] [GitHub]

  • (MedCLIP) MedCLIP: Contrastive Learning from Unpaired Medical Images and Text EMNLP 2022
    [Paper] [GitHub]

  • (BioViL-T) Learning to Exploit Temporal Structure for Biomedical Vision-Language Processing CVPR 2023
    [Paper] [GitHub] [Model]

  • (BiomedCLIP) BiomedCLIP: A Multimodal Biomedical Foundation Model Pretrained from Fifteen Million Scientific Image-Text Pairs arXiv 2023
    [Paper] [Model]

  • (RGRG) Interactive and Explainable Region-guided Radiology Report Generation CVPR 2023
    [Paper] [GitHub]

  • (LLaVA-Med) LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day NeurIPS 2023
    [Paper] [GitHub]

  • (MONET) Transparent Medical Image AI via an Image–Text Foundation Model Grounded in Medical Literature Nature Medicine 2024
    [Paper] [GitHub]

  • (Med-PaLM M) Towards Generalist Biomedical AI NEJM AI 2024
    [Paper] [GitHub]

  • (BioCLIP) BioCLIP: A Vision Foundation Model for the Tree of Life arXiv 2023
    [Paper] [Github] [Model]

Other Modalities (Protein)

Other Modalities (DNA)

Other Modalities (RNA)

  • (RNABERT) Informative RNA-base Embedding for Functional RNA Structural Alignment and Clustering by Deep Representation Learning NAR Genomics and Bioinformatics 2022
    [Paper] [GitHub]

  • (RNA-FM) Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions arXiv 2022
    [Paper] [GitHub]

  • (RNA-MSM) Multiple Sequence-Alignment-based RNA Language Model and its Application to Structural Inference Nucleic Acids Research 2024
    [Paper] [GitHub]

Other Modalities (Multiomics)

  • (scBERT) scBERT as a Large-scale Pretrained Deep Language Model for Cell Type Annotation of Single-cell RNA-seq Data Nature Machine Intelligence 2022
    [Paper] [GitHub]

  • (scGPT) scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI Nature Methods 2024
    [Paper] [GitHub]

  • (scFoundation) Large Scale Foundation Model on Single-cell Transcriptomics bioRxiv 2023
    [Paper] [GitHub] [Model (100M)]

  • (Geneformer) Transfer Learning Enables Predictions in Network Biology Nature 2023
    [Paper] [Model (10M)] [Model (40M)]

  • (CellLM) Large-Scale Cell Representation Learning via Divide-and-Conquer Contrastive Learning arXiv 2023
    [Paper] [GitHub]

  • (BioMedGPT) BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine arXiv 2023
    [Paper] [GitHub] [Model (7B)] [Model (10B)]

  • (CellPLM) CellPLM: Pre-training of Cell Language Model Beyond Single Cells ICLR 2024
    [Paper] [GitHub] [Model (82M)]

Geography, Geology, and Environmental Science

Language

  • (ClimateBERT) ClimateBERT: A Pretrained Language Model for Climate-Related Text arXiv 2021
    [Paper] [GitHub] [Model (82M)]

  • (SpaBERT) SpaBERT: A Pretrained Language Model from Geographic Data for Geo-Entity Representation EMNLP 2022 Findings
    [Paper] [GitHub] [Model (Base)] [Model (Large)]

  • (MGeo) MGeo: Multi-Modal Geographic Pre-training Method SIGIR 2023
    [Paper] [GitHub]

  • (K2) K2: A Foundation Language Model for Geoscience Knowledge Understanding and Utilization WSDM 2024
    [Paper] [GitHub] [Model (7B)]

  • (OceanGPT) OceanGPT: A Large Language Model for Ocean Science Tasks arXiv 2023
    [Paper] [GitHub] [Model (7B)]

  • (ClimateBERT-NetZero) ClimateBERT-NetZero: Detecting and Assessing Net Zero and Reduction Targets EMNLP 2023
    [Paper] [Model (82M)]

  • (GeoLM) GeoLM: Empowering Language Models for Geospatially Grounded Language Understanding EMNLP 2023
    [Paper] [GitHub]

  • (GeoGalactica) GeoGalactica: A Scientific Large Language Model in Geoscience arXiv 2024
    [Paper] [GitHub] [Model (30B)]

Language + Graph

  • (ERNIE-GeoL) ERNIE-GeoL: A Geography-and-Language Pre-trained Model and its Applications in Baidu Maps KDD 2022
    [Paper]

  • (PK-Chat) PK-Chat: Pointer Network Guided Knowledge Driven Generative Dialogue Model arXiv 2023
    [Paper] [GitHub]

Language + Vision

  • (UrbanCLIP) UrbanCLIP: Learning Text-Enhanced Urban Region Profiling with Contrastive Language-Image Pretraining from the Web WWW 2024
    [Paper] [GitHub]

Other Modalities (Climate Time Series)

  • (FourCastNet) FourCastNet: A Global Data-driven High-resolution Weather Model using Adaptive Fourier Neural Operators arXiv 2022
    [Paper] [GitHub]

  • (Pangu-Weather) Accurate Medium-Range Global Weather Forecasting with 3D Neural Networks Nature 2023
    [Paper] [GitHub]

  • (ClimaX) ClimaX: A Foundation Model for Weather and Climate ICML 2023
    [Paper] [GitHub]

  • (FengWu) FengWu: Pushing the Skillful Global Medium-Range Weather Forecast beyond 10 Days Lead arXiv 2023
    [Paper] [GitHub]

  • (W-MAE) W-MAE: Pre-trained Weather Model with Masked Autoencoder for Multi-Variable Weather Forecasting arXiv 2023
    [Paper] [GitHub]

  • (FuXi) FuXi: A Cascade Machine Learning Forecasting System for 15-day Global Weather Forecast npj Climate and Atmospheric Science 2023
    [Paper] [GitHub]

About

A Curated List of Language Models in Scientific Domains

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published