Skip to content

marialymperaiou/knowledge-enhanced-multimodal-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 

Repository files navigation

Knowledge-enhanced multimodal learning

This is a repository that provides a list of papers on knowledge-enhanced multimodal learning inspired by Awesome Vision-and-Language.

Content

Surveys

Datasets

Models

Knowledge-enhanced VQA

  1. Building a Large-scale Multimodal Knowledge Base System for Answering Visual Queries https://arxiv.org/abs/1507.05670
  2. Image Captioning and Visual Question Answering Based on Attributes and External Knowledge https://ieeexplore.ieee.org/document/7934440
  3. Explicit Knowledge-based Reasoning for Visual Question Answering https://arxiv.org/abs/1511.02570
  4. FVQA: Fact-based Visual Question Answering https://arxiv.org/abs/1606.05433
  5. Straight to the Facts: Learning Knowledge Base Retrieval for Factual Visual Question Answering https://arxiv.org/abs/1809.01124
  6. Out of the Box: Reasoning with Graph Convolution Nets for Factual Visual Question Answering https://arxiv.org/abs/1811.00538
  7. Kvqa: Knowledge-aware visual question answering https://ojs.aaai.org/index.php/AAAI/article/view/4915
  8. From Strings to Things: Knowledge-Enabled VQA Model That Can Read and Reason https://ieeexplore.ieee.org/abstract/document/9010987
  9. Cross-modal Knowledge Reasoning for Knowledge-based Visual Question Answering https://arxiv.org/abs/2009.00145
  10. Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering https://arxiv.org/abs/2006.09073
  11. Boosting Visual Question Answering with Context-aware Knowledge Aggregation https://dl.acm.org/doi/pdf/10.1145/3394171.3413943
  12. Zero-shot visual question answering using knowledge graph https://arxiv.org/abs/2107.05348
  13. Towards Knowledge-Augmented Visual Question Answering https://aclanthology.org/2020.coling-main.169/
  14. An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA https://arxiv.org/abs/2109.05014
  15. Image Captioning for Effective Use of Language Models in Knowledge-Based Visual Question Answering https://arxiv.org/abs/2109.08029
  16. A Dataset and Baselines for Visual Question Answering on Art https://arxiv.org/abs/2008.12520
  17. Knowledge is Power: Hierarchical-Knowledge Embedded Meta-Learning for Visual Reasoning in Artistic Domains https://dl.acm.org/doi/pdf/10.1145/3447548.3467285
  18. ConceptBert: Concept-Aware Representation for Visual Question Answering https://aclanthology.org/2020.findings-emnlp.44/
  19. Weakly-supervised visual-retriever-reader for knowledge-based question answering https://aclanthology.org/2021.emnlp-main.517.pdf
  20. KRISP: Integrating Implicit and Symbolic Knowledge for Open-Domain Knowledge-Based VQA https://arxiv.org/abs/2012.11014
  21. EKTVQA: Generalized use of External Knowledge to empower Scene Text in Text-VQA https://arxiv.org/abs/2108.09717
  22. Multi-Modal Answer Validation for Knowledge-Based VQA https://arxiv.org/abs/2103.12248
  23. Passage Retrieval for Outside-Knowledge Visual Question Answering https://arxiv.org/abs/2105.03938
  24. Improving and Diagnosing Knowledge-Based Visual Question Answering via Entity Enhanced Knowledge Injection https://arxiv.org/abs/2112.06888
  25. Select, Substitute, Search: A New Benchmark for Knowledge-Augmented Visual Question Answering https://arxiv.org/abs/2103.05568

Knowledge-enhanced visual commonsense reasoning

  1. KVL-BERT: Knowledge Enhanced Visual-and-Linguistic BERT for Visual Commonsense Reasoning https://arxiv.org/abs/2012.07000
  2. Vision–Language–Knowledge Co-Embedding for Visual Commonsense Reasoning https://www.mdpi.com/1424-8220/21/9/2911
  3. Multi-Level Knowledge Injecting for Visual Commonsense Reasoning https://ieeexplore.ieee.org/abstract/document/9083951

Knowledge-enhanced visual reasoning

  1. Explainable High-order Visual Question Reasoning: A New Benchmark and Knowledge-routed Network https://arxiv.org/abs/1909.10128
  2. Explainable and Explicit Visual Reasoning over Scene Graphs https://arxiv.org/abs/1812.01855

Knowledge-enhanced image captioning

  1. Improving Image Captioning by Leveraging Knowledge Graphs https://arxiv.org/abs/1901.08942
  2. Relational Reasoning using Prior Knowledge for Visual Captioning https://arxiv.org/abs/1906.01290
  3. Image Captioning with Internal and External Knowledge https://dl.acm.org/doi/pdf/10.1145/3340531.3411948
  4. Integrating Image Captioning with Rule-based Entity Masking https://arxiv.org/abs/2007.11690
  5. Joint Commonsense and Relation Reasoning for Image and Video Captioning https://ojs.aaai.org/index.php/AAAI/article/view/6731
  6. Auto-Encoding Scene Graphs for Image Captioning https://arxiv.org/abs/1812.02378
  7. Injecting Prior Knowledge into Image Caption Generation https://arxiv.org/abs/1911.10082
  8. Boosting Entity-aware Image Captioning with Multi-modal Knowledge Graph https://arxiv.org/abs/2107.11970
  9. KM-BART: Knowledge Enhanced Multimodal BART for Visual Commonsense Generation https://arxiv.org/abs/2101.00419
  10. Unified Vision-Language Pre-Training for Image Captioning and VQA https://arxiv.org/abs/1909.11059

Knowledge-enhanced visual storytelling

  1. Knowledgeable Storyteller: A Commonsense-Driven Generative Model for Visual Storytelling https://www.ijcai.org/proceedings/2019/744
  2. Knowledge-Enriched Visual Storytelling https://arxiv.org/abs/1912.01496
  3. Imagine, Reason and Write: Visual Storytelling with Graph Knowledge and Relational Reasoning https://ojs.aaai.org/index.php/AAAI/article/view/16410
  4. Commonsense Knowledge Aware Concept Selection For Diverse and Informative Visual Storytelling https://arxiv.org/abs/2102.02963

Knowledge-enhanced image generation from text

  1. KG-GAN: Knowledge-Guided Generative Adversarial Networks https://arxiv.org/abs/1905.12261
  2. Integrating Visuospatial, Linguistic and Commonsense Structure into Story Visualization https://arxiv.org/abs/2110.10834
  3. StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation https://arxiv.org/abs/2209.06192

Knowledge-enhanced visual dialog

  1. Reasoning with Multi-Structure Commonsense Knowledge in Visual Dialog https://arxiv.org/abs/2204.04680

Multi-task models

  1. Natural Language Rationales with Full-Stack Visual Reasoning: From Pixels to Semantic Frames to Commonsense Graphs https://arxiv.org/abs/2010.07526
  2. Grounded Situation Recognition https://arxiv.org/abs/2003.12058
  3. Reasoning over Vision and Language: Exploring the Benefits of Supplemental Knowledge https://arxiv.org/abs/2101.06013
  4. Kb-vlp: Knowledge based vision and language pretraining https://www.microsoft.com/en-us/research/uploads/prod/2021/10/kb_vlp_ICML2021.pdf
  5. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks https://arxiv.org/abs/2004.06165
  6. ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph https://arxiv.org/abs/2006.16934
  7. ROSITA: Enhancing Vision-and-Language Semantic Alignments via Cross- and Intra-modal Knowledge Integration https://arxiv.org/abs/2108.07073