Skip to content

amzn/informative-diverse-hard-negative-sampling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

InDi: Informative and Diverse Sampling for Dense Retrieval

InDi is an extension of the popular Tevatron package (commit), adding a novel procedure for selecting negative samples. Specifically, it is inspired by ideas from the Active Learning field to find samples on which the model is uncertain about while ensuring high diversity.

Instructions

Prerequisite

To run InDi it is necessary to have the MS-MARCO corpus, train a baseline tevatron model (called S1), and compute vector embeddings to all documents in the corpus. To do that, please follow instructions at https://github.com/texttron/tevatron/blob/main/examples/coCondenser-marco/README.md (up to and including https://github.com/texttron/tevatron/blob/main/examples/coCondenser-marco/README.md#search) The following output files are generated in this process (and must appear in the resources directory):

  • corpus/*.json - the tokenized MS-MARCO corpus.
  • encoding/*.pt - the vector embeddings generated by the S1 model (of the MS-MARCO corpus).
  • qrels.train.tsv - the QRels file of the training dataset.
  • scores/*.parquet - dual encoder (and optional, cross encoder score) for the top-200 documents retrieved by model S1. The file contains the following columns: qid, docid, de_score, ce_score (optional).
  • train.query.txt - the queries in the training dataset.

For evaluation the QRels file must be downloaded from https://microsoft.github.io/msmarco/.

Running

In order to execute InDi run:

python -m active_learning.main_marco

Citation

If you find InDi helpful, please consider citing our paper.

@article{cohen2024indi,
  title={InDi: Informative and diverse sampling for dense retrieval},
  author={Cohen, Nachshon and Indelman, Hedda Cohen and Fairstein, Yaron and Kushilevitz, Guy},
  journal={ECIR},
  year={2024}
}