InDi: Informative and Diverse Sampling for Dense Retrieval

InDi is an extension of the popular Tevatron package (commit), adding a novel procedure for selecting negative samples. Specifically, it is inspired by ideas from the Active Learning field to find samples on which the model is uncertain about while ensuring high diversity.

Instructions

Prerequisite

To run InDi it is necessary to have the MS-MARCO corpus, train a baseline tevatron model (called S1), and compute vector embeddings to all documents in the corpus. To do that, please follow instructions at https://github.com/texttron/tevatron/blob/main/examples/coCondenser-marco/README.md (up to and including https://github.com/texttron/tevatron/blob/main/examples/coCondenser-marco/README.md#search) The following output files are generated in this process (and must appear in the resources directory):

corpus/*.json - the tokenized MS-MARCO corpus.
encoding/*.pt - the vector embeddings generated by the S1 model (of the MS-MARCO corpus).
qrels.train.tsv - the QRels file of the training dataset.
scores/*.parquet - dual encoder (and optional, cross encoder score) for the top-200 documents retrieved by model S1. The file contains the following columns: qid, docid, de_score, ce_score (optional).
train.query.txt - the queries in the training dataset.

For evaluation the QRels file must be downloaded from https://microsoft.github.io/msmarco/.

Running

In order to execute InDi run:

python -m active_learning.main_marco

Citation

If you find InDi helpful, please consider citing our paper.

@article{cohen2024indi,
  title={InDi: Informative and diverse sampling for dense retrieval},
  author={Cohen, Nachshon and Indelman, Hedda Cohen and Fairstein, Yaron and Kushilevitz, Guy},
  journal={ECIR},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
docs		docs
examples		examples
resources		resources
scripts		scripts
src		src
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

InDi: Informative and Diverse Sampling for Dense Retrieval

Instructions

Prerequisite

Running

Citation

About

Releases

Packages

Languages

License

amzn/informative-diverse-hard-negative-sampling

Folders and files

Latest commit

History

Repository files navigation

InDi: Informative and Diverse Sampling for Dense Retrieval

Instructions

Prerequisite

Running

Citation

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages