- Linux
- Python 3.7
- PyTorch 1.10.0+cu111
Clone this repo.
git clone https://github.com/THUDM/OAG-AQA.git
cd OAG-AQA
Please install dependencies by
pip install -r requirements.txt
python -m spacy download en_core_web_sm
The raw dataset can be downloaded from BaiduPan with password v2bb, Aliyun or DropBox.
The processed data can be downloaded from Aliyun.
Unzip the processed data and put these files into data/kddcup/dpr
directory.
Note: In train_with_hn.json, negative_ctxs
are randomly sampled from candidate papers, and hard_negative_ctxs
are randomly sampled from the references of positive samples. References of positive samples are sampled from [DBLP Citation Dataset].
Run Baseline for KDD Cup 2024
We provide a baseline method DPR.
cd $project_path
export CUDA_VISIBLE_DEVICES='?' # specify which GPU(s) to be used
export PYTHONPATH="`pwd`:$PYTHONPATH"
Config the following paths before training (Absolute paths are recommended. The same below.)
dpr_stackex_qa
in conf/ctx_sources/default_sources.yaml
==>candidate_papers.tsv
is descriptions of candidate papers provided in processed data files.stackex_qa_train
andstackex_qa_valid
in conf/datasets/encoder_train_default.yaml.
==>train_with_hn.json
anddev.json
are processed training and valiation data provided in processed data files.pretrained_model_cfg
andpretrained_file
in conf/encoder/hf_bert.yaml.
==> Downloadbert-base-uncased
model from [Aliyun].
bash train_dpr.sh
Config the following paths before generating paper embeddings.
model_file
andout_file
in conf/gen_embs.yaml.
==>model_file
is pre-trained DPR checkpoint. You can use the checkpoint in the last step or use provided checkpoint [Aliyun Download].
python generate_dense_embeddings.py
Config the following paths before retrieval and evaluation.
stackex_qa_test
in conf/datasets/retriever_default.yaml.
==>qa_valid_dpr.tsv
is the procssed valiation data provided in processed data files.model_dir
andepoch
in dense_retriever.sh.
==>model_dir
is the saved model path andepoch
is selcted epoch for evaluation.
bash dense_retriever.sh
The output files for valiation submission is in the same directory as model_dir
.
We evaluate the checkpoint at epoch 29, and the MAP value on validation set is 0.16909.
If you find this dataset useful in your research, please cite the following papers:
@inproceedings{tam2023parameter,
title={Parameter-Efficient Prompt Tuning Makes Generalized and Calibrated Neural Text Retrievers},
author={Weng Tam and Xiao Liu and Kaixuan Ji and Lilong Xue and Jiahua Liu and Tao Li and Yuxiao Dong and Jie Tang},
booktitle={Findings of the Association for Computational Linguistics: EMNLP 2023},
pages={13117--13130},
year={2023}
}
@article{zhang2024oag,
title={OAG-Bench: A Human-Curated Benchmark for Academic Graph Mining},
author={Fanjin Zhang and Shijie Shi and Yifan Zhu and Bo Chen and Yukuo Cen and Jifan Yu and Yelin Chen and Lulu Wang and Qingfei Zhao and Yuqing Cheng and Tianyi Han and Yuwei An and Dan Zhang and Weng Lam Tam and Kun Cao and Yunhe Pang and Xinyu Guan and Huihui Yuan and Jian Song and Xiaoyan Li and Yuxiao Dong and Jie Tang},
journal={arXiv preprint arXiv:2402.15810},
year={2024}
}