Skip to content

THUDM/OAG-AQA

Repository files navigation

OAG-AQA

Prerequisites

  • Linux
  • Python 3.7
  • PyTorch 1.10.0+cu111

Getting Started

Installation

Clone this repo.

git clone https://github.com/THUDM/OAG-AQA.git
cd OAG-AQA

Please install dependencies by

pip install -r requirements.txt
python -m spacy download en_core_web_sm

OAG-QA Dataset

The raw dataset can be downloaded from BaiduPan with password v2bb, Aliyun or DropBox. The processed data can be downloaded from Aliyun. Unzip the processed data and put these files into data/kddcup/dpr directory.

Note: In train_with_hn.json, negative_ctxs are randomly sampled from candidate papers, and hard_negative_ctxs are randomly sampled from the references of positive samples. References of positive samples are sampled from [DBLP Citation Dataset].

Run Baseline for KDD Cup 2024

We provide a baseline method DPR.

cd $project_path
export CUDA_VISIBLE_DEVICES='?'  # specify which GPU(s) to be used
export PYTHONPATH="`pwd`:$PYTHONPATH"

Training DPR

Config the following paths before training (Absolute paths are recommended. The same below.)

  • dpr_stackex_qa in conf/ctx_sources/default_sources.yaml
    ==> candidate_papers.tsv is descriptions of candidate papers provided in processed data files.
  • stackex_qa_train and stackex_qa_valid in conf/datasets/encoder_train_default.yaml.
    ==> train_with_hn.json and dev.json are processed training and valiation data provided in processed data files.
  • pretrained_model_cfg and pretrained_file in conf/encoder/hf_bert.yaml.
    ==> Download bert-base-uncased model from [Aliyun].
bash train_dpr.sh

Generating Paper Embeddings

Config the following paths before generating paper embeddings.

  • model_file and out_file in conf/gen_embs.yaml.
    ==> model_file is pre-trained DPR checkpoint. You can use the checkpoint in the last step or use provided checkpoint [Aliyun Download].
python generate_dense_embeddings.py

Retrieval and Evaluation

Config the following paths before retrieval and evaluation.

  • stackex_qa_test in conf/datasets/retriever_default.yaml.
    ==> qa_valid_dpr.tsv is the procssed valiation data provided in processed data files.
  • model_dir and epoch in dense_retriever.sh.
    ==> model_dir is the saved model path and epoch is selcted epoch for evaluation.
bash dense_retriever.sh

Output

The output files for valiation submission is in the same directory as model_dir. We evaluate the checkpoint at epoch 29, and the MAP value on validation set is 0.16909.

Citation

If you find this dataset useful in your research, please cite the following papers:

@inproceedings{tam2023parameter,
  title={Parameter-Efficient Prompt Tuning Makes Generalized and Calibrated Neural Text Retrievers},
  author={Weng Tam and Xiao Liu and Kaixuan Ji and Lilong Xue and Jiahua Liu and Tao Li and Yuxiao Dong and Jie Tang},
  booktitle={Findings of the Association for Computational Linguistics: EMNLP 2023},
  pages={13117--13130},
  year={2023}
}

@article{zhang2024oag,
    title={OAG-Bench: A Human-Curated Benchmark for Academic Graph Mining},
    author={Fanjin Zhang and Shijie Shi and Yifan Zhu and Bo Chen and Yukuo Cen and Jifan Yu and Yelin Chen and Lulu Wang and Qingfei Zhao and Yuqing Cheng and Tianyi Han and Yuwei An and Dan Zhang and Weng Lam Tam and Kun Cao and Yunhe Pang and Xinyu Guan and Huihui Yuan and Jian Song and Xiaoyan Li and Yuxiao Dong and Jie Tang},
    journal={arXiv preprint arXiv:2402.15810},
    year={2024}
}