OAG-AQA

Prerequisites

Linux
Python 3.7
PyTorch 1.10.0+cu111

Getting Started

Installation

Clone this repo.

git clone https://github.com/THUDM/OAG-AQA.git
cd OAG-AQA

Please install dependencies by

pip install -r requirements.txt
python -m spacy download en_core_web_sm

OAG-QA Dataset

The raw dataset can be downloaded from BaiduPan with password v2bb, Aliyun or DropBox. The processed data can be downloaded from Aliyun. Unzip the processed data and put these files into data/kddcup/dpr directory.

Note: In train_with_hn.json, negative_ctxs are randomly sampled from candidate papers, and hard_negative_ctxs are randomly sampled from the references of positive samples. References of positive samples are sampled from [DBLP Citation Dataset].

Run Baseline for KDD Cup 2024

We provide a baseline method DPR.

cd $project_path
export CUDA_VISIBLE_DEVICES='?'  # specify which GPU(s) to be used
export PYTHONPATH="`pwd`:$PYTHONPATH"

Training DPR

Config the following paths before training (Absolute paths are recommended. The same below.)

dpr_stackex_qa in conf/ctx_sources/default_sources.yaml
==> candidate_papers.tsv is descriptions of candidate papers provided in processed data files.
stackex_qa_train and stackex_qa_valid in conf/datasets/encoder_train_default.yaml.
==> train_with_hn.json and dev.json are processed training and valiation data provided in processed data files.
pretrained_model_cfg and pretrained_file in conf/encoder/hf_bert.yaml.
==> Download bert-base-uncased model from [Aliyun].

bash train_dpr.sh

Generating Paper Embeddings

Config the following paths before generating paper embeddings.

model_file and out_file in conf/gen_embs.yaml.
==> model_file is pre-trained DPR checkpoint. You can use the checkpoint in the last step or use provided checkpoint [Aliyun Download].

python generate_dense_embeddings.py

Retrieval and Evaluation

Config the following paths before retrieval and evaluation.

stackex_qa_test in conf/datasets/retriever_default.yaml.
==> qa_valid_dpr.tsv is the procssed valiation data provided in processed data files.
model_dir and epoch in dense_retriever.sh.
==> model_dir is the saved model path and epoch is selcted epoch for evaluation.

bash dense_retriever.sh

Output

The output files for valiation submission is in the same directory as model_dir. We evaluate the checkpoint at epoch 29, and the MAP value on validation set is 0.16909.

Citation

If you find this dataset useful in your research, please cite the following papers:

@inproceedings{tam2023parameter,
  title={Parameter-Efficient Prompt Tuning Makes Generalized and Calibrated Neural Text Retrievers},
  author={Weng Tam and Xiao Liu and Kaixuan Ji and Lilong Xue and Jiahua Liu and Tao Li and Yuxiao Dong and Jie Tang},
  booktitle={Findings of the Association for Computational Linguistics: EMNLP 2023},
  pages={13117--13130},
  year={2023}
}

@article{zhang2024oag,
    title={OAG-Bench: A Human-Curated Benchmark for Academic Graph Mining},
    author={Fanjin Zhang and Shijie Shi and Yifan Zhu and Bo Chen and Yukuo Cen and Jifan Yu and Yelin Chen and Lulu Wang and Qingfei Zhao and Yuqing Cheng and Tianyi Han and Yuwei An and Dan Zhang and Weng Lam Tam and Kun Cao and Yunhe Pang and Xinyu Guan and Huihui Yuan and Jian Song and Xiaoyan Li and Yuxiao Dong and Jie Tang},
    journal={arXiv preprint arXiv:2402.15810},
    year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
conf		conf
dpr		dpr
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
dense_retriever.py		dense_retriever.py
dense_retriever.sh		dense_retriever.sh
dpr.yml		dpr.yml
generate_dense_embeddings.py		generate_dense_embeddings.py
process.py		process.py
requirements.txt		requirements.txt
setup.py		setup.py
train.sh		train.sh
train_dense_encoder.py		train_dense_encoder.py
train_dpr.sh		train_dpr.sh
train_extractive_reader.py		train_extractive_reader.py

License

THUDM/OAG-AQA

Folders and files

Latest commit

History

Repository files navigation

OAG-AQA

Prerequisites

Getting Started

Installation

OAG-QA Dataset

Run Baseline for KDD Cup 2024

Training DPR

Generating Paper Embeddings

Retrieval and Evaluation

Output

Citation

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Languages