Seed-Guided Topic Discovery with Out-of-Vocabulary Seeds

This repository contains the source code for Seed-Guided Topic Discovery with Out-of-Vocabulary Seeds.

Links

Installation
Quick Start
Data
Running on New Datasets
Citation

Installation

The code is written in C and Python 3.6. The Python dependencies are summarized in the file requirements.txt. You can install them like this:

pip3 install -r requirements.txt

Quick Start

To reproduce the results in our paper, you need to first download the datasets. Three datasets are used in our paper: SciDocs, Amazon and Twitter. Once you unzip the downloaded file (i.e., SeeTopic.zip), you can see five folders: scidocs/, amazon/, and twitter are the dataset folders of SciDocs, Amazon, and Twitter, respectively; bert-base-uncased/ and biobert-v1.1/ contain the pre-trained BERT and BioBERT models downloaded from Hugging Face.

Put the five folders under the main directory ./. Then you need to run the following script.

./seetopic.sh

The topic mining result will be in {dataset}/keywords_seetopic.txt. For example, if you are using the SciDocs dataset, the result will be in scidocs/keywords_seetopic.txt.

To evaluate the result (using automatic evaluation metrics), you need to run the following script.

./evaluation.sh

PMI, NPMI, LCP, and Diversity scores will be printed out.

NOTE: There is an error in our original evaluation code, which halves all NPMI and LCP scores reported in our paper. (In other words, if the NPMI or LCP score reported in our paper is 0.1, then the true value should be 0.2.) We have corrected this error in our current evaluation code.

Data

Three datasets are used in our paper. For each dataset, we use 60% of the documents to perform topic mining and the remaining 40% for automatic evaluation (i.e., calculating PMI, NPMI, and LCP scores). In each dataset folder, you can see three files. We use scidocs/ as an example for explanation.

(1) scidocs/scidocs.txt contains the 60% of the documents to perform topic mining. Each line is a document.

(2) scidocs/scidocs_test.txt contains the remaining 40% of the documents for automatic evaluation. Each line is a document.

(3) scidocs/keywords_0.txt contains the seeds used in topic mining. Each line is a seed.

0:cardiovascular_diseases
1:chronic_kidney_disease
2:chronic_respiratory_diseases
3:diabetes_mellitus
4:digestive_diseases
5:hiv/aids
6:hepatitis_a/b/c/e
7:mental_disorders
8:musculoskeletal_disorders
9:neoplasms_(cancer)
10:neurological_disorders

Running on New Datasets

If you have a new dataset, please take the following steps to run our code on your dataset.

NOTE: By default, our code uses BERT-base-uncased as the pre-trained language model. Please make sure your input corpus is already lowercased before running our code. An alternative way is to use a cased model (e.g., BERT-base-cased) by specifying it here.

(1) Prepare the input files. You need a corpus ({dataset}/{dataset}.txt) to perform topic mining and a set of seeds (see {dataset}/keywords_0.txt). If you would like to calculate the PMI, NPMI, and LCP scores, you need a corpus ({dataset}/{dataset}_test.txt) to count the (co-)occurrence of top-ranked terms.

(2) You can use any tool to preprocess your corpus (e.g., phrase chunking, lowercasing). If you would like to follow our practice, please refer to the CatE preprocessing step, which uses AutoPhrase.

(3) You can use any BERT-based pre-trained language model that you think is more suitable for your seeds and corpus (e.g., BERT-base-cased, SciBERT, ChemBERT).

(4) ./seetopic.sh. Make sure you have changed the dataset name and the language model folder.

Citation

If you find this repository useful, please cite the following paper:

@inproceedings{zhang2022seed,
  title={Seed-Guided Topic Discovery with Out-of-Vocabulary Seeds},
  author={Zhang, Yu and Meng, Yu and Wang, Xuan and Wang, Sheng and Han, Jiawei},
  booktitle={NAACL'22},
  pages={279--290},
  year={2022}
}

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
seed_guided_emb		seed_guided_emb
LICENSE		LICENSE
README.md		README.md
ensemble.py		ensemble.py
evaluation.py		evaluation.py
evaluation.sh		evaluation.sh
format.py		format.py
get_bert_emb.py		get_bert_emb.py
get_bert_nn.py		get_bert_nn.py
requirements.txt		requirements.txt
seetopic.sh		seetopic.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Seed-Guided Topic Discovery with Out-of-Vocabulary Seeds

Links

Installation

Quick Start

Data

Running on New Datasets

Citation

About

Releases

Packages

Languages

License

yuzhimanhua/SeeTopic

Folders and files

Latest commit

History

Repository files navigation

Seed-Guided Topic Discovery with Out-of-Vocabulary Seeds

Links

Installation

Quick Start

Data

Running on New Datasets

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages