GitHub - PreferredAI/topic-metrics: Your preferred tool for meddling with topics

Preferred Topic Metrics

Large-Scale Correlation Analysis of Automated Metrics for Topic Models, ACL'23

Accompanying code that made mining and evaluating millions of topic representations possible. For larger corpora, it is probably more efficient to compute counts once.

Most of the codebase was refactored and lightly tested on python 3.10 (in theory it should work on >=3.6). Some functions were benchmarked for speed, using AMD EPYC 7502 @ 2.50GHz, using large Wikipedia graphs:

2 minutes to calculate 40K Wikipedia NPMI graphs from count graphs (see tutorial)
80 topics evaluated on NPMI / second from lazily loading count graphs (great for evaluating few topics)
30s to load 40K Wikipedia count graphs
Very fast evaluation when count graphs are pre-loaded (300 topics/s with pre-loaded count graphs, see tutorial)
7-8 Hours to count Wikipedia in sliding windows (1B+ tokens total, 5M documents)

More found in docstrings.

Goals

Hackable: hopefully readable and extendable for your own use cases.

Lightweight: only numpy and tqdm dependencies.

Speed: some attempts at computation efficiency.

Features

Topic evaluations
Creating count statistics from corpus
Mining Topic representations from corpora

To-do

Some convenience functions

To install

pip install git+https://github.com/PreferredAI/topic-metrics.git

Recommendations

We recommend setting a low window size (e.g 10) and minimum frequency (e.g. 0) for large corpora.

Releasable Resources

Dropbox Link

count_graph indices are mapped to alphabetically-sorted vocabulary while vocab_count maps are sorted by vocab count.

Example from Wiki's vocab-index:

...
'addison': 724,
'addition': 725,
'additional': 726,
'additionally': 727,
'additions': 728,
...

Anthology Link

If you had found the resources helpful, we'd appreciate a citation!

@inproceedings{lim-lauw-2023-large,
    title = "Large-Scale Correlation Analysis of Automated Metrics for Topic Models",
    author = "Lim, Jia Peng  and
      Lauw, Hady",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.776",
    pages = "13874--13898",
    abstract = "Automated coherence metrics constitute an important and popular way to evaluate topic models. Previous works present a mixed picture of their presumed correlation with human judgement. In this paper, we conduct a large-scale correlation analysis of coherence metrics. We propose a novel sampling approach to mine topics for the purpose of metric evaluation, and conduct the analysis via three large corpora showing that certain automated coherence metrics are correlated. Moreover, we extend the analysis to measure topical differences between corpora. Lastly, we examine the reliability of human judgement by conducting an extensive user study, which is designed as an amalgamation of different proxy tasks to derive a finer insight into the human decision-making processes. Our findings reveal some correlation between automated coherence metrics and human judgement, especially for generic corpora.",
}

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
topic_metrics		topic_metrics
LICENSE		LICENSE
README.md		README.md
setup.py		setup.py
tutorial.ipynb		tutorial.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

topic_metrics

topic_metrics

LICENSE

LICENSE

README.md

README.md

setup.py

setup.py

tutorial.ipynb

tutorial.ipynb

Repository files navigation

Preferred Topic Metrics

Goals

Features

To-do

To install

Recommendations

Releasable Resources

About

Releases 1

Packages

Languages

License

PreferredAI/topic-metrics

Folders and files

Latest commit

History

Repository files navigation

Preferred Topic Metrics

Goals

Features

To-do

To install

Recommendations

Releasable Resources

About

Resources

License

Stars

Watchers

Forks

Languages