GitHub - ezosa/M3L-topic-model: Multimodal and multilingual topic model with pretrained embeddings

Code for our COLING 2022 paper Multilingual and Multimodal Topic Modelling with Pretrained Embeddings

Abstract

We present M3L-Contrast--—a novel multimodal multilingual (M3L) neural topic model for comparable data that maps multilingual texts and images into a shared topic space using a contrastive objective. As a multilingual topic model, it produces aligned language-specific topics and as multimodal model, it infers textual representations of semantic concepts in images. We also show that our model performs almost as well on unaligned embeddings as it does on aligned embeddings.

Our proposed topic model is:

multilingual
multimodal (image-text)
multimodal and multilingual (M3L)

Our model is based on the Contextualized Topic Model (Bianchi et al., 2021)

We use the PyTorch Metric Learning library for the InfoNCE/NTXent loss

Model architecture

Dataset

Aligned articles from the Wikipedia Comparable Corpora
Images from the WIT dataset
We will release the article titles and image urls in the train and test sets (soon!)

Talks and slides

Slides and video from my talk at the Helsinki Language Technology seminar

Trained models

We shared some of the models we trained:

M3L topic model trained with CLIP embeddings for texts and images
M3L topic model trained with multilingual SBERT for text and CLIP for images
M3L topic model trained with monolingual SBERT models for the English and German texts and CLIP for images

Citation

@inproceedings{zosa-pivovarova-2022-multilingual,
    title = "Multilingual and Multimodal Topic Modelling with Pretrained Embeddings",
    author = "Zosa, Elaine  and  Pivovarova, Lidia",
    booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
    month = oct,
    year = "2022",
    address = "Gyeongju, Republic of Korea",
    publisher = "International Committee on Computational Linguistics",
    url = "https://aclanthology.org/2022.coling-1.355",
    pages = "4037--4048",
}

Name		Name	Last commit message	Last commit date
Latest commit History 83 Commits
data		data
datasets		datasets
evaluation		evaluation
images		images
models		models
networks		networks
training_scripts		training_scripts
utils		utils
LICENSE		LICENSE
README.md		README.md
topic_space-CTM.html		topic_space-CTM.html
topic_space_M3L-Contrast.html		topic_space_M3L-Contrast.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

datasets

datasets

evaluation

evaluation

images

images

models

models

networks

networks

training_scripts

training_scripts

utils

utils

LICENSE

LICENSE

README.md

README.md

topic_space-CTM.html

topic_space-CTM.html

topic_space_M3L-Contrast.html

topic_space_M3L-Contrast.html

Repository files navigation

Abstract

Model architecture

Dataset

Talks and slides

Trained models

Citation

About

Releases

Packages

Languages

License

ezosa/M3L-topic-model

Folders and files

Latest commit

History

Repository files navigation

Abstract

Model architecture

Dataset

Talks and slides

Trained models

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Languages