hindi-nli-data

a repository containing the details of natural language inference dataset in Hindi developed by

hindi-nli-data is the first recasted dataset for natural language inference in Hindi. Evaluating the learning capabilities of deep learning models in the field of Natural Language Processing has always been challenging. The task of Natural Language Inference (NLI) have been the touchstone in measuring their performance. However, there is complete absence of labeled NLI datasets in a low-resource language like Hindi. To address this, we performed automated recasting of three existing text classification datasets related to affective content analysis in Hindi language to Natural Language Inference datasets. This resulted in three NLI datasets with 43K, 17K, and 203K premise hypothesis pairs. The dataset along with its details is shared in this repo.

Dataset Overview

Three different affective content datasets in Hindi language is recasted. Two of them are in the domain of sentiment analysis - Product Review dataset PR and Movie Review MR dataset developed by Akhtar et al. The third one is the largest emotion analysis dataset in Hindi - BHAAV BH developed by Kumar et al.

The data is shared as tsv files overe here.

Recasting Process

Samples from Original Sources

Recasted Samples

Distribution of Train and Test samples in the Recasted Dataset

Accuracies of different Sentence Embeddings on Textual Entailment Task

Terms of Use

This corpus can be used freely for research purposes.
The paper listed below provide details of the creation and use of the corpus. If you use the corpus, then please cite the paper.
If interested in commercial use of the corpus, send email to [email protected].
If you use the corpus in a product or application, then please credit the authors and Multimodal Digital Media Analysis Lab - Indraprastha Institute of Information Technology, New Delhi appropriately. Also, if you send us an email, we will be thrilled to know about how you have used the corpus.
Multimodal Digital Media Analysis Lab - Indraprastha Institute of Information Technology, New Delhi, India disclaims any responsibility for the use of the corpus and does not provide technical support. However, the contact listed above will be happy to respond to queries and clarifications.
Rather than redistributing the corpus, please direct interested parties to this page

Please feel free to send us an email:

with feedback regarding the corpus.
with information on how you have used the corpus.
if interested in having us analyze your data for natural language inference.
if interested in a collaborative research project.

References

Paper under review. Please check back soon.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
recasted-data-distribution.png		recasted-data-distribution.png
recasted-data-samples.png		recasted-data-samples.png
recasted-hindi-nli-data.zip		recasted-hindi-nli-data.zip
recasting-template.png		recasting-template.png
source-data-sample.png		source-data-sample.png
te-accuracy.png		te-accuracy.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

hindi-nli-data

Dataset Overview

Recasting Process

Samples from Original Sources

Recasted Samples

Distribution of Train and Test samples in the Recasted Dataset

Accuracies of different Sentence Embeddings on Textual Entailment Task

Terms of Use

References

About

Releases

Packages

License

vgupta123/hindi-nli-data

Folders and files

Latest commit

History

Repository files navigation

hindi-nli-data

Dataset Overview

Recasting Process

Samples from Original Sources

Recasted Samples

Distribution of Train and Test samples in the Recasted Dataset

Accuracies of different Sentence Embeddings on Textual Entailment Task

Terms of Use

References

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages