GitHub - inimah/metric-preference-checklist: Code implementation and Datasets for the ACL2023 Paper "NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist"

Code implementation and Datasets for the ACL2023 Paper "NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist" ACL-Anthology

Prerequisites

conda create -n nlgeval_env python=3.7
conda activate nlgeval_env
conda install cudatoolkit=10.1 -c pytorch -n nlgeval_env

pip install -r requirements.txt

Quick Start

1. Structuring Data

Datasets we provided in ~/data have included scores based on human and automatic metrics in this study (including human-aligned metrics).

Text Summarization

SummEval (Fabbri et al., 2021)
Source : Text source before summarized by the systems
Decoded : Systems'generation outputs
Ref-n : Ground truth human references (11 references are provided)
Model-ID : See Appendix of the paper or the original paper for more detail information
Coherence : Coherence rating by human evaluators (scale 1-5)
Consistency : Consistency rating by human evaluators (scale 1-5)
Fluency : Fluency rating by human evaluators (scale 1-5)
Relevance : Relevance rating by human evaluators (scale 1-5)
Perplexity : Perplexity score for the given output (based on pretrained Language Model)
BLEU-n : BLEU score for the given output
ROUGE-n : ROUGE score for the given output
BERTScore : BERTscore for the given output (Precision, Recall, F1)
CTC : CTC scores (Method: Embedding-based, Discriminative, Regression; Aspect: Consistency, Relevance)
CtrlEval : CtrlEval scores (Aspect: Coherence)
UniEval : UniEval scores (Aspect: Coherence, Consistency, Fluency, Relevance, Overall)
Newsroom (Grusky et al., 2018)
This dataset is not accompanied with ground truth references. So, for measuring the performance with reference-based metrics or nearly reference-less metrics, we use the source (ArticleText) as a means of reference.
ArticleID : The unique ID of the article
ArticleText : Text source before summarized by the systems
SystemSummary : Systems'generation outputs
ArticleTitle : Title of the article
System : NLG System to execute the summarization task. See Appendix of the paper or the original paper for more detail information
CoherenceRating : Coherence rating by human evaluators (scale 1-5)
InformativenessRating : Informativeness rating by human evaluators (scale 1-5)
FluencyRating : Fluency rating by human evaluators (scale 1-5)
RelevanceRating : Relevance rating by human evaluators (scale 1-5)
Perplexity : Perplexity score for the given output (based on pretrained Language Model)
BLEU : BLEU score for the given output
ROUGE : ROUGE score for the given output
BERTScore : BERTscore for the given output (Precision, Recall, F1)
CTC : CTC scores (Method: Embedding-based, Discriminative, Regression; Aspect: Consistency, Relevance)
CtrlEval : CtrlEval scores (Aspect: Coherence, Relevance)
UniEval : UniEval scores (Aspect: Coherence, Consistency, Fluency, Relevance, Overall)

Dialogue Response Generation

USR-Topical Chat (Mehri and Eskenazi, 2020)
Fact : The factual context of the article
Context : The preceding conversation as the context for responses
Response : Responses from the systems or human
Annotators : The annotator for the corresponding human ratings
Model : NLG System to execute the response generation task. See Appendix of the paper or the original paper for more detail information
Understandable : Understandable rating by human evaluators (binary scale 0/1, 0=not understandable, 1=understandable)
Natural : Naturalness rating by human evaluators (scale 1-3, 1=not natural, 2=somewhat/moderate, 3=good)
MaintainsContext : Rating by human evaluators for maintaining context (scale 1-3)
Engaging : Engagingness rating by human evaluators (scale 1-3)
UsesKnowledge : Engagingness rating by human evaluators (binary scale 0/1)
Overall : Overall rating by human evaluators (scale 1-5)
Perplexity : Perplexity score for the given output (based on pretrained Language Model)
BLEU : BLEU score for the given output
ROUGE : ROUGE score for the given output
BERTScore : BERTscore for the given output (Precision, Recall, F1)
CTC : CTC scores (Method: Embedding-based, Discriminative, Regression; Aspect: Engagingness, Groundedness)
CtrlEval : CtrlEval scores (Aspect: Coherence, Consistency, Relevance)
UniEval : UniEval scores (Aspect: Understandability, Naturalness, Coherence, Engagingness, Groundedness, Overall)
USR Persona Chat (Mehri and Eskenazi, 2020)
Fact : Persona context of the article
Context : The preceding conversation as the context for responses
Response : Responses from the systems or human
Annotators : The annotator for the corresponding human ratings
Model : NLG System to execute the response generation task. See Appendix of the paper or the original paper for more detail information
Understandable : Understandable rating by human evaluators (binary scale 0/1, 0=not understandable, 1=understandable)
Natural : Naturalness rating by human evaluators (scale 1-3, 1=not natural, 2=neutral/moderate, 3=good)
MaintainsContext : Rating by human evaluators for maintaining context (scale 1-3)
Engaging : Engagingness rating by human evaluators (scale 1-3)
UsesKnowledge : Engagingness rating by human evaluators (binary scale 0/1)
Overall : Overall rating by human evaluators (scale 1-5)
Perplexity : Perplexity score for the given output (based on pretrained Language Model)
BLEU : BLEU score for the given output
ROUGE : ROUGE score for the given output
BERTScore : BERTscore for the given output (Precision, Recall, F1)
CTC : CTC scores (Method: Embedding-based, Discriminative, Regression; Aspect: Engagingness, Groundedness)
CtrlEval : CtrlEval scores (Aspect: Coherence, Consistency, Relevance)
UniEval : UniEval scores (Aspect: Understandability, Naturalness, Coherence, Engagingness, Groundedness, Overall)

Controlled Generation

UBER-PPLM ((Dathathri et al., 2020))
This dataset is an open-ended task (no ground truth references).
Prefix : A word (two words) at the beginning of the sentence as a cue for Language Model to continue the word(s) and complete them into a sentence or full text
Text : Systems'generation outputs
Domain : Topic category as a control attribute
Annotator : The annotator for the corresponding human ratings
Model : NLG System as text generator. See Appendix of the paper or the original paper for more detail information
Pairtxt : Model pair given to the annotators
Fluency : Fluency rating by human evaluators (scale 1-5)
Relevance : Relevance rating by human evaluators (binary scale 0/1)
Perplexity : Perplexity score for the given output (based on pretrained Language Model)
BERTScore : BERTscore for the given output (Precision, Recall, F1)
CTC : CTC scores (Method: Embedding-based, Discriminative, Regression; Aspect: Consistency, Relevance)
CtrlEval : CtrlEval scores (Aspect: Coherence, Consistency, Relevance)
UniEval : UniEval scores (Aspect: Coherence, Consistency, Fluency, Relevance, Overall)
CTRL (Keskar et al., 2019)
This dataset is an open-ended task (no ground truth references).
Prefix : A word (two words) at the beginning of the sentence as a cue for Language Model to continue the word(s) and complete them into a sentence or full text
Text : Systems'generation outputs
Domain : Topic category as a control attribute
Annotator : The annotator for the corresponding human ratings
Model : NLG System as text generator. See Appendix of the paper or the original paper for more detail information
Pairtxt : Model pair given to the annotators
Fluency : Fluency rating by human evaluators (scale 1-5)
Relevance : Relevance rating by human evaluators (binary scale 0/1)
Perplexity : Perplexity score for the given output (based on pretrained Language Model)
BERTScore : BERTscore for the given output (Precision, Recall, F1)
CTC : CTC scores (Method: Embedding-based, Discriminative, Regression; Aspect: Consistency, Relevance)
CtrlEval : CtrlEval scores (Aspect: Coherence, Consistency, Relevance)
UniEval : UniEval scores (Aspect: Coherence, Consistency, Fluency, Relevance, Overall)
CTRL-Eval (Ke et al., 2022)
This dataset is an open-ended task (no ground truth references).
Prefix : A word (two words) at the beginning of the sentence as a cue for Language Model to continue the word(s) and complete them into a sentence or full text
Text : Systems'generation outputs
Attribute : Topic category as a control attribute
Coherence : Coherence rating by human evaluators (scale 1-5)
Consistency : Consistency rating by human evaluators (scale 1-5)
Relevance : Relevance rating by human evaluators (binary scale 0/1)
Perplexity : Perplexity score for the given output (based on pretrained Language Model)
BERTScore : BERTscore for the given output (Precision, Recall, F1)
CTC : CTC scores (Method: Embedding-based, Discriminative, Regression; Aspect: Consistency, Relevance)
CtrlEval : CtrlEval scores (Aspect: Coherence, Consistency, Relevance)
UniEval : UniEval scores (Aspect: Coherence, Consistency, Fluency, Relevance, Overall)

2. Human-aligned Metrics

We consider three (3) metrics under this category. Prior to computing the evaluation scores of the given system outputs (above datasets), the following Python implementation of the metrics need to be installed.

CTC (Deng et al., 2021)
https://github.com/tanyuqian/ctc-gen-eval
CTRLEval (Ke et al., 2022)
https://github.com/thu-coai/ctrleval
UniEval (Zhong et al., 2022)
https://github.com/maszhongming/unieval

Datasets we provided in ~/data have included scores based on human and automatic metrics in this study (including human-aligned metrics).
However, if you would like to run the automatic metrics on your own datasets, you can see below examples of code implementation.
Prior to running the following scripts, do not forget to modify the environment name in the script.

Automatic Metric	Benchmark	Bash script
Perplexity, BLEU, ROUGE, BERTScore	Text Summarization	scripts/run_autom_newsroom.sh
Perplexity, BLEU, ROUGE, BERTScore	Controlled Generation	scripts/run_autom_uber.sh
UniEval	Text Summarization	scripts/run_unieval_summ.sh
UniEval	Dialogue Generation	scripts/run_unieval_tc.sh

3. Transfer Experiment

\notebooks\Plot Transfer Correlation.ipynb

4. Aspect Evaluation

\notebooks\Quality-Eval.ipynb

5. System Evaluation

\notebooks\System-Eval.ipynb

6. Pairwise Comparison

\notebooks\Pairwise_System_Ranking.ipynb

Computing Infrastructure

GPU: ASUS Turbo GeForce GTX 1080 Ti ( RAM, 3584 CUDA cores, compute capability 6.1); CPU Intel Xeon Broadwell-EP 2683v4 @ 2.1GHz (64 hyperthreads, RAM: 1024GB).
OS: Ubuntu 16.04.7 LTS (GNU/Linux 4.4.0-138-generic x86_64)

Citation

@inproceedings{nimah-etal-2023-nlg,
    title = "{NLG} Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist",
    author = "Nimah, Iftitahu  and
      Fang, Meng  and
      Menkovski, Vlado  and
      Pechenizkiy, Mykola",
    editor = "Rogers, Anna  and
      Boyd-Graber, Jordan  and
      Okazaki, Naoaki",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.69",
    doi = "10.18653/v1/2023.acl-long.69",
    pages = "1240--1266",
    abstract = "In this study, we analyze automatic evaluation metrics for Natural Language Generation (NLG), specifically task-agnostic metrics and human-aligned metrics. Task-agnostic metrics, such as Perplexity, BLEU, BERTScore, are cost-effective and highly adaptable to diverse NLG tasks, yet they have a weak correlation with human. Human-aligned metrics (CTC, CtrlEval, UniEval) improves correlation level by incorporating desirable human-like qualities as training objective. However, their effectiveness at discerning system-level performance and quality of system outputs remain unclear. We present metric preference checklist as a framework to assess the effectiveness of automatic metrics in three NLG tasks: Text Summarization, Dialogue Response Generation, and Controlled Generation. Our proposed framework provides access: (i) for verifying whether automatic metrics are faithful to human preference, regardless of their correlation level to human; and (ii) for inspecting the strengths and limitations of NLG systems via pairwise evaluation. We show that automatic metrics provide a better guidance than human on discriminating system-level performance in Text Summarization and Controlled Generation tasks. We also show that multi-aspect human-aligned metric (UniEval) is not necessarily dominant over single-aspect human-aligned metrics (CTC, CtrlEval) and task-agnostic metrics (BLEU, BERTScore), particularly in Controlled Generation tasks.",
}

Issues and pull requests are welcomed.

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
auto_metrics		auto_metrics
data		data
data_preprocessing		data_preprocessing
figures		figures
notebooks		notebooks
scripts		scripts
unieval		unieval
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Contents

Prerequisites

Quick Start

1. Structuring Data

Text Summarization

Dialogue Response Generation

Controlled Generation

2. Human-aligned Metrics

3. Transfer Experiment

4. Aspect Evaluation

5. System Evaluation

6. Pairwise Comparison

Computing Infrastructure

Citation

About

Releases

Packages

Languages

License

inimah/metric-preference-checklist

Folders and files

Latest commit

History

Repository files navigation

Contents

Prerequisites

Quick Start

1. Structuring Data

Text Summarization

Dialogue Response Generation

Controlled Generation

2. Human-aligned Metrics

3. Transfer Experiment

4. Aspect Evaluation

5. System Evaluation

6. Pairwise Comparison

Computing Infrastructure

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages