Skip to content

Code implementation and Datasets for the ACL2023 Paper "NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist"

License

Notifications You must be signed in to change notification settings

inimah/metric-preference-checklist

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

python jupyter pytorch

Code implementation and Datasets for the ACL2023 Paper "NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist" ACL-Anthology

Contents

Prerequisites

conda create -n nlgeval_env python=3.7
conda activate nlgeval_env
conda install cudatoolkit=10.1 -c pytorch -n nlgeval_env

pip install -r requirements.txt

Quick Start

1. Structuring Data

Datasets we provided in ~/data have included scores based on human and automatic metrics in this study (including human-aligned metrics).

Text Summarization

  • SummEval (Fabbri et al., 2021)
    Source : Text source before summarized by the systems
    Decoded : Systems'generation outputs
    Ref-n : Ground truth human references (11 references are provided)
    Model-ID : See Appendix of the paper or the original paper for more detail information
    Coherence : Coherence rating by human evaluators (scale 1-5)
    Consistency : Consistency rating by human evaluators (scale 1-5)
    Fluency : Fluency rating by human evaluators (scale 1-5)
    Relevance : Relevance rating by human evaluators (scale 1-5)
    Perplexity : Perplexity score for the given output (based on pretrained Language Model)
    BLEU-n : BLEU score for the given output
    ROUGE-n : ROUGE score for the given output
    BERTScore : BERTscore for the given output (Precision, Recall, F1)
    CTC : CTC scores (Method: Embedding-based, Discriminative, Regression; Aspect: Consistency, Relevance)
    CtrlEval : CtrlEval scores (Aspect: Coherence)
    UniEval : UniEval scores (Aspect: Coherence, Consistency, Fluency, Relevance, Overall)

  • Newsroom (Grusky et al., 2018)
    This dataset is not accompanied with ground truth references. So, for measuring the performance with reference-based metrics or nearly reference-less metrics, we use the source (ArticleText) as a means of reference.
    ArticleID : The unique ID of the article
    ArticleText : Text source before summarized by the systems
    SystemSummary : Systems'generation outputs
    ArticleTitle : Title of the article
    System : NLG System to execute the summarization task. See Appendix of the paper or the original paper for more detail information
    CoherenceRating : Coherence rating by human evaluators (scale 1-5)
    InformativenessRating : Informativeness rating by human evaluators (scale 1-5)
    FluencyRating : Fluency rating by human evaluators (scale 1-5)
    RelevanceRating : Relevance rating by human evaluators (scale 1-5)
    Perplexity : Perplexity score for the given output (based on pretrained Language Model)
    BLEU : BLEU score for the given output
    ROUGE : ROUGE score for the given output
    BERTScore : BERTscore for the given output (Precision, Recall, F1)
    CTC : CTC scores (Method: Embedding-based, Discriminative, Regression; Aspect: Consistency, Relevance)
    CtrlEval : CtrlEval scores (Aspect: Coherence, Relevance)
    UniEval : UniEval scores (Aspect: Coherence, Consistency, Fluency, Relevance, Overall)

Dialogue Response Generation

  • USR-Topical Chat (Mehri and Eskenazi, 2020)
    Fact : The factual context of the article
    Context : The preceding conversation as the context for responses
    Response : Responses from the systems or human
    Annotators : The annotator for the corresponding human ratings
    Model : NLG System to execute the response generation task. See Appendix of the paper or the original paper for more detail information
    Understandable : Understandable rating by human evaluators (binary scale 0/1, 0=not understandable, 1=understandable)
    Natural : Naturalness rating by human evaluators (scale 1-3, 1=not natural, 2=somewhat/moderate, 3=good)
    MaintainsContext : Rating by human evaluators for maintaining context (scale 1-3)
    Engaging : Engagingness rating by human evaluators (scale 1-3)
    UsesKnowledge : Engagingness rating by human evaluators (binary scale 0/1)
    Overall : Overall rating by human evaluators (scale 1-5)
    Perplexity : Perplexity score for the given output (based on pretrained Language Model)
    BLEU : BLEU score for the given output
    ROUGE : ROUGE score for the given output
    BERTScore : BERTscore for the given output (Precision, Recall, F1)
    CTC : CTC scores (Method: Embedding-based, Discriminative, Regression; Aspect: Engagingness, Groundedness)
    CtrlEval : CtrlEval scores (Aspect: Coherence, Consistency, Relevance)
    UniEval : UniEval scores (Aspect: Understandability, Naturalness, Coherence, Engagingness, Groundedness, Overall)

  • USR Persona Chat (Mehri and Eskenazi, 2020)
    Fact : Persona context of the article
    Context : The preceding conversation as the context for responses
    Response : Responses from the systems or human
    Annotators : The annotator for the corresponding human ratings
    Model : NLG System to execute the response generation task. See Appendix of the paper or the original paper for more detail information
    Understandable : Understandable rating by human evaluators (binary scale 0/1, 0=not understandable, 1=understandable)
    Natural : Naturalness rating by human evaluators (scale 1-3, 1=not natural, 2=neutral/moderate, 3=good)
    MaintainsContext : Rating by human evaluators for maintaining context (scale 1-3)
    Engaging : Engagingness rating by human evaluators (scale 1-3)
    UsesKnowledge : Engagingness rating by human evaluators (binary scale 0/1)
    Overall : Overall rating by human evaluators (scale 1-5)
    Perplexity : Perplexity score for the given output (based on pretrained Language Model)
    BLEU : BLEU score for the given output
    ROUGE : ROUGE score for the given output
    BERTScore : BERTscore for the given output (Precision, Recall, F1)
    CTC : CTC scores (Method: Embedding-based, Discriminative, Regression; Aspect: Engagingness, Groundedness)
    CtrlEval : CtrlEval scores (Aspect: Coherence, Consistency, Relevance)
    UniEval : UniEval scores (Aspect: Understandability, Naturalness, Coherence, Engagingness, Groundedness, Overall)

Controlled Generation

  • UBER-PPLM ((Dathathri et al., 2020))
    This dataset is an open-ended task (no ground truth references).
    Prefix : A word (two words) at the beginning of the sentence as a cue for Language Model to continue the word(s) and complete them into a sentence or full text
    Text : Systems'generation outputs
    Domain : Topic category as a control attribute
    Annotator : The annotator for the corresponding human ratings
    Model : NLG System as text generator. See Appendix of the paper or the original paper for more detail information
    Pairtxt : Model pair given to the annotators
    Fluency : Fluency rating by human evaluators (scale 1-5)
    Relevance : Relevance rating by human evaluators (binary scale 0/1)
    Perplexity : Perplexity score for the given output (based on pretrained Language Model)
    BERTScore : BERTscore for the given output (Precision, Recall, F1)
    CTC : CTC scores (Method: Embedding-based, Discriminative, Regression; Aspect: Consistency, Relevance)
    CtrlEval : CtrlEval scores (Aspect: Coherence, Consistency, Relevance)
    UniEval : UniEval scores (Aspect: Coherence, Consistency, Fluency, Relevance, Overall)

  • CTRL (Keskar et al., 2019)
    This dataset is an open-ended task (no ground truth references).
    Prefix : A word (two words) at the beginning of the sentence as a cue for Language Model to continue the word(s) and complete them into a sentence or full text
    Text : Systems'generation outputs
    Domain : Topic category as a control attribute
    Annotator : The annotator for the corresponding human ratings
    Model : NLG System as text generator. See Appendix of the paper or the original paper for more detail information
    Pairtxt : Model pair given to the annotators
    Fluency : Fluency rating by human evaluators (scale 1-5)
    Relevance : Relevance rating by human evaluators (binary scale 0/1)
    Perplexity : Perplexity score for the given output (based on pretrained Language Model)
    BERTScore : BERTscore for the given output (Precision, Recall, F1)
    CTC : CTC scores (Method: Embedding-based, Discriminative, Regression; Aspect: Consistency, Relevance)
    CtrlEval : CtrlEval scores (Aspect: Coherence, Consistency, Relevance)
    UniEval : UniEval scores (Aspect: Coherence, Consistency, Fluency, Relevance, Overall)

  • CTRL-Eval (Ke et al., 2022)
    This dataset is an open-ended task (no ground truth references).
    Prefix : A word (two words) at the beginning of the sentence as a cue for Language Model to continue the word(s) and complete them into a sentence or full text
    Text : Systems'generation outputs
    Attribute : Topic category as a control attribute
    Coherence : Coherence rating by human evaluators (scale 1-5)
    Consistency : Consistency rating by human evaluators (scale 1-5)
    Relevance : Relevance rating by human evaluators (binary scale 0/1)
    Perplexity : Perplexity score for the given output (based on pretrained Language Model)
    BERTScore : BERTscore for the given output (Precision, Recall, F1)
    CTC : CTC scores (Method: Embedding-based, Discriminative, Regression; Aspect: Consistency, Relevance)
    CtrlEval : CtrlEval scores (Aspect: Coherence, Consistency, Relevance)
    UniEval : UniEval scores (Aspect: Coherence, Consistency, Fluency, Relevance, Overall)

2. Human-aligned Metrics

We consider three (3) metrics under this category. Prior to computing the evaluation scores of the given system outputs (above datasets), the following Python implementation of the metrics need to be installed.

Datasets we provided in ~/data have included scores based on human and automatic metrics in this study (including human-aligned metrics).
However, if you would like to run the automatic metrics on your own datasets, you can see below examples of code implementation.
Prior to running the following scripts, do not forget to modify the environment name in the script.

Automatic Metric Benchmark Bash script
Perplexity, BLEU, ROUGE, BERTScore Text Summarization scripts/run_autom_newsroom.sh
Perplexity, BLEU, ROUGE, BERTScore Controlled Generation scripts/run_autom_uber.sh
UniEval Text Summarization scripts/run_unieval_summ.sh
UniEval Dialogue Generation scripts/run_unieval_tc.sh

3. Transfer Experiment

\notebooks\Plot Transfer Correlation.ipynb

4. Aspect Evaluation

\notebooks\Quality-Eval.ipynb

5. System Evaluation

\notebooks\System-Eval.ipynb

6. Pairwise Comparison

\notebooks\Pairwise_System_Ranking.ipynb

Computing Infrastructure

  • GPU: ASUS Turbo GeForce GTX 1080 Ti ( RAM, 3584 CUDA cores, compute capability 6.1); CPU Intel Xeon Broadwell-EP 2683v4 @ 2.1GHz (64 hyperthreads, RAM: 1024GB).
  • OS: Ubuntu 16.04.7 LTS (GNU/Linux 4.4.0-138-generic x86_64)

Citation

@inproceedings{nimah-etal-2023-nlg,
    title = "{NLG} Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist",
    author = "Nimah, Iftitahu  and
      Fang, Meng  and
      Menkovski, Vlado  and
      Pechenizkiy, Mykola",
    editor = "Rogers, Anna  and
      Boyd-Graber, Jordan  and
      Okazaki, Naoaki",
    booktitle = "Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2023.acl-long.69",
    doi = "10.18653/v1/2023.acl-long.69",
    pages = "1240--1266",
    abstract = "In this study, we analyze automatic evaluation metrics for Natural Language Generation (NLG), specifically task-agnostic metrics and human-aligned metrics. Task-agnostic metrics, such as Perplexity, BLEU, BERTScore, are cost-effective and highly adaptable to diverse NLG tasks, yet they have a weak correlation with human. Human-aligned metrics (CTC, CtrlEval, UniEval) improves correlation level by incorporating desirable human-like qualities as training objective. However, their effectiveness at discerning system-level performance and quality of system outputs remain unclear. We present metric preference checklist as a framework to assess the effectiveness of automatic metrics in three NLG tasks: Text Summarization, Dialogue Response Generation, and Controlled Generation. Our proposed framework provides access: (i) for verifying whether automatic metrics are faithful to human preference, regardless of their correlation level to human; and (ii) for inspecting the strengths and limitations of NLG systems via pairwise evaluation. We show that automatic metrics provide a better guidance than human on discriminating system-level performance in Text Summarization and Controlled Generation tasks. We also show that multi-aspect human-aligned metric (UniEval) is not necessarily dominant over single-aspect human-aligned metrics (CTC, CtrlEval) and task-agnostic metrics (BLEU, BERTScore), particularly in Controlled Generation tasks.",
}

Issues and pull requests are welcomed.

About

Code implementation and Datasets for the ACL2023 Paper "NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist"

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages