Skip to content

Latest commit





Folders and files

Last commit message
Last commit date

parent directory


Using Supert for Single-Document Summarization: A Case Study on SummEval

Supert is originally designed to rate multi-document summaries, but it can also be used to rate single-document summaries. In this directory we use Supert to evaluate summaries from the SummEval dataset. The summaries are generated using documents from CNN/DailyMail, and are rated by human experts on their quality in consistency, fluency, coherence and relevance.

Step 1: Prepare SummEval Data

Follow the instructions at here to prepare the data. It will generate mulitple folders, but we only need pairs/model_annotations.aligned.paired.jsonl. Move that jsonl file to this directory. It includes 100 topics, each has 23 summaries and their human ratings.

Step 2: Generate Supert scores

Simply run the following command:


It will use Supert to rate all summaries in the SummEval dataset and compute the correlation between the Supert scores and the human ratings. The default setup is to use the first 10 sentences from each document to build the pseudo reference, but you could change the strategy at line 101 in

supert_scores = generate_supert_scores(summ_eval_data, 'top10') 

The generated Supert scores will also be saved at a pickle file to facilitate reuse.


The Pearson correlation between Supert scores and human ratings are below.

Coherence Fluency Relavancy Consistency
Supert, top10 .220 .300 .309 .391
Supert, top200 .221 .287 .311 .389