Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to evaluate BLEU score on LM1B? #6

Open
jzhang38 opened this issue Dec 15, 2022 · 6 comments
Open

How to evaluate BLEU score on LM1B? #6

jzhang38 opened this issue Dec 15, 2022 · 6 comments

Comments

@jzhang38
Copy link

Dear authors,

I understand that you plan to release your code on January. But could you share more details regarding how you evaluate the BLEU score and PPL on the LM1B dataset? I am also working on Diffusion Model for text and may potentially cite your paper. Thanks!

@Hzfinfdu
Copy link
Owner

Hi,

We computed the BLEU score with all test data as references and reported the average BLEU score of each generated sentence. We sampled 1K sentences respectively for evaluating BLEU and S-BLEU.
For PPL, the ELBO on the test set is an upper bound of token-wise NLL. And we first convert such bound to per-word NLL and use this to get the per-word PPL.
Hope this helps!

@yujianll
Copy link

yujianll commented Jan 5, 2023

@Hzfinfdu Thanks for the great work!
I have a follow up question. When you say per-word NLL, do you mean to calculate $\mathcal{L}_{vlb}$ in Eq. 3 for each token? Do you sum up NLL for all tokens in the sequence and use it as NLL for the sequence?
Also, I noticed that in Fig. 4, the validation ELBO is around 110 after training. However, the test set PPL is around 60~70. I wonder why would these two values have such a big difference.

@Hzfinfdu
Copy link
Owner

Hzfinfdu commented Jan 6, 2023

@yujianll Hi,

  1. Yes, we sum up NLL for all tokens in the sequence as NLL for the sequence.
  2. The validation ELBO is around 110. And the average number of words in each sequence in the test set is around 26. Thus per-word NLL is around 4.23. The test PPL is obtained by exp(4.23).

@yujianll
Copy link

yujianll commented Jan 6, 2023

@Hzfinfdu Thanks for the reply!
I have another low-level question. When you calculate NLL on test set, do you sum for all T diffusion steps, or do you sample a few time steps for calculation? If you do sample, how many time steps do you use?

@Hzfinfdu
Copy link
Owner

Hzfinfdu commented Jan 6, 2023

@yujianll Hi,

We trained DiffusionBERT with 512 steps and used DDIM sampling to uniformly sample 128 steps on test set, both for NLL calculation and generation.

Hope this helps!

@yujianll
Copy link

yujianll commented Jan 6, 2023

Thanks, this helps a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants