Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inquiry on some details of the method. #8

Open
leekum2018 opened this issue Dec 20, 2022 · 7 comments
Open

Inquiry on some details of the method. #8

leekum2018 opened this issue Dec 20, 2022 · 7 comments

Comments

@leekum2018
Copy link

leekum2018 commented Dec 20, 2022

As said in the second paragraph of Section 4.3, "We attribute the superior performance of DiffusionBERT to its onetime sampling of all tokens". I wonder the meaning of "onetime sampling of all tokens", does it mean generating all the tokens in a sentence at a time? If it does, it seems to conflict with the demonstration in Table 1. Thank you!

@Hzfinfdu
Copy link
Owner

Hi,

Yes, we generate all tokens in one diffusion step. We use ddim sampling to predict $x_0$ and get $x_{t-1}$ from the forward process. The demonstration in Table 1 shows the input of BERT at time step $t-1$.

Besides, the corresponding predicted $x_0$ is composed of less informative tokens when $t$ is large and gradually shows semantic meaning as $t$ goes to 0. That is also the motivation of our spindle noise schedule.

Hope this helps. If you have more questions please feel free to contact with me.

@leekum2018
Copy link
Author

leekum2018 commented Dec 21, 2022

Thank you for your reply! I have a further question. According to your reply, does it means you model $p_{\theta}(x_{t-1}|x_t)$ as
Screenshot 2022-12-21 at 12 29 17
And is the term $\widetilde{p}(\widetilde{x}_{0}|x_t)$ the output of BERT? Thank you!

@Hzfinfdu
Copy link
Owner

Yes, that's right. DDIM sampling helps to trade off speed and generation quality. And predicting $x_0$ directly is closer to the MLM training objective.

@leekum2018
Copy link
Author

leekum2018 commented Dec 27, 2022

Hi, I have another question. In eq. 9, how to compute $H(x_{0}^{i})$, in other words, what is the distribution of $x_{0}^{i}$ for calculating $H(x_{0}^{i})$. Because I have a hard time understanding why the following equation holds.
Screenshot 2022-12-27 at 13 33 19
Thank you!

@Hzfinfdu
Copy link
Owner

Hzfinfdu commented Dec 27, 2022

Hi,

In fact, $H(x_0^i)$ can be calculated in many ways. We calculate the entropy of each token by the negative logarithm of its frequency in the tokenized training corpus.

Since a masked token loses all its information, the expected information loss of the i-th token at $t$ is $\overline{\alpha}_t^iH(\textbf{x}_0^i)$. We get Eq. 9 by taking the sum over the sequence.

Hope this helps.

@leekum2018
Copy link
Author

For the following formula Structured Denoising Diffusion Models in Discrete State-Spaces, why the RHS is proportional to RHS? Could you please give me some hints? I have a hard time deriving this. Please.
image

@Siddharth-Shrivastava7
Copy link

Hi @leekum2018,

you can refer this : https://openreview.net/forum?id=h7-XixPCAL&noteId=xm7onR_Sg0L

Hope it helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants