Inquiry on some details of the method. #8

leekum2018 · 2022-12-20T13:49:51Z

As said in the second paragraph of Section 4.3, "We attribute the superior performance of DiffusionBERT to its onetime sampling of all tokens". I wonder the meaning of "onetime sampling of all tokens", does it mean generating all the tokens in a sentence at a time? If it does, it seems to conflict with the demonstration in Table 1. Thank you!

Hzfinfdu · 2022-12-20T14:49:42Z

Hi,

Yes, we generate all tokens in one diffusion step. We use ddim sampling to predict $x_0$ and get $x_{t-1}$ from the forward process. The demonstration in Table 1 shows the input of BERT at time step $t-1$.

Besides, the corresponding predicted $x_0$ is composed of less informative tokens when $t$ is large and gradually shows semantic meaning as $t$ goes to 0. That is also the motivation of our spindle noise schedule.

Hope this helps. If you have more questions please feel free to contact with me.

leekum2018 · 2022-12-21T04:30:57Z

Thank you for your reply! I have a further question. According to your reply, does it means you model $p_{\theta}(x_{t-1}|x_t)$ as

And is the term $\widetilde{p}(\widetilde{x}_{0}|x_t)$ the output of BERT? Thank you!

Hzfinfdu · 2022-12-22T10:00:18Z

Yes, that's right. DDIM sampling helps to trade off speed and generation quality. And predicting $x_0$ directly is closer to the MLM training objective.

leekum2018 · 2022-12-27T05:33:46Z

Hi, I have another question. In eq. 9, how to compute $H(x_{0}^{i})$, in other words, what is the distribution of $x_{0}^{i}$ for calculating $H(x_{0}^{i})$. Because I have a hard time understanding why the following equation holds.

Thank you!

Hzfinfdu · 2022-12-27T05:51:00Z

Hi,

In fact, $H(x_0^i)$ can be calculated in many ways. We calculate the entropy of each token by the negative logarithm of its frequency in the tokenized training corpus.

Since a masked token loses all its information, the expected information loss of the i-th token at $t$ is $\overline{\alpha}_t^iH(\textbf{x}_0^i)$. We get Eq. 9 by taking the sum over the sequence.

Hope this helps.

leekum2018 · 2023-01-10T12:25:52Z

For the following formula Structured Denoising Diffusion Models in Discrete State-Spaces, why the RHS is proportional to RHS? Could you please give me some hints? I have a hard time deriving this. Please.

Siddharth-Shrivastava7 · 2023-08-18T03:37:37Z

Hi @leekum2018,

you can refer this : https://openreview.net/forum?id=h7-XixPCAL&noteId=xm7onR_Sg0L

Hope it helps!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inquiry on some details of the method. #8

Inquiry on some details of the method. #8

leekum2018 commented Dec 20, 2022 •

edited

Hzfinfdu commented Dec 20, 2022

leekum2018 commented Dec 21, 2022 •

edited

Hzfinfdu commented Dec 22, 2022

leekum2018 commented Dec 27, 2022 •

edited

Hzfinfdu commented Dec 27, 2022 •

edited

leekum2018 commented Jan 10, 2023

Siddharth-Shrivastava7 commented Aug 18, 2023

Inquiry on some details of the method. #8

Inquiry on some details of the method. #8

Comments

leekum2018 commented Dec 20, 2022 • edited

Hzfinfdu commented Dec 20, 2022

leekum2018 commented Dec 21, 2022 • edited

Hzfinfdu commented Dec 22, 2022

leekum2018 commented Dec 27, 2022 • edited

Hzfinfdu commented Dec 27, 2022 • edited

leekum2018 commented Jan 10, 2023

Siddharth-Shrivastava7 commented Aug 18, 2023

leekum2018 commented Dec 20, 2022 •

edited

leekum2018 commented Dec 21, 2022 •

edited

leekum2018 commented Dec 27, 2022 •

edited

Hzfinfdu commented Dec 27, 2022 •

edited