spot bug in SGDW implementation (weight decay part) #454

Leiay · 2022-08-08T21:04:41Z

Hi,

I was using the SGDW implementation in this repo, and I wonder if anything is wrong with this line:

pytorch-optimizer/torch_optimizer/sgdw.py

Line 121 in 910b414

p.data.add_(weight_decay, alpha=-group['lr'])

Let weight decay be $\lambda$ and learning rate be $\mu_t$. If I understand it correctly, this line of code update weight decay with
$$\theta_t \leftarrow \tilde{\theta}_t - \lambda \mu_t$$
where (follow the notation in the paper)

$$\tilde{\theta}_t \leftarrow \theta_{t-1} - m_t$$

But it should be

$$ \begin{aligned} \theta_{t-1} &\leftarrow \theta_{t-1} \cdot (1 - \lambda \mu_t) \\ \theta_t &\leftarrow \theta_{t-1} - m_t \end{aligned} $$

as in the paper:

This result in poor performance of training compared to SGD with the same set of optimization hyper-parameter.

Thanks!

Regards, Liu

Leiay mentioned this issue Aug 8, 2022

fix sgdw weight decay bug #455

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spot bug in SGDW implementation (weight decay part) #454

spot bug in SGDW implementation (weight decay part) #454

Leiay commented Aug 8, 2022 •

edited

spot bug in SGDW implementation (weight decay part) #454

spot bug in SGDW implementation (weight decay part) #454

Comments

Leiay commented Aug 8, 2022 • edited

Leiay commented Aug 8, 2022 •

edited