[WIP] Add nonlinearity between Down and Up weights, expose alpha hyperparameter #111

brian6091 · 2023-01-05T14:37:56Z

Not strictly a LoRA change, but quite easy to fit into the current code. The idea is that squeezing an element-wise nonlinearity in between the Down and Up transformations can in some cases increase your performance at essentially no cost.

I also exposed the alpha parameter (keeping the convention from the original LoRA paper that scale=alpha/rank), which can also get you some extra performance for little effort.

Ideas from this fantastic paper:

He, Junxian, et al. "Towards a unified view of parameter-efficient transfer learning." arXiv preprint arXiv:2110.04366 (2021).
https://arxiv.org/pdf/2110.04366.pdf

Any thoughts before I start testing more thoroughly?

Note that LoRA paper defines scale = alpha/r

cloneofsimo · 2023-01-05T15:29:23Z

Nice work as always!!

Keep tune_lora_scale as previously to minimize breakage.

cloneofsimo · 2023-01-05T21:38:31Z

Ooooo so alpha = 4 is the default? This is going to make my head hurt 🫠
So in the paper they had high alpha values, but it seemed to me that is equivalent to scaling initialization + increasing learning rate so I didnt bother...

cloneofsimo · 2023-01-05T21:40:02Z

What if we set alpha = 1 as default, and scale up lora instead?

brian6091 · 2023-01-05T22:00:34Z

So I set alpha=4 as default so that everything should produce scale=1.0 by default (which is what you had before). Hence my last addition of a 'tune_lora_alpha' and leaving 'tune_lora_scale' alone. This should be transparent to most users, as they won't see alpha unless they want to...

I agree that alpha and scale are redundant, and I honestly have no idea why it was defined that way in the paper.
We could remove one, but both alpha and scale are used in different places in the code. If you think it useful, I could go through and change everything to 'scale'?

cloneofsimo · 2023-01-06T03:05:38Z

Hmm I want to talk about this bit more deeper. Its just that I"m not really understanding why we need to introduce scale term here. I think that the previous implementation + new implementation on the initialization with scale would be enough.

My reason is suppose we train a lora with scale = 16
That would mean that end user using lora would have to know that this lora was trained with 16, and its the base option. So scale = 16 is the "full model", scale < 16 would mean interpolation.
So somehow the end user need to know another parameter as well as lora to get things done quickly : scale parameter

But instead, since

$$ W + s A B^T = W + 1. (sA) B^T $$

If we instead use the scale parameter to make A larger on the initialization (and use higher lr), end user would just know 1. Is the "default full weight", and would need to use somewhere 0.~1 to get a decent result, which is less confusing

cloneofsimo · 2023-01-06T03:06:46Z

But its just my thought, and I think this will change a lot of code all at once, (with readme and some of your analysis as well) so im happy to discuss with more depth

brian6091 · 2023-01-06T08:15:45Z

You raise great points that I completely agree with. Using both alpha and scale is not a good idea, and even the LoRA authors ignore this after defining it. So I think it's best to revert back to just using scale.

Now the question is whether to expose scale for users. As you said, changing the scale can effectively be compensated by changing the learning rate or the the variance of the weight initialization. He et al. casually mention that changing scale can have an impact, and they actually hand-tune it. This may become more relevant if the pointwise nonlinearity is incorporated, since in that case, scale is no longer equivalent to just changing the variance of the weight initialization.

So here's what I suggest:

Keep scale but not alpha. I also clean up any use of alpha and convert to scale (this is just a parameter naming thing e.g., in tune_lora_scale, weight_apply_lora).
Expose scale, but set the default to 1, which is what it was fixed to before. This leaves it open to advanced users to play with.
Modify the metadata storage to include scale value and the name of the nonlinearity

cloneofsimo · 2023-01-06T14:02:07Z

I think whatever works fine if it is just backward compatible. Overall, I just want to keep the notion of "full training == $\alpha = 1$" and "base model == $\alpha = 0$" intact.

cloneofsimo · 2023-01-06T14:03:48Z

Basically because I've built many test scripts with those notion, and lot of other implementations are on that idea as well, so I think it would surprise a lot of people if alpha is unknown argument, and you suddenly need to use large scale value to get it to work
Also I'm kinda ok with the way it is because LoRA is already too good at fine tuning, it's just that we need more regularization methods to work on

cloneofsimo · 2023-01-06T14:06:38Z

So maybe keeping alpha intact, setting default scale = 1, and exposing it the advanced users might work?

during saving, fallback scale into A?

$$ W' = W + s AB^T = W + (sA) B^T $$

So save $sA , B$ after training?

brian6091 · 2023-01-06T15:23:34Z

So maybe keeping alpha intact, setting default scale = 1, and exposing it the advanced users might work?
* during saving, fallback scale into A?
W′=W+sABT=W+(sA)BT

So save sA,B after training?

Yeah, that sounds good.

brian6091 added 11 commits January 5, 2023 09:52

Update lora.py

271069f

Update lora.py

94e3e81

Update lora.py

0eb98b5

Revert wrong branch

6d30989

Expose alpha hyperparameter

f54f16a

Note that LoRA paper defines scale = alpha/r

Adjust for parameter rename

d9531d7

Add element-wise nonlinearity between Down and UP

24c8470

Add nonlin to necessary patching functions

953b1db

Missed a few alpha/nonlin

9f59cb0

Update lora.py

2326be5

Make kaiming init same as nn.Linear

b20df37

brian6091 marked this pull request as draft January 5, 2023 14:38

brian6091 requested a review from cloneofsimo January 5, 2023 14:38

Update lora.py

fc68fce

brian6091 added 2 commits January 5, 2023 19:57

Update lora.py

314ea0a

New function tune_lora_alpha

5b5dfcb

Keep tune_lora_scale as previously to minimize breakage.

cloneofsimo changed the base branch from master to develop January 6, 2023 13:42

Revert to scale, remove alpha

f0485a3

brian6091 mentioned this pull request Jan 14, 2023

Feature/ Better LoRA : Dropout, Conv2d #133

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add nonlinearity between Down and Up weights, expose alpha hyperparameter #111

[WIP] Add nonlinearity between Down and Up weights, expose alpha hyperparameter #111

brian6091 commented Jan 5, 2023

cloneofsimo commented Jan 5, 2023

cloneofsimo commented Jan 5, 2023

cloneofsimo commented Jan 5, 2023

brian6091 commented Jan 5, 2023 •

edited

cloneofsimo commented Jan 6, 2023

cloneofsimo commented Jan 6, 2023

brian6091 commented Jan 6, 2023 •

edited

cloneofsimo commented Jan 6, 2023

cloneofsimo commented Jan 6, 2023

cloneofsimo commented Jan 6, 2023 •

edited

brian6091 commented Jan 6, 2023

[WIP] Add nonlinearity between Down and Up weights, expose alpha hyperparameter #111

Are you sure you want to change the base?

[WIP] Add nonlinearity between Down and Up weights, expose alpha hyperparameter #111

Conversation

brian6091 commented Jan 5, 2023

cloneofsimo commented Jan 5, 2023

cloneofsimo commented Jan 5, 2023

cloneofsimo commented Jan 5, 2023

brian6091 commented Jan 5, 2023 • edited

cloneofsimo commented Jan 6, 2023

cloneofsimo commented Jan 6, 2023

brian6091 commented Jan 6, 2023 • edited

cloneofsimo commented Jan 6, 2023

cloneofsimo commented Jan 6, 2023

cloneofsimo commented Jan 6, 2023 • edited

brian6091 commented Jan 6, 2023

brian6091 commented Jan 5, 2023 •

edited

brian6091 commented Jan 6, 2023 •

edited

cloneofsimo commented Jan 6, 2023 •

edited