mlp_only_layers is more flexible than decoder_sparse_step #30552

eigen2017 · 2024-04-29T17:44:06Z

Before this pr, the config field decoder_sparse_step decides which layers don't have experts, that means it can choose Qwen2MoeSparseMoeBlock or Qwen2MoeMLP for layers.
however, the choose policy is not flexible enough, for example, when decoder_sparse_step = 2:
layers 2, 4, 6, 8... will use Qwen2MoeSparseMoeBlock and layers 1, 3, 5, 7... will use Qwen2MoeMLP.

In this pr, the layer index array “mlp_only_layers” can choose for any layers to use Qwen2MoeSparseMoeBlock or Qwen2MoeMLP, for example , only layer 12 uses Qwen2MoeMLP, and others all use Qwen2MoeSparseMoeBlock.

This support for “mlp_only_layers” has significant importance on poor gpu devices like v100-16G.
As i tested, qwen1.5-moe only requires a little more HBM before cuda OOM, then i only set layer 12 to use Qwen2MoeMLP rather than Qwen2MoeSparseMoeBlock, the model loaded succesfully. (2 pices of v100, and vllm inference)

It's true when set some layers to Qwen2MoeMLP will lose some weights and cause decline of accuracy, but there are solutions:

use finetune data to run inference, and find which layer's Qwen2MoeSparseMoeBlock's experts' outputs are closest to zero, then set the layer to Qwen2MoeMLP.
set mlp_only_layers first and do finetune.
as i tested, change the middle layers just like layer 12 to Qwen2MoeMLP will nearly not affect the inference accuracy.

This pr do not change the model design, just make current feild decoder_sparse_step more flexible.
Only set 1 or 2 or 3 layers to Qwen2MoeMLP can smoothly fit the model to poor HBM, and cost "minimal damage" to the model.

amyeroberts · 2024-04-29T17:47:34Z

cc @ArthurZucker

eigen2017 · 2024-04-30T13:45:44Z

@amyeroberts hi
1.do i need continuously merg latest modifications？
2. do i need fix all the ci errors?

amyeroberts · 2024-04-30T14:06:57Z

Hi @eigen2017,

You shouldn't need to continuously update from main. You will want your branch to be from a recent commit on main, and obviously update if there's conflicts on files between this branch and main. Providing the PR is short-lived, the occasional rebase should be enough.
Yes, if they're related to this PR. Looking at the errors at the moment, it looks like these are just connection errors from the hub, in which case there's nothing for you to address. I'll re-run the tests now.

From the commit history, it looks like you might have rebased and not force pushed the branch. When rebasing, it's necessary to force push to properly update the remote, as rebasing is effectively re-writing history.

eigen2017 · 2024-04-30T20:08:12Z

HI @amyeroberts , thks 4 your guidance
i force pushed back, and fixed some workflow ci errors.
see my "Files changed", it's the minimum modification.

eigen2017 · 2024-04-30T20:24:54Z

@amyeroberts @ArthurZucker
now all ci passed, pls review my code and merg, thks again to @amyeroberts for the instructions.
i explain here again that this pr will not affect any existing logic when mlp_only_layers defaults to empty list.
and mlp_only_layers can be a good setting field that make qwen moe models to fit in poor gpus.

amyeroberts

Hi @eigen2017, thanks for iterating such that the tests are passing and for opening a PR!

I've added general review comments.

Just so you know, we don't normally accept changes like this to the models: adding in new config arguments for feature which were not present in the original architecture. In particular, if there isn't an associated issue feature request.

I'll let @ArthurZucker decide if this is something that would be wanted for this model

amyeroberts · 2024-05-01T16:53:56Z

utils/check_docstrings.py

@@ -434,6 +434,7 @@
 "QDQBertConfig",
 "QDQBertModel",
 "QuestionAnsweringPipeline",
+ "Qwen2MoeConfig",


This shouldn't be added here - needing this is an indication that the docstring isn't correctly formatted

amyeroberts · 2024-05-01T16:56:23Z

src/transformers/models/qwen2_moe/modeling_qwen2_moe.py

+ isUseQwen2MoeSparseMoeBlock = True
+ if layer_idx in config.mlp_only_layers:
+ isUseQwen2MoeSparseMoeBlock = False
+ elif config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0:
+ isUseQwen2MoeSparseMoeBlock = True
+ else:
+ isUseQwen2MoeSparseMoeBlock = False
+
+ if isUseQwen2MoeSparseMoeBlock:


var name here isn't pythonic and there's a useless else

Suggested change

isUseQwen2MoeSparseMoeBlock = True

if layer_idx in config.mlp_only_layers:

isUseQwen2MoeSparseMoeBlock = False

elif config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0:

isUseQwen2MoeSparseMoeBlock = True

else:

isUseQwen2MoeSparseMoeBlock = False

if isUseQwen2MoeSparseMoeBlock:

if not (layer_idx in config.mlp_only_layers) and (config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0):

amyeroberts · 2024-05-01T16:57:21Z

src/transformers/models/qwen2_moe/configuration_qwen2_moe.py

@@ -95,6 +95,11 @@ class Qwen2MoeConfig(PretrainedConfig):
 allow the model to output the auxiliary loss, including load balancing loss and router z-loss.
 router_aux_loss_coef (`float`, *optional*, defaults to 0.001):
 The aux loss factor for the total loss.
+ mlp_only_layers ([`int`], *optional*, defaults to []):


Suggested change

mlp_only_layers ([`int`], *optional*, defaults to []):

mlp_only_layers (`int`, *optional*, defaults to `[]`):

amyeroberts · 2024-05-01T16:57:56Z

src/transformers/models/qwen2_moe/configuration_qwen2_moe.py

+ Indicate which layers use Qwen2MoeMLP rather than Qwen2MoeSparseMoeBlock
+ integers in list is layer index, from 0 to 23 if we have 24 layers
+ when mlp_only_layers is empty, decoder_sparse_step decides Qwen2MoeMLP or Qwen2MoeSparseMoeBlock
+ when mlp_only_layers is not empty, decoder_sparse_step becomes invalid


This should be reworded - it doesn't parse

eigen2017 · 2024-05-01T18:22:43Z

@amyeroberts HI, many thanks to all the reviews!
i will modify my code to match the suggestions.
i still hold that, mlp_only_layers is not a new feature, it's only make decoder_sparse_step more flexible.
just think why qwen moe model exposed decoder_sparse_step to users? it's means the original architecture supports on cutting specified layers' experts during finetuning and inference. or other creative scenarios, like task1 only use some layers' exports to finetune and infer, task2 use other layers, to improve multi-task effects. or use different data to finetune different layers' exports to make model have more generalization ability. or for my scenario, need trade off some accuracy for HBM, yes i can use decoder_sparse_step to cut exports, but the minimal cut is 12 from 24 layers, and then the model becomes dumb and hard to finetune or inference.
no matter what scenarios decoder_sparse_step participants, mlp_only_layers can theoretically have better performance to decoder_sparse_step, and can support more scenarios .

i saw so many associated feature or issue request during my research on this, someone even requires vllm to support cpu offload inference just like deepspeed, btw deepspeed recently merged my code, i have 10 years experiences in AI research, recently i made huggingface framework work on ascend NPUs(HW gpu).

pls give a deep thought to this pr and conversations before decided merg or not , i sincerely want to contribute to great huggingface, thanks again.^_^

eigen2017 · 2024-05-04T13:14:39Z

@amyeroberts HI~ code modified acording to your suggestion, thks!! and ci errors cleared again.
pls help contact @ArthurZucker , thanks~
and qwen moe was first commited by @bozheng-hit too, so @bozheng-hit , if you see this , please give a reply and share your opinion~

eigen2017 · 2024-05-04T13:16:32Z

this model is constructed by alibaba as i know , if there are members from alibaba to make a confirmations is welcomed too

eigen2017 · 2024-05-04T13:41:05Z

@huybery HI ! as i found, you are qwen member, pls help to check this pr and give some opinion, thks

eigen2017 · 2024-05-07T01:57:49Z

@ArthurZucker HI, pls give this a review , thks..

ArthurZucker

I actually think this is a lot simpler than the steps we used. Could be added for all MoE model, but this one is the only one that needs it for now!

src/transformers/models/qwen2_moe/configuration_qwen2_moe.py

eigen2017 · 2024-05-08T02:49:22Z

I actually think this is a lot simpler than the steps we used. Could be added for all MoE model, but this one is the only one that needs it for now!

many thanks for this confirmation!!
immediately i'll commit to fit your review.

Co-authored-by: Arthur <[email protected]>

eigen2017 · 2024-05-08T03:32:18Z

@ArthurZucker @amyeroberts all your reviews are accepted and modified， CI also passed， please help to merg this pr.
thank you again, and thanks to greate huggingface~~~ ^_^

src/transformers/models/qwen2_moe/configuration_qwen2_moe.py

HuggingFaceDocBuilderDev · 2024-05-08T15:24:09Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Co-authored-by: Arthur <[email protected]>

eigen2017 · 2024-05-09T15:33:08Z

@ArthurZucker HI.. code updated to your review suggestions..

eigen2017 · 2024-05-10T10:30:39Z

@ArthurZucker @amyeroberts HI~, i finished my senario test and below is the report, i think it's helpful to record here.

	no model	glm finetune	moe finetune	moe finetune, then cut exp	moe cut exp, then finetune
acc	48.15%	53.69%	54.66%	50.38%	51.91%
gen	NA	50 tok/s	85 tok/s	105 tok/s	105 tok/s
weight	NA	12GB	27GB	22GB	22GB
gpu	NA	1	4	2	2

overview:

concepts	explaination
task	use glm or moe model to summarize and rewrite long user query, then pass to a faq system, hope to get better faq accuracy.
data set	non-public, for specified scenarios
infer tech	vllm, which requires more HBM for kv blocks, that's why orignal 27GB moe cannot load in 2 gpus
no model	pass the long user query directly to faq system
glm	chatglm3-6b
glm finetune	glm lora fintune, infer, then pass generated query to faq system
moe	Qwen1.5-MoE-A2.7B-Chat
moe finetune	moe lora finetune, infer, then pass generated query to faq system
moe finetune, then cut exp	moe lora finetune, cut exp(set mlp_only_layers to [12,14,16,18]), infer, then pass generated query to faq system
moe cut exp, then finetune	moe cut exp(set mlp_only_layers to [12,14,16,18]), then lora finetune, infer, then pass generated query to faq system

concepts:

concepts	explaination
acc	the final faq accuracy.
gen	token generation speed when inference only one user query's shortter form.
weight	minimal HBM requirements of one instance of the model
gpu	minimal requirements of nvidia v100-16G pices

conclusions:

moe has better acc and gen speed than glm, but need more HBM. it's trade HBM space for speed. yes 2gpus can load 2 instances of glm, but it can only doubles the tps, for only one request, moe is as twice speed as glm.
cut exp causes decline of acc, but fewer gpus requirements and higher generation speed.
cut exp before funetuning can rise acc, as compared to cut exp after funetuning.

table above "cut exp" means cut 4 layers: [12,14,16,18], i also tried --enforce-eager true for vllm, it's means disable cuda graph feature, to trade generate speed for fewer HBM requirement. then i can cut only one layer's experts(only the 12th layer). generation speed declined to 60 tok/s and acc rised to 53.43% on 2 gpus.

it's only my senario using mlp_only_layers, this flexible config field can create many more other usages i think.

ArthurZucker

Thanks for iterating!

eigen2017 · 2024-05-10T12:13:34Z

Thanks for iterating!

it's my honor and pleasure !

ArthurZucker · 2024-05-10T12:20:20Z

🤗

force back to commit ba40a21 and fix workflow errors

a40396b

eigen2017 force-pushed the main branch from eadf958 to a40396b Compare April 30, 2024 20:02

eigen2017 mentioned this pull request May 1, 2024

[Usage]: OOM！！Qwen1.5-MoE-A2.7B-Chat on 32GB HBM vllm-project/vllm#4369

Closed

amyeroberts reviewed May 1, 2024

View reviewed changes

eigen2017 added 6 commits May 4, 2024 19:26

match the review suggestions

abcf2e0

fix ci errors

5de1216

fix CI

0bf3030

fix ci, format code

6ad31ee

fix ci, ruff format

ff3f8ff

fix ci, ruff format again

df93669

eigen2017 mentioned this pull request May 6, 2024

[Usage]: How to offload some layers to CPU？ vllm-project/vllm#3931

Open

ArthurZucker approved these changes May 7, 2024

View reviewed changes

src/transformers/models/qwen2_moe/configuration_qwen2_moe.py Outdated Show resolved Hide resolved

src/transformers/models/qwen2_moe/configuration_qwen2_moe.py Outdated Show resolved Hide resolved

eigen2017 and others added 2 commits May 8, 2024 11:07

Update src/transformers/models/qwen2_moe/configuration_qwen2_moe.py

aa3a104

Co-authored-by: Arthur <[email protected]>

Update src/transformers/models/qwen2_moe/configuration_qwen2_moe.py

cce6eae

Co-authored-by: Arthur <[email protected]>

ArthurZucker reviewed May 8, 2024

View reviewed changes

src/transformers/models/qwen2_moe/configuration_qwen2_moe.py Outdated Show resolved Hide resolved

eigen2017 and others added 2 commits May 9, 2024 00:34

Update src/transformers/models/qwen2_moe/configuration_qwen2_moe.py

6a6e4da

Co-authored-by: Arthur <[email protected]>

solve this warning: Default Argument Value is mutable

fec56e6

ArthurZucker approved these changes May 10, 2024

View reviewed changes

ArthurZucker merged commit 1c52cb7 into huggingface:main May 10, 2024
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mlp_only_layers is more flexible than decoder_sparse_step #30552

mlp_only_layers is more flexible than decoder_sparse_step #30552

eigen2017 commented Apr 29, 2024 •

edited

amyeroberts commented Apr 29, 2024

eigen2017 commented Apr 30, 2024

amyeroberts commented Apr 30, 2024 •

edited

eigen2017 commented Apr 30, 2024

eigen2017 commented Apr 30, 2024

amyeroberts left a comment

amyeroberts May 1, 2024

amyeroberts May 1, 2024

amyeroberts May 1, 2024

amyeroberts May 1, 2024

eigen2017 commented May 1, 2024

eigen2017 commented May 4, 2024

eigen2017 commented May 4, 2024

eigen2017 commented May 4, 2024 •

edited

eigen2017 commented May 7, 2024

ArthurZucker left a comment

eigen2017 commented May 8, 2024

eigen2017 commented May 8, 2024

HuggingFaceDocBuilderDev commented May 8, 2024

eigen2017 commented May 9, 2024

eigen2017 commented May 10, 2024 •

edited

ArthurZucker left a comment

eigen2017 commented May 10, 2024

ArthurZucker commented May 10, 2024

	mlp_only_layers ([`int`], optional, defaults to []):
	mlp_only_layers (`int`, optional, defaults to `[]`):

mlp_only_layers is more flexible than decoder_sparse_step #30552

mlp_only_layers is more flexible than decoder_sparse_step #30552

Conversation

eigen2017 commented Apr 29, 2024 • edited

amyeroberts commented Apr 29, 2024

eigen2017 commented Apr 30, 2024

amyeroberts commented Apr 30, 2024 • edited

eigen2017 commented Apr 30, 2024

eigen2017 commented Apr 30, 2024

amyeroberts left a comment

Choose a reason for hiding this comment

amyeroberts May 1, 2024

Choose a reason for hiding this comment

amyeroberts May 1, 2024

Choose a reason for hiding this comment

amyeroberts May 1, 2024

Choose a reason for hiding this comment

amyeroberts May 1, 2024

Choose a reason for hiding this comment

eigen2017 commented May 1, 2024

eigen2017 commented May 4, 2024

eigen2017 commented May 4, 2024

eigen2017 commented May 4, 2024 • edited

eigen2017 commented May 7, 2024

ArthurZucker left a comment

Choose a reason for hiding this comment

eigen2017 commented May 8, 2024

eigen2017 commented May 8, 2024

HuggingFaceDocBuilderDev commented May 8, 2024

eigen2017 commented May 9, 2024

eigen2017 commented May 10, 2024 • edited

ArthurZucker left a comment

Choose a reason for hiding this comment

eigen2017 commented May 10, 2024

ArthurZucker commented May 10, 2024

eigen2017 commented Apr 29, 2024 •

edited

amyeroberts commented Apr 30, 2024 •

edited

eigen2017 commented May 4, 2024 •

edited

eigen2017 commented May 10, 2024 •

edited