Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mlp_only_layers is more flexible than decoder_sparse_step #30552

Merged
merged 11 commits into from May 10, 2024

Conversation

eigen2017
Copy link
Contributor

@eigen2017 eigen2017 commented Apr 29, 2024

Before this pr, the config field decoder_sparse_step decides which layers don't have experts, that means it can choose Qwen2MoeSparseMoeBlock or Qwen2MoeMLP for layers.
however, the choose policy is not flexible enough, for example, when decoder_sparse_step = 2:
layers 2, 4, 6, 8... will use Qwen2MoeSparseMoeBlock and layers 1, 3, 5, 7... will use Qwen2MoeMLP.

In this pr, the layer index array “mlp_only_layers” can choose for any layers to use Qwen2MoeSparseMoeBlock or Qwen2MoeMLP, for example , only layer 12 uses Qwen2MoeMLP, and others all use Qwen2MoeSparseMoeBlock.

This support for “mlp_only_layers” has significant importance on poor gpu devices like v100-16G.
As i tested, qwen1.5-moe only requires a little more HBM before cuda OOM, then i only set layer 12 to use Qwen2MoeMLP rather than Qwen2MoeSparseMoeBlock, the model loaded succesfully. (2 pices of v100, and vllm inference)

It's true when set some layers to Qwen2MoeMLP will lose some weights and cause decline of accuracy, but there are solutions:

  1. use finetune data to run inference, and find which layer's Qwen2MoeSparseMoeBlock's experts' outputs are closest to zero, then set the layer to Qwen2MoeMLP.
  2. set mlp_only_layers first and do finetune.
  3. as i tested, change the middle layers just like layer 12 to Qwen2MoeMLP will nearly not affect the inference accuracy.

This pr do not change the model design, just make current feild decoder_sparse_step more flexible.
Only set 1 or 2 or 3 layers to Qwen2MoeMLP can smoothly fit the model to poor HBM, and cost "minimal damage" to the model.

@amyeroberts
Copy link
Collaborator

cc @ArthurZucker

@eigen2017
Copy link
Contributor Author

@amyeroberts hi
1.do i need continuously merg latest modifications?
2. do i need fix all the ci errors?

@amyeroberts
Copy link
Collaborator

amyeroberts commented Apr 30, 2024

Hi @eigen2017,

  1. You shouldn't need to continuously update from main. You will want your branch to be from a recent commit on main, and obviously update if there's conflicts on files between this branch and main. Providing the PR is short-lived, the occasional rebase should be enough.
  2. Yes, if they're related to this PR. Looking at the errors at the moment, it looks like these are just connection errors from the hub, in which case there's nothing for you to address. I'll re-run the tests now.

From the commit history, it looks like you might have rebased and not force pushed the branch. When rebasing, it's necessary to force push to properly update the remote, as rebasing is effectively re-writing history.

@eigen2017
Copy link
Contributor Author

HI @amyeroberts , thks 4 your guidance
i force pushed back, and fixed some workflow ci errors.
see my "Files changed", it's the minimum modification.

@eigen2017
Copy link
Contributor Author

@amyeroberts @ArthurZucker
now all ci passed, pls review my code and merg, thks again to @amyeroberts for the instructions.
i explain here again that this pr will not affect any existing logic when mlp_only_layers defaults to empty list.
and mlp_only_layers can be a good setting field that make qwen moe models to fit in poor gpus.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @eigen2017, thanks for iterating such that the tests are passing and for opening a PR!

I've added general review comments.

Just so you know, we don't normally accept changes like this to the models: adding in new config arguments for feature which were not present in the original architecture. In particular, if there isn't an associated issue feature request.

I'll let @ArthurZucker decide if this is something that would be wanted for this model

@@ -434,6 +434,7 @@
"QDQBertConfig",
"QDQBertModel",
"QuestionAnsweringPipeline",
"Qwen2MoeConfig",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't be added here - needing this is an indication that the docstring isn't correctly formatted

Comment on lines 869 to 877
isUseQwen2MoeSparseMoeBlock = True
if layer_idx in config.mlp_only_layers:
isUseQwen2MoeSparseMoeBlock = False
elif config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0:
isUseQwen2MoeSparseMoeBlock = True
else:
isUseQwen2MoeSparseMoeBlock = False

if isUseQwen2MoeSparseMoeBlock:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

var name here isn't pythonic and there's a useless else

Suggested change
isUseQwen2MoeSparseMoeBlock = True
if layer_idx in config.mlp_only_layers:
isUseQwen2MoeSparseMoeBlock = False
elif config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0:
isUseQwen2MoeSparseMoeBlock = True
else:
isUseQwen2MoeSparseMoeBlock = False
if isUseQwen2MoeSparseMoeBlock:
if not (layer_idx in config.mlp_only_layers) and (config.num_experts > 0 and (layer_idx + 1) % config.decoder_sparse_step == 0):

@@ -95,6 +95,11 @@ class Qwen2MoeConfig(PretrainedConfig):
allow the model to output the auxiliary loss, including load balancing loss and router z-loss.
router_aux_loss_coef (`float`, *optional*, defaults to 0.001):
The aux loss factor for the total loss.
mlp_only_layers ([`int`], *optional*, defaults to []):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
mlp_only_layers ([`int`], *optional*, defaults to []):
mlp_only_layers (`int`, *optional*, defaults to `[]`):

Comment on lines 99 to 102
Indicate which layers use Qwen2MoeMLP rather than Qwen2MoeSparseMoeBlock
integers in list is layer index, from 0 to 23 if we have 24 layers
when mlp_only_layers is empty, decoder_sparse_step decides Qwen2MoeMLP or Qwen2MoeSparseMoeBlock
when mlp_only_layers is not empty, decoder_sparse_step becomes invalid
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be reworded - it doesn't parse

@eigen2017
Copy link
Contributor Author

@amyeroberts HI, many thanks to all the reviews!
i will modify my code to match the suggestions.
i still hold that, mlp_only_layers is not a new feature, it's only make decoder_sparse_step more flexible.
just think why qwen moe model exposed decoder_sparse_step to users? it's means the original architecture supports on cutting specified layers' experts during finetuning and inference. or other creative scenarios, like task1 only use some layers' exports to finetune and infer, task2 use other layers, to improve multi-task effects. or use different data to finetune different layers' exports to make model have more generalization ability. or for my scenario, need trade off some accuracy for HBM, yes i can use decoder_sparse_step to cut exports, but the minimal cut is 12 from 24 layers, and then the model becomes dumb and hard to finetune or inference.
no matter what scenarios decoder_sparse_step participants, mlp_only_layers can theoretically have better performance to decoder_sparse_step, and can support more scenarios .

i saw so many associated feature or issue request during my research on this, someone even requires vllm to support cpu offload inference just like deepspeed, btw deepspeed recently merged my code, i have 10 years experiences in AI research, recently i made huggingface framework work on ascend NPUs(HW gpu).

pls give a deep thought to this pr and conversations before decided merg or not , i sincerely want to contribute to great huggingface, thanks again.^_^

@eigen2017
Copy link
Contributor Author

@amyeroberts HI~ code modified acording to your suggestion, thks!! and ci errors cleared again.
pls help contact @ArthurZucker , thanks~
and qwen moe was first commited by @bozheng-hit too, so @bozheng-hit , if you see this , please give a reply and share your opinion~

@eigen2017
Copy link
Contributor Author

this model is constructed by alibaba as i know , if there are members from alibaba to make a confirmations is welcomed too

@eigen2017
Copy link
Contributor Author

eigen2017 commented May 4, 2024

@huybery HI ! as i found, you are qwen member, pls help to check this pr and give some opinion, thks

@eigen2017
Copy link
Contributor Author

@ArthurZucker HI, pls give this a review , thks..

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think this is a lot simpler than the steps we used. Could be added for all MoE model, but this one is the only one that needs it for now!

@eigen2017
Copy link
Contributor Author

I actually think this is a lot simpler than the steps we used. Could be added for all MoE model, but this one is the only one that needs it for now!

many thanks for this confirmation!!
immediately i'll commit to fit your review.

@eigen2017
Copy link
Contributor Author

@ArthurZucker @amyeroberts all your reviews are accepted and modified, CI also passed, please help to merg this pr.
thank you again, and thanks to greate huggingface~~~ ^_^

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@eigen2017
Copy link
Contributor Author

@ArthurZucker HI.. code updated to your review suggestions..

@eigen2017
Copy link
Contributor Author

eigen2017 commented May 10, 2024

@ArthurZucker @amyeroberts HI~, i finished my senario test and below is the report, i think it's helpful to record here.

no model glm finetune moe finetune moe finetune, then cut exp moe cut exp, then finetune
acc 48.15% 53.69% 54.66% 50.38% 51.91%
gen NA 50 tok/s 85 tok/s 105 tok/s 105 tok/s
weight NA 12GB 27GB 22GB 22GB
gpu NA 1 4 2 2

overview:

concepts explaination
task use glm or moe model to summarize and rewrite long user query, then pass to a faq system, hope to get better faq accuracy.
data set non-public, for specified scenarios
infer tech vllm, which requires more HBM for kv blocks, that's why orignal 27GB moe cannot load in 2 gpus
no model pass the long user query directly to faq system
glm chatglm3-6b
glm finetune glm lora fintune, infer, then pass generated query to faq system
moe Qwen1.5-MoE-A2.7B-Chat
moe finetune moe lora finetune, infer, then pass generated query to faq system
moe finetune, then cut exp moe lora finetune, cut exp(set mlp_only_layers to [12,14,16,18]), infer, then pass generated query to faq system
moe cut exp, then finetune moe cut exp(set mlp_only_layers to [12,14,16,18]), then lora finetune, infer, then pass generated query to faq system

concepts:

concepts explaination
acc the final faq accuracy.
gen token generation speed when inference only one user query's shortter form.
weight minimal HBM requirements of one instance of the model
gpu minimal requirements of nvidia v100-16G pices

conclusions:

  1. moe has better acc and gen speed than glm, but need more HBM. it's trade HBM space for speed. yes 2gpus can load 2 instances of glm, but it can only doubles the tps, for only one request, moe is as twice speed as glm.
  2. cut exp causes decline of acc, but fewer gpus requirements and higher generation speed.
  3. cut exp before funetuning can rise acc, as compared to cut exp after funetuning.

table above "cut exp" means cut 4 layers: [12,14,16,18], i also tried --enforce-eager true for vllm, it's means disable cuda graph feature, to trade generate speed for fewer HBM requirement. then i can cut only one layer's experts(only the 12th layer). generation speed declined to 60 tok/s and acc rised to 53.43% on 2 gpus.

it's only my senario using mlp_only_layers, this flexible config field can create many more other usages i think.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating!

@ArthurZucker ArthurZucker merged commit 1c52cb7 into huggingface:main May 10, 2024
20 checks passed
@eigen2017
Copy link
Contributor Author

Thanks for iterating!

it's my honor and pleasure !

@ArthurZucker
Copy link
Collaborator

🤗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants