Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PEFT on Gaudi2C] speed of Full-parameter Finetuning is almost equal to that of LoRA #952

Closed
intelyoungway opened this issue May 6, 2024 · 6 comments

Comments

@intelyoungway
Copy link

intelyoungway commented May 6, 2024

Feature request

  1. [Model] chinese-alpaca-2-7b
  2. [Hardware] Gaudi2C
  3. [Method] LoRA and FineTuning
  4. [Related codes] examples/language_modeling
  5. [Test Cmdlines]:
  • LoRA:
python3 ../gaudi_spawn.py --use_deepspeed --world_size 8 \
run_lora_clm.py \
--model_name_or_path /workspace/chinese-alpaca-2-7b  \
--deepspeed llama2_ds_zero3_config.json \
--dataset_name tatsu-lab/alpaca  \
--bf16 True \
--output_dir /workspace/lora_out \
--overwrite_output_dir \
--num_train_epochs 2 \
--max_seq_len 2048 \
--per_device_train_batch_size 10 \
--per_device_eval_batch_size 10  \
--gradient_checkpointing  \
--evaluation_strategy epoch \
--eval_delay 2 \
--save_strategy no \
--learning_rate 0.0018 \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--dataset_concatenation \
--attn_softmax_bf16 True \
--do_train  --do_eval \
--use_habana \
--use_lazy_mode \
--pipelining_fwd_bwd \
--throughput_warmup_steps 3 \
--lora_rank 4 \
--lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" \
--validation_split_percentage 4 \
--use_flash_attention True
  • FineTuning (which is modified by customer based on the run_lora_clm.py for finetuning, see attached tmp_finetune.zip):
python3 ../gaudi_spawn.py --use_deepspeed --world_size 8 \
tmp_finetune.py \
--model_name_or_path /workspace/chinese-alpaca-2-7b \
--deepspeed llama2_ds_zero3_config.json \
--dataset_name tatsu-lab/alpaca \
--bf16 True \
--output_dir /workspace/lora_out --overwrite_output_dir \
--num_train_epochs 2 --max_seq_len 2048 --per_device_train_batch_size 10 \
--per_device_eval_batch_size 10 --gradient_checkpointing --evaluation_strategy epoch \
--eval_delay 2 --save_strategy no --learning_rate 0.0018 \
--warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 \
--dataset_concatenation --attn_softmax_bf16 True --do_train \
--use_habana \
--use_lazy_mode \
--pipelining_fwd_bwd \
--throughput_warmup_steps 3 \
--lora_rank 4 \
--lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" \
--validation_split_percentage 4 \
--use_flash_attention True

tmp_finetune.zip

Motivation

Customer found that, finetuning of full-parameter is 14 train-samples-per-second, which is similar to that of LoRA (16 train-samples-per-second).
Please see details in feature request and check if there is any possible way to optimize LoRA for better performance.

@LeoZhao-Intel
Copy link

can you attach training logs to ease analysis?

@intelyoungway
Copy link
Author

Sure, I am asking for customer's feedback.

@yafshar
Copy link
Contributor

yafshar commented Jun 18, 2024

@intelyoungway, the attached script is also doing LoRA finetuning. Would you clarify what is the exact issue/request?

@intelyoungway
Copy link
Author

customer said they modified the original LoRA script to do finetune (see in the attached files).
And the issue is, the speed is too close between LoRA and finetuning (their modified scripts), which is strange cause LoRA should be significantly faster in theory.
So the solution is simple:
(1) if the attached file is correct finetuning, then please provide a LoRA script optimized so that it could be significantly faster than finetuning.
(2) if incorrect, then please feedback that the script that customer modified is not a correct implementation of finetuning, and I will tell customer this message and close the ticket.

@yafshar
Copy link
Contributor

yafshar commented Jun 20, 2024

@intelyoungway, Thanks for the comment. From what you said, the goal is to compare the full parameter model fine-tuning with Lora fine-tuning.

As per the original Lora paper from Microsoft, https://arxiv.org/abs/2106.09685, it's theoretically understood that full parameter and Lora fine-tuning should not yield the same performance, mainly when low ranks are used in Lora. The disparity in the number of parameters used for training is a key factor here.

From the attached script, I see you are using the same run_lora_clm.py script with some minor modifications for both full parameters and Lora fine-tuning. If the performance is the same, the script might have an issue. It would help if you used run_clm.py for full parameter fine-tuning and run_lora_clm.py for Lora fine-tuning.

For me or anyone else to be able to help, I need more details, especially log files, number of parameters, etc.

@intelyoungway
Copy link
Author

Thanks for the explanation. I think this can fulfill the need. The issue should be closed now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants