[PEFT on Gaudi2C] speed of Full-parameter Finetuning is almost equal to that of LoRA #952

intelyoungway · 2024-05-06T11:25:20Z

Feature request

[Model] chinese-alpaca-2-7b
[Hardware] Gaudi2C
[Method] LoRA and FineTuning
[Related codes] examples/language_modeling
[Test Cmdlines]:

LoRA:

python3 ../gaudi_spawn.py --use_deepspeed --world_size 8 \
run_lora_clm.py \
--model_name_or_path /workspace/chinese-alpaca-2-7b  \
--deepspeed llama2_ds_zero3_config.json \
--dataset_name tatsu-lab/alpaca  \
--bf16 True \
--output_dir /workspace/lora_out \
--overwrite_output_dir \
--num_train_epochs 2 \
--max_seq_len 2048 \
--per_device_train_batch_size 10 \
--per_device_eval_batch_size 10  \
--gradient_checkpointing  \
--evaluation_strategy epoch \
--eval_delay 2 \
--save_strategy no \
--learning_rate 0.0018 \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--dataset_concatenation \
--attn_softmax_bf16 True \
--do_train  --do_eval \
--use_habana \
--use_lazy_mode \
--pipelining_fwd_bwd \
--throughput_warmup_steps 3 \
--lora_rank 4 \
--lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" \
--validation_split_percentage 4 \
--use_flash_attention True

FineTuning (which is modified by customer based on the run_lora_clm.py for finetuning, see attached tmp_finetune.zip):

python3 ../gaudi_spawn.py --use_deepspeed --world_size 8 \
tmp_finetune.py \
--model_name_or_path /workspace/chinese-alpaca-2-7b \
--deepspeed llama2_ds_zero3_config.json \
--dataset_name tatsu-lab/alpaca \
--bf16 True \
--output_dir /workspace/lora_out --overwrite_output_dir \
--num_train_epochs 2 --max_seq_len 2048 --per_device_train_batch_size 10 \
--per_device_eval_batch_size 10 --gradient_checkpointing --evaluation_strategy epoch \
--eval_delay 2 --save_strategy no --learning_rate 0.0018 \
--warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 \
--dataset_concatenation --attn_softmax_bf16 True --do_train \
--use_habana \
--use_lazy_mode \
--pipelining_fwd_bwd \
--throughput_warmup_steps 3 \
--lora_rank 4 \
--lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" \
--validation_split_percentage 4 \
--use_flash_attention True

tmp_finetune.zip

Motivation

Customer found that, finetuning of full-parameter is 14 train-samples-per-second, which is similar to that of LoRA (16 train-samples-per-second).
Please see details in feature request and check if there is any possible way to optimize LoRA for better performance.

LeoZhao-Intel · 2024-05-14T08:33:56Z

can you attach training logs to ease analysis?

intelyoungway · 2024-05-14T12:23:45Z

Sure, I am asking for customer's feedback.

yafshar · 2024-06-18T22:52:01Z

@intelyoungway, the attached script is also doing LoRA finetuning. Would you clarify what is the exact issue/request?

intelyoungway · 2024-06-19T13:31:23Z

customer said they modified the original LoRA script to do finetune (see in the attached files).
And the issue is, the speed is too close between LoRA and finetuning (their modified scripts), which is strange cause LoRA should be significantly faster in theory.
So the solution is simple:
(1) if the attached file is correct finetuning, then please provide a LoRA script optimized so that it could be significantly faster than finetuning.
(2) if incorrect, then please feedback that the script that customer modified is not a correct implementation of finetuning, and I will tell customer this message and close the ticket.

yafshar · 2024-06-20T22:41:22Z

@intelyoungway, Thanks for the comment. From what you said, the goal is to compare the full parameter model fine-tuning with Lora fine-tuning.

As per the original Lora paper from Microsoft, https://arxiv.org/abs/2106.09685, it's theoretically understood that full parameter and Lora fine-tuning should not yield the same performance, mainly when low ranks are used in Lora. The disparity in the number of parameters used for training is a key factor here.

From the attached script, I see you are using the same run_lora_clm.py script with some minor modifications for both full parameters and Lora fine-tuning. If the performance is the same, the script might have an issue. It would help if you used run_clm.py for full parameter fine-tuning and run_lora_clm.py for Lora fine-tuning.

For me or anyone else to be able to help, I need more details, especially log files, number of parameters, etc.

intelyoungway · 2024-06-21T00:11:57Z

Thanks for the explanation. I think this can fulfill the need. The issue should be closed now.

intelyoungway closed this as completed Jun 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PEFT on Gaudi2C] speed of Full-parameter Finetuning is almost equal to that of LoRA #952

[PEFT on Gaudi2C] speed of Full-parameter Finetuning is almost equal to that of LoRA #952

intelyoungway commented May 6, 2024 •

edited

LeoZhao-Intel commented May 14, 2024

intelyoungway commented May 14, 2024

yafshar commented Jun 18, 2024 •

edited

intelyoungway commented Jun 19, 2024

yafshar commented Jun 20, 2024

intelyoungway commented Jun 21, 2024

[PEFT on Gaudi2C] speed of Full-parameter Finetuning is almost equal to that of LoRA #952

[PEFT on Gaudi2C] speed of Full-parameter Finetuning is almost equal to that of LoRA #952

Comments

intelyoungway commented May 6, 2024 • edited

Feature request

Motivation

LeoZhao-Intel commented May 14, 2024

intelyoungway commented May 14, 2024

yafshar commented Jun 18, 2024 • edited

intelyoungway commented Jun 19, 2024

yafshar commented Jun 20, 2024

intelyoungway commented Jun 21, 2024

intelyoungway commented May 6, 2024 •

edited

yafshar commented Jun 18, 2024 •

edited