Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Am I using the code incorrectly? help me #6

Open
Pang-dachu opened this issue Jan 28, 2024 · 14 comments
Open

Am I using the code incorrectly? help me #6

Pang-dachu opened this issue Jan 28, 2024 · 14 comments

Comments

@Pang-dachu
Copy link

Pang-dachu commented Jan 28, 2024

I used this code and trained with Korean ko-snil data.

adapter_config.json, adapter_model.safetensors, special_tokens_map.json, tokenizer_config.json, tokenizer.json, tokenizer.model

5 files were saved.

I configured accelerate as shown below.
I applied rola.json as it was published.

CUDA_VISIBLE_DEVICES="1" accelerate launch \
    --config_file ds_zero2_0125.yaml \
    peft_lora_embedding_semantic_search.py \
    --dataset_name similarity_Kodataset \
    --max_length 512 \
    --model_name_or_path "/home/embedding_kim/[Model]/e5-mistral-7b-instruct" \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --learning_rate 0.00005 \
    --weight_decay 0.01 \
    --num_train_epochs 4 \
    --max_train_steps 2048 \
    --gradient_accumulation_steps 512 \
    --lr_scheduler_type cosine\
    --num_warmup_steps 128 \
    --output_dir trained_ko_model_0125 \
    --with_tracking \
    --report_to "wandb" \
    --use_peft

I also saw a change in learning loss, and when I evaluated the model with my code against the STS BenchMark
The benchmark scores for the model published by [intfloat] and the model I fine-tuned are the same to the same number of decimal places.

I would like to ask if your code cannot fine-tune the model published by [intfloat], or if I am missing something and need to apply additional measures after training for the results to be reflected.

(ex.. Is there a process to merge the generated adapter.json into the model further...?)

@kamalkraj
Copy link
Owner

The code can be used to finetune the published infloat model or the original mistral model.

To merge lora
https://discuss.huggingface.co/t/help-with-merging-lora-weights-back-into-base-model/40968/4

@Pang-dachu
Copy link
Author

Thank you very much for providing the fine-tuning code.

However, I have been trying to fine-tune using your code for a few days now, and the generated LoRA adapter is not having any effect on the model.

I have tried fine-tuning both 1) intfloat's model and 2) the base model of intfloat's model, but applying the LoRA adapter made no difference.

===
The data used for training is Korean dataset
The benchmark I use for evaluation is MTEB STS17 (ko-ko).

Am I doing something wrong in my approach or thinking ?

@bjelkenhed
Copy link

Thank you for providing this code, however I get the same results that is described by Pang-dachu above. The trained LoRA adapters does not seem to have any effect on the response from the model. Have tried with merge and unload before saving the model and loading it again but the result is alway exactly the same as with the e5-instruct-7b basemodel.

@Rinatum
Copy link

Rinatum commented Feb 8, 2024

@Pang-dachu @bjelkenhed I got the same problem for my custom dataset

the reason was that accelerate config provided by @kamalkraj is not suitable for my machine. I only start to work with accelerate so I don't know which fields in the config are wrong but I have universal solution. Just run

accelerate launch --mixed_precision="fp16" peft_lora_embedding_semantic_search.py ...

It will be use default accelerate parameters for your machine.

Also the here is selfcheck :

                    lora_params = {n: p for n, p in model.named_parameters() if "lora" in n}
                    for n, p in lora_params.items():
                        accelerator.print(n, p.sum())

past this in your trainloop to check that your Lora_B parameters are not zero

I hope it will be help you

@Pang-dachu
Copy link
Author

@Rinatum

There are quite a few floors with zero values of LoRA_B.
How would you recommend approaching and solving this problem?

  • ps :
    I noticed it today as well, and it seems that in the learning process, lr goes to zero when the LoRA_B layer goes to zero.

@Rinatum
Copy link

Rinatum commented Feb 8, 2024

@Pang-dachu

  • try to train with default paramters of accelerate (one machine, one gpu, fp16)
  • be careful when you will load learned lora

THIS ONE DOESN'T WORK:

model = AutoModel.from_pretrained('intfloat/e5-mistral-7b-instruct')
model = PeftModel.from_pretrained(model, "path-to-lora")

THIS ONE WORK:

model = MistralForSequenceEmbedding.from_pretrained('intfloat/e5-mistral-7b-instruct')
model = PeftModel.from_pretrained(model, "path-to-lora")

@Pang-dachu
Copy link
Author

Pang-dachu commented Feb 8, 2024

@Rinatum

  • I currently have accelerate configured like this: (deepspeed not used)
NCCL_P2P_LEVEL=NVL CUDA_VISIBLE_DEVICES="0" accelerate launch \
    --mixed_precision="bf16" \
    peft_lora_embedding_semantic_search.py \
    --dataset_name custom_data_path \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 1 \
    --model_name_or_path local_model_path\ 
    --output_dir output_dir \
    --use_peft
  • I proceeded with the class "MistralForSequenceEmbedding.from_pretrained", which is also defined in the peft_lora_embedding_semantic_search code.
    (The difference is that it is imported into bf16)

  • I tried to train it as you suggested, but I got the same result of 0 for LoRA_B.

Total number of LoRA layers (A+B) : 448
LoRA_A count : 224
LoRA_B count : 224
LoRA_A zero weight count : 0
LoRA_B zero weight Count : 224

@Pang-dachu
Copy link
Author

@Rinatum

I think I saw a glimpse of hope, but I need to verify it.

I'll try again and talk about the results.

@Rinatum
Copy link

Rinatum commented Feb 8, 2024

@Pang-dachu could you check lora_B weights during training?

@Pang-dachu
Copy link
Author

@Rinatum

I've been playing around with this for about two weeks now, so I don't remember exactly what the conditions were.

I probably just used the code provided in this GitHub.
I think I took a picture of the tensors of the LoRA layer during the learning process.

There were cases where all the tensor values in the LoRA_B layer were 0, and I think the lr value suddenly changed to 0.
(Rather than looking at all the tensors, I think it would be better to apply sum() as you suggested).

===
For now, as you suggested

  • Using a single GPU
  • Not using DeepSpeed
  • LoRA merge using Mistral Embedding Classes
    I've noticed a performance change in the benchmarks with custom data when using the above conditions.

However, it takes a long time using a single GPU, so my goal is to apply multi-GPU or DeepSpeed since the training and data size is small. (It's not easy, but...)

p.s : I've been struggling for about 2 weeks now, and I'm so grateful for the hope.

@bjelkenhed
Copy link

Thank you @Rinatum for all your suggestions. Trying now with something similar as @Pang-dachu without Deepspeed and using the MistralForSequenceEmbedding when loading the model and it looks promising so far. The results differ from using the base model e5-mistral-7b-instruct at least for the first time. Using bitsandbytes QLora instead now and that seems to work fine.

Will have confirmed results tomorrow.

@Pang-dachu
Copy link
Author

current progress.

  • Using accelerator
  • LoRA merge using Mistral Embedding Classes
  • Applying ZeRO-2 for multi-GPU training
  • Loading a model in bfloat-16 format from training to LoRA merge

Load the model using Mistral Embedding Classes for merge_and_unload for the trained model, and use Mistral Embedding Classes for the merged model.
For the merged model, I checked that it can be used by importing it into AutoModel.

ZeRO-3 application failed and LoRA-B layer becomes 0 when applied.

@Rinatum
I've been struggling for almost 3 weeks and this has solved a huge problem for me, thank you so much.

@bjelkenhed
I'm going to need to test it out in my environment for a few different situations this week.
Can you share any successes or anything else unusual ?

@Rinatum
Copy link

Rinatum commented Feb 13, 2024

@Pang-dachu @bjelkenhed

So nice ! I Also recommend you to delete standart model saving hooks and accelerator.save_state

Use this one instead of:

                    # accelerator.save_state(output_dir)
                    unwrapped_model = accelerator.unwrap_model(model)
                    unwrapped_model.save_pretrained(
                        output_dir,
                        is_main_process=accelerator.is_main_process,
                        save_function=accelerator.save,
                    )

It allows you to save only lora weigths and this weigths will be not zero

  • Also i figure out that multi-gpu training using deepspeed depends on you gpu card

I have A100 so I can use bf16 but you can try different options for you case

By the way, I can conclude that the main problem of zero lora weights is using AutoModel.from_pretrained

In this case there is only one correct option is to use original model class exactly (MistralForSequenceEmbedding)

@bjelkenhed
Copy link

Hi, here are some updates from me.

Without Deepspeed ZeRO-3 it works much better and no LoRA layers get zeros. That makes the training work as expected and the results differ from the base model as excected. I have H100s with 80GB and have used Bitsandbytes with 4bits so far, but will try with ZeRO-2 without Bitsandbytes as well. If you would like to share your config for ZeRO-2 @Pang-dachu it would be appreciated.

So far I only have approx 10 000 examples in my trainingset and so far the evaluation results are not better than with the base model e5-mistral-7b-instruct in a hitrate evaluation with a evaluationset resembling ms marco format. What batch size do you use and how large is your datasets? I am currently using smaller batch size, but I don't know what would be the best one considering the size of the dataset.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants