Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About RepLLaMA #103

Open
sunxiaojie99 opened this issue Jan 11, 2024 · 19 comments
Open

About RepLLaMA #103

sunxiaojie99 opened this issue Jan 11, 2024 · 19 comments

Comments

@sunxiaojie99
Copy link

sunxiaojie99 commented Jan 11, 2024

Hi~I am trying to reproduce the results of RepLLaMA. I have an a800 GPU. If I start training RepLLaMA from scratch with your code, it may take 80 hours? I want to know if this is normal? If possible, I would like to know the time cost when training RepLLaMA (lora) on the msmarco passage and doc datasets? Thank you very much. @MXueguang

@MXueguang
Copy link
Contributor

MXueguang commented Jan 11, 2024

Hi Xiaojie,
I trained repllama (passage) on 16 v100 32g gpu, which took me around 1 day.
I think 80 hours on a single a800 GPU is a reasonable time.
On msmarco-doc, if the max input length is set as 2048, it will take 3 days on 16 gpus.

@sunxiaojie99
Copy link
Author

sunxiaojie99 commented Jan 12, 2024

Hi Xueguang, @MXueguang

Thank you very much for sharing your code. However, when I tested it on a small test MSMARCO passage corpus (the first 100 passages), I encountered an issue: after encoding, the embeddings of some passages turned out to be NaN. Have you experienced this problem?

The part of your code that I modified is located here:

attn_output = xops.memory_efficient_attention(
. I made these changes for two reasons: 1) xformers was not functioning correctly in my environment. If possible, i want to know the reason why you reset the forward function, Is this step necessary? 2) the attention_mask input in the custom_forward function did not seem to be utilized in the subsequent code. Does this mean that the padding positions will still receive attention?

Please forgive my limited experience in this area. Your insights would be greatly appreciated.

Here are the changes I made:

# Original code
        attn_weights = None
        attn_output = xops.memory_efficient_attention(
            query_states.transpose(1, 2), key_states.transpose(1, 2), value_states.transpose(1, 2),
            attn_bias=xops.LowerTriangularMask()
        ).reshape(bsz, q_len, self.hidden_size)

Modified to:

        # Scale queries for dot-product attention
        query_states = query_states / (self.head_dim ** 0.5)

        # Dot-product attention, [bsz, num_heads, q_len, head_dim]*[bsz, num_head, head_dim, q_len]
        attn_scores = torch.matmul(query_states, key_states.transpose(-2, -1))
        
        # Apply lower triangular mask
        if attn_scores.size(1) == attn_scores.size(2):
            # Only square matrices require masking
            mask = torch.tril(torch.ones_like(attn_scores.float())).type_as(attn_scores)
            attn_scores = attn_scores.masked_fill(mask == 0, float('-inf'))
        
        # Apply attention mask
        if attention_mask is not None:
            attn_scores = attn_scores + attention_mask

        attn_probs = softmax(attn_scores, dim=-1)
        attn_output = torch.matmul(attn_probs, value_states)

        attn_output = attn_output.transpose(1, 2).reshape(bsz, q_len, self.hidden_size)

@MXueguang
Copy link
Contributor

MXueguang commented Jan 12, 2024

my transformers version is 4.31.0. I think later version has some issue here
it is ok to remove the flash attention replacement and use default llama class.
I'll update the code to make it fit latest transformers, and I am trying to do a refactor here https://github.com/texttron/tevatron/tree/refactor

btw, repllama code in tevatron is a re-implementation, and due to limited resource I didn't get chance to do very detailed tests. Feel free to let me know any issues there.

@sunxiaojie99
Copy link
Author

ok~ so I only need to comment out this line of code replace_with_xformers_attention() in train.py? I will run it again to check if everything is normal. thank you!

@MXueguang
Copy link
Contributor

MXueguang commented Jan 12, 2024

so I only need to comment out this line of code replace_with_xformers_attention() in train.py

yes, in train.py and encode.py

@sunxiaojie99
Copy link
Author

Hi Xueguang, I think I've found the issue with the NaN embedding. I've noticed that when we use fp16 during encoding, this problem occurs. However, when we switch to fp32, everything seems fine. By the way, could I ask you to provide the training data (or the co-condenser hard negative) for MSMARCO-passage/doc used in your paper 'Fine-Tuning LLaMA for Multi-Stage Text Retrieval'?

@MXueguang
Copy link
Contributor

its a bit weird fp16 not works...the model was finetuned with fp16...I'll take a look.

I created a training data for repllama in tevatron format can be downloaded here
https://www.dropbox.com/scl/fi/pkm1mtgfobae9kuesp7dr/train-tevatron.jsonl?rlkey=2thutc4zkozr9jp4zbbrz5rvi&dl=0

@MXueguang
Copy link
Contributor

Hi @sunxiaojie99, are you getting similar training log as #104?

@sunxiaojie99
Copy link
Author

Hi @sunxiaojie99, are you getting similar training log as #104?

I just completed the test on the small corpus. I will run the entire process later and then confirm this.

@sunxiaojie99
Copy link
Author

its a bit weird fp16 not works...the model was finetuned with fp16...I'll take a look.

I created a training data for repllama in tevatron format can be downloaded here https://www.dropbox.com/scl/fi/pkm1mtgfobae9kuesp7dr/train-tevatron.jsonl?rlkey=2thutc4zkozr9jp4zbbrz5rvi&dl=0

Thanks for sharing! Does this JSON file contain both the MSMARCO passage and document datasets?
By the way, bfp16 is actually used during fine-tuning. When I test using bfp16 during encoding, the NaN issue doesn't appear either. So, I guess the fine-tuning process will run smoothly.

@MXueguang
Copy link
Contributor

I train repllama on v100 gpus which only supports fp16. When I add implementation to tevatron I worked on A6000 so bf16 also work. But the released model was trained on fp16. I'll take a look at the NaN issue next week.

The data in above link is the training data for passage ranking.
document data is bigger, I'll upload it later.

@sunxiaojie99
Copy link
Author

I train repllama on v100 gpus which only supports fp16. When I add implementation to tevatron I worked on A6000 so bf16 also work. But the released model was trained on fp16. I'll take a look at the NaN issue next week.

The data in above link is the training data for passage ranking. document data is bigger, I'll upload it later.

Okay, I sincerely appreciate your help! Please remind me when the document data is ready.

@sunxiaojie99
Copy link
Author

Hi Xueguang,

Sorry to bother you again. I have completed the training process for RepLLaMa. However, it seems that encoding the msmarco passage corpus requires at least 300 hours. I've noticed that Tevatron doesn't support multi-GPU encoding. Could you tell me how long the encoding process took for you? Also, is the document data ready? Haha.

@MXueguang
Copy link
Contributor

Hi Xiaojie,

300 hours on single gpu is reasonable.
tevatron dosent support multi-gpu encoding, but a efficient way is to encode the corpus by shard, and run that in parallel.
A example below.

mkdir beir_embedding_scifact
for s in 0 1 2 3;
do
CUDA_VISIBLE_DEVICES=$s python encode.py \
  --output_dir=temp \
  --model_name_or_path castorini/repllama-v1-7b-lora-passage \
  --tokenizer_name meta-llama/Llama-2-7b-hf \
  --fp16 \
  --per_device_eval_batch_size 16 \
  --p_max_len 512 \
  --dataset_name Tevatron/beir-corpus:scifact \
  --encoded_save_path beir_embedding_scifact/corpus_scifact.${s}.pkl \
  --encode_num_shard 4 \
  --encode_shard_index ${s} &
done

oops.. thanks for the reminder...uploading the document data now.

@MXueguang
Copy link
Contributor

Hi Xiaojie, the processed training data for document ranking is big and hard to upload.
Below is a slim verision, with processd corpus and training data but need a process to convert to tevatron format.
https://www.dropbox.com/scl/fi/rbxa9u0dusa4g3fh8sn9j/repllama-doc-slim-corpus.jsonl?rlkey=8ddybs8xt8lq723hks0y2uhku&dl=0
https://www.dropbox.com/scl/fi/sz3oqve6tln2hird03cxv/repllama-doc-slim-train.jsonl?rlkey=t1kjx1wdxky4zjo3zglo6yxzq&dl=0

@sunxiaojie99
Copy link
Author

Hi Xiaojie, the processed training data for document ranking is big and hard to upload. Below is a slim verision, with processd corpus and training data but need a process to convert to tevatron format. https://www.dropbox.com/scl/fi/rbxa9u0dusa4g3fh8sn9j/repllama-doc-slim-corpus.jsonl?rlkey=8ddybs8xt8lq723hks0y2uhku&dl=0 https://www.dropbox.com/scl/fi/sz3oqve6tln2hird03cxv/repllama-doc-slim-train.jsonl?rlkey=t1kjx1wdxky4zjo3zglo6yxzq&dl=0

Ok, thanks! Actually, I think I only need the CoCondenser-MaxP hard negatives for the document ranking data to reliably reproduce the results of the paper. By the way, is the slim version obtained by sampling a smaller proportion?

@MXueguang
Copy link
Contributor

the hard negatives should be top100 bm25 and top 100 cocondenser, but document contents are not saved in the training data. to save the space

@sunxiaojie99
Copy link
Author

the hard negatives should be top100 bm25 and top 100 cocondenser, but document contents are not saved in the training data. to save the space

Okay ~ Is it convenient to tell me other parameters, such as the size of p

@MXueguang
Copy link
Contributor

Hi @sunxiaojie99 , sorry I missed your latest comment.
what do you mean size of p? the truncation size? for msmarco document, we truncate the document by 10 sentences, with a slide window of 5 sentences.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants