Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi gpu use? #59

Open
ganeshkrishnan1 opened this issue Mar 9, 2024 · 18 comments
Open

multi gpu use? #59

ganeshkrishnan1 opened this issue Mar 9, 2024 · 18 comments

Comments

@ganeshkrishnan1
Copy link
Contributor

I am running out of memory on Tesla T4. I have 4 of them though and I usually use accelerator for multigpu setup. How can I use them for angle semantic similarity?

@SeanLee97
Copy link
Owner

Do you use it for training or inference?

@ganeshkrishnan1
Copy link
Contributor Author

I used it for training. It looks like the script does use multiple gpu but it runs out of memory due to high batch size. I will close this ticket

@ganeshkrishnan1
Copy link
Contributor Author

I ran this on lower batch count and I can see the trainer never uses more than 1 GPU
image


@ganeshkrishnan1
Copy link
Contributor Author

I used the example provided and also put accelerator but both of them fails to use more than 1 GPU. Any suggestions?

@SeanLee97
Copy link
Owner

hi @ganeshkrishnan1 , could you provide the training script?

@SeanLee97
Copy link
Owner

SeanLee97 commented Mar 19, 2024

Here is one example that can be successfully run on multi gpus:

CUDA_VISIBLE_DEVICES=0,1 WANDB_MODE=disabled torchrun --nproc_per_node=2 --master_port=2345 train_cli.py \
--model_name_or_path mixedbread-ai/mxbai-embed-large-v1 \
--train_name_or_path ./snli_5k.jsonl --save_dir mxbai-snli-ckpts \
--w1 0. --w2 20.0 --w3 1.0 --angle_tau 20.0 --learning_rate 3e-6 --maxlen 64 \
--pooling_strategy cls \
--epochs 1 \
--batch_size 32 \
--logging_steps 100 \
--warmup_steps 200 \
--save_steps 1000 --seed 42 --gradient_accumulation_steps 2 --fp16 1 --torch_dtype 'float32'

train_cli.py is from: https://github.com/SeanLee97/AnglE/blob/main/angle_emb/train_cli.py

data format:

$ head -3 snli_5k.jsonl

{"text": "A person on a horse jumps over a broken down airplane.", "positive": "A person is outdoors, on a horse.", "negative": "A person is at a diner, ordering an omelette."}
{"text": "Children smiling and waving at camera", "positive": "There are children present", "negative": "The kids are frowning"}
{"text": "A boy is jumping on skateboard in the middle of a red bridge.", "positive": "The boy does a skateboarding trick.", "negative": "The boy skates down the sidewalk."}

@ganeshkrishnan1
Copy link
Contributor Author

This is my python code:
I experimented with accelerator, then torch distributed and also added to(device).
I will try with your method and see if it works out with 4 gpus.

from sentence_transformers import InputExample, losses, SentenceTransformer
from torch import optim
from sentence_transformers import SentenceTransformer, models, losses
import torch
from datasets import load_dataset,Dataset, DatasetDict
from angle_emb import AnglE, AngleDataTokenizer
from accelerate import FullyShardedDataParallelPlugin, Accelerator
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP, StateDictType, FullStateDictConfig


fsdp_plugin = FullyShardedDataParallelPlugin(
    state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
    optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False),
)

accelerator = Accelerator(fsdp_plugin=fsdp_plugin)
device = accelerator.device

train_json_file = './brand_search_term_con_new.json_9.json'  # JSON file for training data
full_dataset  = load_dataset('json', data_files=train_json_file,  split='train')
desired_test_size = 5000
# # Calculate the training set size
train_size = len(full_dataset) - desired_test_size
# # Split the dataset into training and evaluation sets
split_datasets = full_dataset.train_test_split(test_size=desired_test_size, train_size=train_size)
dataset_dict = DatasetDict({
    'train': split_datasets['train'],
    'test': split_datasets['test']
})

# 1. load pretrained model
angle = AnglE.from_pretrained('SeanLee97/angle-bert-base-uncased-nli-en-v1', max_length=512, pooling_strategy='cls',device_map = 'auto').to(device)
# # 3. transform data
train_ds = dataset_dict['train'].shuffle().map(AngleDataTokenizer(angle.tokenizer, angle.max_length), num_proc=16)
test_ds = dataset_dict['test'].map(AngleDataTokenizer(angle.tokenizer, angle.max_length), num_proc=16)
# angle, train_ds, test_ds = accelerator.prepare(angle, train_ds, test_ds)
angle.to(device)
# # 4. fit
angle.fit(
    train_ds=train_ds,
    valid_ds=test_ds,
    output_dir='trainedmodel/aihello-model',
    batch_size=8,
    epochs=2,
    learning_rate=2e-5,
    save_steps=5000,
    eval_steps=5000,
    warmup_steps=100,
    gradient_accumulation_steps=4,
    loss_kwargs={
        'w1': 1.0,
        'w2': 1.0,
        'w3': 1.0,
        'cosine_tau': 20,
        'ibn_tau': 20,
        'angle_tau': 1.0
    },
    fp16=True,
    logging_steps=100
)
corrcoef, accuracy = angle.evaluate(test_ds, device=angle.device)
print('corrcoef:', corrcoef)

@ganeshkrishnan1
Copy link
Contributor Author

image

The shell script worked and I got the checkpoint as well with multiple GPUs.

Python code didn't use the multiple GPU though.

@SeanLee97
Copy link
Owner

SeanLee97 commented Mar 20, 2024

I haven't tried multiGPU in python code, just used it supported by Transformers Trainer.

BTW, here are some tips to improve the model:

  1. if your dataset is FormatA: {'text1': "", "text2": "", "label": float or int}, it is better slightly increase weight for w1.

  2. if your dataset is FormatB: {'text': "", "positive": "", "negative": ""}, the suggested parameters are w1=0, w2=20, w3=1.0, angle_tau=20.0

@ganeshkrishnan1
Copy link
Contributor Author

Thanks for the tip about the w. I am using DataFormat C.

eg
{"text": "Cool Spot 11x11 Pop-Up Instant Gazebo Tent with Mosquito Netting Outdoor Canopy Shelter with 121 Square Feet of Shade by COOS BAY (Beige)", "positive": "outdoor tent canopy"}

Should I use the same as B?

@SeanLee97
Copy link
Owner

DataFormats.C is okay. However, DataFormats.B is recommended since it can improve performance more significantly.

BTW, here are the tips, we will push it in the next version.

image

@ganeshkrishnan1
Copy link
Contributor Author

Negative is very hard to generate from unlabelled text for DataSet B. We have "product title" -> "search term" as positive correlation but there is no correct way to generate negative

Like you mentioned, the performance of Dataset C on training from sample was not as good as I wanted it to be. I am running the trainer on our whole dataset of 200m records and report back on performance (~15 days)

@SeanLee97
Copy link
Owner

Negative is very hard to generate from unlabelled text for DataSet B. We have "product title" -> "search term" as positive correlation but there is no correct way to generate negative

Like you mentioned, the performance of Dataset C on training from sample was not as good as I wanted it to be. I am running the trainer on our whole dataset of 200m records and report back on performance (~15 days)

For such large datasets, it is better to specify a small learning_rate such as 1e-6, and specify --fixed_teacher_name_or_path to alleviate information forgetting.

@ganeshkrishnan1
Copy link
Contributor Author

I don't mind catastrophic forgetting. I could even train from scratch with the amount of data we have. The learning rate is currently set to 3e-6. It took 8 hours for the dataset to load so I think I will let this training run and then re-run with the smaller one you mentioned.

Your models don't seem compatible with KeyBert https://github.com/MaartenGr/keyBERT so that's one more challenge for me

@SeanLee97
Copy link
Owner

I found KeyBert works for sentence-transformers. Maybe you can add a feature to make it support angle_emb.

@ganeshkrishnan1
Copy link
Contributor Author

I will ask someone from our team to look into it. Right now its easier for me to use this for generating vectors and training a different sentence transformers for generating keywords from documents: two different usecases

@ganeshkrishnan1
Copy link
Contributor Author

btw, can my team member reach out on your email to get some support for adding support of angle_emb to sentence-transformers?

@SeanLee97
Copy link
Owner

btw, can my team member reach out on your email to get some support for adding support of angle_emb to sentence-transformers?

Sure! thanks!

BTW, I am working on exporting sentence-transformers (ST) model so that the AnglE-trained model can be used in ST.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants