multi gpu use? #59

ganeshkrishnan1 · 2024-03-09T05:14:00Z

I am running out of memory on Tesla T4. I have 4 of them though and I usually use accelerator for multigpu setup. How can I use them for angle semantic similarity?

SeanLee97 · 2024-03-11T08:32:05Z

Do you use it for training or inference?

ganeshkrishnan1 · 2024-03-14T17:54:22Z

I used it for training. It looks like the script does use multiple gpu but it runs out of memory due to high batch size. I will close this ticket

ganeshkrishnan1 · 2024-03-18T18:28:39Z

I ran this on lower batch count and I can see the trainer never uses more than 1 GPU

ganeshkrishnan1 · 2024-03-18T18:30:06Z

I used the example provided and also put accelerator but both of them fails to use more than 1 GPU. Any suggestions?

SeanLee97 · 2024-03-19T01:26:38Z

hi @ganeshkrishnan1 , could you provide the training script?

SeanLee97 · 2024-03-19T01:29:34Z

Here is one example that can be successfully run on multi gpus:

CUDA_VISIBLE_DEVICES=0,1 WANDB_MODE=disabled torchrun --nproc_per_node=2 --master_port=2345 train_cli.py \
--model_name_or_path mixedbread-ai/mxbai-embed-large-v1 \
--train_name_or_path ./snli_5k.jsonl --save_dir mxbai-snli-ckpts \
--w1 0. --w2 20.0 --w3 1.0 --angle_tau 20.0 --learning_rate 3e-6 --maxlen 64 \
--pooling_strategy cls \
--epochs 1 \
--batch_size 32 \
--logging_steps 100 \
--warmup_steps 200 \
--save_steps 1000 --seed 42 --gradient_accumulation_steps 2 --fp16 1 --torch_dtype 'float32'

train_cli.py is from: https://github.com/SeanLee97/AnglE/blob/main/angle_emb/train_cli.py

data format:

$ head -3 snli_5k.jsonl

{"text": "A person on a horse jumps over a broken down airplane.", "positive": "A person is outdoors, on a horse.", "negative": "A person is at a diner, ordering an omelette."}
{"text": "Children smiling and waving at camera", "positive": "There are children present", "negative": "The kids are frowning"}
{"text": "A boy is jumping on skateboard in the middle of a red bridge.", "positive": "The boy does a skateboarding trick.", "negative": "The boy skates down the sidewalk."}

ganeshkrishnan1 · 2024-03-19T14:43:18Z

This is my python code:
I experimented with accelerator, then torch distributed and also added to(device).
I will try with your method and see if it works out with 4 gpus.

from sentence_transformers import InputExample, losses, SentenceTransformer
from torch import optim
from sentence_transformers import SentenceTransformer, models, losses
import torch
from datasets import load_dataset,Dataset, DatasetDict
from angle_emb import AnglE, AngleDataTokenizer
from accelerate import FullyShardedDataParallelPlugin, Accelerator
from torch.distributed.fsdp.fully_sharded_data_parallel import FullOptimStateDictConfig, FullStateDictConfig
from torch.distributed.fsdp import FullyShardedDataParallel as FSDP, StateDictType, FullStateDictConfig


fsdp_plugin = FullyShardedDataParallelPlugin(
    state_dict_config=FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
    optim_state_dict_config=FullOptimStateDictConfig(offload_to_cpu=True, rank0_only=False),
)

accelerator = Accelerator(fsdp_plugin=fsdp_plugin)
device = accelerator.device

train_json_file = './brand_search_term_con_new.json_9.json'  # JSON file for training data
full_dataset  = load_dataset('json', data_files=train_json_file,  split='train')
desired_test_size = 5000
# # Calculate the training set size
train_size = len(full_dataset) - desired_test_size
# # Split the dataset into training and evaluation sets
split_datasets = full_dataset.train_test_split(test_size=desired_test_size, train_size=train_size)
dataset_dict = DatasetDict({
    'train': split_datasets['train'],
    'test': split_datasets['test']
})

# 1. load pretrained model
angle = AnglE.from_pretrained('SeanLee97/angle-bert-base-uncased-nli-en-v1', max_length=512, pooling_strategy='cls',device_map = 'auto').to(device)
# # 3. transform data
train_ds = dataset_dict['train'].shuffle().map(AngleDataTokenizer(angle.tokenizer, angle.max_length), num_proc=16)
test_ds = dataset_dict['test'].map(AngleDataTokenizer(angle.tokenizer, angle.max_length), num_proc=16)
# angle, train_ds, test_ds = accelerator.prepare(angle, train_ds, test_ds)
angle.to(device)
# # 4. fit
angle.fit(
    train_ds=train_ds,
    valid_ds=test_ds,
    output_dir='trainedmodel/aihello-model',
    batch_size=8,
    epochs=2,
    learning_rate=2e-5,
    save_steps=5000,
    eval_steps=5000,
    warmup_steps=100,
    gradient_accumulation_steps=4,
    loss_kwargs={
        'w1': 1.0,
        'w2': 1.0,
        'w3': 1.0,
        'cosine_tau': 20,
        'ibn_tau': 20,
        'angle_tau': 1.0
    },
    fp16=True,
    logging_steps=100
)
corrcoef, accuracy = angle.evaluate(test_ds, device=angle.device)
print('corrcoef:', corrcoef)

ganeshkrishnan1 · 2024-03-20T00:52:16Z

The shell script worked and I got the checkpoint as well with multiple GPUs.

Python code didn't use the multiple GPU though.

SeanLee97 · 2024-03-20T01:28:08Z

I haven't tried multiGPU in python code, just used it supported by Transformers Trainer.

BTW, here are some tips to improve the model:

if your dataset is FormatA: {'text1': "", "text2": "", "label": float or int}, it is better slightly increase weight for w1.
if your dataset is FormatB: {'text': "", "positive": "", "negative": ""}, the suggested parameters are w1=0, w2=20, w3=1.0, angle_tau=20.0

ganeshkrishnan1 · 2024-03-20T13:48:26Z

Thanks for the tip about the w. I am using DataFormat C.

eg
{"text": "Cool Spot 11x11 Pop-Up Instant Gazebo Tent with Mosquito Netting Outdoor Canopy Shelter with 121 Square Feet of Shade by COOS BAY (Beige)", "positive": "outdoor tent canopy"}

Should I use the same as B?

SeanLee97 · 2024-03-21T02:44:14Z

DataFormats.C is okay. However, DataFormats.B is recommended since it can improve performance more significantly.

BTW, here are the tips, we will push it in the next version.

ganeshkrishnan1 · 2024-03-21T03:08:45Z

Negative is very hard to generate from unlabelled text for DataSet B. We have "product title" -> "search term" as positive correlation but there is no correct way to generate negative

Like you mentioned, the performance of Dataset C on training from sample was not as good as I wanted it to be. I am running the trainer on our whole dataset of 200m records and report back on performance (~15 days)

SeanLee97 · 2024-03-21T03:14:34Z

Negative is very hard to generate from unlabelled text for DataSet B. We have "product title" -> "search term" as positive correlation but there is no correct way to generate negative

Like you mentioned, the performance of Dataset C on training from sample was not as good as I wanted it to be. I am running the trainer on our whole dataset of 200m records and report back on performance (~15 days)

For such large datasets, it is better to specify a small learning_rate such as 1e-6, and specify --fixed_teacher_name_or_path to alleviate information forgetting.

ganeshkrishnan1 · 2024-03-21T03:26:09Z

I don't mind catastrophic forgetting. I could even train from scratch with the amount of data we have. The learning rate is currently set to 3e-6. It took 8 hours for the dataset to load so I think I will let this training run and then re-run with the smaller one you mentioned.

Your models don't seem compatible with KeyBert https://github.com/MaartenGr/keyBERT so that's one more challenge for me

SeanLee97 · 2024-03-21T03:41:24Z

I found KeyBert works for sentence-transformers. Maybe you can add a feature to make it support angle_emb.

ganeshkrishnan1 · 2024-03-21T04:20:42Z

I will ask someone from our team to look into it. Right now its easier for me to use this for generating vectors and training a different sentence transformers for generating keywords from documents: two different usecases

ganeshkrishnan1 · 2024-03-21T04:21:55Z

btw, can my team member reach out on your email to get some support for adding support of angle_emb to sentence-transformers?

SeanLee97 · 2024-03-25T01:23:44Z

btw, can my team member reach out on your email to get some support for adding support of angle_emb to sentence-transformers?

Sure! thanks!

BTW, I am working on exporting sentence-transformers (ST) model so that the AnglE-trained model can be used in ST.

ganeshkrishnan1 closed this as completed Mar 14, 2024

ganeshkrishnan1 reopened this Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi gpu use? #59

multi gpu use? #59

ganeshkrishnan1 commented Mar 9, 2024

SeanLee97 commented Mar 11, 2024

ganeshkrishnan1 commented Mar 14, 2024

ganeshkrishnan1 commented Mar 18, 2024

ganeshkrishnan1 commented Mar 18, 2024

SeanLee97 commented Mar 19, 2024

SeanLee97 commented Mar 19, 2024 •

edited

ganeshkrishnan1 commented Mar 19, 2024

ganeshkrishnan1 commented Mar 20, 2024

SeanLee97 commented Mar 20, 2024 •

edited

ganeshkrishnan1 commented Mar 20, 2024

SeanLee97 commented Mar 21, 2024

ganeshkrishnan1 commented Mar 21, 2024

SeanLee97 commented Mar 21, 2024

ganeshkrishnan1 commented Mar 21, 2024

SeanLee97 commented Mar 21, 2024

ganeshkrishnan1 commented Mar 21, 2024

ganeshkrishnan1 commented Mar 21, 2024

SeanLee97 commented Mar 25, 2024

multi gpu use? #59

multi gpu use? #59

Comments

ganeshkrishnan1 commented Mar 9, 2024

SeanLee97 commented Mar 11, 2024

ganeshkrishnan1 commented Mar 14, 2024

ganeshkrishnan1 commented Mar 18, 2024

ganeshkrishnan1 commented Mar 18, 2024

SeanLee97 commented Mar 19, 2024

SeanLee97 commented Mar 19, 2024 • edited

ganeshkrishnan1 commented Mar 19, 2024

ganeshkrishnan1 commented Mar 20, 2024

SeanLee97 commented Mar 20, 2024 • edited

ganeshkrishnan1 commented Mar 20, 2024

SeanLee97 commented Mar 21, 2024

ganeshkrishnan1 commented Mar 21, 2024

SeanLee97 commented Mar 21, 2024

ganeshkrishnan1 commented Mar 21, 2024

SeanLee97 commented Mar 21, 2024

ganeshkrishnan1 commented Mar 21, 2024

ganeshkrishnan1 commented Mar 21, 2024

SeanLee97 commented Mar 25, 2024

SeanLee97 commented Mar 19, 2024 •

edited

SeanLee97 commented Mar 20, 2024 •

edited