eval_loss showing Nan but train_loss decreases and goes to NaN after couple of steps while fine tuning gemma model with additional vocab #1715

sidtandon2014 · 2024-05-07T17:28:49Z

System Info

I am trying to fine tune gemma 7b model in 4 bit with additional vocab and using following configuration, but getting NaN in train and eval loss. Though train loss first decreases for couple of steps and then turn to NaN

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(model_id + MODEL_NAME
                                             , quantization_config=bnb_config
                                             , token=os.environ['HF_TOKEN']
                                             , device_map={'':device_string}
                                             , use_cache=False
                                            )

lora_config = LoraConfig(
    r=LORA_RANK,
    target_modules=["q_proj", "v_proj", "embed_tokens", "lm_head"],
    task_type="CAUSAL_LM",
    lora_alpha = LORA_ALPHA,
    lora_dropout = LORA_DROPOUT, 
    bias = "none",
)

model = get_peft_model(model, lora_config)

In order to update the vocab I have extended sentencepiece model instead of add_tokens method (FYI: add_tokens is degrading tokens quality)
huggingface/tokenizers#627 (comment)
https://github.com/google/sentencepiece/blob/master/python/add_new_vocab.ipynb

Along with this while training, I am setting embedding values to 0 for all new tokens

emb_dim = model.model.embed_tokens.weight.shape
with torch.no_grad():
    model.model.embed_tokens.weight[-NEW_TOKENS:] = torch.zeros((NEW_TOKENS, emb_dim[1]))

Additional properties:

args = TrainingArguments(
        per_device_train_batch_size=BATCH_SIZE,
        per_device_eval_batch_size=BATCH_SIZE,
        gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS,
        save_steps=200,
        save_total_limit=20,
        save_strategy="steps",
        evaluation_strategy='steps',
        eval_steps=200,
        logging_steps=200,
    
        warmup_steps=2,
        num_train_epochs =EPOCHS,
        # max_steps=2,
        learning_rate=2e-4,
        lr_scheduler_type = "cosine",
        weight_decay = 0.001,
        max_grad_norm=1.0,
        fp16 = False,
        bf16 = True,
        logging_strategy = "steps",
        output_dir=output_dir,
        optim="paged_adamw_8bit",
        seed=42,
        
        gradient_checkpointing = True,
        gradient_checkpointing_kwargs={'use_reentrant':False},
        #accelerator_config = {'split_batches' : True},
        report_to = None
    )

Who can help?

@BenjaminBossan

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder
My own task or dataset (give details below)

Reproduction

Task: Translate Sanskrit to English
Dataset:"rahular/itihasa"

Loss Snapshot: [A{'eval_loss': nan, 'eval_runtime': 708.8687, 'eval_samples_per_second': 13.125, 'eval_steps_per_second': 1.641, 'epoch': 0.15}

Expected behavior

Validation loss should not be NaN

The text was updated successfully, but these errors were encountered:

BenjaminBossan · 2024-05-08T08:56:18Z

Can you try to run this additional snippet:

model = get_peft_model(...)
# convert all peft parameters to float32
for param in model.parameters():
    if param.requires_grad:
        param.data = param.data.float()

sidtandon2014 changed the title ~~eval_loss showing Nan but train_loss is decreasing while fine tuning gemma model with additional vocab~~ eval_loss showing Nan but train_loss decreases and goes to NaN after couple of steps while fine tuning gemma model with additional vocab May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eval_loss showing Nan but train_loss decreases and goes to NaN after couple of steps while fine tuning gemma model with additional vocab #1715

eval_loss showing Nan but train_loss decreases and goes to NaN after couple of steps while fine tuning gemma model with additional vocab #1715

sidtandon2014 commented May 7, 2024 •

edited

BenjaminBossan commented May 8, 2024

eval_loss showing Nan but train_loss decreases and goes to NaN after couple of steps while fine tuning gemma model with additional vocab #1715

eval_loss showing Nan but train_loss decreases and goes to NaN after couple of steps while fine tuning gemma model with additional vocab #1715

Comments

sidtandon2014 commented May 7, 2024 • edited

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

BenjaminBossan commented May 8, 2024

sidtandon2014 commented May 7, 2024 •

edited