Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixture-of-Depths Finetune,IndexError: too many indices for tensor of dimension 0 #3662

Open
1 task done
AlexYoung757 opened this issue May 9, 2024 · 1 comment
Open
1 task done
Labels
pending This problem is yet to be addressed.

Comments

@AlexYoung757
Copy link

AlexYoung757 commented May 9, 2024

Reminder

  • I have read the README and searched the existing issues.

Reproduction

MASTER_PORT=$(shuf -n 1 -i 10000-65535)
DEEPSPEED_PATH=../config/ds_config_sft_z2_offload.json
MODEL_PATH=/your path/Meta-Llama-3-8B
OUTPUT_PATH=../output/llama3-8b-mod-sft
LOG_PATH=../logs/result_mod_sft.log

nohup deepspeed --num_gpus=4 --master_port $MASTER_PORT ../src/train_bash.py
--deepspeed $DEEPSPEED_PATH
--stage sft
--do_train
--model_name_or_path $MODEL_PATH
--dataset_dir ../data
--dataset oaast_sft_zh
--template llama3
--finetuning_type lora
--lora_target q_proj,v_proj
--mixture_of_depths convert
--output_dir $OUTPUT_PATH
--overwrite_cache
--overwrite_output_dir
--cutoff_len 1024
--preprocessing_num_workers 16
--per_device_train_batch_size 1
--per_device_eval_batch_size 1
--gradient_accumulation_steps 8
--optim paged_adamw_8bit
--lr_scheduler_type cosine
--logging_steps 10
--warmup_steps 20
--save_steps 100
--eval_steps 100
--evaluation_strategy steps
--load_best_model_at_end
--learning_rate 5e-5
--num_train_epochs 3.0
--max_samples 3000
--val_size 0.1
--plot_loss
--fp16
--flash_attn fa2

Expected behavior

I get an error: IndexError: IndexError: too many indices for tensor of dimension 0. What could be the reason for this?

System Info

  • Platform: Linux-5.15.0-86-generic-x86_64-with-glibc2.17
  • Python version: 3.8.10
  • Transformers: 4.40.2
  • Huggingface_hub version: 0.21.3
  • Safetensors version: 0.4.2
  • Accelerate version: 0.28.0
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.1.2+cu121 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using GPU in script?:
  • Using distributed or parallel set-up in script?:

Others

  File "../src/train_bash.py", line 14, in <module>
    main()
  File "../src/train_bash.py", line 5, in main
    run_exp()
  File "/data2/yangzl/project/yd-llm-summary/yd-llama-factory-main/src/llmtuner/train/tuner.py", line 33, in run_exp
    run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
  File "/data2/yangzl/project/yd-llm-summary/yd-llama-factory-main/src/llmtuner/train/sft/workflow.py", line 78, in run_sft
    train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint)
  File "/data1/envs/baichuan/lib/python3.8/site-packages/transformers/trainer.py", line 1859, in train
    return inner_training_loop(
  File "/data1/envs/baichuan/lib/python3.8/site-packages/transformers/trainer.py", line 2203, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/data1/envs/baichuan/lib/python3.8/site-packages/transformers/trainer.py", line 3138, in training_step
    loss = self.compute_loss(model, inputs)
  File "/data1/envs/baichuan/lib/python3.8/site-packages/transformers/trainer.py", line 3161, in compute_loss
    outputs = model(**inputs)
  File "/data1/envs/baichuan/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data1/envs/baichuan/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data1/envs/baichuan/lib/python3.8/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/data1/envs/baichuan/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 1852, in forward
    loss = self.module(*inputs, **kwargs)
  File "/data1/envs/baichuan/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data1/envs/baichuan/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data1/envs/baichuan/lib/python3.8/site-packages/peft/peft_model.py", line 1129, in forward
    return self.base_model(
  File "/data1/envs/baichuan/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data1/envs/baichuan/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data1/envs/baichuan/lib/python3.8/site-packages/peft/tuners/tuners_utils.py", line 161, in forward
    return self.model.forward(*args, **kwargs)
  File "/data1/envs/baichuan/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 1211, in forward
    outputs = self.model(
  File "/data1/envs/baichuan/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data1/envs/baichuan/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data1/envs/baichuan/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 1007, in forward
    layer_outputs = self._gradient_checkpointing_func(
  File "/data2/yangzl/project/yd-llm-summary/yd-llama-factory-main/src/llmtuner/model/utils/checkpointing.py", line 47, in custom_gradient_checkpointing_func
    return gradient_checkpointing_func(func, *args, **kwargs)
  File "/data1/envs/baichuan/lib/python3.8/site-packages/torch/_compile.py", line 24, in inner
    return torch._dynamo.disable(fn, recursive)(*args, **kwargs)
  File "/data1/envs/baichuan/lib/python3.8/site-packages/torch/_dynamo/eval_frame.py", line 328, in _fn
    return fn(*args, **kwargs)
  File "/data1/envs/baichuan/lib/python3.8/site-packages/torch/_dynamo/external_utils.py", line 17, in inner
    return fn(*args, **kwargs)
  File "/data1/envs/baichuan/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 451, in checkpoint
    return CheckpointFunction.apply(function, preserve, *args)
  File "/data1/envs/baichuan/lib/python3.8/site-packages/torch/autograd/function.py", line 539, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/data1/envs/baichuan/lib/python3.8/site-packages/torch/utils/checkpoint.py", line 230, in forward
    outputs = run_function(*args)
  File "/data1/envs/baichuan/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/data1/envs/baichuan/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/data1/envs/baichuan/lib/python3.8/site-packages/MoD/MoD.py", line 60, in forward
    current_causal_mask = current_causal_mask[current_selected_mask][:, current_selected_mask].unsqueeze(0).unsqueeze(0) #first if for the one second is for the bs
IndexError: too many indices for tensor of dimension 0

@AlexYoung757 AlexYoung757 changed the title mod finetune,IndexError: too many indices for tensor of dimension 0 Mixture-of-Depths Finetune,IndexError: too many indices for tensor of dimension 0 May 10, 2024
@hiyouga hiyouga added the pending This problem is yet to be addressed. label May 11, 2024
@Zkli-hub
Copy link

Did you solve the problem? I also met some problems when running MoD fine-tuning. I directly run the official example script: CUDA_VISIBLE_DEVICES=0 llamafactory-cli train examples/extras/mod/llama3_full_sft.yaml
Traceback (most recent call last): File "/home/yifei/anaconda3/envs/zkli/bin/llamafactory-cli", line 8, in <module> sys.exit(main()) ^^^^^^ File "/home/yifei/zkli/MoD/LLaMA-Factory/src/llamafactory/cli.py", line 65, in main run_exp() File "/home/yifei/zkli/MoD/LLaMA-Factory/src/llamafactory/train/tuner.py", line 33, in run_exp run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks) File "/home/yifei/zkli/MoD/LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 73, in run_sft train_result = trainer.train(resume_from_checkpoint=training_args.resume_from_checkpoint) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yifei/anaconda3/envs/zkli/lib/python3.12/site-packages/transformers/trainer.py", line 1859, in train return inner_training_loop( ^^^^^^^^^^^^^^^^^^^^ File "/home/yifei/anaconda3/envs/zkli/lib/python3.12/site-packages/transformers/trainer.py", line 2203, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yifei/anaconda3/envs/zkli/lib/python3.12/site-packages/transformers/trainer.py", line 3138, in training_step loss = self.compute_loss(model, inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yifei/anaconda3/envs/zkli/lib/python3.12/site-packages/transformers/trainer.py", line 3161, in compute_loss outputs = model(**inputs) ^^^^^^^^^^^^^^^ File "/home/yifei/anaconda3/envs/zkli/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yifei/anaconda3/envs/zkli/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yifei/anaconda3/envs/zkli/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 1211, in forward outputs = self.model( ^^^^^^^^^^^ File "/home/yifei/anaconda3/envs/zkli/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yifei/anaconda3/envs/zkli/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yifei/anaconda3/envs/zkli/lib/python3.12/site-packages/transformers/models/llama/modeling_llama.py", line 974, in forward inputs_embeds = self.embed_tokens(input_ids) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yifei/anaconda3/envs/zkli/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl return self._call_impl(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yifei/anaconda3/envs/zkli/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl return forward_call(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/yifei/anaconda3/envs/zkli/lib/python3.12/site-packages/torch/nn/modules/sparse.py", line 163, in forward return F.embedding( ^^^^^^^^^^^^ File "/home/yifei/anaconda3/envs/zkli/lib/python3.12/site-packages/torch/nn/functional.py", line 2264, in embedding return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: CUDA error: device-side assert triggered Compile with TORCH_USE_CUDA_DSA` to enable device-side assertions.

0%| |0/366 [00:00<?, ?it/s]` This seems to be the tensor shape's error. Because it raised warning like:
77ecb60da6813f5fca1172082b8ba83
Did you meet similar problem before?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pending This problem is yet to be addressed.
Projects
None yet
Development

No branches or pull requests

3 participants