烦请帮忙看看，微调后运行cli_demo.py出现维度不一致问题； RuntimeError: The size of tensor a (12288) must match the size of tensor b (25165824) at non-singleton dimension 0 #339

New-start-man · 2024-01-25T02:58:19Z

微调模型训练已完成，为何调用时依然出现问题呢？
python cli_demo.py --from_pretrained "checkpoints/finetune-visualglm-6b-01-24-20-02/" --quant 4
[2024-01-25 10:40:45,884] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /home/anaconda3/envs/pytorch/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so
/home/anaconda3/envs/pytorch/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: /home/anaconda3/envs/pytorch did not contain ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] as expected! Searching further paths...
warn(msg)
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.9
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /home/anaconda3/envs/pytorch/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda118.so...
[2024-01-25 10:40:53,953] [INFO] building FineTuneVisualGLMModel model ...
[2024-01-25 10:40:53,955] [INFO] [RANK 0] > initializing model parallel with size 1
[2024-01-25 10:40:53,955] [INFO] [RANK 0] You didn't pass in LOCAL_WORLD_SIZE environment variable. We use the guessed LOCAL_WORLD_SIZE=1. If this is wrong, please pass the LOCAL_WORLD_SIZE manually.
[2024-01-25 10:40:53,956] [INFO] [RANK 0] You are using model-only mode.
For torch.distributed users or loading model parallel models, set environment variables RANK, WORLD_SIZE and LOCAL_RANK.
/home/anaconda3/envs/pytorch/lib/python3.10/site-packages/torch/nn/init.py:405: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
[2024-01-25 10:41:01,778] [INFO] [RANK 0] replacing layer 0 attention with lora
[2024-01-25 10:41:02,129] [INFO] [RANK 0] replacing layer 14 attention with lora
[2024-01-25 10:41:02,481] [INFO] [RANK 0] replacing chatglm linear layer with 4bit
[2024-01-25 10:41:30,816] [INFO] [RANK 0] > number of parameters on model parallel rank 0: 7802848768
[2024-01-25 10:41:33,995] [INFO] [RANK 0] global rank 0 is loading checkpoint checkpoints/finetune-visualglm-6b-01-24-20-02/300/mp_rank_00_model_states.pt
Traceback (most recent call last):
File "/home/PycharmProjects/VisualGLM-6B/cli_demo.py", line 103, in
main()
File "/home/PycharmProjects/VisualGLM-6B/cli_demo.py", line 30, in main
model, model_args = AutoModel.from_pretrained(
File "/home/anaconda3/envs/pytorch/lib/python3.10/site-packages/sat/model/base_model.py", line 338, in from_pretrained
return cls.from_pretrained_base(name, args=args, home_path=home_path, url=url, prefix=prefix, build_only=build_only, overwrite_args=overwrite_args, **kwargs)
File "/home/anaconda3/envs/pytorch/lib/python3.10/site-packages/sat/model/base_model.py", line 332, in from_pretrained_base
load_checkpoint(model, args, load_path=model_path, prefix=prefix)
File "/home/anaconda3/envs/pytorch/lib/python3.10/site-packages/sat/training/model_io.py", line 273, in load_checkpoint
missing_keys, unexpected_keys = module.load_state_dict(sd['module'], strict=False)
File "/home/anaconda3/envs/pytorch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2027, in load_state_dict
load(self, state_dict)
File "/home/anaconda3/envs/pytorch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2015, in load
load(child, child_state_dict, child_prefix)
File "/home/anaconda3/envs/pytorch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2015, in load
load(child, child_state_dict, child_prefix)
File "/home/anaconda3/envs/pytorch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2015, in load
load(child, child_state_dict, child_prefix)
[Previous line repeated 3 more times]
File "/home/anaconda3/envs/pytorch/lib/python3.10/site-packages/torch/nn/modules/module.py", line 2009, in load
module._load_from_state_dict(
File "/home/anaconda3/envs/pytorch/lib/python3.10/site-packages/sat/model/finetune/lora2.py", line 49, in load_from_state_dict
self.weight.data.copy(state_dict[prefix+'weight'])
RuntimeError: The size of tensor a (12288) must match the size of tensor b (25165824) at non-singleton dimension 0

xiongxiaochu · 2024-01-29T11:59:17Z

请问能发一下用来训练的模型文件吗？我这边mac无法安装triton不能用sat代码来下，然后清华开源的模型又有问题。。。

New-start-man · 2024-01-31T06:33:37Z

请问能发一下用来训练的模型文件吗？我这边mac无法安装triton不能用sat代码来下，然后清华开源的模型又有问题。。。

您好，训练后的模型如何发给您呢？似乎没有太好的路径可以把微调后的模型文件共享

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

烦请帮忙看看，微调后运行cli_demo.py出现维度不一致问题； RuntimeError: The size of tensor a (12288) must match the size of tensor b (25165824) at non-singleton dimension 0 #339

烦请帮忙看看，微调后运行cli_demo.py出现维度不一致问题； RuntimeError: The size of tensor a (12288) must match the size of tensor b (25165824) at non-singleton dimension 0 #339

New-start-man commented Jan 25, 2024

xiongxiaochu commented Jan 29, 2024

New-start-man commented Jan 31, 2024

烦请帮忙看看，微调后运行cli_demo.py出现维度不一致问题； RuntimeError: The size of tensor a (12288) must match the size of tensor b (25165824) at non-singleton dimension 0 #339

烦请帮忙看看，微调后运行cli_demo.py出现维度不一致问题； RuntimeError: The size of tensor a (12288) must match the size of tensor b (25165824) at non-singleton dimension 0 #339

Comments

New-start-man commented Jan 25, 2024

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

xiongxiaochu commented Jan 29, 2024

New-start-man commented Jan 31, 2024