Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

linux使用qloar微调遇到问题四个全部Killing subprocess,四张8G卡。out of memory该如何解决? #336

Open
abandonnnnn opened this issue Jan 10, 2024 · 0 comments

Comments

@abandonnnnn
Copy link

abandonnnnn commented Jan 10, 2024

`(newvisglm) zzz@zzz:~/yz/AllVscodes/VisualGLM-6B-main$ bash finetune/finetune_visualglm_qlora.sh
NCCL_DEBUG=info NCCL_IB_DISABLE=0 NCCL_NET_GDR_LEVEL=2 deepspeed --master_port 16666 --include localhost:0,1,2,3 --hostfile hostfile_single finetune_visualglm.py --experiment-name finetune-visualglm-6b --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 10 --layer_range 0 14 --pre_seq_len 4 --train-data ./fewshot-data/dataset.json --valid-data ./fewshot-data/dataset.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 300 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 1 --gradient-accumulation-steps 4 --skip-init --fp16 --use_qlora
[2024-01-10 21:42:21,222] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so
/home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so'), PosixPath('/home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
CUDA SETUP: CUDA runtime path found: /home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 6.1
CUDA SETUP: Detected CUDA version 113
/home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
warn(msg)
CUDA SETUP: Loading binary /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so...
[2024-01-10 21:43:44,628] [WARNING] [runner.py:202:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2024-01-10 21:43:44,628] [INFO] [runner.py:571:main] cmd = /home/zzz/anaconda3/envs/newvisglm/bin/python -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgM119 --master_addr=127.0.0.1 --master_port=16666 --enable_each_rank_log=None finetune_visualglm.py --experiment-name finetune-visualglm-6b --model-parallel-size 1 --mode finetune --train-iters 300 --resume-dataloader --max_source_length 64 --max_target_length 256 --lora_rank 10 --layer_range 0 14 --pre_seq_len 4 --train-data ./fewshot-data/dataset.json --valid-data ./fewshot-data/dataset.json --distributed-backend nccl --lr-decay-style cosine --warmup .02 --checkpoint-activations --save-interval 300 --eval-interval 10000 --save ./checkpoints --split 1 --eval-iters 10 --eval-batch-size 8 --zero-stage 1 --lr 0.0001 --batch-size 1 --gradient-accumulation-steps 4 --skip-init --fp16 --use_qlora
[2024-01-10 21:43:46,326] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so
/home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so'), PosixPath('/home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
CUDA SETUP: CUDA runtime path found: /home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 6.1
CUDA SETUP: Detected CUDA version 113
/home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
warn(msg)
CUDA SETUP: Loading binary /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so...
[2024-01-10 21:43:48,984] [INFO] [launch.py:138:main] 0 NCCL_IB_DISABLE=0
[2024-01-10 21:43:48,984] [INFO] [launch.py:138:main] 0 NCCL_DEBUG=info
[2024-01-10 21:43:48,984] [INFO] [launch.py:138:main] 0 NCCL_NET_GDR_LEVEL=2
[2024-01-10 21:43:48,984] [INFO] [launch.py:145:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3]}
[2024-01-10 21:43:48,984] [INFO] [launch.py:151:main] nnodes=1, num_local_procs=4, node_rank=0
[2024-01-10 21:43:48,984] [INFO] [launch.py:162:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3]})
[2024-01-10 21:43:48,984] [INFO] [launch.py:163:main] dist_world_size=4
[2024-01-10 21:43:48,984] [INFO] [launch.py:165:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3
[2024-01-10 21:43:50,865] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-10 21:43:50,916] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-10 21:43:50,938] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
[2024-01-10 21:43:50,944] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so
/home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so'), PosixPath('/home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
CUDA SETUP: CUDA runtime path found: /home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 6.1
CUDA SETUP: Detected CUDA version 113
/home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
warn(msg)
CUDA SETUP: Loading binary /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so...

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so
/home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so'), PosixPath('/home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
CUDA SETUP: CUDA runtime path found: /home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 6.1
CUDA SETUP: Detected CUDA version 113
/home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
warn(msg)
CUDA SETUP: Loading binary /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so...

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please run

python -m bitsandbytes

and submit this information together with your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

bin /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so
bin /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so
/home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so'), PosixPath('/home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
CUDA SETUP: CUDA runtime path found: /home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 6.1
CUDA SETUP: Detected CUDA version 113
/home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
warn(msg)
CUDA SETUP: Loading binary /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so...
/home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: Found duplicate ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] files: {PosixPath('/home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so'), PosixPath('/home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so.11.0')}.. We'll flip a coin and try one of these, in order to fail forward.
Either way, this might cause trouble in the future:
If you get CUDA error: invalid device function errors, the above might be the cause and the solution is to make sure only one ['libcudart.so', 'libcudart.so.11.0', 'libcudart.so.12.0'] in the paths that we search based on your env.
warn(msg)
CUDA SETUP: CUDA runtime path found: /home/zzz/anaconda3/envs/newvisglm/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 6.1
CUDA SETUP: Detected CUDA version 113
/home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/cuda_setup/main.py:149: UserWarning: WARNING: Compute capability < 7.5 detected! Only slow 8-bit matmul is supported for your GPU!
warn(msg)
CUDA SETUP: Loading binary /home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda113_nocublaslt.so...
[2024-01-10 21:44:24,179] [INFO] using world size: 4 and model-parallel size: 1
[2024-01-10 21:44:24,179] [INFO] > padded vocab (size: 100) with 28 dummy tokens (new size: 128)
[2024-01-10 21:44:25,235] [INFO] [RANK 0] > initializing model parallel with size 1
[2024-01-10 21:44:25,276] [INFO] [RANK 0] You didn't pass in LOCAL_WORLD_SIZE environment variable. We use the guessed LOCAL_WORLD_SIZE=4. If this is wrong, please pass the LOCAL_WORLD_SIZE manually.
[2024-01-10 21:44:25,285] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-10 21:44:25,285] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-10 21:44:25,286] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2024-01-10 21:44:25,287] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-10 21:44:25,288] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2024-01-10 21:44:25,288] [INFO] [checkpointing.py:1045:_configure_using_config_file] {'partition_activations': False, 'contiguous_memory_optimization': False, 'cpu_checkpointing': False, 'number_checkpoints': None, 'synchronize_checkpoint_boundary': False, 'profile': False}
[2024-01-10 21:44:25,288] [INFO] [checkpointing.py:227:model_parallel_cuda_manual_seed] > initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234
[2024-01-10 21:44:25,289] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2024-01-10 21:44:25,295] [INFO] [comm.py:637:init_distributed] cdb=None
[2024-01-10 21:44:25,296] [WARNING] [config_utils.py:69:_process_deprecated_field] Config parameter cpu_offload is deprecated use offload_optimizer instead
[2024-01-10 21:44:25,422] [INFO] [RANK 0] building FineTuneVisualGLMModel model ...
/home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/torch/nn/init.py:403: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
/home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/torch/nn/init.py:403: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
/home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/torch/nn/init.py:403: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
/home/zzz/anaconda3/envs/newvisglm/lib/python3.10/site-packages/torch/nn/init.py:403: UserWarning: Initializing zero-element tensors is a no-op
warnings.warn("Initializing zero-element tensors is a no-op")
[2024-01-10 21:44:44,798] [INFO] [RANK 0] replacing layer 0 attention with lora
[2024-01-10 21:44:45,735] [INFO] [RANK 0] replacing layer 14 attention with lora
[2024-01-10 21:44:46,718] [INFO] [RANK 0] replacing chatglm linear layer with 4bit
[2024-01-10 21:48:30,236] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2432905
[2024-01-10 21:48:31,608] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2432906
[2024-01-10 21:48:32,635] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2432907
[2024-01-10 21:48:32,636] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2432908
[2024-01-10 21:48:33,659] [ERROR] [launch.py:321:sigkill_handler] ['/home/zzz/anaconda3/envs/newvisglm/bin/python', '-u', 'finetune_visualglm.py', '--local_rank=3', '--experiment-name', 'finetune-visualglm-6b', '--model-parallel-size', '1', '--mode', 'finetune', '--train-iters', '300', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--layer_range', '0', '14', '--pre_seq_len', '4', '--train-data', './fewshot-data/dataset.json', '--valid-data', './fewshot-data/dataset.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '300', '--eval-interval', '10000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '8', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '1', '--gradient-accumulation-steps', '4', '--skip-init', '--fp16', '--use_qlora'] exits with return code = -9
(newvisglm) zzz@zzz:~/yz/AllVscodes/VisualGLM-6B-main$ `

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant