咨询：4 bit量化Qwen72B的模型，需要多大的GPU？我采用4 * A40 （4 * 48GB），量化进度到46%的时候OOM了。 #3663

camposs1979 · 2024-05-09T09:07:32Z

Reminder

I have read the README and searched the existing issues.

Reproduction

CUDA_VISIBLE_DEVICES=0,1,2,3 python3.10 export_model.py
--model_name_or_path /hy-tmp/models/Qwen1.5-72B-Chat-sft
--export_quantization_bit 4
--export_quantization_dataset ../data/c4_demo.json
--template qwen
--export_dir ../../models/Qwen1.5-72B-Chat-sft-INT4
--export_size 2
--export_device cpu
--export_legacy_format False
......
Quantizing model.layers blocks : 46%|███████████████████████████████████████▎ | 37/80 [38:12<44:24, 61.96s/it]
Traceback (most recent call last):
File "/hy-tmp/LLaMA-Factory-main/src/export_model.py", line 8, in
main()
File "/hy-tmp/LLaMA-Factory-main/src/export_model.py", line 4, in main
export_model()
File "/hy-tmp/LLaMA-Factory-main/src/llmtuner/train/tuner.py", line 57, in export_model
model = load_model(tokenizer, model_args, finetuning_args) # must after fixing tokenizer to resize vocab
File "/hy-tmp/LLaMA-Factory-main/src/llmtuner/model/loader.py", line 128, in load_model
model = AutoModelForCausalLM.from_pretrained(**init_kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/auto/auto_factory.py", line 563, in from_pretrained
return model_class.from_pretrained(
File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 3592, in from_pretrained
hf_quantizer.postprocess_model(model)
File "/usr/local/lib/python3.10/dist-packages/transformers/quantizers/base.py", line 195, in postprocess_model
return self._process_model_after_weight_loading(model, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/quantizers/quantizer_gptq.py", line 85, in _process_model_after_weight_loading
self.optimum_quantizer.quantize_model(model, self.quantization_config.tokenizer)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/optimum/gptq/quantizer.py", line 506, in quantize_model
scale, zero, g_idx = gptq[name].fasterquant(
File "/usr/local/lib/python3.10/dist-packages/auto_gptq/quantization/gptq.py", line 117, in fasterquant
H = torch.cholesky_inverse(H)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.25 GiB. GPU 1 has a total capacty of 47.33 GiB of which 943.44 MiB is free. Process 2716737 has 46.40 GiB memory in use. Of the allocated memory 44.18 GiB is allocated by PyTorch, and 1.88 GiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Expected behavior

了解量化72B Qwen1.5到底需要多大的GPU

System Info

(base) root@I19c2837ff800901ccf:/# python3.10 -m pip list
Package Version

accelerate 0.28.0
addict 2.4.0
aiofiles 23.2.1
aiohttp 3.9.3
aiosignal 1.3.1
aliyun-python-sdk-core 2.15.0
aliyun-python-sdk-kms 2.16.2
altair 5.2.0
annotated-types 0.6.0
anyio 4.3.0
async-timeout 4.0.3
attrs 23.2.0
auto_gptq 0.7.1
bitsandbytes 0.43.0
certifi 2019.11.28
cffi 1.16.0
chardet 3.0.4
charset-normalizer 3.3.2
click 8.1.7
cloudpickle 3.0.0
cmake 3.29.2
coloredlogs 15.0.1
contourpy 1.2.0
crcmod 1.7
cryptography 42.0.5
cupy-cuda12x 12.1.0
cycler 0.12.1
datasets 2.18.0
dbus-python 1.2.16
deepspeed 0.14.0
dill 0.3.8
diskcache 5.6.3
distro 1.4.0
distro-info 0.23ubuntu1
docstring_parser 0.16
einops 0.7.0
exceptiongroup 1.2.0
fastapi 0.110.0
fastrlock 0.8.2
ffmpy 0.3.2
filelock 3.13.3
fire 0.6.0
fonttools 4.50.0
frozenlist 1.4.1
fsspec 2024.2.0
galore-torch 1.0
gast 0.5.4
gekko 1.0.7
gradio 4.10.0
gradio_client 0.7.3
h11 0.14.0
hjson 3.1.0
httpcore 1.0.4
httptools 0.6.1
httpx 0.27.0
huggingface-hub 0.22.0
humanfriendly 10.0
idna 2.8
importlib_metadata 7.1.0
importlib_resources 6.4.0
interegular 0.3.3
Jinja2 3.1.3
jmespath 0.10.0
joblib 1.3.2
jsonschema 4.21.1
jsonschema-specifications 2023.12.1
kiwisolver 1.4.5
lark 1.1.9
llvmlite 0.42.0
markdown-it-py 3.0.0
MarkupSafe 2.1.5
matplotlib 3.8.3
mdurl 0.1.2
modelscope 1.13.3
mpmath 1.3.0
msgpack 1.0.8
multidict 6.0.5
multiprocess 0.70.16
nest-asyncio 1.6.0
networkx 3.2.1
ninja 1.11.1.1
numba 0.59.1
numpy 1.26.4
nvidia-cublas-cu12 12.1.3.1
nvidia-cuda-cupti-cu12 12.1.105
nvidia-cuda-nvrtc-cu12 12.1.105
nvidia-cuda-runtime-cu12 12.1.105
nvidia-cudnn-cu12 8.9.2.26
nvidia-cufft-cu12 11.0.2.54
nvidia-curand-cu12 10.3.2.106
nvidia-cusolver-cu12 11.4.5.107
nvidia-cusparse-cu12 12.1.0.106
nvidia-nccl-cu12 2.18.1
nvidia-nvjitlink-cu12 12.4.99
nvidia-nvtx-cu12 12.1.105
optimum 1.16.0
orjson 3.9.15
oss2 2.18.4
outlines 0.0.34
packaging 24.0
pandas 2.2.1
peft 0.10.0
pillow 10.2.0
pip 24.0
platformdirs 4.2.0
prometheus_client 0.20.0
protobuf 5.26.0
psutil 5.9.8
py-cpuinfo 9.0.0
pyarrow 15.0.2
pyarrow-hotfix 0.6
pycparser 2.21
pycryptodome 3.20.0
pydantic 2.6.4
pydantic_core 2.16.3
pydub 0.25.1
Pygments 2.17.2
PyGObject 3.36.0
pynvml 11.5.0
pyparsing 3.1.2
python-apt 2.0.1+ubuntu0.20.4.1
python-dateutil 2.9.0.post0
python-dotenv 1.0.1
python-multipart 0.0.9
pytz 2024.1
PyYAML 6.0.1
ray 2.10.0
referencing 0.34.0
regex 2023.12.25
requests 2.31.0
requests-unixsocket 0.2.0
rich 13.7.1
rouge 1.0.1
rpds-py 0.18.0
safetensors 0.4.2
scipy 1.12.0
semantic-version 2.10.0
sentencepiece 0.2.0
setuptools 69.2.0
shellingham 1.5.4
shtab 1.7.1
simplejson 3.19.2
six 1.14.0
sniffio 1.3.1
sortedcontainers 2.4.0
sse-starlette 2.0.0
ssh-import-id 5.10
starlette 0.36.3
sympy 1.12
termcolor 2.4.0
tiktoken 0.6.0
tokenizers 0.15.2
tomli 2.0.1
tomlkit 0.12.0
toolz 0.12.1
torch 2.1.2
tqdm 4.66.2
transformers 4.39.1
triton 2.1.0
trl 0.8.1
typer 0.12.3
typing_extensions 4.10.0
tyro 0.7.3
tzdata 2024.1
unattended-upgrades 0.1
urllib3 2.2.1
uvicorn 0.29.0
uvloop 0.19.0
vllm 0.4.0
watchfiles 0.21.0
websockets 11.0.3
wheel 0.34.2
xformers 0.0.23.post1
xxhash 3.4.1
yapf 0.40.2
yarl 1.9.4
zipp 3.18.1

Others

No response

camposs1979 · 2024-05-09T23:15:42Z

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
+---------------------------------------------------------------------------------------+

hiyouga · 2024-05-11T16:12:22Z

建议使用更大的显存

hiyouga added the solved This problem has been already solved. label May 11, 2024

hiyouga closed this as completed May 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

咨询：4 bit量化Qwen72B的模型，需要多大的GPU？我采用4 * A40 （4 * 48GB），量化进度到46%的时候OOM了。 #3663

咨询：4 bit量化Qwen72B的模型，需要多大的GPU？我采用4 * A40 （4 * 48GB），量化进度到46%的时候OOM了。 #3663

camposs1979 commented May 9, 2024

camposs1979 commented May 9, 2024

hiyouga commented May 11, 2024

咨询：4 bit量化Qwen72B的模型，需要多大的GPU？我采用4 * A40 （4 * 48GB），量化进度到46%的时候OOM了。 #3663

咨询：4 bit量化Qwen72B的模型，需要多大的GPU？我采用4 * A40 （4 * 48GB），量化进度到46%的时候OOM了。 #3663

Comments

camposs1979 commented May 9, 2024

Reminder

Reproduction

Expected behavior

System Info

Others

camposs1979 commented May 9, 2024

hiyouga commented May 11, 2024