New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chatGLM3-6b-128k的模型使用lora微调后,进行推理的时候卡住怎么回事? #3902
Comments
To address the issue of your model inference getting stuck, especially with longer prompts, consider implementing the following optimizations and strategies:
These strategies are derived from practices in managing large model inference processes and should be adjusted based on your specific setup and requirements. If the problem persists, profiling the memory usage during inference might help identify any bottlenecks or areas of excessive memory use.
|
估计显卡没有用到,你检查显卡推理吧,另外这个模型应该有问题,不建议调这个 |
显卡用到了,显存占用着呢,这个模型有什么问题啊,可以说说吗 |
问题描述 / Problem Description
用简洁明了的语言描述这个问题 / Describe the problem in a clear and concise manner.
128k的模型使用lora微调后,进行推理的时候卡住怎么回事?
复现问题的步骤 / Steps to Reproduce
使用官方的微调脚本进行lora微调的:
1、微调:python finetune_hf.py data/xdd/ THUDM/chatglm3-6b-128k configs/lora.yaml
没有报错,成功了
2、合并模型:python merge_model.py output/checkpoint-10000 THUDM/chatglm3-6b-128k-n2
3、推理过程:推理的时候,先是报了一个错误“AttributeError: can't set attribute 'eos_token'”,删除tokenizer_config.json中的eos_token、pad_token、unk_token就可以了,后面就是正常加载和推理,对于比较短的prompt(20个中文汉字以下),一般
可以正常推理,但是超过字数以后,会出现疑似卡死,推理半小时没有结果,但是显存占用基本拉满(24G),显存占用有时候还会变化。
预期的结果 / Expected Result
正常应该是十秒左右就有返回,推理完成
实际结果 / Actual Result
卡住半小时以上没有结果,显存占用一直在变,高的时候32G全部用完,少的时候14G左右
环境信息 / Environment Information
cuda 11.7,t4卡,pytorch版本“1.11.0+cu113”
The text was updated successfully, but these errors were encountered: