Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chatGLM3-6b-128k的模型使用lora微调后,进行推理的时候卡住怎么回事? #3902

Closed
dazzlingCn opened this issue Apr 26, 2024 · 3 comments
Assignees
Labels
bug Something isn't working

Comments

@dazzlingCn
Copy link

问题描述 / Problem Description
用简洁明了的语言描述这个问题 / Describe the problem in a clear and concise manner.
128k的模型使用lora微调后,进行推理的时候卡住怎么回事?

复现问题的步骤 / Steps to Reproduce
使用官方的微调脚本进行lora微调的:
1、微调:python finetune_hf.py data/xdd/ THUDM/chatglm3-6b-128k configs/lora.yaml
没有报错,成功了
2、合并模型:python merge_model.py output/checkpoint-10000 THUDM/chatglm3-6b-128k-n2
3、推理过程:推理的时候,先是报了一个错误“AttributeError: can't set attribute 'eos_token'”,删除tokenizer_config.json中的eos_token、pad_token、unk_token就可以了,后面就是正常加载和推理,对于比较短的prompt(20个中文汉字以下),一般
可以正常推理,但是超过字数以后,会出现疑似卡死,推理半小时没有结果,但是显存占用基本拉满(24G),显存占用有时候还会变化。

预期的结果 / Expected Result
正常应该是十秒左右就有返回,推理完成

实际结果 / Actual Result
卡住半小时以上没有结果,显存占用一直在变,高的时候32G全部用完,少的时候14G左右

环境信息 / Environment Information
cuda 11.7,t4卡,pytorch版本“1.11.0+cu113”

@dazzlingCn dazzlingCn added the bug Something isn't working label Apr 26, 2024
Copy link

dosubot bot commented Apr 26, 2024

To address the issue of your model inference getting stuck, especially with longer prompts, consider implementing the following optimizations and strategies:

  1. PagedAttention: This method reduces memory waste by dividing the KV cache into blocks for efficient retrieval during attention computation. It's particularly effective in managing the large memory footprint associated with KV caches.

  2. Efficient Memory Sharing with PagedAttention: By sharing blocks among sequences through a block table, this approach can significantly reduce memory overhead, leading to improved throughput.

  3. Optimize CUDA Kernels and Implement Continuous Batching: Focus on CUDA kernel optimizations and use continuous batching to improve GPU utilization by dynamically inserting new sequences into the batch based on resource availability.

  4. FlashAttention for Attention Mechanism Optimization: FlashAttention optimizes the attention mechanism by introducing tiling and recomputation strategies, reducing the need for large intermediate attention matrices and speeding up the computation.

  5. Dynamic and Inflight Batching for Scheduling Optimization: Use dynamic batching and inflight batching to refine scheduling and improve GPU resource utilization by addressing "bubbles" during inference.

  6. Memory Management in Conversation History: Ensure the conversation history buffer does not exceed a predefined maximum token limit to prevent memory overflow issues.

  7. GPU Memory Usage Management: Manage GPU memory usage through parameters such as gpu_memory_utilization, swap_space, max_num_batched_tokens, and max_num_seqs to control memory allocation during inference.

These strategies are derived from practices in managing large model inference processes and should be adjusted based on your specific setup and requirements. If the problem persists, profiling the memory usage during inference might help identify any bottlenecks or areas of excessive memory use.

Details

Sources


About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

@zRzRzRzRzRzRzR zRzRzRzRzRzRzR self-assigned this May 2, 2024
@zRzRzRzRzRzRzR
Copy link
Collaborator

估计显卡没有用到,你检查显卡推理吧,另外这个模型应该有问题,不建议调这个

@dazzlingCn
Copy link
Author

估计显卡没有用到,你检查显卡推理吧,另外这个模型应该有问题,不建议调这个

显卡用到了,显存占用着呢,这个模型有什么问题啊,可以说说吗

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants