Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG/Help] <title>chatglm2首token时延增长随输入长度成倍快速增长 #664

Open
1 task done
woaipichuli opened this issue Feb 8, 2024 · 0 comments
Open
1 task done

Comments

@woaipichuli
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

验证chatglm2-6b和chatglm2-6b-int4都出现首token时延随输入长度成倍快速增长,从输入长度512到2048,首token时延从500ms增长至1.8s

Expected Behavior

输入部分应该是并行的,为什么增长会这么明显

Steps To Reproduce

tokenizer = AutoTokenizer.from_pretrained(base_model_name_or_path, trust_remote_code=True)
base_model = AutoModel.from_pretrained(base_model_name_or_path, trust_remote_code=True, revision=True)
model = PeftModel.from_pretrained(base_model, peft_model_id,torch_dtype=torch.float16)

str="测试文本"
pt_data = tokenizer(str, return_tensors="pt", padding=True).to('cuda')
gen_kwargs = {"max_length": pt_data["input_ids"].shape[-1] + 1, "num_beams": 1, "do_sample": False,
"top_p": 0.8,
"temperature": 0, "logits_processor": logits_processor}

outputs = model.generate(**pt_data, **gen_kwargs)

Environment

V100和T4两个GPU上都进行了验证

Anything else?

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant