Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG/Help] 首token时延受输入长度影响,显著线性增长 #1456

Open
1 task done
woaipichuli opened this issue Feb 8, 2024 · 0 comments
Open
1 task done

Comments

@woaipichuli
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

首token在GPU上验证时延随输入长度增长明显,基本是线性倍增,从512到2048,首token的时延基本上涨接近4倍,从500ms上涨到1.8s
输入部分应该是并行计算的,为什么时延会增长这么大呢?

Expected Behavior

No response

Steps To Reproduce

tokenizer = AutoTokenizer.from_pretrained(base_model_name_or_path, trust_remote_code=True)
base_model = AutoModel.from_pretrained(base_model_name_or_path, trust_remote_code=True, revision=True)
model = PeftModel.from_pretrained(base_model, peft_model_id,torch_dtype=torch.float16)
model.cuda()

str = "测试文本"
pt_data = tokenizer(str, return_tensors="pt", padding=True).to('cuda')
gen_kwargs = {"max_length": pt_data["input_ids"].shape[-1] + 1, "num_beams": 1, "do_sample": False,
"top_p": 0.8,
"temperature": 0, "logits_processor": logits_processor}
outputs = model.generate(**pt_data, **gen_kwargs)

Environment

V100 T4 两个GPU上验证了该问题

Anything else?

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant