Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [date 5.9] tke regression: tpcc 1000warehouse 1000threads connection timeout #15960

Open
1 task done
heni02 opened this issue May 10, 2024 · 15 comments
Open
1 task done
Assignees
Labels
kind/bug Something isn't working severity/s0 Extreme impact: Cause the application to break down and seriously affect the use
Milestone

Comments

@heni02
Copy link
Contributor

heni02 commented May 10, 2024

Is there an existing issue for the same bug?

  • I have checked the existing issues.

Branch Name

main

Commit ID

1aed8d9

Other Environment Information

- Hardware parameters:
- OS type:
- Others:

Actual Behavior

job:
https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9019459807/job/24803217074
企业微信截图_0d41e70f-2e84-472f-a15f-005c936be999

The tke environment did not restart
企业微信截图_333a5019-cc47-477b-8e0c-4b7db32d5db7

mo log:
https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22fzn%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-regression-20240509%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221715283409000%22,%22to%22:%221715283420000%22%7D%7D%7D&schemaVersion=1&orgId=1

profile:
linktimeout_profile.tar.gz

Expected Behavior

No response

Steps to Reproduce

tke regression tpcc 1000warehouse 1000threads test

Additional information

No response

@heni02 heni02 added kind/bug Something isn't working needs-triage severity/s0 Extreme impact: Cause the application to break down and seriously affect the use labels May 10, 2024
@heni02 heni02 added this to the 1.2.0 milestone May 10, 2024
@aressu1985
Copy link
Contributor

update on 5.11
commit id: b5c2eaa(1.2-dev)

also in tpcc 100 warehouse 1000 terminals

image

job link:
https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9035308966/job/24831847809

@sukki37 sukki37 assigned volgariver6 and unassigned matrix-meow May 11, 2024
@sukki37
Copy link
Contributor

sukki37 commented May 11, 2024

@volgariver6 volgariver6 assigned daviszhen and unassigned volgariver6 May 11, 2024
@volgariver6
Copy link
Contributor

@daviszhen will help to fix it.

This was referenced May 11, 2024
@daviszhen
Copy link
Contributor

it was fixed.

@sukki37 sukki37 assigned reusee and unassigned daviszhen May 14, 2024
@reusee
Copy link
Contributor

reusee commented May 15, 2024

从metrics看,17:09:00 时,p99 apply-latency 是 23.9s,apply是4.66ms,说明logtail的数量很多。
17:08:00-17:08:30 的 apply queue 达到 4.81k和8.22k。
到了 17:09:00,apply queue 降到了 4,说明logtail已经消费完毕。30秒消费8.22k,平均3ms左右消费一个logtail,正常范围。
问题应是,为什么会有突发的这么多logtail。

观察到一个现象,logtail collect duration 在 17:01:30 开始波动。17:01:30 产生了一个尖峰,然后在 17:03:30 开始变成低谷,这个低谷一直持续到 17:10:00。这个时间段刚好对应上面的 cn 的 logtail 数量尖峰。
一个猜测是,dn 在这段时间内,累积了很多 logtail,然后集中发送,于是 cn 侧产生了比较长的队列。至少可以确定的是,dn 先于 cn 发生异常波动。

在 17:03:30,cn 侧的 logtail consume 累积时间达到 8.55s,这个时候的 apply queue 是 1.83k,平均是 4ms 左右消费一个logtail,消费速度正常。问题仍然是突发的logtail数量。

@aressu1985
Copy link
Contributor

long running go routine file:

Uploading routine_018f93c1-1e9f-76b1-b13e-b52300b44cd7.txt…

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working severity/s0 Extreme impact: Cause the application to break down and seriously affect the use
Projects
None yet
Development

No branches or pull requests

7 participants