[Bug]: [date 5.9] tke regression: tpcc 1000warehouse 1000threads connection timeout #15960

heni02 · 2024-05-10T03:08:45Z

Is there an existing issue for the same bug?

I have checked the existing issues.

Branch Name

main

Commit ID

1aed8d9

Other Environment Information

- Hardware parameters:
- OS type:
- Others:

Actual Behavior

job：
https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9019459807/job/24803217074

The tke environment did not restart

mo log:
https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22fzn%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-regression-20240509%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221715283409000%22,%22to%22:%221715283420000%22%7D%7D%7D&schemaVersion=1&orgId=1

profile：
linktimeout_profile.tar.gz

Expected Behavior

No response

Steps to Reproduce

tke regression tpcc 1000warehouse 1000threads test

Additional information

No response

The text was updated successfully, but these errors were encountered:

aressu1985 · 2024-05-11T02:46:49Z

update on 5.11
commit id: b5c2eaa（1.2-dev）

also in tpcc 100 warehouse 1000 terminals

job link:
https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9035308966/job/24831847809

sukki37 · 2024-05-11T02:58:10Z

https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22fzn%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-regression-20240509%5C%22%7D%20%7C%3D%20%60ERROR%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221715283406618%22,%22to%22:%221715283419976%22%7D%7D%7D&schemaVersion=1&orgId=1

sukki37 · 2024-05-11T03:45:06Z

update on 5.11 commit id: b5c2eaa（1.2-dev）

also in tpcc 100 warehouse 1000 terminals
job link: https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9035308966/job/24831847809

@volgariver6 The issue here is also that when there's a problem, the proxy reports an error indicating a failure to write to the client.

10.143.198.17: client ip

https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22fzn%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-branch-reg-b5c2eaa%5C%22%7D%20%7C%3D%20%60ERROR%60%20%21%3D%20%60invalid%20argument%20function%20prefix_in%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221715369833563%22,%22to%22:%221715369834420%22%7D%7D%7D&schemaVersion=1&orgId=1

volgariver6 · 2024-05-11T04:06:54Z

@daviszhen will help to fix it.

daviszhen · 2024-05-11T07:06:45Z

it was fixed.

aressu1985 · 2024-05-13T03:21:13Z

not fixed

update on 5.13
commit it: 7bc191e (1.2-dev)

job link:
https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9052676441/job/24871224919

mo-log:
https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22DjJ%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-branch-reg-7bc191e%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221715537718000%22,%22to%22:%221715538643000%22%7D%7D%7D&schemaVersion=1&orgId=1

heni02 · 2024-05-14T02:41:45Z

main commit：b88a2b8e1
job：https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9064806413/job/24907214817

mo log：
https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22LTv%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-regression-20240513%5C%22%7D%20%7C%3D%20%60%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221715625892000%22,%22to%22:%221715625893000%22%7D%7D%7D&schemaVersion=1&orgId=1

这个时间段内内存使用情况，3个cn都没有超过50G

https://grafana.ci.matrixorigin.cn/d/85a562078cdf77779eaa1add43ccec1e/kubernetes-compute-resources-namespace-pods?orgId=1&var-datasource=prometheus&var-cluster=&var-namespace=mo-nightly-regression-20240513&from=1715612144572&to=1715654454026

daviszhen · 2024-05-14T07:51:48Z

execute执行超过60秒。然后client链接断开了。不同于卡住的问题。

https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22LTv%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-regression-20240513%5C%22%7D%20%7C~%20%60connectionId%2023055%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221715623252000%22,%22to%22:%221715626253000%22%7D%7D%7D&schemaVersion=1&orgId=1

daviszhen · 2024-05-14T07:52:40Z

https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22LTv%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-regression-20240513%5C%22%7D%20%7C~%20%60connectionId%2023064%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221715623252000%22,%22to%22:%221715626253000%22%7D%7D%7D&schemaVersion=1&orgId=1

daviszhen · 2024-05-14T07:52:54Z

https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22LTv%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-nightly-regression-20240513%5C%22%7D%20%7C~%20%60%28%3Fi%29Failed%20to%20send%20response%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221715625892000%22,%22to%22:%221715625893000%22%7D%7D%7D&schemaVersion=1&orgId=1

sukki37 · 2024-05-14T09:12:22Z

repro：https://github.com/matrixorigin/mo-nightly-regression/actions/runs/9064407826/job/24919544811
branch：1.2-dev
commitid：939743759dfad6245f30d130f186ba3c45bfcd43

metric：
https://grafana.ci.matrixorigin.cn/d/a24c5b82-0f61-407d-a318-c253de17583d/logtail-metrics?orgId=1&var-interval=1m&var-namespace=mo-branch-reg-9397437&var-pod=All&from=1715616000000&to=1715623799000

log：
https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22fzn%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-branch-reg-9397437%5C%22%7D%20%7C%3D%20%60018f72e696007c55af33e4839af9bc6d%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221715618212693%22,%22to%22:%221715622148693%22%7D%7D,%22_t3%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-branch-reg-9397437%5C%22%7D%20%7C%3D%20%60018f72e6798472818849f6c57607dd14%60%20%21%3D%20%60wait%20too%20long%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22,%22maxLines%22:10000%7D%5D,%22range%22:%7B%22from%22:%221715618212693%22,%22to%22:%221715622148693%22%7D%7D%7D&schemaVersion=1&orgId=1

reusee · 2024-05-15T01:10:44Z

从metrics看，17:09:00 时，p99 apply-latency 是 23.9s，apply是4.66ms，说明logtail的数量很多。
17:08:00-17:08:30 的 apply queue 达到 4.81k和8.22k。
到了 17:09:00，apply queue 降到了 4，说明logtail已经消费完毕。30秒消费8.22k，平均3ms左右消费一个logtail，正常范围。
问题应是，为什么会有突发的这么多logtail。

观察到一个现象，logtail collect duration 在 17:01:30 开始波动。17:01:30 产生了一个尖峰，然后在 17:03:30 开始变成低谷，这个低谷一直持续到 17:10:00。这个时间段刚好对应上面的 cn 的 logtail 数量尖峰。
一个猜测是，dn 在这段时间内，累积了很多 logtail，然后集中发送，于是 cn 侧产生了比较长的队列。至少可以确定的是，dn 先于 cn 发生异常波动。

在 17:03:30，cn 侧的 logtail consume 累积时间达到 8.55s，这个时候的 apply queue 是 1.83k，平均是 4ms 左右消费一个logtail，消费速度正常。问题仍然是突发的logtail数量。

aressu1985 · 2024-05-20T04:02:20Z

update on 5.20
commit : 5522522
job link:
#15960

and there is a long running error in this period:

and the txn queues status and txn creation / commit rt p99 is unnormal:

mo-log link:
https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22rGD%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-branch-reg-5522522%5C%22%7D%20%7C%3D%20%60long%20running%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221716169792432%22,%22to%22:%221716171770614%22%7D%7D%7D&schemaVersion=1&orgId=1

txn dashbaord link:
https://grafana.ci.matrixorigin.cn/d/b66fd46c-534a-48dd-a395-db16785e3919/txn-metrics?orgId=1&var-interval=1m&var-namespace=mo-branch-reg-5522522&var-pod=All&from=1716135763000&to=1716147043000

reusee · 2024-05-20T04:25:45Z

update on 5.20 commit : 5522522 job link: #15960
and there is a long running error in this period:
and the txn queues status and txn creation / commit rt p99 is unnormal:

mo-log link: https://grafana.ci.matrixorigin.cn/explore?panes=%7B%22rGD%22:%7B%22datasource%22:%22loki%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bnamespace%3D%5C%22mo-branch-reg-5522522%5C%22%7D%20%7C%3D%20%60long%20running%60%22,%22queryType%22:%22range%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22loki%22%7D,%22editorMode%22:%22builder%22%7D%5D,%22range%22:%7B%22from%22:%221716169792432%22,%22to%22:%221716171770614%22%7D%7D%7D&schemaVersion=1&orgId=1

txn dashbaord link: https://grafana.ci.matrixorigin.cn/d/b66fd46c-534a-48dd-a395-db16785e3919/txn-metrics?orgId=1&var-interval=1m&var-namespace=mo-branch-reg-5522522&var-pod=All&from=1716135763000&to=1716147043000

https://grafana.ci.matrixorigin.cn/d/b8e7b5e4-1d06-41bd-a1d0-a17eba7bee9e/logtail-metrics?orgId=1&from=1716134400000&to=1716148800000&var-interval=1m&var-namespace=mo-branch-reg-5522522&var-pod=All

这次复现的logtail状况是，大部分apply时间在几ms，偶发几次200ms到400ms的apply。7点半左右，cn收到的logtail数量暴增，但apply queue没有堆积，消费速度正常。
从apply queue的情况来看，logtail消费不是瓶颈，不是问题的根源。

aressu1985 · 2024-05-20T06:14:33Z

long running go routine file:

Uploading routine_018f93c1-1e9f-76b1-b13e-b52300b44cd7.txt…

heni02 added kind/bug Something isn't working needs-triage severity/s0 Extreme impact: Cause the application to break down and seriously affect the use labels May 10, 2024

heni02 added this to the 1.2.0 milestone May 10, 2024

heni02 assigned matrix-meow May 10, 2024

sukki37 assigned volgariver6 and unassigned matrix-meow May 11, 2024

volgariver6 assigned daviszhen and unassigned volgariver6 May 11, 2024

This was referenced May 11, 2024

fix hung #16014

Merged

fix hung #16015

Merged

daviszhen assigned heni02 and unassigned daviszhen May 11, 2024

matrix-meow added the phase/testing label May 11, 2024

aressu1985 assigned daviszhen and unassigned heni02 May 13, 2024

matrix-meow removed the phase/testing label May 13, 2024

sukki37 removed the needs-triage label May 14, 2024

sukki37 assigned reusee and unassigned daviszhen May 14, 2024

This was referenced May 16, 2024

fix deadlock by log wait tool long #16159

Merged

fix deadlock by log wait tool long #16160

Merged

fix trace not work #16199

Merged

fix trace not work 1.2 #16201

Merged

aressu1985 modified the milestones: 1.2.0, 1.2.1 May 20, 2024

This was referenced May 20, 2024

support load trace data to s3 #16264

Merged

make load trace to s3 as default #16280

Merged

convert dup error if txn is orphan #16333

Merged

convert dup error if txn is orphan to 1.2 #16334

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: [date 5.9] tke regression: tpcc 1000warehouse 1000threads connection timeout #15960

[Bug]: [date 5.9] tke regression: tpcc 1000warehouse 1000threads connection timeout #15960

heni02 commented May 10, 2024

aressu1985 commented May 11, 2024

sukki37 commented May 11, 2024

sukki37 commented May 11, 2024

volgariver6 commented May 11, 2024

daviszhen commented May 11, 2024

aressu1985 commented May 13, 2024

heni02 commented May 14, 2024

daviszhen commented May 14, 2024 •

edited

daviszhen commented May 14, 2024

daviszhen commented May 14, 2024

sukki37 commented May 14, 2024

reusee commented May 15, 2024

aressu1985 commented May 20, 2024 •

edited

reusee commented May 20, 2024

aressu1985 commented May 20, 2024

[Bug]: [date 5.9] tke regression: tpcc 1000warehouse 1000threads connection timeout #15960

[Bug]: [date 5.9] tke regression: tpcc 1000warehouse 1000threads connection timeout #15960

Comments

heni02 commented May 10, 2024

Is there an existing issue for the same bug?

Branch Name

Commit ID

Other Environment Information

Actual Behavior

Expected Behavior

Steps to Reproduce

Additional information

aressu1985 commented May 11, 2024

sukki37 commented May 11, 2024

sukki37 commented May 11, 2024

volgariver6 commented May 11, 2024

daviszhen commented May 11, 2024

aressu1985 commented May 13, 2024

heni02 commented May 14, 2024

daviszhen commented May 14, 2024 • edited

daviszhen commented May 14, 2024

daviszhen commented May 14, 2024

sukki37 commented May 14, 2024

reusee commented May 15, 2024

aressu1985 commented May 20, 2024 • edited

reusee commented May 20, 2024

aressu1985 commented May 20, 2024

daviszhen commented May 14, 2024 •

edited

aressu1985 commented May 20, 2024 •

edited