Shared-data cluster Query Profile Metric Problems with accuracy #45669

kaka-zb · 2024-05-15T09:03:31Z

Backgrounds

We tried to use sr replace elasticsearch to build the logging platform, and then during the testing process, we found that the query latency was very high, so we used query profile to locate the problem.

Problem description

With this query profile result, it looks like there's a data skew problem，but there is something wrong with the metric data.

According to the sr document, InstanceNum means Number of all FragmentInstances for this Fragment, so the number of fragmentInstances for this Fragment is 6, but the iotime metrics above seems to be not correct, max iotime divided by 6 is also much larger than the average iotime.

StarRocks version

3.2.6-2585333 (shared-data StarRocks cluster)

kaka-zb · 2024-05-15T09:23:42Z

query_profile.txt

wupan-olo · 2024-05-16T08:18:33Z

instance是6，pipeline还有还有自己的并行度：

Pipeline (id=0):
         - ActiveTime: 178.376ms
           - __MAX_OF_ActiveTime: 742.777ms
           - __MIN_OF_ActiveTime: 1.555ms
         - BlockByInputEmpty: 2.810K (2810)
           - __MAX_OF_BlockByInputEmpty: 502
           - __MIN_OF_BlockByInputEmpty: 4
         - BlockByOutputFull: 0
         - BlockByPrecondition: 0
         - DegreeOfParallelism: 4
         - DriverPrepareTime: 41.537us
           - __MAX_OF_DriverPrepareTime: 87.626us
           - __MIN_OF_DriverPrepareTime: 25.631us
         - DriverTotalTime: 1s687ms
           - __MAX_OF_DriverTotalTime: 5s857ms
           - __MIN_OF_DriverTotalTime: 709.111ms

并行度是4，所以io task一共是 4*6

before-Sunrise · 2024-05-16T08:38:39Z

instance number is not the smallest concurrency granularity in StarRocks，every instance will use multi threads in io thread pool to execute io task

kaka-zb · 2024-05-16T09:20:01Z

got it !

so, according to profile results blew, can we surmise that there is a data skew problem, whether or not it hits the block cache.

a demo that hits the cache

- IOTimeLocalDisk: 355.869ms
 - __MAX_OF_IOTimeLocalDisk: 25s710ms
 - __MIN_OF_IOTimeLocalDisk: 51.257us
- IOTimeRemote: 0ns

a demo that misses the cache

- IOTimeLocalDisk: 6.265ms
 - __MAX_OF_IOTimeLocalDisk: 70.878ms
 - __MIN_OF_IOTimeLocalDisk: 0ns
- IOTimeRemote: 4s430ms
 - __MAX_OF_IOTimeRemote: 55s242ms
 - __MIN_OF_IOTimeRemote: 0ns

FYI, we enabled the Block Cache, by set starlet_use_star_cache=true

kaka-zb added the type/bug Something isn't working label May 15, 2024

kaka-zb changed the title ~~Query Profile Metric Merging and MIN/MAX Values Problems with accuracy~~ Shared-data cluster Query Profile Metric Problems with accuracy May 15, 2024

kaka-zb closed this as completed May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shared-data cluster Query Profile Metric Problems with accuracy #45669

Shared-data cluster Query Profile Metric Problems with accuracy #45669

kaka-zb commented May 15, 2024 •

edited

kaka-zb commented May 15, 2024

wupan-olo commented May 16, 2024

before-Sunrise commented May 16, 2024

kaka-zb commented May 16, 2024 •

edited

Shared-data cluster Query Profile Metric Problems with accuracy #45669

Shared-data cluster Query Profile Metric Problems with accuracy #45669

Comments

kaka-zb commented May 15, 2024 • edited

Backgrounds

Problem description

StarRocks version

kaka-zb commented May 15, 2024

wupan-olo commented May 16, 2024

before-Sunrise commented May 16, 2024

kaka-zb commented May 16, 2024 • edited

kaka-zb commented May 15, 2024 •

edited

kaka-zb commented May 16, 2024 •

edited