Significant number of Clickhouse exceptions #4931

byronwall · 2024-04-29T12:37:46Z

Bug description

The signoz-clickhouse container is filled with tons of logs related to failed background tasks. Most errors appear to include the same language (example below). The main problem for me is that these logs or exceptions appear to take significant disk space. If it's possible to reduce their disk impact, I wouldn't care.

Important part of exceptions is probably: Code: 117. DB::Exception: Invalid flag for negative store:
[...] in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 286)

Expected behavior

Exceptions should probably not happen as frequently. If they do, they should not fill the disk?

How to reproduce

Nothing in particular, I am just running the docker containers.
I started in version 0.35 or so and have been migrating as updates come out.

Version information

Signoz version: 0.44, started earlier though
Browser version: Chrome latest (issue is backend/docker though)
Your OS and version: Ubuntu
Your CPU Architecture(ARM/Intel): x86_64

Additional context

NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

I added line breaks so you don't have to scroll this:

2024.04.29 12:30:55.663013 [ 634 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task 
{2b326032-b962-42d9-ab8c-bf5431fcecfa::20240322_9019_9082_3}: Code: 117. DB::Exception: Invalid flag for negative store: 
(while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240322_9019_9057_2/ in 
table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 
with max_rows_to_read = 286): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying 
this message, always include the lines below):


0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000c800f1b in /usr/bin/clickhouse
1. DB::Exception::Exception<>(int, FormatStringHelperImpl<>) @ 0x000000000721a243 in /usr/bin/clickhouse
2. DB::DDSketchDenseLogarithmic::deserialize(DB::ReadBuffer&) @ 0x000000000d5260e0 in /usr/bin/clickhouse
3. DB::SerializationAggregateFunction::deserializeBinaryBulk(DB::IColumn&, DB::ReadBuffer&, unsigned long, double) const
 @ 0x00000000108b58e9 in /usr/bin/clickhouse
4. DB::ISerialization::deserializeBinaryBulkWithMultipleStreams(COW<DB::IColumn>::immutable_ptr<DB::IColumn>&, unsigned long, DB::ISerialization::DeserializeBinaryBulkSettings&, std::shared_ptr<DB::ISerialization::DeserializeBinaryBulkState>&, std::unordered_map<String, COW<DB::IColumn>::immutable_ptr<DB::IColumn>, std::hash<String>, std::equal_to<String>, std::allocator<std::pair<String const, COW<DB::IColumn>::immutable_ptr<DB::IColumn>>>>*) const @ 0x00000000108b01d9 in /usr/bin/clickhouse
5. DB::MergeTreeReaderCompact::readRows(unsigned long, unsigned long, bool, unsigned long, std::vector<COW<DB::IColumn>::immutable_ptr<DB::IColumn>, std::allocator<COW<DB::IColumn>::immutable_ptr<DB::IColumn>>>&) @ 0x000000001250a6b9 in /usr/bin/clickhouse
6. DB::MergeTreeSequentialSource::generate() @ 0x000000001251ca4d in /usr/bin/clickhouse
7. DB::ISource::tryGenerate() @ 0x000000001297acf5 in /usr/bin/clickhouse
8. DB::ISource::work() @ 0x000000001297a743 in /usr/bin/clickhouse
9. DB::ExecutionThreadContext::executeTask() @ 0x000000001299371a in /usr/bin/clickhouse
10. DB::PipelineExecutor::executeStepImpl(unsigned long, std::atomic<bool>*) @ 0x000000001298a170 in /usr/bin/clickhouse
11. DB::PipelineExecutor::executeStep(std::atomic<bool>*) @ 0x0000000012989928 in /usr/bin/clickhouse
12. DB::PullingPipelineExecutor::pull(DB::Chunk&) @ 0x0000000012998017 in /usr/bin/clickhouse
13. DB::PullingPipelineExecutor::pull(DB::Block&) @ 0x00000000129981d3 in /usr/bin/clickhouse
14. DB::MergeTask::ExecuteAndFinalizeHorizontalPart::executeImpl() @ 0x000000001233b6f2 in /usr/bin/clickhouse
15. DB::MergeTask::ExecuteAndFinalizeHorizontalPart::execute() @ 0x000000001233b64b in /usr/bin/clickhouse
16. DB::MergeTask::execute() @ 0x0000000012340d99 in /usr/bin/clickhouse
17. DB::MergePlainMergeTreeTask::executeStep() @ 0x0000000012723517 in /usr/bin/clickhouse
18. DB::MergeTreeBackgroundExecutor<DB::DynamicRuntimeQueue>::threadFunction() @ 0x00000000123532c4 in /usr/bin/clickhouse
19. ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::worker(std::__list_iterator<ThreadFromGlobalPoolImpl<false>, void*>) @ 0x000000000c8eb0c1 in /usr/bin/clickhouse
20. void std::__function::__policy_invoker<void ()>::__call_impl<std::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false>::ThreadFromGlobalPoolImpl<void ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::scheduleImpl<void>(std::function<void ()>, Priority, std::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'(), void ()>>(std::__function::__policy_storage const*) @ 0x000000000c8ee8fa in /usr/bin/clickhouse
21. void* std::__thread_proxy[abi:v15000]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, void ThreadPoolImpl<std::thread>::scheduleImpl<void>(std::function<void ()>, Priority, std::optional<unsigned long>, bool)::'lambda0'()>>(void*) @ 0x000000000c8ed6fe in /usr/bin/clickhouse
22. ? @ 0x00007f6c29757609
23. ? @ 0x00007f6c2967c353

Thank you for your bug report – we love squashing them!

The text was updated successfully, but these errors were encountered:

welcome · 2024-04-29T12:37:51Z

Thanks for opening this issue. A team member should give feedback soon. In the meantime, feel free to check out the contributing guidelines.

srikanthccv · 2024-04-29T13:45:47Z

This shouldn't be happening unless some data is corrupted. Can you share the number of records in the signoz_metrics.exp_hist table? Do you have negative latencies reported by your application for some reason?

byronwall · 2024-04-29T16:29:06Z

I ran some queries to hopefully help you out. I do not know if my application is returning negative latencies. The query below maybe says "no" if I understand the result. Having said that, I am not intentionally trying to return a negative latency. I am mostly using SigNoz to understand spans visually and review exceptions, so I don't spend any time looking at latency.

Let me know if you want me to run any more queries on the database. The ones below were guesses based on your question and ChatGPT.

docker exec -it signoz-clickhouse clickhouse-client --query="SELECT COUNT(*) FROM signoz_metrics.exp_hist"
97501

Command:

docker exec -it signoz-clickhouse clickhouse-client --query="SELECT COUNT(*) AS total_records, SUM(count) AS total_count, MIN(min) AS minimum_latency, MAX(max) AS maximum_latency, MIN(max) AS minimum_of_max, MAX(min) AS maximum_of_min FROM signoz_metrics.exp_hist WHERE metric_name = 'signoz_latency'"

Results:

total_records	total_count	minimum_latency	maximum_latency	minimum_of_max	maximum_of_min
97501	7388390	0	54821048.4397	0	24613273.5689

srikanthccv · 2024-04-29T16:59:19Z

Thanks. I am not sure what could be the reason. This exception is thrown when the serialized data is incorrect and can't be deserialized. From the logs containing the part, does it show any day other than 20240322?

srikanthccv · 2024-04-29T17:01:26Z

Did start seeing them after after any specific update? What is your ClickHouse version?

srikanthccv · 2024-04-29T17:15:33Z

There is no change in the binary format. I suspect it more likely some data corruption issue or HW issue.

byronwall · 2024-04-29T17:47:04Z

With respect to dates, it appears that most dates are impacted:

docker logs signoz-clickhouse 2>&1 | grep 'negative store' | grep -oE '([0-9]{8})_' | awk -F "_" '{print $1}' | sort -u

Gives: 20240227, 20240228, 20240305, 20240306, 20240308, 20240311, 20240313, 20240314, 20240315, 20240318, 20240319, 20240320, 20240321, 20240322, 20240325, 20240327, 20240329, 20240404, 20240419, 20240422

My ClickHouse version:

docker exec signoz-clickhouse clickhouse-client --version
ClickHouse client version 24.1.2.5 (official build).

Regarding version where it may have happened, my guess it that it was sometime after upgrading to 0.38.2? That's just a guess based on the updates I went through. SigNoz was definitely OK for some duration and then these logs started filling up after those updates. I had to provisioning a bigger drive at some point and now that bigger one is filling up so I need to do something different.

For a sample of the full logs:

docker logs signoz-clickhouse 2>&1 | grep 'negative store' | head
2024.04.29 17:27:14.230807 [ 617 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240321_8209_8241_2}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240321_8209_8225_1/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 129): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:14.299663 [ 620 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240322_8715_8787_3}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240322_8715_8743_1/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 274): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:14.300471 [ 634 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240322_9019_9082_3}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240322_9019_9057_2/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 286): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:14.301010 [ 622 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240308_3932_4014_4}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240308_3932_3974_2/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 210): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:14.301457 [ 626 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240227_110_170_3}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240227_146_150_1/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 33): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:14.302400 [ 627 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240306_2855_2897_2}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240306_2855_2873_1/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 182): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:14.307170 [ 624 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240322_8885_8925_2}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240322_8885_8905_1/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 203): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:14.383201 [ 629 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240305_2341_2395_4}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240305_2358_2376_3/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 55): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:16.285638 [ 626 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240321_8209_8241_2}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240321_8209_8225_1/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 129): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:16.385360 [ 618 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240319_7342_7369_2}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240319_7342_7355_1/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 83): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):

byronwall · 2024-04-29T17:48:27Z

Assuming some sort of corruption, is there a simple way to clear out the bad data or other ClickHouse issues? I am not concerned if I lose data. I am more concerned about the disk filling up again and losing the server.

If possible, it'd be nice to maintain the SigNoz settings, but I could rebuild those if that's needed.

Thanks for the quick replies and feedback. I really appreciate the support.

srikanthccv · 2024-04-30T01:36:08Z

It would help if we could find a way to reproduce this. If you are not using the signoz_latency metric, you can delete the data from this table and set this setting to false

signoz/deploy/docker/clickhouse-setup/otel-collector-config.yaml

Line 104 in 5069166

enable_exp_histogram: true

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significant number of Clickhouse exceptions #4931

Significant number of Clickhouse exceptions #4931

byronwall commented Apr 29, 2024 •

edited

welcome bot commented Apr 29, 2024

srikanthccv commented Apr 29, 2024 •

edited

byronwall commented Apr 29, 2024 •

edited

srikanthccv commented Apr 29, 2024

srikanthccv commented Apr 29, 2024

srikanthccv commented Apr 29, 2024

byronwall commented Apr 29, 2024 •

edited

byronwall commented Apr 29, 2024

srikanthccv commented Apr 30, 2024

Significant number of Clickhouse exceptions #4931

Significant number of Clickhouse exceptions #4931

Comments

byronwall commented Apr 29, 2024 • edited

Bug description

Expected behavior

How to reproduce

Version information

Additional context

Thank you for your bug report – we love squashing them!

welcome bot commented Apr 29, 2024

srikanthccv commented Apr 29, 2024 • edited

byronwall commented Apr 29, 2024 • edited

srikanthccv commented Apr 29, 2024

srikanthccv commented Apr 29, 2024

srikanthccv commented Apr 29, 2024

byronwall commented Apr 29, 2024 • edited

byronwall commented Apr 29, 2024

srikanthccv commented Apr 30, 2024

byronwall commented Apr 29, 2024 •

edited

srikanthccv commented Apr 29, 2024 •

edited

byronwall commented Apr 29, 2024 •

edited

byronwall commented Apr 29, 2024 •

edited