Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Significant number of Clickhouse exceptions #4931

Open
byronwall opened this issue Apr 29, 2024 · 9 comments
Open

Significant number of Clickhouse exceptions #4931

byronwall opened this issue Apr 29, 2024 · 9 comments

Comments

@byronwall
Copy link

byronwall commented Apr 29, 2024

Bug description

The signoz-clickhouse container is filled with tons of logs related to failed background tasks. Most errors appear to include the same language (example below). The main problem for me is that these logs or exceptions appear to take significant disk space. If it's possible to reduce their disk impact, I wouldn't care.

Important part of exceptions is probably: Code: 117. DB::Exception: Invalid flag for negative store:
[...] in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 286)

Expected behavior

Exceptions should probably not happen as frequently. If they do, they should not fill the disk?

How to reproduce

  1. Nothing in particular, I am just running the docker containers.
  2. I started in version 0.35 or so and have been migrating as updates come out.

Version information

  • Signoz version: 0.44, started earlier though
  • Browser version: Chrome latest (issue is backend/docker though)
  • Your OS and version: Ubuntu
  • Your CPU Architecture(ARM/Intel): x86_64

Additional context

NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

I added line breaks so you don't have to scroll this:

2024.04.29 12:30:55.663013 [ 634 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task 
{2b326032-b962-42d9-ab8c-bf5431fcecfa::20240322_9019_9082_3}: Code: 117. DB::Exception: Invalid flag for negative store: 
(while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240322_9019_9057_2/ in 
table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 
with max_rows_to_read = 286): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying 
this message, always include the lines below):


0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000c800f1b in /usr/bin/clickhouse
1. DB::Exception::Exception<>(int, FormatStringHelperImpl<>) @ 0x000000000721a243 in /usr/bin/clickhouse
2. DB::DDSketchDenseLogarithmic::deserialize(DB::ReadBuffer&) @ 0x000000000d5260e0 in /usr/bin/clickhouse
3. DB::SerializationAggregateFunction::deserializeBinaryBulk(DB::IColumn&, DB::ReadBuffer&, unsigned long, double) const
 @ 0x00000000108b58e9 in /usr/bin/clickhouse
4. DB::ISerialization::deserializeBinaryBulkWithMultipleStreams(COW<DB::IColumn>::immutable_ptr<DB::IColumn>&, unsigned long, DB::ISerialization::DeserializeBinaryBulkSettings&, std::shared_ptr<DB::ISerialization::DeserializeBinaryBulkState>&, std::unordered_map<String, COW<DB::IColumn>::immutable_ptr<DB::IColumn>, std::hash<String>, std::equal_to<String>, std::allocator<std::pair<String const, COW<DB::IColumn>::immutable_ptr<DB::IColumn>>>>*) const @ 0x00000000108b01d9 in /usr/bin/clickhouse
5. DB::MergeTreeReaderCompact::readRows(unsigned long, unsigned long, bool, unsigned long, std::vector<COW<DB::IColumn>::immutable_ptr<DB::IColumn>, std::allocator<COW<DB::IColumn>::immutable_ptr<DB::IColumn>>>&) @ 0x000000001250a6b9 in /usr/bin/clickhouse
6. DB::MergeTreeSequentialSource::generate() @ 0x000000001251ca4d in /usr/bin/clickhouse
7. DB::ISource::tryGenerate() @ 0x000000001297acf5 in /usr/bin/clickhouse
8. DB::ISource::work() @ 0x000000001297a743 in /usr/bin/clickhouse
9. DB::ExecutionThreadContext::executeTask() @ 0x000000001299371a in /usr/bin/clickhouse
10. DB::PipelineExecutor::executeStepImpl(unsigned long, std::atomic<bool>*) @ 0x000000001298a170 in /usr/bin/clickhouse
11. DB::PipelineExecutor::executeStep(std::atomic<bool>*) @ 0x0000000012989928 in /usr/bin/clickhouse
12. DB::PullingPipelineExecutor::pull(DB::Chunk&) @ 0x0000000012998017 in /usr/bin/clickhouse
13. DB::PullingPipelineExecutor::pull(DB::Block&) @ 0x00000000129981d3 in /usr/bin/clickhouse
14. DB::MergeTask::ExecuteAndFinalizeHorizontalPart::executeImpl() @ 0x000000001233b6f2 in /usr/bin/clickhouse
15. DB::MergeTask::ExecuteAndFinalizeHorizontalPart::execute() @ 0x000000001233b64b in /usr/bin/clickhouse
16. DB::MergeTask::execute() @ 0x0000000012340d99 in /usr/bin/clickhouse
17. DB::MergePlainMergeTreeTask::executeStep() @ 0x0000000012723517 in /usr/bin/clickhouse
18. DB::MergeTreeBackgroundExecutor<DB::DynamicRuntimeQueue>::threadFunction() @ 0x00000000123532c4 in /usr/bin/clickhouse
19. ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::worker(std::__list_iterator<ThreadFromGlobalPoolImpl<false>, void*>) @ 0x000000000c8eb0c1 in /usr/bin/clickhouse
20. void std::__function::__policy_invoker<void ()>::__call_impl<std::__function::__default_alloc_func<ThreadFromGlobalPoolImpl<false>::ThreadFromGlobalPoolImpl<void ThreadPoolImpl<ThreadFromGlobalPoolImpl<false>>::scheduleImpl<void>(std::function<void ()>, Priority, std::optional<unsigned long>, bool)::'lambda0'()>(void&&)::'lambda'(), void ()>>(std::__function::__policy_storage const*) @ 0x000000000c8ee8fa in /usr/bin/clickhouse
21. void* std::__thread_proxy[abi:v15000]<std::tuple<std::unique_ptr<std::__thread_struct, std::default_delete<std::__thread_struct>>, void ThreadPoolImpl<std::thread>::scheduleImpl<void>(std::function<void ()>, Priority, std::optional<unsigned long>, bool)::'lambda0'()>>(void*) @ 0x000000000c8ed6fe in /usr/bin/clickhouse
22. ? @ 0x00007f6c29757609
23. ? @ 0x00007f6c2967c353

Thank you for your bug report – we love squashing them!

Copy link

welcome bot commented Apr 29, 2024

Thanks for opening this issue. A team member should give feedback soon. In the meantime, feel free to check out the contributing guidelines.

@srikanthccv
Copy link
Member

srikanthccv commented Apr 29, 2024

This shouldn't be happening unless some data is corrupted. Can you share the number of records in the signoz_metrics.exp_hist table? Do you have negative latencies reported by your application for some reason?

@byronwall
Copy link
Author

byronwall commented Apr 29, 2024

I ran some queries to hopefully help you out. I do not know if my application is returning negative latencies. The query below maybe says "no" if I understand the result. Having said that, I am not intentionally trying to return a negative latency. I am mostly using SigNoz to understand spans visually and review exceptions, so I don't spend any time looking at latency.

Let me know if you want me to run any more queries on the database. The ones below were guesses based on your question and ChatGPT.

docker exec -it signoz-clickhouse clickhouse-client --query="SELECT COUNT(*) FROM signoz_metrics.exp_hist"
97501

Command:

docker exec -it signoz-clickhouse clickhouse-client --query="SELECT COUNT(*) AS total_records, SUM(count) AS total_count, MIN(min) AS minimum_latency, MAX(max) AS maximum_latency, MIN(max) AS minimum_of_max, MAX(min) AS maximum_of_min FROM signoz_metrics.exp_hist WHERE metric_name = 'signoz_latency'"

Results:

total_records total_count minimum_latency maximum_latency minimum_of_max maximum_of_min
97501 7388390 0 54821048.4397 0 24613273.5689

@srikanthccv
Copy link
Member

Thanks. I am not sure what could be the reason. This exception is thrown when the serialized data is incorrect and can't be deserialized. From the logs containing the part, does it show any day other than 20240322?

@srikanthccv
Copy link
Member

Did start seeing them after after any specific update? What is your ClickHouse version?

@srikanthccv
Copy link
Member

There is no change in the binary format. I suspect it more likely some data corruption issue or HW issue.

@byronwall
Copy link
Author

byronwall commented Apr 29, 2024

With respect to dates, it appears that most dates are impacted:

docker logs signoz-clickhouse 2>&1 | grep 'negative store' | grep -oE '([0-9]{8})_' | awk -F "_" '{print $1}' | sort -u

Gives: 20240227, 20240228, 20240305, 20240306, 20240308, 20240311, 20240313, 20240314, 20240315, 20240318, 20240319, 20240320, 20240321, 20240322, 20240325, 20240327, 20240329, 20240404, 20240419, 20240422

My ClickHouse version:

docker exec signoz-clickhouse clickhouse-client --version
ClickHouse client version 24.1.2.5 (official build).

Regarding version where it may have happened, my guess it that it was sometime after upgrading to 0.38.2? That's just a guess based on the updates I went through. SigNoz was definitely OK for some duration and then these logs started filling up after those updates. I had to provisioning a bigger drive at some point and now that bigger one is filling up so I need to do something different.

image

For a sample of the full logs:

docker logs signoz-clickhouse 2>&1 | grep 'negative store' | head
2024.04.29 17:27:14.230807 [ 617 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240321_8209_8241_2}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240321_8209_8225_1/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 129): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:14.299663 [ 620 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240322_8715_8787_3}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240322_8715_8743_1/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 274): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:14.300471 [ 634 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240322_9019_9082_3}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240322_9019_9057_2/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 286): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:14.301010 [ 622 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240308_3932_4014_4}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240308_3932_3974_2/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 210): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:14.301457 [ 626 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240227_110_170_3}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240227_146_150_1/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 33): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:14.302400 [ 627 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240306_2855_2897_2}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240306_2855_2873_1/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 182): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:14.307170 [ 624 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240322_8885_8925_2}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240322_8885_8905_1/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 203): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:14.383201 [ 629 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240305_2341_2395_4}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240305_2358_2376_3/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 55): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:16.285638 [ 626 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240321_8209_8241_2}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240321_8209_8225_1/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 129): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):
2024.04.29 17:27:16.385360 [ 618 ] {} <Error> MergeTreeBackgroundExecutor: Exception while executing background task {2b326032-b962-42d9-ab8c-bf5431fcecfa::20240319_7342_7369_2}: Code: 117. DB::Exception: Invalid flag for negative store: (while reading from part /var/lib/clickhouse/store/2b3/2b326032-b962-42d9-ab8c-bf5431fcecfa/20240319_7342_7355_1/ in table signoz_metrics.exp_hist (2b326032-b962-42d9-ab8c-bf5431fcecfa) located on disk default of type local, from mark 0 with max_rows_to_read = 83): While executing MergeTreeSequentialSource. (INCORRECT_DATA), Stack trace (when copying this message, always include the lines below):

@byronwall
Copy link
Author

Assuming some sort of corruption, is there a simple way to clear out the bad data or other ClickHouse issues? I am not concerned if I lose data. I am more concerned about the disk filling up again and losing the server.

If possible, it'd be nice to maintain the SigNoz settings, but I could rebuild those if that's needed.

Thanks for the quick replies and feedback. I really appreciate the support.

@srikanthccv
Copy link
Member

It would help if we could find a way to reproduce this. If you are not using the signoz_latency metric, you can delete the data from this table and set this setting to false

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants