Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The use of JSON struct in clickhouse results in high storage space consumption #481

Open
shenqidebaozi opened this issue Mar 20, 2024 · 15 comments
Labels
enhancement New feature or request help wanted Extra attention is needed needs testing

Comments

@shenqidebaozi
Copy link

shenqidebaozi commented Mar 20, 2024

For the same 10 million traces, qryn requires 18GB of storage, while uptrace only requires 4GB, which seems to be due to the inability to optimize the payload use of JSON.

`CREATE TABLE IF NOT EXISTS {{DB}}.traces_input {{{OnCluster}}} (
oid String DEFAULT '0',
trace_id String,
span_id String,
parent_id String,
name String,
timestamp_ns Int64 CODEC(DoubleDelta),
duration_ns Int64,
service_name String,
payload_type Int8,
payload String,
tags Array(Tuple(String, String))
) Engine=Null`,

@akvlad
Copy link
Collaborator

akvlad commented Mar 20, 2024

@shenqidebaozi do you have any performance comparison of the traces search procedure?

@shenqidebaozi
Copy link
Author

@akvlad Currently, in the research of different products, there is no comparison of trace search performance. But I think storage costs are also important.

@akvlad
Copy link
Collaborator

akvlad commented Mar 20, 2024

@shenqidebaozi according to your opinion. How many GB of HDD can be a completely equal replacement of 1 CPU core?
In terms of price.

@shenqidebaozi
Copy link
Author

This is a good question, I don't know how to measure it.
But for me, these projects can run with the same configuration, so I hope qryn can take up less space. I briefly reviewed the code and found that Payload seems to have not been actually used.

@lmangani
Copy link
Collaborator

If there's anything really unused it can be avoided but I'm not sure that's the case. Compression and codec choices might also play a vital role and should be carefully reviewed.

@lmangani lmangani added enhancement New feature or request help wanted Extra attention is needed needs testing labels Mar 20, 2024
@gaby
Copy link

gaby commented Mar 21, 2024

This could be because Uptrace uses zstd compression by default with Clickhouse. Does qryn allow specifying compression?

qryn seems to be using zstd in only 3-4 fields, which explains the difference in size. Having an option to allow using zstd whenever possible, would reduce disk usage substantially.

It would be useful to have an ENV for specifying compression algorithm and Level in qryn. For example the default zstd level is 1 compared to 3 when using zstd cli.

https://clickhouse.com/docs/en/sql-reference/statements/create/table#:~:text=By%20default%2C%20ClickHouse%20applies%20lz4,and%20zstd%20in%20ClickHouse%20Cloud.

@gaby
Copy link

gaby commented Mar 21, 2024

This query shows that uptrace lets you configure compression type/level and that gets appended to the clickhouse schema.

https://github.com/search?q=repo%3Auptrace%2Fuptrace%20ch_schema&type=code

@shenqidebaozi
Copy link
Author

select
    sum(rows) as row,
    formatReadableSize(sum(data_uncompressed_bytes)) as ysq,
    formatReadableSize(sum(data_compressed_bytes)) as ysh,
    round(sum(data_compressed_bytes) / sum(data_uncompressed_bytes) * 100, 0) ys_rate
from system.parts;
image image

@shenqidebaozi
Copy link
Author

shenqidebaozi commented Mar 21, 2024

`CREATE TABLE IF NOT EXISTS {{DB}}.traces_input {{{OnCluster}}} (
oid String DEFAULT '0',
trace_id String,
span_id String,
parent_id String,
name String,
timestamp_ns Int64 CODEC(DoubleDelta),
duration_ns Int64,
service_name String,
payload_type Int8,
payload String,
tags Array(Tuple(String, String))
) Engine=Null`,

What is the specific purpose of the payload field? It saved the original trace information, but I don't seem to have found the usage of this field. In addition, this field has added an additional JSON serialization in both qryn and qryn otel-collector, which incurs additional performance overhead.

https://github.com/metrico/otel-collector/blob/a7399e20d0915ceae5986de50178e6c24e09d635/exporter/qrynexporter/traces.go#L320

this.payload = JSON.stringify(obj)

For the second question, should we define payload as ClickHouse Nested so that JSON marshal into a string is not necessary. Also, may there be better compression effects?

@gaby
Copy link

gaby commented Mar 21, 2024

Ths only downsize of using Nested is that it makes the field more strict than a string.

Also worth mentioning fields like: service_name, parent_id, payload_type, payload, tags are all highly redundant and would benefit from compression.

@lmangani
Copy link
Collaborator

@gaby we absolutely want compression choices to be as open as possible for experimenting. We could work on a set of ALTER statements we can use to experiment with.

@gaby
Copy link

gaby commented Mar 21, 2024

@lmangani That would be a good starting point, or updating the CREATE TABLE and testing with a big data set to see the difference in size/performance. Compression will add ltency and reduce throughput thus why it should be configurable.

@shenqidebaozi
Copy link
Author

shenqidebaozi commented Mar 22, 2024

Ths only downsize of using Nested is that it makes the field more strict than a string.

Also worth mentioning fields like: service_name, parent_id, payload_type, payload, tags are all highly redundant and would benefit from compression.

@gaby can also reduce JSON marshal、unmarshal once,this is helpful for bulk write and query

@gaby
Copy link

gaby commented Apr 10, 2024

According to ChatGPT the same Create Table SQL would look like:

CREATE TABLE IF NOT EXISTS {{DB}}.traces_input {{{OnCluster}}} ( 
     oid String DEFAULT '0' CODEC(ZSTD), 
     trace_id String CODEC(ZSTD), 
     span_id String CODEC(ZSTD), 
     parent_id String CODEC(ZSTD), 
     name String CODEC(ZSTD), 
     timestamp_ns Int64 CODEC(DoubleDelta, ZSTD), 
     duration_ns Int64 CODEC(ZSTD), 
     service_name String CODEC(ZSTD), 
     payload_type Int8 CODEC(ZSTD), 
     payload String CODEC(ZSTD), 
     tags Array(Tuple(String, String)) CODEC(ZSTD) 
) Engine=Null

When asked for adding levels based on field type it producss the following:

CREATE TABLE IF NOT EXISTS {{DB}}.traces_input {{{OnCluster}}} ( 
     oid String DEFAULT '0' CODEC(ZSTD(1)), 
     trace_id String CODEC(ZSTD(3)),  -- Likely to benefit from more compression
     span_id String CODEC(ZSTD(3)),   -- Likely to benefit from more compression
     parent_id String CODEC(ZSTD(1)), 
     name String CODEC(ZSTD(1)), 
     timestamp_ns Int64 CODEC(DoubleDelta, ZSTD(3)),  -- Larger data size, benefits from more compression
     duration_ns Int64 CODEC(ZSTD(3)),  -- Larger data size, benefits from more compression
     service_name String CODEC(ZSTD(1)), 
     payload_type Int8 CODEC(ZSTD(1)), 
     payload String CODEC(ZSTD(3)),  -- Assuming payloads can be large/structured, they might benefit more
     tags Array(Tuple(String, String)) CODEC(ZSTD(1)) 
) Engine=Null

@lmangani
Copy link
Collaborator

@gaby let us know how this plays out and if it produces a visible effect we can most definitely implement options to trigger it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed needs testing
Projects
None yet
Development

No branches or pull requests

4 participants