The use of JSON struct in clickhouse results in high storage space consumption #481

shenqidebaozi · 2024-03-20T02:55:18Z

For the same 10 million traces, qryn requires 18GB of storage, while uptrace only requires 4GB, which seems to be due to the inability to optimize the payload use of JSON.

qryn/lib/db/maintain/scripts.js

Lines 158 to 170 in b4cda9e

 `CREATE TABLE IF NOT EXISTS {{DB}}.traces_input {{{OnCluster}}} ( 

  oid String DEFAULT '0', 

  trace_id String, 

  span_id String, 

  parent_id String, 

  name String, 

  timestamp_ns Int64 CODEC(DoubleDelta), 

  duration_ns Int64, 

  service_name String, 

  payload_type Int8, 

  payload String, 

  tags Array(Tuple(String, String)) 

  ) Engine=Null`,

akvlad · 2024-03-20T08:28:36Z

@shenqidebaozi do you have any performance comparison of the traces search procedure?

shenqidebaozi · 2024-03-20T08:40:03Z

@akvlad Currently, in the research of different products, there is no comparison of trace search performance. But I think storage costs are also important.

akvlad · 2024-03-20T08:42:46Z

@shenqidebaozi according to your opinion. How many GB of HDD can be a completely equal replacement of 1 CPU core?
In terms of price.

shenqidebaozi · 2024-03-20T14:27:48Z

This is a good question, I don't know how to measure it.
But for me, these projects can run with the same configuration, so I hope qryn can take up less space. I briefly reviewed the code and found that Payload seems to have not been actually used.

lmangani · 2024-03-20T14:36:02Z

If there's anything really unused it can be avoided but I'm not sure that's the case. Compression and codec choices might also play a vital role and should be carefully reviewed.

gaby · 2024-03-21T00:49:35Z

This could be because Uptrace uses zstd compression by default with Clickhouse. Does qryn allow specifying compression?

qryn seems to be using zstd in only 3-4 fields, which explains the difference in size. Having an option to allow using zstd whenever possible, would reduce disk usage substantially.

It would be useful to have an ENV for specifying compression algorithm and Level in qryn. For example the default zstd level is 1 compared to 3 when using zstd cli.

https://clickhouse.com/docs/en/sql-reference/statements/create/table#:~:text=By%20default%2C%20ClickHouse%20applies%20lz4,and%20zstd%20in%20ClickHouse%20Cloud.

gaby · 2024-03-21T02:00:35Z

This query shows that uptrace lets you configure compression type/level and that gets appended to the clickhouse schema.

https://github.com/search?q=repo%3Auptrace%2Fuptrace%20ch_schema&type=code

shenqidebaozi · 2024-03-21T03:03:14Z

select
    sum(rows) as row,
    formatReadableSize(sum(data_uncompressed_bytes)) as ysq,
    formatReadableSize(sum(data_compressed_bytes)) as ysh,
    round(sum(data_compressed_bytes) / sum(data_uncompressed_bytes) * 100, 0) ys_rate
from system.parts;

shenqidebaozi · 2024-03-21T03:15:20Z

qryn/lib/db/maintain/scripts.js

Lines 158 to 170 in b4cda9e

 `CREATE TABLE IF NOT EXISTS {{DB}}.traces_input {{{OnCluster}}} ( 

  oid String DEFAULT '0', 

  trace_id String, 

  span_id String, 

  parent_id String, 

  name String, 

  timestamp_ns Int64 CODEC(DoubleDelta), 

  duration_ns Int64, 

  service_name String, 

  payload_type Int8, 

  payload String, 

  tags Array(Tuple(String, String)) 

  ) Engine=Null`,

What is the specific purpose of the payload field? It saved the original trace information, but I don't seem to have found the usage of this field. In addition, this field has added an additional JSON serialization in both qryn and qryn otel-collector, which incurs additional performance overhead.

https://github.com/metrico/otel-collector/blob/a7399e20d0915ceae5986de50178e6c24e09d635/exporter/qrynexporter/traces.go#L320

qryn/lib/db/otlp.js

Line 18 in b4cda9e

this.payload = JSON.stringify(obj)

For the second question, should we define payload as ClickHouse Nested so that JSON marshal into a string is not necessary. Also, may there be better compression effects?

gaby · 2024-03-21T11:54:38Z

Ths only downsize of using Nested is that it makes the field more strict than a string.

Also worth mentioning fields like: service_name, parent_id, payload_type, payload, tags are all highly redundant and would benefit from compression.

lmangani · 2024-03-21T11:58:06Z

@gaby we absolutely want compression choices to be as open as possible for experimenting. We could work on a set of ALTER statements we can use to experiment with.

gaby · 2024-03-21T13:25:42Z

@lmangani That would be a good starting point, or updating the CREATE TABLE and testing with a big data set to see the difference in size/performance. Compression will add ltency and reduce throughput thus why it should be configurable.

shenqidebaozi · 2024-03-22T11:55:23Z

Ths only downsize of using Nested is that it makes the field more strict than a string.

Also worth mentioning fields like: service_name, parent_id, payload_type, payload, tags are all highly redundant and would benefit from compression.

@gaby can also reduce JSON marshal、unmarshal once，this is helpful for bulk write and query

gaby · 2024-04-10T04:44:03Z

According to ChatGPT the same Create Table SQL would look like:

CREATE TABLE IF NOT EXISTS {{DB}}.traces_input {{{OnCluster}}} ( 
     oid String DEFAULT '0' CODEC(ZSTD), 
     trace_id String CODEC(ZSTD), 
     span_id String CODEC(ZSTD), 
     parent_id String CODEC(ZSTD), 
     name String CODEC(ZSTD), 
     timestamp_ns Int64 CODEC(DoubleDelta, ZSTD), 
     duration_ns Int64 CODEC(ZSTD), 
     service_name String CODEC(ZSTD), 
     payload_type Int8 CODEC(ZSTD), 
     payload String CODEC(ZSTD), 
     tags Array(Tuple(String, String)) CODEC(ZSTD) 
) Engine=Null

When asked for adding levels based on field type it producss the following:

CREATE TABLE IF NOT EXISTS {{DB}}.traces_input {{{OnCluster}}} ( 
     oid String DEFAULT '0' CODEC(ZSTD(1)), 
     trace_id String CODEC(ZSTD(3)),  -- Likely to benefit from more compression
     span_id String CODEC(ZSTD(3)),   -- Likely to benefit from more compression
     parent_id String CODEC(ZSTD(1)), 
     name String CODEC(ZSTD(1)), 
     timestamp_ns Int64 CODEC(DoubleDelta, ZSTD(3)),  -- Larger data size, benefits from more compression
     duration_ns Int64 CODEC(ZSTD(3)),  -- Larger data size, benefits from more compression
     service_name String CODEC(ZSTD(1)), 
     payload_type Int8 CODEC(ZSTD(1)), 
     payload String CODEC(ZSTD(3)),  -- Assuming payloads can be large/structured, they might benefit more
     tags Array(Tuple(String, String)) CODEC(ZSTD(1)) 
) Engine=Null

lmangani · 2024-04-11T11:10:43Z

@gaby let us know how this plays out and if it produces a visible effect we can most definitely implement options to trigger it

lmangani added enhancement New feature or request help wanted Extra attention is needed needs testing labels Mar 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The use of JSON struct in clickhouse results in high storage space consumption #481

The use of JSON struct in clickhouse results in high storage space consumption #481

shenqidebaozi commented Mar 20, 2024 •

edited

akvlad commented Mar 20, 2024 •

edited

shenqidebaozi commented Mar 20, 2024

akvlad commented Mar 20, 2024

shenqidebaozi commented Mar 20, 2024

lmangani commented Mar 20, 2024

gaby commented Mar 21, 2024 •

edited

gaby commented Mar 21, 2024

shenqidebaozi commented Mar 21, 2024

shenqidebaozi commented Mar 21, 2024 •

edited

gaby commented Mar 21, 2024

lmangani commented Mar 21, 2024

gaby commented Mar 21, 2024

shenqidebaozi commented Mar 22, 2024 •

edited

gaby commented Apr 10, 2024

lmangani commented Apr 11, 2024

The use of JSON struct in clickhouse results in high storage space consumption #481

The use of JSON struct in clickhouse results in high storage space consumption #481

Comments

shenqidebaozi commented Mar 20, 2024 • edited

akvlad commented Mar 20, 2024 • edited

shenqidebaozi commented Mar 20, 2024

akvlad commented Mar 20, 2024

shenqidebaozi commented Mar 20, 2024

lmangani commented Mar 20, 2024

gaby commented Mar 21, 2024 • edited

gaby commented Mar 21, 2024

shenqidebaozi commented Mar 21, 2024

shenqidebaozi commented Mar 21, 2024 • edited

gaby commented Mar 21, 2024

lmangani commented Mar 21, 2024

gaby commented Mar 21, 2024

shenqidebaozi commented Mar 22, 2024 • edited

gaby commented Apr 10, 2024

lmangani commented Apr 11, 2024

shenqidebaozi commented Mar 20, 2024 •

edited

akvlad commented Mar 20, 2024 •

edited

gaby commented Mar 21, 2024 •

edited

shenqidebaozi commented Mar 21, 2024 •

edited

shenqidebaozi commented Mar 22, 2024 •

edited