feat: influx inspect export parquet #25047

alespour · 2024-06-07T16:02:57Z

Closes https://github.com/influxdata/edge/issues/672

This PR extends influx_inspect command export with Parquet output. The code in cmd/influx_inspect/export/parquet folder is Parquet exporter code and related code ported from idpe (command snapshot store tsm export), just the minimal required subset for the export.

New influx_inspect options:

-measurement - selects measurement to be exported required
-parquet - selects Parquet output instead of line protocol
-chunk-size - size, in bytes, to partition Parquet files (default 100000000)

Output file(s) are created in a folder specified via existing -out option. The limitations are:

-database, -retention and -measurement must be specified
only TSM files are exported (not WAL files, unlike when exporting to line protocol) - if requested, can be easily implemented
export to Parquet file(s) is done per each TSM file. The files are apparently not sorted by time of the contained data by the reading code. Therefore, neither are output files. So for example table-00001.parquet may contain older data than table-00000.parquet. Seems irrelevant for import.

I've read the contributing section of the project README.

alespour · 2024-06-12T09:40:25Z

Export example cmd:

influx_inspect export -datadir /var/lib/influxdb/data/ -waldir /var/lib/influxdb/wal/ -out /bigdata/export/ -database benchmark_db -retention autogen -measurement cpu -parquet

Import via telegraf:

[[inputs.file]]
   files = ["/bigdata/export/table-*.parquet"]
   name_override = "cpu"
   data_format = "parquet"
   tag_columns = ["datacenter","hostname","os","rack","region","service","team"]
   timestamp_column = "time"
   timestamp_format = "unix_ns"

telegraf --once

powersj · 2024-06-12T14:03:39Z

@davidby-influx could we get Stuart's review on this PR? While not urgent, it would be nice to keep up the momentum on this.

@alespour I have two comments:

In the README I'd rather see some examples of running with this new option + the required params
Is there a reason you are using arrow v14 and not v16? I assume that is copied over as well?

alespour · 2024-06-12T14:08:19Z

@powersj Yes, arrow v14 is used used in v2 exporter and it was just copied. I'll update the dep to v16. And add some examples of running the tool with Parquet output.

alespour added 6 commits June 7, 2024 12:46

feat(client): add initial support for exporting to Parquet

0c65d78

style: import order

733e1bb

fix: unused input parameter

3b5896c

test: add influx_inspect test

92f08e2

style: go fmt

2544c28

fix: extend Parquet options values checks

f04fc04

alespour marked this pull request as ready for review June 11, 2024 08:34

alespour marked this pull request as draft June 12, 2024 09:04

alespour marked this pull request as ready for review June 12, 2024 09:41

alespour added 2 commits June 13, 2024 09:41

chore: update arrow to v16

031aae7

docs: update with new influx_inspect options and sample command

50f4511

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: influx inspect export parquet #25047

feat: influx inspect export parquet #25047

alespour commented Jun 7, 2024 •

edited

alespour commented Jun 12, 2024 •

edited

powersj commented Jun 12, 2024 •

edited

alespour commented Jun 12, 2024 •

edited

feat: influx inspect export parquet #25047

Are you sure you want to change the base?

feat: influx inspect export parquet #25047

Conversation

alespour commented Jun 7, 2024 • edited

alespour commented Jun 12, 2024 • edited

powersj commented Jun 12, 2024 • edited

alespour commented Jun 12, 2024 • edited

alespour commented Jun 7, 2024 •

edited

alespour commented Jun 12, 2024 •

edited

powersj commented Jun 12, 2024 •

edited

alespour commented Jun 12, 2024 •

edited