Skip to content

CSV Benchmark

Yasuhiro Yamada edited this page Mar 1, 2023 · 4 revisions

Benchmarking for processing CSV file

Replace all the characters on 2nd field to "@" in the approx 120 MiB of CSV file.

Summary

Tool 1st 2nd 3rd Average
teip + tr 3.253s 3.462s 3.447s 3.387s
awk 5.143s 4.946s 4.792s 4.960s
teip + awk 5.099s 4.987s 6.069s 5.385s

Please note that teip parses CSV align with RFC 4180, AWK does not.

Prerequisites

  • Platform: AWS t3.medium (vCPU x 2, Memory 4 GiB)
  • Storage: EBS volume gp2 / 200 GiB (600 IOPS)
$ wget https://github.com/greymd/test_files/raw/v1.0.0/xsv/1000000_Sales_Records.csv.gz
$ zcat 1000000_Sales_Records.csv.gz | awk '{print}' > test.csv # Filtered by AWK to add trailing newline
$ du -hs test.csv
120M    test.csv
$ ./target/release/teip --csv -f 2 -- tr '[:print:]' '@' < test.csv > teip_result.csv
$ ./target/release/teip --csv -f 2 -- awk '{gsub(".", "@");print}' < test.csv > teip_awk_result.csv
$ awk '{gsub(".","@",$2);print}' FS=, OFS=, < test.csv > awk_result.csv
$ md5sum *_result.csv
4328c75307064d3bfc3743a24c83513b  awk_result.csv
4328c75307064d3bfc3743a24c83513b  teip_result.csv
4328c75307064d3bfc3743a24c83513b  teip_awk_result.csv

Result

$ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
$ time ./target/release/teip --csv -f 2 -- tr '[:print:]' '@' < test.csv > /dev/null

real    0m3.253s
user    0m3.407s
sys     0m0.146s
$ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
$ time ./target/release/teip --csv -f 2 -- tr '[:print:]' '@' < test.csv > /dev/null

real    0m3.462s
user    0m3.624s
sys     0m0.152s
$ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
$ time ./target/release/teip --csv -f 2 -- tr '[:print:]' '@' < test.csv > /dev/null

real    0m3.447s
user    0m3.680s
sys     0m0.121s
$ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
$ time awk '{gsub(".","@",$2);print}' FS=, OFS=, < test.csv > /dev/null

real    0m5.143s
user    0m5.021s
sys     0m0.072s
$ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
$ time awk '{gsub(".","@",$2);print}' FS=, OFS=, < test.csv > /dev/null

real    0m4.946s
user    0m4.815s
sys     0m0.093s
$ sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
$ time awk '{gsub(".","@",$2);print}' FS=, OFS=, < test.csv > /dev/null

real    0m4.792s
user    0m4.683s
sys     0m0.091s
$ time ./target/release/teip --csv -f 2 -- awk '{gsub(".", "@");print}' < test.csv > /dev/null

real    0m5.099s
user    0m4.840s
sys     0m0.187s
$ time ./target/release/teip --csv -f 2 -- awk '{gsub(".", "@");print}' < test.csv > /dev/null

real    0m4.987s
user    0m4.863s
sys     0m0.121s
$ time ./target/release/teip --csv -f 2 -- awk '{gsub(".", "@");print}' < test.csv > /dev/null

real    0m6.069s
user    0m5.819s
sys     0m0.071s