[path-] fix undercounted progress for multibyte chars #2323

midichef · 2024-02-19T07:20:00Z

For text files encoded with more than one byte per character, FileProgress undercounts loading progress.

To demonstrate, you can use a UTF-32 file, where every character takes 4 bytes:

seq 1000001 | iconv -t UTF-32 >! progress.utf32.tsv
vd --encoding=utf-32 progress.utf32.tsv

The progress only goes up to 25%, not 100%.

That's because read() progress is counting the characters, but the goal is measured in bytes. The file is around 7 million characters long, but when encoded in UTF-32, it is 28 million bytes, so even at the end, 7 million/28 million becomes 25%.

This PR changes FileProgress to track progress as bytes.

visidata/path.py

midichef · 2024-03-04T08:03:49Z

ecc8628 estimates character length and uses that to estimaet progress. It fixes the progress for UTF-16/UTF-32 as discussed. It also fixes it for UTF-8. For example, a sample UTF-8 dataset of mostly Thai characters, loading progress was too slow by a factor of 2.5. Progress would max out at 40% instead of 100%.

The progress estimator samples characters every so often to estimate the average bytes per character. Right now it samples more characters early in the file. That's because it needs more samples to come up with a decent estimate of byte length early.

It doesn't slow down code much. For a UTF-8 tsv file with 10 million short lines (seq 10000000), it's within 1% of the running time of the develop branch. For a 2.7GB UTF-32 tsv file with 6 million rows, this change increases mean loading time by approximately 4%. That's higher than I expect. I would like to bring it down to below 1%. I will look into it more, but not soon. So this is ready for review.

midichef · 2024-03-04T08:05:13Z

A related bug in v3.0.2. Progress on compressed textfiles was overcounted. It would pass 100% and go to 200-600%.
338b250 fixes it.
To reproduce: seq 10000000 |gzip -c >! repro_gz_progress.gz; vd repro_gz_progress.gz
The progress meter will pass 100% around the 5,000,000th row.

@midichef

Co-authored-by: @midichef

@midichef

Co-authored-by: @midichef

saulpw · 2024-05-16T21:33:10Z

Thanks for doing the work on this, @midichef! I put together #2407 which might be a simpler way to do a few things. I'm not super-excited about some things in my PR, but between the two let's find the common ground.

and let's take 338b250 regardless.

@midichef

Co-authored-by: @midichef

@midichef

Co-authored-by: @midichef

anjakefala · 2024-05-18T05:54:02Z

Hi @midichef! Could you review the most recent commit? We removed the batching to simplify the code a bit.

saulpw reviewed Feb 19, 2024

View reviewed changes

visidata/path.py Outdated Show resolved Hide resolved

saulpw force-pushed the develop branch from 33cdd38 to df09830 Compare February 23, 2024 04:40

[path-] fix progress overcount for compressed files

338b250

midichef force-pushed the progress_undercount branch from 2a45fd5 to 612fea8 Compare March 4, 2024 02:57

[path-] fix undercounted progress for multibyte chars

ecc8628

midichef force-pushed the progress_undercount branch from 612fea8 to ecc8628 Compare March 4, 2024 03:30

saulpw added a commit that referenced this pull request May 16, 2024

[path-] fix undercounted progress for multibyte chars #2323

943c3a2

Co-authored-by: @midichef

saulpw added a commit that referenced this pull request May 16, 2024

[path-] fix undercounted progress for multibyte chars #2323

c4c49e4

Co-authored-by: @midichef

saulpw mentioned this pull request May 16, 2024

[path-] fix undercounted progress for multibyte chars #2407

Closed

anjakefala pushed a commit that referenced this pull request May 18, 2024

[path-] fix undercounted progress for multibyte chars #2323

5cefaa9

Co-authored-by: @midichef

anjakefala pushed a commit that referenced this pull request May 18, 2024

[path-] fix undercounted progress for multibyte chars #2323

28b16a8

Co-authored-by: @midichef

[path-] do not batch calculation of progress

6676721

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[path-] fix undercounted progress for multibyte chars #2323

[path-] fix undercounted progress for multibyte chars #2323

midichef commented Feb 19, 2024 •

edited

midichef commented Mar 4, 2024

midichef commented Mar 4, 2024

saulpw commented May 16, 2024

anjakefala commented May 18, 2024

[path-] fix undercounted progress for multibyte chars #2323

Are you sure you want to change the base?

[path-] fix undercounted progress for multibyte chars #2323

Conversation

midichef commented Feb 19, 2024 • edited

midichef commented Mar 4, 2024

midichef commented Mar 4, 2024

saulpw commented May 16, 2024

anjakefala commented May 18, 2024

midichef commented Feb 19, 2024 •

edited