Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[csv-] use newline='' for opening files #2374

Draft
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

midichef
Copy link
Contributor

@midichef midichef commented Apr 7, 2024

The csv loader should use newline=''. According to the csv library documentation:

"If newline='' is not specified, newlines embedded inside quoted fields will not be interpreted correctly
[...]
It should always be safe to specify newline='', since the csv module does its own (universal) newline handling.

newline='' is already used for saving CSV files (#1362), added by 7a5e834.

An example file that demonstrates the difference:
newlines.csv
In vd newlines.csv, the third row, second column, loads as "\n" (on my Ubuntu system where newline is "\n". On Windows it would load as "\r\n", I think.). After the patch it loads as "\r\n".

I'm making this PR a draft, pending others' input, since I'm not well-versed in parsing CSV files.

Is there a canonical test suite of CSV files for parsing? I'd prefer to test this change before it's deployed.

The csv library documentation says
"If newline='' is not specified, newlines embedded inside quoted fields
will not be interpreted correctly [...] . It should always be safe
to specify newline='', since the csv module does its own (universal)
newline handling."
https://docs.python.org/3/library/csv.html#id4
@saulpw
Copy link
Owner

saulpw commented May 18, 2024

Okay, so I made a small change to open_text_source to pass through the kwargs. So this PR can be changed to simply the first newline='' change (which seems reasonable to me).

We have sample_data/errors.csv which should cover most of the weirdnesses found in csv-spectrum and csvkit/examples. But it doesn't cover encodings or alternate line-endings; we'd probably need separate files for those.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants