Issue with Parsing NDJSON File in DuckDB: Unexpected Quotation Marks #12188

Max0u · 2024-05-22T13:01:20Z

Max0u
May 22, 2024

Description

We are trying to parse a complex NDJSON file using DuckDB. The schema of the NDJSON file can vary between lines. To handle this without blocking, we have set the maximum_depth to 1. However, we are encountering an issue where DuckDB is adding quotation marks (") around the entire field.

Steps to Reproduce

Parse a complex NDJSON file with varying schema lines using DuckDB.
Set the maximum_depth parameter to 1.

Expected Behavior

The fields should be parsed correctly without adding extra quotation marks.

Actual Behavior

DuckDB adds quotation marks around the entire field.

Questions

Is this the correct behavior for DuckDB when parsing an NDJSON file with maximum_depth set to 1?
If not, is there a way to prevent DuckDB from adding these quotation marks?
Any insights or suggestions would be greatly appreciated!

Additional Information

NDJSON file example:

{"field1": "value1", "field2": {"subfield1": "subvalue1"}}
{"field1": "value2", "field2": {"subfield2": "subvalue2"}}

Code snippet:

SELECT * FROM read_ndjson('path_to_file.ndjson', maximum_depth=1);

Output using pqrs tool:

field2: ""{"subfield1": "subvalue1"}""

Thank you in advance for your help!

Answered by lnkuiper

May 23, 2024

Hi @Max0u, I think the issue is that we're not annotating the strings going into the Parquet file as being the Parquet JSON type. Therefore, the type is interpreted by pqrs as a VARCHAR, and surrounded by double quotes.

If we add a cast like so:

duckdb -c "COPY (SELECT * FROM read_ndjson('path_to_file.ndjson', maximum_depth=1)) TO 'my.parquet'";
duckdb --jsonlines -c "SELECT field1::JSON field1, field2::JSON field2 FROM 'my.parquet'";

We get proper JSON output without the double quotes:

{"field1":"value1","field2":{"subfield1":"subvalue1"}}
{"field1":"value2","field2":{"subfield2":"subvalue2"}}

View full answer

Tishj · 2024-05-22T13:13:07Z

Tishj
May 22, 2024
Collaborator

I think you might be looking for this: https://duckdb.org/docs/extensions/json#json-extraction-functions

2 replies

Max0u May 22, 2024
Author

Sorry, I may have misspoken. If you look at the screenshot below fields 1 has ".

But since it's already a string when looking at it using pqrs you see the redundancy "

is it a normal behavior?

Tishj May 22, 2024
Collaborator

That might be a bug actually, perhaps @lnkuiper can shine some light on this behavior

lnkuiper · 2024-05-23T09:29:43Z

lnkuiper
May 23, 2024
Collaborator

Hi @Max0u, I think the issue is that we're not annotating the strings going into the Parquet file as being the Parquet JSON type. Therefore, the type is interpreted by pqrs as a VARCHAR, and surrounded by double quotes.

If we add a cast like so:

duckdb -c "COPY (SELECT * FROM read_ndjson('path_to_file.ndjson', maximum_depth=1)) TO 'my.parquet'";
duckdb --jsonlines -c "SELECT field1::JSON field1, field2::JSON field2 FROM 'my.parquet'";

We get proper JSON output without the double quotes:

{"field1":"value1","field2":{"subfield1":"subvalue1"}}
{"field1":"value2","field2":{"subfield2":"subvalue2"}}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue with Parsing NDJSON File in DuckDB: Unexpected Quotation Marks #12188

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Issue with Parsing NDJSON File in DuckDB: Unexpected Quotation Marks #12188

Max0u May 22, 2024

Description

Steps to Reproduce

Expected Behavior

Actual Behavior

Questions

Additional Information

Replies: 2 comments · 2 replies

Tishj May 22, 2024 Collaborator

Max0u May 22, 2024 Author

Tishj May 22, 2024 Collaborator

lnkuiper May 23, 2024 Collaborator

Max0u
May 22, 2024

Replies: 2 comments 2 replies

Tishj
May 22, 2024
Collaborator

Max0u May 22, 2024
Author

Tishj May 22, 2024
Collaborator

lnkuiper
May 23, 2024
Collaborator