-
DescriptionWe are trying to parse a complex NDJSON file using DuckDB. The schema of the NDJSON file can vary between lines. To handle this without blocking, we have set the maximum_depth to 1. However, we are encountering an issue where DuckDB is adding quotation marks (") around the entire field. Steps to ReproduceParse a complex NDJSON file with varying schema lines using DuckDB. Expected BehaviorThe fields should be parsed correctly without adding extra quotation marks. Actual BehaviorDuckDB adds quotation marks around the entire field. Questions
Additional InformationNDJSON file example: {"field1": "value1", "field2": {"subfield1": "subvalue1"}}
{"field1": "value2", "field2": {"subfield2": "subvalue2"}} Code snippet: SELECT * FROM read_ndjson('path_to_file.ndjson', maximum_depth=1); Output using
Thank you in advance for your help! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 2 replies
-
I think you might be looking for this: https://duckdb.org/docs/extensions/json#json-extraction-functions |
Beta Was this translation helpful? Give feedback.
-
Hi @Max0u, I think the issue is that we're not annotating the strings going into the Parquet file as being the Parquet JSON type. Therefore, the type is interpreted by If we add a cast like so: duckdb -c "COPY (SELECT * FROM read_ndjson('path_to_file.ndjson', maximum_depth=1)) TO 'my.parquet'";
duckdb --jsonlines -c "SELECT field1::JSON field1, field2::JSON field2 FROM 'my.parquet'"; We get proper JSON output without the double quotes: {"field1":"value1","field2":{"subfield1":"subvalue1"}}
{"field1":"value2","field2":{"subfield2":"subvalue2"}} |
Beta Was this translation helpful? Give feedback.
Hi @Max0u, I think the issue is that we're not annotating the strings going into the Parquet file as being the Parquet JSON type. Therefore, the type is interpreted by
pqrs
as aVARCHAR
, and surrounded by double quotes.If we add a cast like so:
We get proper JSON output without the double quotes: