[SPARK-48241][SQL] CSV parsing failure with char/varchar type columns #46537

liujiayi771 · 2024-05-11T07:51:48Z

What changes were proposed in this pull request?

CSV table containing char and varchar columns will result in the following error when selecting from the CSV table:

spark-sql (default)> show create table test_csv;
CREATE TABLE default.test_csv (
  id INT,
  name CHAR(10))
USING csv

java.lang.IllegalArgumentException: requirement failed: requiredSchema (struct<id:int,name:string>) should be the subset of dataSchema (struct<id:int,name:string>).
    at scala.Predef$.require(Predef.scala:281)
    at org.apache.spark.sql.catalyst.csv.UnivocityParser.<init>(UnivocityParser.scala:56)
    at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat.$anonfun$buildReader$2(CSVFileFormat.scala:127)
    at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:155)
    at org.apache.spark.sql.execution.datasources.FileFormat$$anon$1.apply(FileFormat.scala:140)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:231)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:293)
    at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:125)

Why are the changes needed?

For char and varchar types, Spark will convert them to StringType in CharVarcharUtils.replaceCharVarcharWithStringInSchema and record __CHAR_VARCHAR_TYPE_STRING in the metadata.

The reason for the above error is that the StringType columns in the dataSchema and requiredSchema of UnivocityParser are not consistent. The StringType in the dataSchema has metadata, while the metadata in the requiredSchema is empty. We need to retain the metadata when resolving schema.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Add a new test case in CSVSuite.

Was this patch authored or co-authored using generative AI tooling?

No.

liujiayi771 · 2024-05-11T07:53:26Z

Hi @ulysses-you Could you help review?

ulysses-you · 2024-05-11T08:54:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala

- case a: AttributeReference => a
+ case a: AttributeReference =>
+ // Keep the metadata in given schema.
+ a.copy(metadata = field.metadata)(exprId = a.exprId, qualifier = a.qualifier)


a.withMetadata(field.metadata)

ulysses-you

lgtm if tests pass, cc @yaooqinn @cloud-fan

cloud-fan

good catch!

cloud-fan · 2024-05-13T14:41:42Z

thanks, merging to master/~~3.5~~!

cloud-fan · 2024-05-13T14:43:40Z

it has conflicts with 3.5, can you create a new backport PR?

liujiayi771 · 2024-05-14T02:39:57Z

it has conflicts with 3.5, can you create a new backport PR?

Create a backport PR in #46565.

SPARK-48241: CSV parsing failure with char/varchar type columns

d3378a2

github-actions bot added the SQL label May 11, 2024

ulysses-you reviewed May 11, 2024

View reviewed changes

ulysses-you approved these changes May 11, 2024

View reviewed changes

Address comments

204a4ab

cloud-fan approved these changes May 13, 2024

View reviewed changes

cloud-fan closed this in b14abb3 May 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48241][SQL] CSV parsing failure with char/varchar type columns #46537

[SPARK-48241][SQL] CSV parsing failure with char/varchar type columns #46537

liujiayi771 commented May 11, 2024 •

edited

liujiayi771 commented May 11, 2024

ulysses-you May 11, 2024

ulysses-you left a comment

cloud-fan left a comment

cloud-fan commented May 13, 2024 •

edited

cloud-fan commented May 13, 2024

liujiayi771 commented May 14, 2024

[SPARK-48241][SQL] CSV parsing failure with char/varchar type columns #46537

[SPARK-48241][SQL] CSV parsing failure with char/varchar type columns #46537

Conversation

liujiayi771 commented May 11, 2024 • edited

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

liujiayi771 commented May 11, 2024

ulysses-you May 11, 2024

Choose a reason for hiding this comment

ulysses-you left a comment

Choose a reason for hiding this comment

cloud-fan left a comment

Choose a reason for hiding this comment

cloud-fan commented May 13, 2024 • edited

cloud-fan commented May 13, 2024

liujiayi771 commented May 14, 2024

liujiayi771 commented May 11, 2024 •

edited

cloud-fan commented May 13, 2024 •

edited