byte and short types in spark no longer auto coerce to int32 #10225

jkolash · 2024-04-25T19:30:56Z

Apache Iceberg version

1.5.0

Query engine

Spark

Please describe the bug 🐞

The removal of the code

 private static PrimitiveWriter<?> ints(DataType type, ColumnDescriptor desc) {
    if (type instanceof ByteType) {
      return ParquetValueWriters.tinyints(desc);
    } else if (type instanceof ShortType) {
      return ParquetValueWriters.shorts(desc);
    }
    return ParquetValueWriters.ints(desc);
  }

In this PR https://github.com/apache/iceberg/pull/9440/files

broke this auto-coercion

Is there a reason for the removal of byte short support auto coercing to int? before on iceberg 1.4.x we were able to materialize this into iceberg just fine but now on iceberg 1.5.x it doesn't work

Fokko · 2024-04-25T20:14:47Z

Hey @jkolash Thanks for reporting this. The behavior should stay the same, due to the logic here:

https://github.com/apache/iceberg/pull/9440/files#diff-8ac59cbdbcc60cc0c558051dfe8dcf9ffeb4c66379e48c49867a93ee43e27528R224-R236

What's the error that you're seeing? This will help me to reproduce the issue on my end and see if we can come up with a fix.

jkolash · 2024-04-25T20:15:59Z

@Fokko Thanks for the quick response I will try to write up a code snippet reproducing the issue.

jkolash · 2024-04-25T20:50:38Z

        val df = spark.sql("""select inline(array(from_json('{"b":82}', 'struct<b:byte>')))""")
        df.show()

+---+
|  b|
+---+
| 82|
+---+

        df.writeTo("staging.iceberg_table_3")
        .using("iceberg")
        .createOrReplace()

using this spark config

        conf.set("spark.sql.catalog.staging", "org.apache.iceberg.spark.SparkCatalog")
            .set("spark.sql.catalog.staging.type", "hadoop")
            .set("spark.sql.catalog.staging.warehouse", "/tmp/random_directory");

jkolash · 2024-04-25T20:53:07Z

If/when there is a PR I can test it on my side. where I have exhaustive type testing.

java.lang.ClassCastException: class java.lang.Byte cannot be cast to class java.lang.Integer (java.lang.Byte and java.lang.Integer are in module java.base of loader 'bootstrap')
	at org.apache.iceberg.parquet.ColumnWriter$2.write(ColumnWriter.java:39)
	at org.apache.iceberg.parquet.ParquetValueWriters$PrimitiveWriter.write(ParquetValueWriters.java:131)
	at org.apache.iceberg.parquet.ParquetValueWriters$OptionWriter.write(ParquetValueWriters.java:375)
	at org.apache.iceberg.parquet.ParquetValueWriters$StructWriter.write(ParquetValueWriters.java:608)

is the error I get

jkolash · 2024-04-25T21:24:57Z

hmm I think this may be related to the spark version we are using as I tested on spark-3.4.1 and didn't see the issue but see it on our 3.4.2

apache#10225

jkolash · 2024-04-26T02:40:17Z

Ok this is reproducing via the github actions build on my public fork
https://github.com/jkolash/iceberg/actions/runs/8842101257/job/24280206652

TestDataFrameWriterV2 > testByte FAILED
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 7) (localhost executor driver): java.lang.ClassCastException: class java.lang.Byte cannot be cast to class java.lang.Integer (java.lang.Byte and java.lang.Integer are in module java.base of loader 'bootstrap')
    	at org.apache.iceberg.parquet.ColumnWriter$2.write(ColumnWriter.java:39)
    	at org.apache.iceberg.parquet.ParquetValueWriters$PrimitiveWriter.write(ParquetValueWriters.java:131)
    	at org.apache.iceberg.parquet.ParquetValueWriters$OptionWriter.write(ParquetValueWriters.java:356)
    	at org.apache.iceberg.parquet.ParquetValueWriters$StructWriter.write(ParquetValueWriters.java:589)

jkolash · 2024-05-01T12:54:00Z

Just wanted to make sure you were aware reproducing is pretty simple

Author: jkolash <[email protected]>
Date:   Thu Apr 25 19:23:22 2024 -0400

    Failing test for issue #10225
    
    https://github.com/apache/iceberg/issues/10225

diff --git a/spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestDataFrameWriterV2.java b/spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestDataFrameWriterV2.java
index 76b138ced..9193154ce 100644
--- a/spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestDataFrameWriterV2.java
+++ b/spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestDataFrameWriterV2.java
@@ -177,6 +177,17 @@ public class TestDataFrameWriterV2 extends SparkTestBaseWithCatalog {
         sql("select * from %s order by id", tableName));
   }
 
+  @Test
+  public void testByte() {
+    SparkSession sparkSession = spark.cloneSession();
+    Dataset<Row> dataset =
+        sparkSession.sql("select inline(array(from_json('{\"b\": 3}', 'struct<b:byte>')))");
+
+    dataset.show();
+
+    dataset.writeTo(tableName).createOrReplace();
+  }
+
   @Test
   public void testWriteWithCaseSensitiveOption() throws NoSuchTableException, ParseException {
     SparkSession sparkSession = spark.cloneSession();

jkolash added the bug Something isn't working label Apr 25, 2024

jkolash added a commit to jkolash/iceberg that referenced this issue Apr 25, 2024

Failing test for issue apache#10225

01431f4

apache#10225

jkolash added a commit to jkolash/iceberg that referenced this issue Apr 26, 2024

Failing test for issue apache#10225

921c3af

apache#10225

Fokko added this to the Iceberg 1.6.0 milestone May 2, 2024

shardulm94 mentioned this issue May 17, 2024

Spark: Coerce shorts and bytes into ints in Parquet Writer #10349

Merged

shardulm94 closed this as completed in #10349 May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

byte and short types in spark no longer auto coerce to int32 #10225

byte and short types in spark no longer auto coerce to int32 #10225

jkolash commented Apr 25, 2024

Fokko commented Apr 25, 2024

jkolash commented Apr 25, 2024

jkolash commented Apr 25, 2024

jkolash commented Apr 25, 2024

jkolash commented Apr 25, 2024

jkolash commented Apr 26, 2024

jkolash commented May 1, 2024

byte and short types in spark no longer auto coerce to int32 #10225

byte and short types in spark no longer auto coerce to int32 #10225

Comments

jkolash commented Apr 25, 2024

Apache Iceberg version

Query engine

Please describe the bug 🐞

Fokko commented Apr 25, 2024

jkolash commented Apr 25, 2024

jkolash commented Apr 25, 2024

jkolash commented Apr 25, 2024

jkolash commented Apr 25, 2024

jkolash commented Apr 26, 2024

jkolash commented May 1, 2024