Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

byte and short types in spark no longer auto coerce to int32 #10225

Closed
jkolash opened this issue Apr 25, 2024 · 7 comments 路 Fixed by #10349
Closed

byte and short types in spark no longer auto coerce to int32 #10225

jkolash opened this issue Apr 25, 2024 · 7 comments 路 Fixed by #10349
Labels
bug Something isn't working
Milestone

Comments

@jkolash
Copy link
Contributor

jkolash commented Apr 25, 2024

Apache Iceberg version

1.5.0

Query engine

Spark

Please describe the bug 馃悶

The removal of the code

 private static PrimitiveWriter<?> ints(DataType type, ColumnDescriptor desc) {
    if (type instanceof ByteType) {
      return ParquetValueWriters.tinyints(desc);
    } else if (type instanceof ShortType) {
      return ParquetValueWriters.shorts(desc);
    }
    return ParquetValueWriters.ints(desc);
  }

In this PR https://github.com/apache/iceberg/pull/9440/files

broke this auto-coercion

Is there a reason for the removal of byte short support auto coercing to int? before on iceberg 1.4.x we were able to materialize this into iceberg just fine but now on iceberg 1.5.x it doesn't work

@jkolash jkolash added the bug Something isn't working label Apr 25, 2024
@Fokko
Copy link
Contributor

Fokko commented Apr 25, 2024

Hey @jkolash Thanks for reporting this. The behavior should stay the same, due to the logic here:

https://github.com/apache/iceberg/pull/9440/files#diff-8ac59cbdbcc60cc0c558051dfe8dcf9ffeb4c66379e48c49867a93ee43e27528R224-R236

What's the error that you're seeing? This will help me to reproduce the issue on my end and see if we can come up with a fix.

@jkolash
Copy link
Contributor Author

jkolash commented Apr 25, 2024

@Fokko Thanks for the quick response I will try to write up a code snippet reproducing the issue.

@jkolash
Copy link
Contributor Author

jkolash commented Apr 25, 2024

        val df = spark.sql("""select inline(array(from_json('{"b":82}', 'struct<b:byte>')))""")
        df.show()
+---+
|  b|
+---+
| 82|
+---+
        df.writeTo("staging.iceberg_table_3")
        .using("iceberg")
        .createOrReplace()

using this spark config

        conf.set("spark.sql.catalog.staging", "org.apache.iceberg.spark.SparkCatalog")
            .set("spark.sql.catalog.staging.type", "hadoop")
            .set("spark.sql.catalog.staging.warehouse", "/tmp/random_directory");

@jkolash
Copy link
Contributor Author

jkolash commented Apr 25, 2024

If/when there is a PR I can test it on my side. where I have exhaustive type testing.

java.lang.ClassCastException: class java.lang.Byte cannot be cast to class java.lang.Integer (java.lang.Byte and java.lang.Integer are in module java.base of loader 'bootstrap')
	at org.apache.iceberg.parquet.ColumnWriter$2.write(ColumnWriter.java:39)
	at org.apache.iceberg.parquet.ParquetValueWriters$PrimitiveWriter.write(ParquetValueWriters.java:131)
	at org.apache.iceberg.parquet.ParquetValueWriters$OptionWriter.write(ParquetValueWriters.java:375)
	at org.apache.iceberg.parquet.ParquetValueWriters$StructWriter.write(ParquetValueWriters.java:608)

is the error I get

@jkolash
Copy link
Contributor Author

jkolash commented Apr 25, 2024

hmm I think this may be related to the spark version we are using as I tested on spark-3.4.1 and didn't see the issue but see it on our 3.4.2

jkolash added a commit to jkolash/iceberg that referenced this issue Apr 25, 2024
jkolash added a commit to jkolash/iceberg that referenced this issue Apr 26, 2024
@jkolash
Copy link
Contributor Author

jkolash commented Apr 26, 2024

Ok this is reproducing via the github actions build on my public fork
https://github.com/jkolash/iceberg/actions/runs/8842101257/job/24280206652

TestDataFrameWriterV2 > testByte FAILED
    org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 5.0 failed 1 times, most recent failure: Lost task 0.0 in stage 5.0 (TID 7) (localhost executor driver): java.lang.ClassCastException: class java.lang.Byte cannot be cast to class java.lang.Integer (java.lang.Byte and java.lang.Integer are in module java.base of loader 'bootstrap')
    	at org.apache.iceberg.parquet.ColumnWriter$2.write(ColumnWriter.java:39)
    	at org.apache.iceberg.parquet.ParquetValueWriters$PrimitiveWriter.write(ParquetValueWriters.java:131)
    	at org.apache.iceberg.parquet.ParquetValueWriters$OptionWriter.write(ParquetValueWriters.java:356)
    	at org.apache.iceberg.parquet.ParquetValueWriters$StructWriter.write(ParquetValueWriters.java:589)

@jkolash
Copy link
Contributor Author

jkolash commented May 1, 2024

Just wanted to make sure you were aware reproducing is pretty simple

Author: jkolash <[email protected]>
Date:   Thu Apr 25 19:23:22 2024 -0400

    Failing test for issue #10225
    
    https://github.com/apache/iceberg/issues/10225

diff --git a/spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestDataFrameWriterV2.java b/spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestDataFrameWriterV2.java
index 76b138ced..9193154ce 100644
--- a/spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestDataFrameWriterV2.java
+++ b/spark/v3.4/spark/src/test/java/org/apache/iceberg/spark/source/TestDataFrameWriterV2.java
@@ -177,6 +177,17 @@ public class TestDataFrameWriterV2 extends SparkTestBaseWithCatalog {
         sql("select * from %s order by id", tableName));
   }
 
+  @Test
+  public void testByte() {
+    SparkSession sparkSession = spark.cloneSession();
+    Dataset<Row> dataset =
+        sparkSession.sql("select inline(array(from_json('{\"b\": 3}', 'struct<b:byte>')))");
+
+    dataset.show();
+
+    dataset.writeTo(tableName).createOrReplace();
+  }
+
   @Test
   public void testWriteWithCaseSensitiveOption() throws NoSuchTableException, ParseException {
     SparkSession sparkSession = spark.cloneSession();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants