[SUPPORT] Error executing Merge Or Read #11199

jai20242 · 2024-05-13T10:33:54Z

Hello.

I am developing a project using Hudi version 0.14.1 and I have developed my custom payload class. I have a problem using Merge On Read because the data is not consistent. If I use Copy On Write, it works well but with Merge On Read hudi calls the method getInsertValue before calling combineAndGetUpdateValue when the row exists. It is a problem for me because in the method getInsertValue I replace the values @#=BDP_N=#@ by null (because when the row does not exist, this value is null). Then, in the method combineAndGetUpdateValue I use the row before and I replace the columns modified. If the value of a column is null, I get the value before.

I am going to explain better with an example:

1) Insert a first row

spark.createDataFrame(Seq(("1",2L,"value1","value2","2",false))).toDF("key", "sort","field1","field2","partition","_hoodie_is_deleted").
write.format("hudi").
option(OPERATION_OPT_KEY, "upsert").
option(CDC_ENABLED.key(), "true").
option(TABLE_NAME, "pruebasc").
option("hoodie.datasource.write.payload.class","CustomOverwriteWithLatestAvroPayload").
option("hoodie.avro.schema.validate","false").
option("hoodie.datasource.write.recordkey.field","key").
option("hoodie.datasource.write.precombine.field","sort").
option("hoodie.datasource.write.new.columns.nullable", "true").
option("hoodie.datasource.write.reconcile.schema","true").
option("hoodie.metadata.enable","false").
option("hoodie.index.type","SIMPLE").
option("hoodie.datasource.write.table.type","MERGE_ON_READ").
option("hoodie.compact.inline","true").
mode(Overwrite).
save("/tmp/pruebasc")

I have the following row:
key -> 1
sort -> 2
field1 -> value1
field2 -> value2
partition -> 2
_hoodie_is_deleted -> false

2) I update the row. The custom payload replaces the values @#=BDP_N=#@ by null and the columns not modified keep the value before

spark.createDataFrame(Seq(("1",3L,"@#=BDP_N=#@","value3","2",false))).toDF("key", "sort","field1","field2","partition","_hoodie_is_deleted").withColumn("field1", col("field1").cast("String")).
write.format("hudi").
option(OPERATION_OPT_KEY, "upsert").
option(CDC_ENABLED.key(), "true").
option(TABLE_NAME, "pruebasc").
option("hoodie.datasource.write.payload.class","CustomOverwriteWithLatestAvroPayload").
option("hoodie.avro.schema.validate","false").
option("hoodie.datasource.write.recordkey.field","key").
option("hoodie.datasource.write.precombine.field","sort").
option("hoodie.datasource.write.new.columns.nullable", "true"). //Las columnas nuevas insertadas las pone a null
option("hoodie.datasource.write.reconcile.schema","true"). //coge el esquema de lo insertado y no del nuevo registro. Importante que todo sea nulo menos la PK
option("hoodie.metadata.enable","false").
option("hoodie.index.type","SIMPLE").
option("hoodie.datasource.write.table.type","MERGE_ON_READ").
option("hoodie.compact.inline","true").
mode(Append).
save("/tmp/pruebasc")

I have the following row:
key -> 1
sort -> 3
field1 -> value1
field2 -> value3
partition -> 2
_hoodie_is_deleted -> false

But the correct row must be:
key -> 1
sort -> 3
field1 -> null
field2 -> value3
partition -> 2
_hoodie_is_deleted -> false

I have checked my log and the problem is that using Merge On Read, hudi calls the method getInsertValue when the key exists before calling combineAndGetUpdateValue

Regards

danny0405 · 2024-05-14T00:59:08Z

There is no guarantee that #getInsertValue should be invoked after #combineAndGetUpdateValue, actually we need the value returned by #getInsertValue to hand it over to #combineAndGetUpdateValue.

jai20242 · 2024-05-14T05:53:22Z

And why does it only happen with Merge On Read? Also, I have tested the version 1.0.0-beta and it doesn't happen (it works well but we can't use a beta version in production and it works worse than 0 versions). With copy or write and with the version 1.0.0-beta with updates only call the method combineAndGetUpdateValue.
I don't understand why we need to call a function to insert data before updating because then both methods are not isolated.

danny0405 · 2024-05-14T08:31:42Z

And why does it only happen with Merge On Read

My guess is we use the #getInsertValue to deserialize the payload from logs. Maybe I lost some reminisenses and if you already tests it locally you can trace the invocation chain.

codope added the writer-core Issues relating to core transactions/write actions label May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SUPPORT] Error executing Merge Or Read #11199

[SUPPORT] Error executing Merge Or Read #11199

jai20242 commented May 13, 2024

danny0405 commented May 14, 2024

jai20242 commented May 14, 2024 •

edited

danny0405 commented May 14, 2024

[SUPPORT] Error executing Merge Or Read #11199

[SUPPORT] Error executing Merge Or Read #11199

Comments

jai20242 commented May 13, 2024

danny0405 commented May 14, 2024

jai20242 commented May 14, 2024 • edited

danny0405 commented May 14, 2024

jai20242 commented May 14, 2024 •

edited