ParquetDataset met coredump when data contain DELTA_BINARY_PACKED encoding #597

fuhailin · 2022-12-19T09:34:51Z

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: None
TensorFlow installed from (source or binary): source
TensorFlow version (use command below): r1.15.5-deeprec2210-25-ga27850bf1de 1.15.5
Python version: Python 3.6.9
Bazel version (if compiling from source): 0.26.1
GCC/Compiler version (if compiling from source): gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CUDA/cuDNN version: cuda:11.7.0-cudnn8
GPU model and memory: NVIDIA TITAN V 12288MiB

You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior

I use apache iceberg to generate parquet files, when parquet files compressed by zstd, ParquetDataset crashed with reading int64 type data. I notice DeepRec use arrow=5.0, but the arrow supports DELTA_BINARY_PACKED encoding begins at version 7.0, so I think we need to upgrade arrow version, and that won't affect other user's compatibility.

Describe the expected behavior
Works expected with DELTA_BINARY_PACKED encoding.

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
part.zstd.parquet: https://drive.google.com/file/d/1CoumvsuL47trnFi4Bn6haRIsgTy9frSE/view?usp=share_link
part.gz.parquet: https://drive.google.com/file/d/1V_cOrjIVTVZ5y7Q4KbHa085ay6GeaZH-/view?usp=share_link

import os

import tensorflow as tf
from tensorflow.python.data.experimental.ops.dataframe import DataFrame
from tensorflow.python.data.experimental.ops.parquet_dataset_ops import ParquetDataset
from tensorflow.python.data.ops import dataset_ops



def make_initializable_iterator(ds):
    r"""Wrapper of make_initializable_iterator."""
    if hasattr(dataset_ops, "make_initializable_iterator"):
        return dataset_ops.make_initializable_iterator(ds)
    return ds.make_initializable_iterator()


def parquet_map(record):
    label = record.pop("label")
    return record, label


filename = """part.zstd.parquet"""
# filename = 'part.gz.parquet'

# Read from a parquet file.
ds = ParquetDataset(
    filename,
    batch_size=4,
    fields=[
        DataFrame.Field("f_2672", tf.int64),
        DataFrame.Field("f_2671", tf.int64, ragged_rank=0),
        DataFrame.Field("f_2673", tf.int64, ragged_rank=0),
        DataFrame.Field("f_5196", tf.float32, ragged_rank=0),
        DataFrame.Field("f_8436", tf.float32, ragged_rank=0),
        DataFrame.Field("label", tf.int32),
    ],
    num_parallel_reads=8,
).map(parquet_map)
ds = ds.prefetch(4)

iterator = make_initializable_iterator(ds)
features, labels = iterator.get_next()

sess_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)

with tf.Session(config=sess_config) as sess:
    sess.run(iterator.initializer)
    for i in range(1):
        feature, label = sess.run([features, labels])
        print(feature)
        print("Label: ")
        print(label)

Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.

The text was updated successfully, but these errors were encountered:

liutongxuan · 2023-05-26T14:33:36Z

@JackMoriarty

liutongxuan assigned fuhailin Dec 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ParquetDataset met coredump when data contain DELTA_BINARY_PACKED encoding #597

ParquetDataset met coredump when data contain DELTA_BINARY_PACKED encoding #597

fuhailin commented Dec 19, 2022

liutongxuan commented May 26, 2023

ParquetDataset met coredump when data contain DELTA_BINARY_PACKED encoding #597

ParquetDataset met coredump when data contain DELTA_BINARY_PACKED encoding #597

Comments

fuhailin commented Dec 19, 2022

liutongxuan commented May 26, 2023