Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParquetDataset met coredump when data contain DELTA_BINARY_PACKED encoding #597

Open
fuhailin opened this issue Dec 19, 2022 · 1 comment
Assignees

Comments

@fuhailin
Copy link
Collaborator

Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: None
  • TensorFlow installed from (source or binary): source
  • TensorFlow version (use command below): r1.15.5-deeprec2210-25-ga27850bf1de 1.15.5
  • Python version: Python 3.6.9
  • Bazel version (if compiling from source): 0.26.1
  • GCC/Compiler version (if compiling from source): gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
  • CUDA/cuDNN version: cuda:11.7.0-cudnn8
  • GPU model and memory: NVIDIA TITAN V 12288MiB

You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"

Describe the current behavior

I use apache iceberg to generate parquet files, when parquet files compressed by zstd, ParquetDataset crashed with reading int64 type data. I notice DeepRec use arrow=5.0, but the arrow supports DELTA_BINARY_PACKED encoding begins at version 7.0, so I think we need to upgrade arrow version, and that won't affect other user's compatibility.

Describe the expected behavior
Works expected with DELTA_BINARY_PACKED encoding.

Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
part.zstd.parquet: https://drive.google.com/file/d/1CoumvsuL47trnFi4Bn6haRIsgTy9frSE/view?usp=share_link
part.gz.parquet: https://drive.google.com/file/d/1V_cOrjIVTVZ5y7Q4KbHa085ay6GeaZH-/view?usp=share_link

import os

import tensorflow as tf
from tensorflow.python.data.experimental.ops.dataframe import DataFrame
from tensorflow.python.data.experimental.ops.parquet_dataset_ops import ParquetDataset
from tensorflow.python.data.ops import dataset_ops



def make_initializable_iterator(ds):
    r"""Wrapper of make_initializable_iterator."""
    if hasattr(dataset_ops, "make_initializable_iterator"):
        return dataset_ops.make_initializable_iterator(ds)
    return ds.make_initializable_iterator()


def parquet_map(record):
    label = record.pop("label")
    return record, label


filename = """part.zstd.parquet"""
# filename = 'part.gz.parquet'

# Read from a parquet file.
ds = ParquetDataset(
    filename,
    batch_size=4,
    fields=[
        DataFrame.Field("f_2672", tf.int64),
        DataFrame.Field("f_2671", tf.int64, ragged_rank=0),
        DataFrame.Field("f_2673", tf.int64, ragged_rank=0),
        DataFrame.Field("f_5196", tf.float32, ragged_rank=0),
        DataFrame.Field("f_8436", tf.float32, ragged_rank=0),
        DataFrame.Field("label", tf.int32),
    ],
    num_parallel_reads=8,
).map(parquet_map)
ds = ds.prefetch(4)

iterator = make_initializable_iterator(ds)
features, labels = iterator.get_next()

sess_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)

with tf.Session(config=sess_config) as sess:
    sess.run(iterator.initializer)
    for i in range(1):
        feature, label = sess.run([features, labels])
        print(feature)
        print("Label: ")
        print(label)

Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
image

@liutongxuan
Copy link
Member

@JackMoriarty

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants