You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
Have I written custom code (as opposed to using a stock example script provided in TensorFlow): Yes
OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Ubuntu 18.04
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device: None
TensorFlow installed from (source or binary): source
TensorFlow version (use command below): r1.15.5-deeprec2210-25-ga27850bf1de 1.15.5
Python version: Python 3.6.9
Bazel version (if compiling from source): 0.26.1
GCC/Compiler version (if compiling from source): gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
CUDA/cuDNN version: cuda:11.7.0-cudnn8
GPU model and memory: NVIDIA TITAN V 12288MiB
You can collect some of this information using our environment capture script
You can also obtain the TensorFlow version with: 1. TF 1.0: python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)" 2. TF 2.0: python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"
Describe the current behavior
I use apache iceberg to generate parquet files, when parquet files compressed by zstd, ParquetDataset crashed with reading int64 type data. I notice DeepRec use arrow=5.0, but the arrow supports DELTA_BINARY_PACKED encoding begins at version 7.0, so I think we need to upgrade arrow version, and that won't affect other user's compatibility.
Describe the expected behavior
Works expected with DELTA_BINARY_PACKED encoding.
import os
import tensorflow as tf
from tensorflow.python.data.experimental.ops.dataframe import DataFrame
from tensorflow.python.data.experimental.ops.parquet_dataset_ops import ParquetDataset
from tensorflow.python.data.ops import dataset_ops
def make_initializable_iterator(ds):
r"""Wrapper of make_initializable_iterator."""
if hasattr(dataset_ops, "make_initializable_iterator"):
return dataset_ops.make_initializable_iterator(ds)
return ds.make_initializable_iterator()
def parquet_map(record):
label = record.pop("label")
return record, label
filename = """part.zstd.parquet"""
# filename = 'part.gz.parquet'
# Read from a parquet file.
ds = ParquetDataset(
filename,
batch_size=4,
fields=[
DataFrame.Field("f_2672", tf.int64),
DataFrame.Field("f_2671", tf.int64, ragged_rank=0),
DataFrame.Field("f_2673", tf.int64, ragged_rank=0),
DataFrame.Field("f_5196", tf.float32, ragged_rank=0),
DataFrame.Field("f_8436", tf.float32, ragged_rank=0),
DataFrame.Field("label", tf.int32),
],
num_parallel_reads=8,
).map(parquet_map)
ds = ds.prefetch(4)
iterator = make_initializable_iterator(ds)
features, labels = iterator.get_next()
sess_config = tf.ConfigProto(allow_soft_placement=True, log_device_placement=False)
with tf.Session(config=sess_config) as sess:
sess.run(iterator.initializer)
for i in range(1):
feature, label = sess.run([features, labels])
print(feature)
print("Label: ")
print(label)
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
The text was updated successfully, but these errors were encountered:
Please make sure that this is a bug. As per our GitHub Policy, we only address code/doc bugs, performance issues, feature requests and build/installation issues on GitHub. tag:bug_template
System information
You can collect some of this information using our environment capture
script
You can also obtain the TensorFlow version with: 1. TF 1.0:
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
2. TF 2.0:python -c "import tensorflow as tf; print(tf.version.GIT_VERSION, tf.version.VERSION)"
Describe the current behavior
I use apache iceberg to generate parquet files, when parquet files compressed by zstd, ParquetDataset crashed with reading int64 type data. I notice DeepRec use arrow=5.0, but the arrow supports DELTA_BINARY_PACKED encoding begins at version 7.0, so I think we need to upgrade arrow version, and that won't affect other user's compatibility.
Describe the expected behavior
Works expected with DELTA_BINARY_PACKED encoding.
Code to reproduce the issue
Provide a reproducible test case that is the bare minimum necessary to generate the problem.
part.zstd.parquet: https://drive.google.com/file/d/1CoumvsuL47trnFi4Bn6haRIsgTy9frSE/view?usp=share_link
part.gz.parquet: https://drive.google.com/file/d/1V_cOrjIVTVZ5y7Q4KbHa085ay6GeaZH-/view?usp=share_link
Other info / logs
Include any logs or source code that would be helpful to diagnose the problem. If including tracebacks, please include the full traceback. Large logs and files should be attached.
The text was updated successfully, but these errors were encountered: