Beam AnalyzeAndTransformDataset runs expensive transformation _InstanceDictInputToTFXIOInput Twice #296

michaelwsherman · 2023-02-16T07:39:48Z

AnalyzeAndTransformDataset should not run _InstanceDictInputToTF twice.

AnalyzeAndTransformDataset runs AnalyzeDataset and TransformDataset back-to-back. AnalyzeDataset runs _InstanceDictInputToTFXIOInput and TransformDataset also runs _InstanceDictInputToTFXIOInput.

But when running AnalyzeAndTransformDataset, the _InstanceDictInputToTFXIOInput call in TransformDataset is unnecessary, since it was already run in AnalyzeDataset.

The _InstanceDictInputToTFXIOInput transformation is expensive, and this redundant call meaningfully increase runtime and cost

The text was updated successfully, but these errors were encountered:

zoyahav · 2023-02-16T09:10:33Z

Those who care about performance should be using the optimized TFXIO input path -
https://www.tensorflow.org/tfx/transform/get_started#the_tfxio_format
https://www.tensorflow.org/tfx/transform/get_started#pre-canned_pcollection_sources_tfxio
At this point instance-dict is mostly a test/experimentation only input format that should be used on small datasets.

michaelwsherman · 2023-02-16T23:01:41Z

@kardiff18 @klmilam

kardiff18 · 2023-03-03T20:56:53Z

Hi @zoyahav thank you for the resources. The ask here is actually specific to BigQuery.

IIUC, there is no tfxio precanned input path for BigQuery sources. It seems like it exists for CSV, but there is no equivalent for BigQuery, unless a user writes the PyArrow RecordBatch conversion code themselves.

Are there any plans to create a tfx_bsl.public.tfxio.BeamRecordBigQueryTFXIO or similar? This would help a lot of our use cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Beam AnalyzeAndTransformDataset runs expensive transformation _InstanceDictInputToTFXIOInput Twice #296

Beam AnalyzeAndTransformDataset runs expensive transformation _InstanceDictInputToTFXIOInput Twice #296

michaelwsherman commented Feb 16, 2023

zoyahav commented Feb 16, 2023

michaelwsherman commented Feb 16, 2023

kardiff18 commented Mar 3, 2023

Beam AnalyzeAndTransformDataset runs expensive transformation _InstanceDictInputToTFXIOInput Twice #296

Beam AnalyzeAndTransformDataset runs expensive transformation _InstanceDictInputToTFXIOInput Twice #296

Comments

michaelwsherman commented Feb 16, 2023

zoyahav commented Feb 16, 2023

michaelwsherman commented Feb 16, 2023

kardiff18 commented Mar 3, 2023