Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Beam AnalyzeAndTransformDataset runs expensive transformation _InstanceDictInputToTFXIOInput Twice #296

Open
michaelwsherman opened this issue Feb 16, 2023 · 3 comments

Comments

@michaelwsherman
Copy link

AnalyzeAndTransformDataset should not run _InstanceDictInputToTF twice.

AnalyzeAndTransformDataset runs AnalyzeDataset and TransformDataset back-to-back. AnalyzeDataset runs _InstanceDictInputToTFXIOInput and TransformDataset also runs _InstanceDictInputToTFXIOInput.

But when running AnalyzeAndTransformDataset, the _InstanceDictInputToTFXIOInput call in TransformDataset is unnecessary, since it was already run in AnalyzeDataset.

The _InstanceDictInputToTFXIOInput transformation is expensive, and this redundant call meaningfully increase runtime and cost

@zoyahav
Copy link
Member

zoyahav commented Feb 16, 2023

Those who care about performance should be using the optimized TFXIO input path -
https://www.tensorflow.org/tfx/transform/get_started#the_tfxio_format
https://www.tensorflow.org/tfx/transform/get_started#pre-canned_pcollection_sources_tfxio
At this point instance-dict is mostly a test/experimentation only input format that should be used on small datasets.

@michaelwsherman
Copy link
Author

@kardiff18 @klmilam

@kardiff18
Copy link

Hi @zoyahav thank you for the resources. The ask here is actually specific to BigQuery.

IIUC, there is no tfxio precanned input path for BigQuery sources. It seems like it exists for CSV, but there is no equivalent for BigQuery, unless a user writes the PyArrow RecordBatch conversion code themselves.

Are there any plans to create a tfx_bsl.public.tfxio.BeamRecordBigQueryTFXIO or similar? This would help a lot of our use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants