-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Beam AnalyzeAndTransformDataset runs expensive transformation _InstanceDictInputToTFXIOInput Twice #296
Comments
Those who care about performance should be using the optimized TFXIO input path - |
Hi @zoyahav thank you for the resources. The ask here is actually specific to BigQuery. IIUC, there is no tfxio precanned input path for BigQuery sources. It seems like it exists for CSV, but there is no equivalent for BigQuery, unless a user writes the PyArrow RecordBatch conversion code themselves. Are there any plans to create a |
AnalyzeAndTransformDataset
should not run_InstanceDictInputToTF
twice.AnalyzeAndTransformDataset
runsAnalyzeDataset
andTransformDataset
back-to-back.AnalyzeDataset
runs_InstanceDictInputToTFXIOInput
andTransformDataset
also runs_InstanceDictInputToTFXIOInput
.But when running
AnalyzeAndTransformDataset
, the_InstanceDictInputToTFXIOInput
call inTransformDataset
is unnecessary, since it was already run inAnalyzeDataset
.The
_InstanceDictInputToTFXIOInput
transformation is expensive, and this redundant call meaningfully increase runtime and costThe text was updated successfully, but these errors were encountered: