-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Python expansion with multiple SqlTransforms is extremely slow #31227
Comments
The workaround looks good for the Python Direct Runner. @tvalentyn |
cc: @chamikaramj |
The downloaded jars should be cached. Probably this caching doesn't work for your environment ? You also have the option of manually specifying the jar [1] or manually starting up an expansion service [2]. [1] |
@chamikaramj The cache hit will never be detected for the downloaded JARs because of this line:
It always evaluates to False. A worse problem though that, as mentioned above |
@robertwb can you check this? |
Hi any news? I have also encountered this exact same issue |
beam/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto Line 1437 in fed6489
Expansion service jar is cached elsewhere when starting up the expansion service and served to Python side using the
|
What happened?
When building a Pipeline with multiple SqlTransforms from Beam Python, the expansion that happens in SqlTransforms is currently (Beam 2.55.0) extremely inefficient.
This inefficiency has multiple sources.
The latter dominates execution time.
For example running a Beam from a 4 vCPU, 2 core, 16 GB memory machine (standard Dataflow workbench setup) a Pipeline with 31 trivial SQL transforms takes 200 seconds to execute.
(See example below.)
We found a somewhat dirty workaround to speed things up by skipping the
SqlTransform._resolve_artifacts()
altogether when working from inside Jupyter.This brings down the execution speed from 200s to 22s.
I suspect these inefficiencies also contribute to beam_sql being extremely slow even for trivial queries.
apache_beam/runners/portability/artifact_service.py contains this code snippet that might be one of the culprits for this inefficiency (note the
and False
):In addition, once the ExpansionService is cached it only takes 100-200ms to perform the actual SQL expansion, but the
ArtifactRetrievalService.ResolveArtifacts()
call takes 1.5s per SQL query even without the downloading of the actual files. This dominates the expansion time, which dominates the overall time of launching and running a pipeline.So the hotspot call sequence is something like:
SqlTransform.expand()
ExternalTransform.expand()
ArtifactRetrievalService.ResolveArtifacts()
The times may not sound like much, but latency is bad enough to ruin the Jupyter REPL experience when combining Python + SQL.
Code to repro and demonstrate the workaround.
Issue Priority
Priority: 2 (default / most bugs should be filed as P2)
Issue Components
The text was updated successfully, but these errors were encountered: