[Bug]: Python expansion with multiple SqlTransforms is extremely slow #31227

gergely-g · 2024-05-09T06:36:49Z

What happened?

When building a Pipeline with multiple SqlTransforms from Beam Python, the expansion that happens in SqlTransforms is currently (Beam 2.55.0) extremely inefficient.

This inefficiency has multiple sources.

By default, a new BeamJarExpansionService() is started for each ExpansionService.
The ResolveArtifacts call will unconditionally download the 300MB beam-sdks-java-extensions-sql-expansion-service-2.55.0.jar to a temporary directory for each step (!).

The latter dominates execution time.
For example running a Beam from a 4 vCPU, 2 core, 16 GB memory machine (standard Dataflow workbench setup) a Pipeline with 31 trivial SQL transforms takes 200 seconds to execute.
(See example below.)

We found a somewhat dirty workaround to speed things up by skipping the SqlTransform._resolve_artifacts() altogether when working from inside Jupyter.

This brings down the execution speed from 200s to 22s.

I suspect these inefficiencies also contribute to beam_sql being extremely slow even for trivial queries.

apache_beam/runners/portability/artifact_service.py contains this code snippet that might be one of the culprits for this inefficiency (note the and False):

    if os.path.exists(
        payload.path) and payload.sha256 and payload.sha256 == sha256(
            payload.path) and False:
      return artifact
    else:
      return store_artifact(artifact, service, dest_dir)

In addition, once the ExpansionService is cached it only takes 100-200ms to perform the actual SQL expansion, but the ArtifactRetrievalService.ResolveArtifacts() call takes 1.5s per SQL query even without the downloading of the actual files. This dominates the expansion time, which dominates the overall time of launching and running a pipeline.

So the hotspot call sequence is something like:

SqlTransform.expand()
ExternalTransform.expand()
ArtifactRetrievalService.ResolveArtifacts()

The times may not sound like much, but latency is bad enough to ruin the Jupyter REPL experience when combining Python + SQL.

Code to repro and demonstrate the workaround.

import apache_beam as beam
from apache_beam.transforms.sql import SqlTransform
from apache_beam.transforms.external import BeamJarExpansionService
import time

# By default this pipeline of 31 SQL transforms takes 200s to execute.
# With the short-circuiting below execution time is reduced to 22s.

class FasterSqlTransform(SqlTransform):
    def _resolve_artifacts(self, components, service, dest):
        # Short circuit unnecessary call that results in the downloading of the 300MB
        # beam-sdks-java-extensions-sql-expansion-service-2.55.0.jar for each transform.
        return components

start_time = time.time()

# FasterSqlTransform with naive but shared BeamJarExpansionService
with beam.Pipeline() as p:
    with BeamJarExpansionService(':sdks:java:extensions:sql:expansion-service:shadowJar') as shared_expansion_service:
        sql_result = (p |
                      "Begin SQL" >> FasterSqlTransform("SELECT 'hello' AS message, 1 AS counter",
                                                           expansion_service=shared_expansion_service))
        for i in range(30):
            sql_result = (sql_result
                          | f"SQL {i}" >>
                          FasterSqlTransform("SELECT message, counter + 1 AS counter FROM PCOLLECTION",
                                                expansion_service=shared_expansion_service))
        sql_result | beam.LogElements()

print(f"Pipeline took {time.time() - start_time} s to execute")

Issue Priority

Priority: 2 (default / most bugs should be filed as P2)

Issue Components

The text was updated successfully, but these errors were encountered:

liferoad · 2024-05-09T13:49:49Z

The workaround looks good for the Python Direct Runner. @tvalentyn

tvalentyn · 2024-05-09T22:18:39Z

cc: @chamikaramj

chamikaramj · 2024-05-09T23:28:57Z

The ResolveArtifacts call will unconditionally download the 300MB beam-sdks-java-extensions-sql-expansion-service-2.55.0.jar to a temporary directory for each ste

The downloaded jars should be cached. Probably this caching doesn't work for your environment ?

You also have the option of manually specifying the jar [1] or manually starting up an expansion service [2].

[1] --beam_services="{\":sdks:java:extensions:sql:expansion-service:shadowJar\": \"$EXPANSION_SERVICE_JAR\"}"
[2] https://beam.apache.org/documentation/sdks/python-multi-language-pipelines/#choose-an-expansion-service

gergely-g · 2024-05-10T11:52:27Z

@chamikaramj The cache hit will never be detected for the downloaded JARs because of this line:

beam/sdks/python/apache_beam/runners/portability/artifact_service.py

Line 294 in bb51380

payload.path) and False:

It always evaluates to False.

A worse problem though that, as mentioned above ArtifactRetrievalService.ResolveArtifacts() call takes 1.5s per SQL query even without the downloading of the actual files.

liferoad · 2024-05-10T12:29:23Z

@robertwb can you check this?

wollowizard · 2024-05-20T19:11:40Z

Hi any news? I have also encountered this exact same issue

chamikaramj · 2024-05-20T23:10:33Z

and False part does seem like a bug but I don't think that actually gets hit since the Java expansion response serves Beam artifacts as DEFERRED artifacts that are retrieved from the locally available expansion service (so URN is DEFERRED not FILE).

https://github.com/apache/beam/blob/fed6489124000b3f222dc444136009ab22e4846e/sdks/java/core/src/main/java/org/apache/beam/sdk/util/construction/Environments.java#L452C31-L452C48

beam/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto

Line 1437 in fed6489

message StandardArtifacts {

Expansion service jar is cached elsewhere when starting up the expansion service and served to Python side using the ArtifactRetrievalService.ResolveArtifacts() API. This might be adding the O(seconds) per-query delay you are observing unfortunately.

beam/sdks/python/apache_beam/utils/subprocess_server.py

Line 366 in fed6489

cached_jar = os.path.join(cache_dir, os.path.basename(url))

gergely-g added awaiting triage bug labels May 9, 2024

github-actions bot added python P2 labels May 9, 2024

liferoad assigned robertwb May 10, 2024

github-actions bot removed the awaiting triage label May 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Python expansion with multiple SqlTransforms is extremely slow #31227

[Bug]: Python expansion with multiple SqlTransforms is extremely slow #31227

gergely-g commented May 9, 2024 •

edited

liferoad commented May 9, 2024

tvalentyn commented May 9, 2024

chamikaramj commented May 9, 2024

gergely-g commented May 10, 2024

liferoad commented May 10, 2024

wollowizard commented May 20, 2024

chamikaramj commented May 20, 2024

[Bug]: Python expansion with multiple SqlTransforms is extremely slow #31227

[Bug]: Python expansion with multiple SqlTransforms is extremely slow #31227

Comments

gergely-g commented May 9, 2024 • edited

What happened?

Issue Priority

Issue Components

liferoad commented May 9, 2024

tvalentyn commented May 9, 2024

chamikaramj commented May 9, 2024

gergely-g commented May 10, 2024

liferoad commented May 10, 2024

wollowizard commented May 20, 2024

chamikaramj commented May 20, 2024

gergely-g commented May 9, 2024 •

edited