test(pyspark): validate pyspark-specific tests for streaming #8945

mfatihaktas · 2024-04-11T22:08:58Z

Description of changes

Aims to address #8888.

chloeh13q · 2024-04-18T14:39:38Z

ibis/backends/pyspark/__init__.py

+ # E.g.,
+ # cursor.query.writeStream.format("memory").queryName("table_name").start()
+ #
+ # This in-memory table might conflict with those defined


Can we use arbitrary names from the name generator? The chance of clashing is never 0, but it's pretty low.

chloeh13q · 2024-04-18T14:44:13Z

ibis/backends/pyspark/__init__.py

+ import glob
+
+ files = glob.glob(f"{dir_}/*.parquet")
+ df_list = [pd.read_parquet(f) for f in files]


I think you can past the files list directly to pd.read_parquet, or just the directory itself

chloeh13q · 2024-04-18T14:58:47Z

ibis/backends/pyspark/__init__.py

+ .option("path", dir_)
+ .trigger(availableNow=True)
+ .start()
+ .awaitTermination()


Are you assuming here that the job will always finish? What if it's a continuous query?

chloeh13q · 2024-04-18T15:00:16Z

ibis/backends/pyspark/__init__.py

 def _fetch_from_cursor(self, cursor, schema):
- df = cursor.query.toPandas() # blocks until finished
+ if cursor.query.isStreaming:
+ df = self._execute_stream(cursor.query)


I don't think we actually support converting the results back into pandas df in the Flink backend right now... Also these outputs can get arbitrarily large because it runs on continuous data. I get that the intention is the maintain the same interface/UX across streaming and batch jobs, but I wonder if this makes sense

chloeh13q · 2024-04-18T15:03:31Z

ibis/backends/pyspark/tests/conftest.py

@@ -22,22 +25,41 @@ def set_pyspark_database(con, database):
 class TestConf(BackendTest):
 deps = ("pyspark",)

- def _load_data(self, **_: Any) -> None:
+ def _load_data_helper(self, for_streaming: bool = False):


just streaming?

chloeh13q · 2024-04-18T15:06:10Z

ibis/backends/pyspark/tests/conftest.py

+ # cases, you can re-enable schema inference by setting
+ # spark.sql.streaming.schemaInference to true."
+ # Ref: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#schema-inference-and-partition-of-streaming-dataframesdatasets
+ s.sql("set spark.sql.streaming.schemaInference=true")


You can pass this as a config to the session builder

ibis/ibis/backends/pyspark/tests/conftest.py

Lines 169 to 191 in b8894d0

config = (

SparkSession.builder.appName("ibis_testing")

.master("local[1]")

.config("spark.cores.max", 1)

.config("spark.default.parallelism", 1)

.config("spark.driver.extraJavaOptions", "-Duser.timezone=GMT")

.config("spark.dynamicAllocation.enabled", False)

.config("spark.executor.extraJavaOptions", "-Duser.timezone=GMT")

.config("spark.executor.heartbeatInterval", "3600s")

.config("spark.executor.instances", 1)

.config("spark.network.timeout", "4200s")

.config("spark.rdd.compress", False)

.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")

.config("spark.shuffle.compress", False)

.config("spark.shuffle.spill.compress", False)

.config("spark.sql.legacy.timeParserPolicy", "LEGACY")

.config("spark.sql.session.timeZone", "UTC")

.config("spark.sql.shuffle.partitions", 1)

.config("spark.storage.blockManagerSlaveTimeoutMs", "4200s")

.config("spark.ui.enabled", False)

.config("spark.ui.showConsoleProgress", False)

.config("spark.sql.execution.arrow.pyspark.enabled", False)

)

. I think that helps group everything together instead of having this single line of SQL here

chloeh13q · 2024-04-18T15:21:06Z

ibis/backends/pyspark/tests/conftest.py

+ # Note: The same session can be used for both batch and streaming
+ # jobs in Spark. Streaming is made explicit on the source
+ # dataframes. This is why, we do not really need a separate
+ # `TestConf` class for streaming, but only need to create
+ # streaming counterparts for the test tables. Still added this
+ # class to keep the testing uniform with Flink. This class is used
+ # in `con_streaming()` fixture (streaming counterpart of `con()`)
+ # to create a new spark session and load the `***_streaming`
+ # tables for testing. However, either `con()` or
+ # `con_streaming()` can be used to execute any batch/streaming
+ # job. This is why, we set `autouse=True` for `con_streaming()`
+ # to create the streaming tables, and then rely solely on `con()`
+ # to operate on those tables in the tests.


It makes sense to try to keep the testing uniform, but I also wonder if it's cleaner/easier to just reuse the same test con, given this difference in behavior... It feels a little unnecessary. Maybe others have some thoughts.

mfatihaktas self-assigned this Apr 11, 2024

mfatihaktas added tests Issues or PRs related to tests pyspark The Apache PySpark backend streaming Issue related to streaming APIs or backends labels Apr 11, 2024

mfatihaktas force-pushed the test/pyspark/validate-tests-for-streaming branch 8 times, most recently from aa81cd4 to 1c5680b Compare April 16, 2024 19:47

test(pyspark): validate pyspark-specific tests for streaming

b8894d0

mfatihaktas force-pushed the test/pyspark/validate-tests-for-streaming branch from 1c5680b to b8894d0 Compare April 17, 2024 14:15

chloeh13q reviewed Apr 18, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(pyspark): validate pyspark-specific tests for streaming #8945

test(pyspark): validate pyspark-specific tests for streaming #8945

mfatihaktas commented Apr 11, 2024

chloeh13q Apr 18, 2024

chloeh13q Apr 18, 2024

chloeh13q Apr 18, 2024

chloeh13q Apr 18, 2024

chloeh13q Apr 18, 2024

chloeh13q Apr 18, 2024

chloeh13q Apr 18, 2024 •

edited

	config = (
	SparkSession.builder.appName("ibis_testing")
	.master("local[1]")
	.config("spark.cores.max", 1)
	.config("spark.default.parallelism", 1)
	.config("spark.driver.extraJavaOptions", "-Duser.timezone=GMT")
	.config("spark.dynamicAllocation.enabled", False)
	.config("spark.executor.extraJavaOptions", "-Duser.timezone=GMT")
	.config("spark.executor.heartbeatInterval", "3600s")
	.config("spark.executor.instances", 1)
	.config("spark.network.timeout", "4200s")
	.config("spark.rdd.compress", False)
	.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
	.config("spark.shuffle.compress", False)
	.config("spark.shuffle.spill.compress", False)
	.config("spark.sql.legacy.timeParserPolicy", "LEGACY")
	.config("spark.sql.session.timeZone", "UTC")
	.config("spark.sql.shuffle.partitions", 1)
	.config("spark.storage.blockManagerSlaveTimeoutMs", "4200s")
	.config("spark.ui.enabled", False)
	.config("spark.ui.showConsoleProgress", False)
	.config("spark.sql.execution.arrow.pyspark.enabled", False)
	)

test(pyspark): validate pyspark-specific tests for streaming #8945

Are you sure you want to change the base?

test(pyspark): validate pyspark-specific tests for streaming #8945

Conversation

mfatihaktas commented Apr 11, 2024

Description of changes

chloeh13q Apr 18, 2024

Choose a reason for hiding this comment

chloeh13q Apr 18, 2024

Choose a reason for hiding this comment

chloeh13q Apr 18, 2024

Choose a reason for hiding this comment

chloeh13q Apr 18, 2024

Choose a reason for hiding this comment

chloeh13q Apr 18, 2024

Choose a reason for hiding this comment

chloeh13q Apr 18, 2024

Choose a reason for hiding this comment

chloeh13q Apr 18, 2024 • edited

Choose a reason for hiding this comment

chloeh13q Apr 18, 2024 •

edited