You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi!
I'm trying to run the catboost-spark algo (ai.catboost:catboost-spark_3.5_2.12:1.2.3) on an EMR serverless cluster from AWS (EMR 7.0.0), but upon calling the fit function I'm getting the following error:
cb_model = algo.fit(train_pool)
File "/tmp/spark-752df6af-d30d-400f-b5b2-a7705527b9fb/userFiles-41e7b55e-142c-4325-9a0d-88923b710594/ai.catboost_catboost-spark_3.5_2.12-1.2.3.jar/catboost_spark/core.py", line 5362, in fit
File "/tmp/spark-752df6af-d30d-400f-b5b2-a7705527b9fb/userFiles-41e7b55e-142c-4325-9a0d-88923b710594/ai.catboost_catboost-spark_3.5_2.12-1.2.3.jar/catboost_spark/core.py", line 5359, in _fit_with_eval
File "/tmp/spark-752df6af-d30d-400f-b5b2-a7705527b9fb/userFiles-41e7b55e-142c-4325-9a0d-88923b710594/ai.catboost_catboost-spark_3.5_2.12-1.2.3.jar/catboost_spark/core.py", line 5316, in _fit_with_eval
File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in call
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o817.fit.
: java.util.concurrent.ExecutionException: Error while executing master
at ai.catboost.spark.impl.Helpers$.checkOneFutureAndWaitForOther(Helpers.scala:33)
at ai.catboost.spark.impl.Helpers$.waitForTwoFutures(Helpers.scala:59)
at ai.catboost.spark.CatBoostPredictorTrait.$anonfun$fit$12(CatBoostPredictor.scala:260)
at scala.util.control.Breaks.breakable(Breaks.scala:42)
at ai.catboost.spark.CatBoostPredictorTrait.fit(CatBoostPredictor.scala:230)
at ai.catboost.spark.CatBoostPredictorTrait.fit$(CatBoostPredictor.scala:125)
at ai.catboost.spark.CatBoostClassifier.fit(CatBoostClassifier.scala:372)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: ai.catboost.CatBoostError: CatBoost Master process failed: exited with code 134
at ai.catboost.spark.impl.CatBoostMasterWrapper.trainCallback(Master.scala:206)
at ai.catboost.spark.CatBoostPredictorTrait.$anonfun$fit$13(CatBoostPredictor.scala:234)
at ai.catboost.spark.CatBoostPredictorTrait.$anonfun$fit$13$adapted(CatBoostPredictor.scala:234)
at ai.catboost.spark.TrainingDriver.run(TrainingDriver.scala:271)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
... 1 more
Checking the error logs, I'm getting a network connection refused error:
The information in logs is correct, you seem to be having issues with network connectivity between Spark executors in your cluster. Are there any TCP connection restrictions (firewall?) between hosts in the cluster?
I can't really be sure of that since it is a serverless solution, but our current network setting there is allowing TCP connections. Additionally other spark transformations in the data are working just fine
Hi!
I'm trying to run the catboost-spark algo (ai.catboost:catboost-spark_3.5_2.12:1.2.3) on an EMR serverless cluster from AWS (EMR 7.0.0), but upon calling the fit function I'm getting the following error:
Checking the error logs, I'm getting a network connection refused error:
How can I fix this issue?
catboost version: 1.2.3, Spark 3.5.0, Scala 2.12.17
Operating System: Linux x86_64
CPU:
GPU: not using gpu
The text was updated successfully, but these errors were encountered: