Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[spark] ai.catboost.CatBoostError: CatBoost Master process failed: exited with code 134 #2611

Open
Gctucci opened this issue Mar 13, 2024 · 2 comments

Comments

@Gctucci
Copy link

Gctucci commented Mar 13, 2024

Hi!
I'm trying to run the catboost-spark algo (ai.catboost:catboost-spark_3.5_2.12:1.2.3) on an EMR serverless cluster from AWS (EMR 7.0.0), but upon calling the fit function I'm getting the following error:

cb_model = algo.fit(train_pool)
File "/tmp/spark-752df6af-d30d-400f-b5b2-a7705527b9fb/userFiles-41e7b55e-142c-4325-9a0d-88923b710594/ai.catboost_catboost-spark_3.5_2.12-1.2.3.jar/catboost_spark/core.py", line 5362, in fit
File "/tmp/spark-752df6af-d30d-400f-b5b2-a7705527b9fb/userFiles-41e7b55e-142c-4325-9a0d-88923b710594/ai.catboost_catboost-spark_3.5_2.12-1.2.3.jar/catboost_spark/core.py", line 5359, in _fit_with_eval
File "/tmp/spark-752df6af-d30d-400f-b5b2-a7705527b9fb/userFiles-41e7b55e-142c-4325-9a0d-88923b710594/ai.catboost_catboost-spark_3.5_2.12-1.2.3.jar/catboost_spark/core.py", line 5316, in _fit_with_eval
File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/java_gateway.py", line 1322, in call
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/errors/exceptions/captured.py", line 179, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.9.7-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o817.fit.
: java.util.concurrent.ExecutionException: Error while executing master
at ai.catboost.spark.impl.Helpers$.checkOneFutureAndWaitForOther(Helpers.scala:33)
at ai.catboost.spark.impl.Helpers$.waitForTwoFutures(Helpers.scala:59)
at ai.catboost.spark.CatBoostPredictorTrait.$anonfun$fit$12(CatBoostPredictor.scala:260)
at scala.util.control.Breaks.breakable(Breaks.scala:42)
at ai.catboost.spark.CatBoostPredictorTrait.fit(CatBoostPredictor.scala:230)
at ai.catboost.spark.CatBoostPredictorTrait.fit$(CatBoostPredictor.scala:125)
at ai.catboost.spark.CatBoostClassifier.fit(CatBoostClassifier.scala:372)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:568)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:374)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
at java.base/java.lang.Thread.run(Thread.java:840)
Caused by: ai.catboost.CatBoostError: CatBoost Master process failed: exited with code 134
at ai.catboost.spark.impl.CatBoostMasterWrapper.trainCallback(Master.scala:206)
at ai.catboost.spark.CatBoostPredictorTrait.$anonfun$fit$13(CatBoostPredictor.scala:234)
at ai.catboost.spark.CatBoostPredictorTrait.$anonfun$fit$13$adapted(CatBoostPredictor.scala:234)
at ai.catboost.spark.TrainingDriver.run(TrainingDriver.scala:271)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:539)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
... 1 more

Checking the error logs, I'm getting a network connection refused error:

[CatBoost Master] ERROR: 2024-03-13 13:52:36.141 +0000 par_network.cpp:180 query error: 0 Connection refused info: reqId: 4286cfe8-9528905d-7d648cd8-2ddc0d60 tcp2://[fd00:0:0:0:0:0:0:2]:39313/matrixnet init@35737 retries rest: 4
[CatBoost Master] ERROR: 2024-03-13 13:52:36.141 +0000 par_network.cpp:180 query error: 0 Connection refused info: reqId: d201367d-e96c7bd1-c6bb96d0-58b5da64 tcp2://[fd00:0:0:0:0:0:0:2]:33007/matrixnet init@35737 retries rest: 5
[CatBoost Master] ERROR: 2024-03-13 13:52:36.141 +0000 par_network.cpp:180 query error: 0 Connection refused info: reqId: 4286cfe8-9528905d-7d648cd8-2ddc0d60 tcp2://[fd00:0:0:0:0:0:0:2]:39313/matrixnet init@35737 retries rest: 3
[CatBoost Master] ERROR: 2024-03-13 13:52:36.141 +0000 par_network.cpp:180 query error: 0 Connection refused info: reqId: d201367d-e96c7bd1-c6bb96d0-58b5da64 tcp2://[fd00:0:0:0:0:0:0:2]:33007/matrixnet init@35737 retries rest: 4
[CatBoost Master] ERROR: 2024-03-13 13:52:36.141 +0000 par_network.cpp:180 query error: 0 Connection refused info: reqId: 4286cfe8-9528905d-7d648cd8-2ddc0d60 tcp2://[fd00:0:0:0:0:0:0:2]:39313/matrixnet init@35737 retries rest: 2
[CatBoost Master] ERROR: 2024-03-13 13:52:36.141 +0000 par_network.cpp:180 query error: 0 Connection refused info: reqId: d201367d-e96c7bd1-c6bb96d0-58b5da64 tcp2://[fd00:0:0:0:0:0:0:2]:33007/matrixnet init@35737 retries rest: 3
[CatBoost Master] ERROR: 2024-03-13 13:52:36.141 +0000 par_network.cpp:180 query error: 0 Connection refused info: reqId: d201367d-e96c7bd1-c6bb96d0-58b5da64 tcp2://[fd00:0:0:0:0:0:0:2]:33007/matrixnet init@35737 retries rest: 2
[CatBoost Master] ERROR: 2024-03-13 13:52:36.141 +0000 par_network.cpp:180 query error: 0 Connection refused info: reqId: 4286cfe8-9528905d-7d648cd8-2ddc0d60 tcp2://[fd00:0:0:0:0:0:0:2]:39313/matrixnet init@35737 retries rest: 1
[CatBoost Master] ERROR: 2024-03-13 13:52:36.141 +0000 par_network.cpp:180 query error: 0 Connection refused info: reqId: d201367d-e96c7bd1-c6bb96d0-58b5da64 tcp2://[fd00:0:0:0:0:0:0:2]:33007/matrixnet init@35737 retries rest: 1
[CatBoost Master] ERROR: 2024-03-13 13:52:36.141 +0000 par_network.cpp:180 query error: 0 Connection refused info: reqId: 4286cfe8-9528905d-7d648cd8-2ddc0d60 tcp2://[fd00:0:0:0:0:0:0:2]:39313/matrixnet init@35737 retries rest: 0
[CatBoost Master] VERIFY failed (2024-03-13T13:52:36.141547Z): got unexpected network error, no retries rest
[CatBoost Master] /src/catboost/library/cpp/par/par_network.cpp:185
[CatBoost Master] MultiClientThreadLoopFunction(): requirement false failed

How can I fix this issue?

catboost version: 1.2.3, Spark 3.5.0, Scala 2.12.17
Operating System: Linux x86_64
CPU:
GPU: not using gpu

@andrey-khropov
Copy link
Member

The information in logs is correct, you seem to be having issues with network connectivity between Spark executors in your cluster. Are there any TCP connection restrictions (firewall?) between hosts in the cluster?

@Gctucci
Copy link
Author

Gctucci commented Mar 13, 2024

I can't really be sure of that since it is a serverless solution, but our current network setting there is allowing TCP connections. Additionally other spark transformations in the data are working just fine

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants