Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All PY4J 'sendCommand' blocked on JDK's createSocket default configuration #525

Open
Leniox opened this issue Jul 13, 2023 · 0 comments
Open

Comments

@Leniox
Copy link

Leniox commented Jul 13, 2023

The version of Py4J, Python, and Java you are using (e.g., 0.10.1, 3.5.1, 8)
py4j=0.10.9.7=pyhd8ed1ab_0

The OS your are using (Windows 7, OSX Yosemite, Ubuntu 16.04)

  • Linux-5.15.0-1039-aws-x86_64-with-debian-bullseye-sid
  • version='#44~20.04.1-Ubuntu SMP Thu Jun 22 12:21:12 UTC 2023'

A snippet of code that can reproduce the problem
Have not been able to reliably reproduce. However, I think it may be acceptable to give clients the ability to address the symptoms of this problem with configurable timeouts.

Problem Description

We had an incident recently where we noticed all Py4J threads (64) were blocked. Here is an example of a blocked thread:

{threadName154} (blockedCount: 5, daemon: true, lockOwnerId: 356, threadState: WAITING, waitedCount: 4, waitingOn: <475476020> (a java.util.concurrent.locks.ReentrantLock$FairSync))
at jdk.internal.misc.Unsafe.park(Unsafe.java:-2)
at java.util.concurrent.locks.LockSupport.park(LockSupport.java:211)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:715)
at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:938)
at java.util.concurrent.locks.ReentrantLock$Sync.lock(ReentrantLock.java:153)
at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:322)
at py4j.PythonClient.giveBackConnection(PythonClient.java:239)
at py4j.CallbackClient.sendCommand(CallbackClient.java:406)
at py4j.CallbackClient.sendCommand(CallbackClient.java:356)

This thread is blocked on threadName356. This is the thread-trace:

{threadName356} (blockedCount: 0, daemon: true, isNative: true, threadState: RUNNABLE, waitedCount: 1)
at sun.nio.ch.Net.connect0(Net.java:-2)
at sun.nio.ch.Net.connect(Net.java:579)
at sun.nio.ch.Net.connect(Net.java:568)
at sun.nio.ch.NioSocketImpl.connect(NioSocketImpl.java:588)
at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:327)
at java.net.Socket.connect(Socket.java:633)
at java.net.Socket.connect(Socket.java:583)
at java.net.Socket.(Socket.java:507)
at java.net.Socket.(Socket.java:319)
at javax.net.DefaultSocketFactory.createSocket(SocketFactory.java:277)
at py4j.PythonClient.startClientSocket(PythonClient.java:192)
at py4j.PythonClient.getConnection(PythonClient.java:213)
at py4j.CallbackClient.getConnectionLock(CallbackClient.java:250)
at py4j.CallbackClient.sendCommand(CallbackClient.java:377)
at py4j.CallbackClient.sendCommand(CallbackClient.java:356)
at py4j.reflection.PythonProxyHandler.invoke(PythonProxyHandler.java:106)

It appears that, for some reason, threads were blocked on an initial command from Java to Python when we open a socket connection. Specifically, this line of code

PY4J does not specify a timeout, and thus we fallback to the JDK's default connection, which is indefinite block.

It is unclear to me why the OS blocked on this, and I cannot repro this. Still, this resulted in a non-trivial incident on our side! Therefore, I wanted to propose the following changes:

Potential Solutions

  • Solution 1: Add a way to specify a socket connection timeout down to the JDK level.
  • Solution 2: Add a timeout for this lock . The intention is that this would release, other threads would attempt to send a command to a socket that is not connected, and we'd fail loudly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant