Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intermittent stall of S3 PUT request for about 17 minutes #3110

Open
1 task done
gudladona opened this issue May 13, 2024 · 2 comments
Open
1 task done

Intermittent stall of S3 PUT request for about 17 minutes #3110

gudladona opened this issue May 13, 2024 · 2 comments
Assignees
Labels
bug This issue is a bug. third-party This issue is related to third-party libraries or applications.

Comments

@gudladona
Copy link

gudladona commented May 13, 2024

Upcoming End-of-Support

  • I acknowledge the upcoming end-of-support for AWS SDK for Java v1 was announced, and migration to AWS SDK for Java v2 is recommended.

Describe the bug

Hello,

We have an interesting problem that happens intermittently in our environment that causes the S3 PUT via HTTP PUT operation stall between 17-19 minutes. Let me try to describe this in detail.

First of, Environment details. We are running OSS spark and Hadoop on EKS with Karpenter.

JDK version : 11.0.19
Spark Version: 3.4.1
Hadoop Version: 3.3.4
EKS Version: 1.26
Hudi Version: 0.14.x
OS: Verified on both Bottlerocket & AL2

Issue Details:

Occasionally, we notice that Spark stage & few tasks get stalled for about 17 minutes, this delay is consistent whenever it happens. We have noticed that this is due to a stalled socket write on a close() within AWS SDK which uses Apache HTTP Client. When we expect a bad TLS connection, and the underlying socket should be terminated eagerly for a retry we don’t see that happening. Instead, the Socket is left until OS triggers a terminate. This seems to be due to the implementation of socket Linger option which is set to -1 by default in the JDK. An option exists to set Linger to 0 which means bad connections are immediately removed. But neither the AWS SDK nor the Apache HTTP Client sets this option to alter the default Linger behavior in the JDK.

Attached are the logs with slightly different errors with DEBUG level for AWS SDK and Hadoop S3a and Apache HTTP Client with when the issue is encountered.

After further investigation we have found this JDK bug : https://bugs.openjdk.org/browse/JDK-8241239. This perfectly describes and reproduces the issue we are having.

We have tried to fork the aws sdk by adding the LINGER option with default to 0 in here and set it to the SSL socket options here. But that did not fix the issue, which could be due to how the JDK version is treating the socket options.

Expected Behavior

The socket file descriptor should close non-gracefully/"prematurely", forcing the write to terminate immediately.

Current Behavior

close() blocks until the OS forces the socket closed at the transport layer, causing the socket write to fail

Reproduction Steps

As mentioned in https://bugs.openjdk.org/browse/JDK-8241239

  1. establish a connection between two hosts/VMs, have the client side perform sizable writes (enough to fill up socket buffers etc.), the server just reads and discards.
  2. introduce a null route on either side (or otherwise prevent transmission of TCP acks from the server to the client) force the client to attempt retransmits
  3. wait until you're stuck in a write() (check stack dumps), then call close() on the client-side socket.

Possible Solution

Assuming the right implementation of the LINGER by jdk, it would be good to allow users to set this ClientConfiguration.java which gets set into Apache HTTP Client settings.

Additional Information/Context

No response

AWS Java SDK version used

1.12.26

JDK version used

OpenJDK Runtime Environment Temurin-11.0.19+7

Operating System and version

BottleRocket & AL2

Logs & Other Attachments

outlier-task-17-min
aws+sdk+httpclient+debug.log

@gudladona gudladona added bug This issue is a bug. needs-triage This issue or PR still needs to be triaged. labels May 13, 2024
@bhoradc bhoradc self-assigned this May 14, 2024
@bhoradc bhoradc added third-party This issue is related to third-party libraries or applications. and removed needs-triage This issue or PR still needs to be triaged. labels May 14, 2024
@bhoradc
Copy link

bhoradc commented May 14, 2024

Hi @gudladona,

Thank you for reporting the issue. I see that you have also opened a Support ticket for the same.

We will continue to provide support through the ticket you have opened. I will mark this as Closing-soon, to avoid duplicate efforts. Kindly send your questions through the Support case. Thanks.

Regards,
Chaitanya

@bhoradc bhoradc added the closing-soon This issue will close in 2 days unless further comments are made. label May 14, 2024
@hgudladona
Copy link

ok, Thanks for the response. Can you kindly post the initial assessment if you have one already?

@github-actions github-actions bot removed the closing-soon This issue will close in 2 days unless further comments are made. label May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug This issue is a bug. third-party This issue is related to third-party libraries or applications.
Projects
None yet
Development

No branches or pull requests

3 participants