Wrong version of aws-java-sdk-bundle in sagemaker-spark 1.4.5 #149

jobvisser03 · 2022-08-30T11:45:54Z

System Information

Spark or PySpark: 3.3.0
SDK Version: 1.4.5
Spark Version: 3.3.0

Describe the problem

I just spend 3 days trying to fix this but to no avail. My setup on an AWS notebook instance:
jars:
aws-java-sdk-bundle-1.11.901.jar
aws-java-sdk-core-1.12.262.jar
aws-java-sdk-kms-1.12.262.jar
aws-java-sdk-s3-1.12.262.jar
aws-java-sdk-sagemaker-1.12.262.jar
aws-java-sdk-sagemakerruntime-1.12.262.jar
aws-java-sdk-sts-1.12.262.jar
hadoop-aws-3.3.1.jar
sagemaker-spark_2.12-spark_3.3.0-1.4.5.jar

Problem:

Upon reading a file from S3 this error is thrown
this is caused by a bug in the httpclient jar dependency of pyspark and is reported here: https://issues.apache.org/jira/browse/HADOOP-18159?page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel&focusedCommentId=17554677#comment-17554677

Based on suggested workarounds in the article above I tried 4 things

upgrade aws-java-sdk-bundle to version 1.12.262 like the other jars → didn’t work
downgrade httpclient to version 4.5.10 → didn’t work
tried to set the aws-java-sdk to disable SSL certificate checking (SSLPeerUnverifiedException on S3 actions aws-sdk-java-v2#1786 ) → didn’t work with "-Dcom.amazonaws.sdk.disableCertChecking=true"
try to read from a bucket that doesn’t contain dots (.) → works

Minimal repo / logs

22/08/30 11:00:22 WARN FileStreamSink: Assume no metadata directory. Error while looking for metadata directory in the path: s3a://comp.data.sci.data.tst/some/folder/export_date=20220822. org.apache.hadoop.fs.s3a.AWSClientIOException: getFileStatus on s3a://comp.data.sci.data.tst/some/folder/export_date=20220822: com.amazonaws.SdkClientException: Unable to execute HTTP request: Certificate for <comp.data.sci.data.tst.s3.amazonaws.com> doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]: Unable to execute HTTP request: Certificate for <comp.data.sci.data.tst.s3.amazonaws.com> doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com] at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:208) at org.apache.hadoop.fs.s3a.S3AUtils.translateException(S3AUtils.java:170) at org.apache.hadoop.fs.s3a.S3AFileSystem.s3GetFileStatus(S3AFileSystem.java:3351) at org.apache.hadoop.fs.s3a.S3AFileSystem.innerGetFileStatus(S3AFileSystem.java:3185) at org.apache.hadoop.fs.s3a.S3AFileSystem.isDirectory(S3AFileSystem.java:4277) at org.apache.spark.sql.execution.streaming.FileStreamSink$.hasMetadata(FileStreamSink.scala:54) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:370) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:228) at org.apache.spark.sql.DataFrameReader.$anonfun$load$2(DataFrameReader.scala:210) at scala.Option.getOrElse(Option.scala:189)

Exact command to reproduce:
Works:
df = spark.read.parquet("s3a://aws-bucket-with-dashes/file_0_1_0.snappy.parquet")
Doesn't work:
df = spark.read.parquet("s3a://aws.bucket.with.dots/file_0_1_0.snappy.parquet")

It's not possible to rename the bucket due to the many data consumers that depend on them.

The text was updated successfully, but these errors were encountered:

steveloughran · 2022-10-05T12:56:09Z

you shouldn't be duplicating sagemaker jars with the sdk bundle, as that contains everything and is meant to be shaded so as to avoid transient dependency issues.
it's probably a problem with other things on your classpath
s3a connector support for buckets with dots is incomplete and wont be fixed

jobvisser03 changed the title ~~Read using S3A doesn't work; SdkClientException: Unable to execute HTTP request: Certificate for ... doesn't match any of the subject alternative names: [*.s3.amazonaws.com, s3.amazonaws.com]~~ Wrong version of aws-java-sdk-bundle in sagemaker-spark 1.4.5 Feb 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong version of aws-java-sdk-bundle in sagemaker-spark 1.4.5 #149

Wrong version of aws-java-sdk-bundle in sagemaker-spark 1.4.5 #149

jobvisser03 commented Aug 30, 2022 •

edited

steveloughran commented Oct 5, 2022

Wrong version of aws-java-sdk-bundle in sagemaker-spark 1.4.5 #149

Wrong version of aws-java-sdk-bundle in sagemaker-spark 1.4.5 #149

Comments

jobvisser03 commented Aug 30, 2022 • edited

System Information

Describe the problem

Minimal repo / logs

steveloughran commented Oct 5, 2022

jobvisser03 commented Aug 30, 2022 •

edited