Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wr.s3.to_deltalake throwing TypeError about storage_options #2378

Open
leodido opened this issue Jul 3, 2023 · 14 comments
Open

wr.s3.to_deltalake throwing TypeError about storage_options #2378

leodido opened this issue Jul 3, 2023 · 14 comments
Labels
bug Something isn't working

Comments

@leodido
Copy link

leodido commented Jul 3, 2023

Describe the bug

Calling we.s3.to_deltalake() throws the following error:

self._table = RawDeltaTable(
                  ^^^^^^^^^^^^^^
TypeError: argument 'storage_options': 'NoneType' object cannot be converted to 'PyString'

How to Reproduce

wr.s3.to_deltalake(
            df=data,
            path="s3://bucket/delta",
            index=False,
            partition_cols=["a", "b"],
            overwrite_schema=True,
            s3_additional_kwargs={
                "AWS_ACCESS_KEY_ID": "...",
                "AWS_SECRET_ACCESS_KEY": "...",
                "AWS_REGION": "eu-west-1",
            },
            s3_allow_unsafe_rename=True,
        )

Expected behavior

I'd expect awswrangler to connect to S3 and write the delta table.

Your project

No response

Screenshots

No response

OS

Mac

Python version

3.11.4

AWS SDK for pandas version

3.2.1

Additional context

No response

@leodido leodido added the bug Something isn't working label Jul 3, 2023
@jaidisido
Copy link
Contributor

The s3_additional_kwargs argument is for passing S3 specific arguments like ServerSideEncryption, not your AWS credentials. The boto3 session is used to extract the credentials and the region. So as long as that is correctly configured and passed, it should be enough:

boto3_session = boto3.Session(region_name="eu-west-1")
wr.s3.to_deltalake(path=path, df=df, boto3_session=boto3_session, s3_additional_kwargs={"ServerSideEncryption": "AES256"})

@leodido
Copy link
Author

leodido commented Jul 5, 2023

Did it ...

     boto3_session = boto3.Session(region_name="eu-west-1") # Yes, the region is correct
      wrangler.s3.to_deltalake(
          df=data,
          path="s3://mybucket/delta", # Yes, the bucket exists
          index=False,
          partition_cols=["a", "b"], 
          overwrite_schema=True,
          boto3_session=boto3_session,
          s3_allow_unsafe_rename=True,
      )

But I keep getting the same error:

TypeError: argument 'storage_options': 'NoneType' object cannot be converted to 'PyString'

I enabled logging (INFO level) with:

logging.basicConfig(level=logging.INFO, format="[%(name)s][%(funcName)s] %(message)s")
logging.getLogger("awswrangler").setLevel(logging.INFO)

And I see this in logs just before the error above...

[botocore.credentials][load] Found credentials in environment variables.
on 0: {'AWS_ACCESS_KEY_ID': 'REDACT', 'AWS_SECRET_ACCESS_KEY': 'REDACT', 'AWS_SESSION_TOKEN': None, 'PROFILE_NAME': 'default', 'AWS_REGION': 'eu-west-1', 'AWS_S3_ALLOW_UNSAFE_RENAME': 'TRUE'}

Any further suggestions on how to fix this?

Here's the error trace:
Traceback (most recent call last):
  File "/Users/leodido/Workspace/github.com/REDACTED/v2p/v2p.py", line 304, in <module>
    process(chunk)
  File "/Users/leodido/Workspace/github.com/REDACTED/v2p/v2p.py", line 255, in process
    wrangler.s3.to_deltalake(
  File "/Users/leodido/Workspace/github.com/REDACTED/v2p/lib/python3.11/site-packages/awswrangler/_utils.py", line 122, in inner
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/leodido/Workspace/github.com/REDACTED/v2p/lib/python3.11/site-packages/awswrangler/annotations.py", line 44, in inner
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/Users/leodido/Workspace/github.com/REDACTED/v2p/lib/python3.11/site-packages/awswrangler/s3/_write_deltalake.py", line 104, in to_deltalake
    deltalake.write_deltalake(
  File "/Users/leodido/Workspace/github.com/REDACTED/v2p/lib/python3.11/site-packages/deltalake/writer.py", line 147, in write_deltalake
    table, table_uri = try_get_table_and_table_uri(table_or_uri, storage_options)
                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/leodido/Workspace/github.com/REDACTED/v2p/lib/python3.11/site-packages/deltalake/writer.py", line 392, in try_get_table_and_table_uri
    table = try_get_deltatable(table_or_uri, storage_options)
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/leodido/Workspace/github.com/REDACTED/v2p/lib/python3.11/site-packages/deltalake/writer.py", line 405, in try_get_deltatable
    return DeltaTable(table_uri, storage_options=storage_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/leodido/Workspace/github.com/REDACTED/v2p/lib/python3.11/site-packages/deltalake/table.py", line 122, in __init__
    self._table = RawDeltaTable(

@jaidisido
Copy link
Contributor

Hmm strange, I am unable to replicate this error on my local:

boto3_session = boto3.Session(region_name="us-east-1")
df = pd.DataFrame({"c0": [1, 2, 3], "c1": [True, False, True], "par0": ["foo", "foo", "bar"], "par1": [1, 2, 2]})
wr.s3.to_deltalake(
    path=path,
    df=df,
    index=False,
    boto3_session=boto3_session,
    partition_cols=["par0", "par1"],
    overwrite_schema=True,
    s3_allow_unsafe_rename=True,
)
df2 = wr.s3.read_deltalake(path=path, columns=["c0"], partitions=[("par0", "=", "foo"), ("par1", "=", "1")])
assert df2.shape == (1, 1)

works fine.

Could you share your pip freeze? I imagine you are on deltalake 0.9.0?

Also please try to make the call directly using the deltalake library, which is pretty much what we are doing. At which point it might be worth opening an issue in delta-rs directly.

@jaidisido jaidisido changed the title we.s3.to_deltalake throwing TypeError about storage_options wr.s3.to_deltalake throwing TypeError about storage_options Jul 6, 2023
@leandro-ferreira-farm
Copy link

I recive the same error here:

wr.s3.to_deltalake(
df=df_delta,
path=s3_path,
mode="overwrite",
partition_cols = partition_cols,
index=False,
overwrite_schema=True,
s3_allow_unsafe_rename=True,
)

argument 'storage_options': 'NoneType' object cannot be converted to 'PyString'

@leandro-ferreira-farm
Copy link

Please, reeopen this ticket. This issue continues to happen even in version 3.3.0

@leandro-ferreira-farm
Copy link

I saw that the error happens when we use a boto.session. I don't know if it's a deltalake-rs issue or an awswrangler issue

@leandro-ferreira-farm
Copy link

I'm using poetry as a dependencies manager, and my pyproject.tom is:

[tool.poetry.dependencies]
python = "3.10.11"
awswrangler = "3.3.0"
boto3 = "1.27.1"
pyarrow = "12.0.1"
duckdb = "0.8.1"
pandas = "2.0.3"
deltalake = "0.10.0"
jsonschema = "4.18.0"
requests = "2.31.0"
pyyaml = "6.0.1"
ipykernel = "6.24.0"
pyspark = "3.4.0"
delta-spark = "2.4.0"
sagemaker = "2.72"
findspark = "2.0.1"
msal = "1.22.0"
great-expectations = "0.17.7"
hvac = "1.1.1"

@luis-fnogueira
Copy link

I gently ask to reopen this issue, I am facing the very same problem. I'm using version 3.4.1.

@kukushking kukushking reopened this Nov 13, 2023
@ZulqarnainB
Copy link

ZulqarnainB commented Dec 20, 2023

I am facing this issue with wr.s3.read_deltalake as follows:

df = wr.s3.read_deltalake(path=label_path, columns=[label_field], boto3_session=session)

@neverlink
Copy link

neverlink commented Jan 28, 2024

This issue can occur when no region is associated with the profile in ~/.aws/config
Running aws configure and providing a default region fixes this.

to_deltalake() pulls a None value from the boto3_session which then can't be cast to a PyString as the exception shows.

@leodido In your case, it could be the AWS_SESSION_TOKEN missing instead, considering the log you posted.

[botocore.credentials][load] Found credentials in environment variables.
on 0: {'AWS_ACCESS_KEY_ID': 'REDACT', 'AWS_SECRET_ACCESS_KEY': 'REDACT', 'AWS_SESSION_TOKEN': None, 'PROFILE_NAME': 'default', 'AWS_REGION': 'eu-west-1', 'AWS_S3_ALLOW_UNSAFE_RENAME': 'TRUE'}

@stuart-powell
Copy link

stuart-powell commented Mar 6, 2024

I had this problem too. It looks to me like the 'AWS_SESSION_TOKEN' being None is clashing with the constructor expecting a dictionary of strings. (Perhaps the underlying C++ code does not handle None for string??)

I tried this workaround in the init function of the DeltaTable class, prior to the creation of the DataRawTable:

        # replace None with empty string
        if 'AWS_SESSION_TOKEN' in storage_options and storage_options['AWS_SESSION_TOKEN'] is None:
            storage_options['AWS_SESSION_TOKEN'] = ''

It stopped the type error and having a blank value in the AWS_SESSION_TOKEN did not cause a problem, as the write completed without error.

@vavaan
Copy link

vavaan commented Mar 6, 2024

I've had same error in AWS environment while locally everything was working fine. Got it fixed by adding before calling to_deltalake:

boto3.setup_default_session(region_name='us-east-1')

@stuart-powell
Copy link

Thanks, vavaan, for your update. That didn't work for me. Before calling to_deltalake I currently have:

boto3_session = boto3.Session(region_name="ap-southeast-2")

I tried changing this to use boto3.setup_default_session(...) but it was the same either way. When the call is made to set up the
DataRawTable the AWS_SESSION_TOKEN is still set to None and the conversion to PyString fails. In my program this is my first use of the S3 session, so maybe if you've already done something with the session before trying to write, the token will have been set to a non-None value and it works. But it doesn't work for me even if I've created the session as shown above.

Thanks again for your response.

@tposlins
Copy link

tposlins commented Apr 22, 2024

I just set the unused AWS_SESSION_TOKEN to an empty string to override it from being set to None when calling to_deltalake to fix the issue:

boto3_session = boto3.Session(region_name="us-east-1")
df = pd.DataFrame({"c0": [1, 2, 3], "c1": [True, False, True], "par0": ["foo", "foo", "bar"], "par1": [1, 2, 2]})
wr.s3.to_deltalake(
    path=path,
    df=df,
    index=False,
    boto3_session=boto3_session,
    partition_cols=["par0", "par1"],
    overwrite_schema=True,
    s3_allow_unsafe_rename=True,
    s3_additional_kwargs={
        'AWS_SESSION_TOKEN': ''
    }
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

10 participants