RF: Drop boto for boto3 #7575

effigies · 2024-04-06T12:53:10Z

This PR transitions from the deeply defunct boto package for boto3. We refrain from using the high-level API, as it makes it difficult to customize the authentication method.

This PR preserves the old behavior of using datalad-managed credentials, falling back to anonymous access. A future effort could use the boto3 API to give users the option of pulling credentials from AWS sources instead of manually re-entering, but this is specifically avoided here. This is the main difference between this and #7340.

I've added an anonymous fetch test using a small file from OpenNeuro, which at least exercises most of the functionality. To test the authentication, we will need #7574.

Revives #7340.
Closes #5597, #7587 .

Still relevant inclusions from original PR text:

Boto package is no longer maintained (last release is from July 2018, v.2.49.0) and has been replaced by Boto3. This PR removes the boto dependency in favour of boto3, and makes all code adjustments necessary.

Three most significant changes include:

Custom retry logic for S3 connections is removed in favour of boto3's built-in Standard retry Mode (see boto3 docs: Retries).

Instead of old boto bucket and key objects, the functions in the downloader module will now operate primarily on a boto3 S3 client object ("low-level interface to AWS whose method map close to 1:1 with service APIs"). Although boto3 has a high-level resource object for a Bucket, its methods are different from the old boto bucket, and I found working with a client much more suitable for the kind of operations done by the downloaders code (setting credentials, asking about headers, etc.).

See commit messages for details on specific decisions - [@mslw] tried to be verbose.

Notes

Note: The downloader/authenticator code based on boto seems to be built around "authenticating to a bucket". Authenticator's behaviour was maintained for compatibility (it still checks bucket access permissions, as the old code did). But the transition to boto3 client object makes it apparent that authenticating to a bucket is not really what matters if all we want is to download an object. Authentication is done to S3, and then object permissions are decisive for downloading. For example, a public file (public read permission) can be in a bucket that does not block public access, but does not grant public permission to list objects either (easy and reasonable setup in S3) - in which case non-owner bucket interaction will end in 403 although the object could be retrieved.

This commit replaces the usage of boto (which has been deprecated) with boto3 in the s3 downloaders module. Instead of boto bucket and key objects, the functions in this module will now operate primarily on a boto3 client object ("low-level interface to AWS whose method map close to 1:1 with service APIs"). Although boto3 has a high-level resource object for a Bucket, it behaves differently than the old boto bucket, and I found working with a client much more suitable for the kind of operations done by the downloaders code (setting credentials, asking about headers, etc.). The commit also allows boto3 to use credentials coming from its own configuration mechanism (`AWS_*` environment variables, `~/.aws/credentials`, `~/.aws/config`, `/etc/boto.cfg` and `~/.boto`) allowing AWS users to rely on their existing configuration. The preference order is DataLad credentials, boto credentials, anonymous. Some unused code pieces (now incompatible with the rest of the file) are left over, and some new TODOs for completing the transition are introduced. These will be addressed by subsequent commits. But the current code works for simple (authenticated & anonymous) download scenarios already. Note: The S3Authenticator.authenticate docstring said that it authenticates to the specified bucket. This behaviour was maintained for compatibility - we effectively check bucket permissions, as the old code did. But the transition to boto3 client object makes it apparent that authenticating to a bucket is not really what we care about if all we want is to download an object. Authentication is done to S3, and then object permissions are decisive for downloading. For example, there can be a public file (public read permission) in a bucket that doesn't block public access, but does not grant public permission to list objects (easy and reasonable setup in S3) - in which case non-owner bucket interaction will end in 403 although the object could be retrieved.

Methods in the downloader framework need headers as an input argument. These are now taken from the output of boto3's head_object(). Assembling the headers is delegated to a function replacing old get_key_headers(). Type coercion is improved to match the types obtained by working with old boto's bucket.get_key(). Also, now we work only with the "top-level", documented keys of head_object() response. I also see that there are "HTTPHeaders" under "ResponseMetadata", which may be interesting, but I decided not to use them as all key information was available on "top-level" anyway. Note: head_object() response includes "LastModified" (datetime), which removes the necessity of converting from rfc2822 or iso8601 formats (HTTPHeaders still contain the rfc2822 date string). I'm preserving here a comment from the removed original code: "boto would return time string the way amazon returns which returns it in two different ones depending on how key information was obtained (bucket vs object -- mslw), github.com/boto/boto/issues/466"

The _get_key / _get_key_via_get approach was used to gain additional insight from the get response if the first (head) attempt fails. I am wary of the fact that fixes usually accumulate in the code for good reasons, but I feel that these can be retired without massive consequences. Admittedly, we currently only perform head requests (head_object in this context), which have no response body and therefore return just a generic 404 Not Found or 403 Forbidden code if things go wrong. But I think this is workable already. However, if the get query is deemed necessary, the commit can be reverted, and the logic rebuilt based on get_object, which might be more convenient to work with.

This was introduced as part of the boto to boto3 rewrite, but if the code comment about NDA still holds true (I did not verify), it is redundant. We simply won't add VersionId to download parameters unless it was contained in the URL, no unsetting necessary. Original comment introduced in 34ba22a is left for posterity.

Boto3 session does not have a "host" argument; instead it can take region (like eu-west-1) and it will figure out the correct endpoint by itself. If no region is specified, boto will use its default, most likely us-east-1. Specifying host mattered a lot for old boto: many regions enforce use of sigv4 request authentication, which in turn required an explicit value for host / region. These things are done by default now. IMHO the most convenient way to pick a region for users who care about their requests going through a specific region (like eu-west-1) would be through ~/aws/config; but the possibility to set it in the authenticator remains. Note: although boto3 figures out the aws endpoint on its own, it is also possible and easy to set client's endpoint_url in the same way as the region. There are other possible uses for the 'host' argument that aren't covered by setting the region (non-aws endpoints, e.g. fake-s3 on localhost) - but if I understand correctly, the authenticator aspires to be an aws s3 authenticator, so I leave that out.

This restores a workaround for bucket names containing uppercase or dots, and is a replacement of old boto OrdinaryCallingFormat; see stackoverflow.com/a/19089045/1265472, github.com/boto/boto3/issues/334

Boto3 itself is capable of retrying calls to AWS. This commit removes downloader's usage of our custom try_multiple_dec_s3 and instructs boto3 to use its standard retry mode (default is legacy). The standard mode will do a default of 3 maximum retry attempts with an exponential backoff time. https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html As it stands, the mode setting will override user aws config (if any is present at all), but I think it's fair. Alternatively we could do the same thing as for credentials, i.e. create a session first and see if it picks up any retry mode. Possible modes are: legacy, standard, and adaptive (the latter is described as experimental).

Boto3 invokes the callback "intermittently", and it seems it does a good job deciding how often to update.

This removes the last traces of old boto from s3 downloaders.

It seems that most of support/s3.py documents the process of creating test buckets - these functions are not imported elsewhere. Usage of some functions had been removed when transitioning s3 downloader to boto3. Judging by git grep, now only get_versioned_url and add_version_to_url are used elsewhere (the former in local/addurls.py, the latter in tests). Functions that aren't usied locally (to generate test buckets) or imported elsewhere are now removed. Generation of test buckets is migrated to boto3, and based on on high-level resources (which are different from low-level clients used in the downloaders code, but more fit for purpose here). In manual testing, I had to rename the buckets (otherwise, I got 403), but after doing so I was able to run all the bucket-generating code under my aws account. The get_versioned_url function is yet to be reworked and isn't functional as of this commit - but because it has a lot of logic, the task is left for a separate commit.

The get_versioned_url function is now updated to boto3. Matching S3 downloader, it now uses boto3 s3 client object for S3 interactions. URL pattern matching is improved to allow (optional) region codes in the urls. Boto3 helps us not having to deal with delete markers - they are also included in the response, but separately from versions.

Boto package is no longer maintained (last release is from July 2018, v.2.49.0) and has been replaced by Boto3. This change of dependencies follows a series of changes to migrate our code from Boto to Boto3. This commit also adjusts external_versions to match.

There are two places where AWS can respond with 403 access denied, causing botocore.exceptions.ClientError: 1. When the S3Authenticator calls head_bucket (note that this would be a false positive for buckets that have public objects but don't allow listing; however it follows the logic of "authenticating to a bucket" that was present in boto2 code). 2. When the S3Downloader calls head_object These ClientErrors are now caught and raised as DataLad's AccessDeniedError. IMHO this leads to less informative error messages, but AccessDeniedError seems to be expected in other places of the codebase. The ClientError message would say, e.g., "An error occurred (403) when calling the HeadObject operation: Forbidden", indicating whether it came from HeadBucket or HeadObject. To preserve the error message, at least in debug output, I am using the original error as the message for AccessDeniedError. What is certain is that raising AccessDeniedError was necessary to make test_deny_access pass when "doesn't matter" (used by the test as a download url) was replaced with an url to an actual S3 bucket or object that returned 403. The test as it is still fails, because boto3 now tries to validate the url format and raises a different error (in any case, not equivalent to access denied), which needs to be addressed in the test itself, with next commit.

Although test_deny_access patched downloader._download() to raise AccessDeniedError, the patch was never triggered! downloader.download() calls downloader.access(), which first calls downloader.establish_session() and only then, if successful, downloader._download(). However, with url="doesn't matter", in boto2 code it was within downloader.establish_session() that authentication was attempted, a DownloadError was (eventually) produced, and the assertion passed. After updating to boto3, downloader.establish_session(), would raise a different error, "Invalid bucket name". Because _download() was patched and the url was "doesn't matter", it seems to me that the intention of the test was to check error conversion coming from a download attempt, not connection attempt with misformed url. Hence, I'm also patching downloader_establish_session to do nothing and return False (indicating things went well and session is not reused). However, an ideal test should try downloading e.g. a non-public object from a public bucket. I understand that we don't have those in the existing datalad-* test buckets. I made one ad-hoc, used its url, and the test passed with patching removed. However, I don't plan to keep test buckets owned by my account perpetually, so not using those. In summary - for migration to boto3 I tried keeping the test close to what I believe it was, even though it has more to do with BaseDownloader than S3Downloader behaviour. TBF, I'd also be happy to retire this test.

codecov · 2024-04-06T13:28:02Z

Codecov Report

Attention: Patch coverage is 48.64865% with 76 lines in your changes are missing coverage. Please review.

Project coverage is 91.33%. Comparing base (35c5492) to head (9a8dd62).
Report is 31 commits behind head on maint.

Files	Patch %	Lines
datalad/support/s3.py	28.35%	48 Missing ⚠️
datalad/downloaders/s3.py	60.00%	28 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##            maint    #7575      +/-   ##
==========================================
+ Coverage   91.21%   91.33%   +0.11%     
==========================================
  Files         325      325              
  Lines       43442    43345      -97     
  Branches     5785        0    -5785     
==========================================
- Hits        39626    39588      -38     
+ Misses       3801     3757      -44     
+ Partials       15        0      -15

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

effigies · 2024-04-06T13:56:16Z

@mslw @yarikoptic It looks like datalad-crawler depends on the boto bucket API:

https://github.com/datalad/datalad-crawler/blob/68a224dceda90f1329abcdfdeb70e31091d09664/datalad_crawler/nodes/s3.py#L129-L148

yarikoptic · 2024-05-06T19:27:32Z

Thank you @effigies and sorry for the silence -- somehow this awesome PR went behind my consciousness radar.

To test the authentication, we will need #7574.

why? we didn't have any credentials provided there either, so should probably be the same.

…'s logger This should bring running tests of downloaders/tests/test_s3.py which were completely skipped on CI and thus code coverage was low. Tested that logger is indeed boto3 and not just boto in the boto3's codebase: ❯ git grep getLogger boto3/__init__.py: logger = logging.getLogger(name) boto3/__init__.py:logging.getLogger('boto3').addHandler(NullHandler()) boto3/dynamodb/table.py:logger = logging.getLogger(__name__) boto3/resources/action.py:logger = logging.getLogger(__name__) boto3/resources/base.py:logger = logging.getLogger(__name__) boto3/resources/collection.py:logger = logging.getLogger(__name__) boto3/resources/factory.py:logger = logging.getLogger(__name__) boto3/resources/model.py:logger = logging.getLogger(__name__) boto3/s3/transfer.py:logger = logging.getLogger(__name__) docs/source/guide/error-handling.rst: logger = logging.getLogger() tests/integration/test_s3.py:LOG = logging.getLogger('boto3.tests.integration') ❯ git describe 1.34.99

effigies · 2024-05-06T20:45:18Z

My understanding in Dusseldorf was that these tests, which require credentials, had been running on Travis. I wrote new tests that don't require credentials, and was noting that the authentication won't be tested without credentials.

yarikoptic

initial comments/question, just started to dig, had to recall how to get tests to run to use credentials ;) should be

DATALAD_TESTS_CREDENTIALS=system python -m pytest -s -v datalad/downloaders/tests/test_s3.py

if e.g. already have credentials known to the system by e.g. invoking datalad ls 's3://datalad-test0-versioned/'

running them gives:

FAILED datalad/downloaders/tests/test_s3.py::test_s3_download_basic[s3://datalad-test0-versioned/2versions-nonversioned1.txt-version2-version1] - datalad.support.exceptions.UnaccountedDownloadError: Downloaded size 8 differs from originally announced 1
FAILED datalad/downloaders/tests/test_s3.py::test_s3_download_basic[s3://datalad-test0-versioned/2versions-nonversioned1.txt?versionId=V4Dqhu0QTEtxmvoNkCHGrjVZVomR1Ryo-version2-version1] - datalad.support.exceptions.UnaccountedDownloadError: Downloaded size 8 differs from originally announced 1
FAILED datalad/downloaders/tests/test_s3.py::test_s3_download_basic[s3://datalad-test0-versioned/2versions-nonversioned1.txt?versionId=null-version1-version2] - datalad.support.exceptions.UnaccountedDownloadError: Downloaded size 8 differs from originally announced 1
FAILED datalad/downloaders/tests/test_s3.py::test_s3_download_basic[s3://datalad.test1/version1.txt-version1-nothing] - datalad.support.exceptions.UnaccountedDownloadError: Downloaded size 9 differs from originally announced 1

so seems like somewhere getting the size of a key is incorrect.

I have pushed 2 commits

to make downloaders/ test_s3.py not skip (so we should get some code covered now)
to remove test which no longer applies kinda (tested removed interfaces) but we might want to bring it back in some fashion

yarikoptic · 2024-05-06T19:33:44Z

datalad/downloaders/s3.py

@@ -60,32 +52,46 @@ class S3Authenticator(Authenticator):
 allows_anonymous = True
 DEFAULT_CREDENTIAL_TYPE = 'aws-s3'

- def __init__(self, *args, host=None, **kwargs):
+ def __init__(self, *args, host=None, region=None, **kwargs):


What was overall motivation to remove host?

I am not aware of any use case but I could foresee people potentially using it against some non-AWS S3 servers (minio etc) which might work...? meanwhile formulated that idea for testing:

Testing of authenticated S3 interactions #7588

It seems that host can be supported via endpoint_url https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-example-privatelink.html and https://www.stackhero.io/en/services/MinIO/documentations/Getting-started/Connect-to-MinIO-from-Python even shows example of specifying it with the region (may be implementation somehow handles it internally)? WDYT?

This was removed by @mslw in the previous PR. My guess is because host did not have an obvious counterpart. I think it would be sensible to restore it if endpoint_url is the equivalent concept (modulo some shim?).

gotcha... insofar it seems that host was just a FQDN (e.g. s3.us-east-2.amazonaws.com) whenever endpoint_url should be a URL thus have protocol (https) added... So I will reincarnate host and extend it to start with protocol (https) if doesn't have :// in it, and provide as endpoint_url. Sounds ok'ish?

Seems fine to me.

yarikoptic · 2024-05-06T21:39:29Z

I think that particular size mismatches are due to removed

        if size:
-            headers['Range'] = 'bytes=0-%d' % (size - 1)

where we allowed for download of specific leading portion of the file... smth to get back

…ing of the file Had to decompose s3_download_kwargs since VersionId sometimes goes along with Bucket and Key, but some times need to go as ExtraKwargs in new boto3 interface. Also apparently there is no Range support for download_fileobj so as a really dirty workaround we will just read into memory and save into a file. But getting the key does not get a Callback -- so fixed for that too. I think such use cases of getting size specified should not be frequent -- at least with "download-url" we do get to download_fileobj and get a Callback engaged. Altogether this seems to address current test cases.

yarikoptic · 2024-05-07T03:36:01Z

I pushed some changes to make it seemingly work for all test cases we have. but trying on an hcp case

lead to nothing downloaded but progress bar showing GBps speed ;)

❯ datalad get -J5 .
2024-05-06 23:33:47,125 [WARNING] DueCredit internal failure while running <function _get_active_due at 0x7f5884a4dd00>: ModuleNotFoundError("No module named 'distutils'"). Please report to developers at https://github.com/duecredit/duecredit/issues (utils.py:194)
[ERROR  ] Failed to import duecredit due to Both inactive and active collectors should be provided. Got active=None, inactive=InactiveDueCreditCollector() 
[WARNING] Requested extension 'next' is not available 
[WARNING] Requested extension 'dataverse' is not available 
get(error): 100206.164k_fs_LR.wb.spec (file) [not available; (Note that these git remotes have annex-ignore set: origin)]                                                                                                         
get(error): 100206.ArealDistortion_MSMAll.164k_fs_LR.dscalar.nii (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): 100206.ArealDistortion_FS.164k_fs_LR.dscalar.nii (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): 100206.ArealDistortion_MSMSulc.164k_fs_LR.dscalar.nii (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): 100206.L.RefMyelinMap.164k_fs_LR.func.gii (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): 100206.L.MyelinMap_BC.164k_fs_LR.func.gii (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): 100206.L.SmoothedMyelinMap.164k_fs_LR.func.gii (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): 100206.L.BA.164k_fs_LR.label.gii (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): 100206.L.curvature.164k_fs_LR.shape.gii (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): 100206.L.ArealDistortion_FS.164k_fs_LR.shape.gii (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
  [11837 similar messages have been suppressed; disable with datalad.ui.suppress-similar-results=off]
action summary:
  get (error: 11847)
datalad get -J5 .  48.36s user 4.85s system 251% cpu 21.124 total
❯ pwd
/home/yoh/datalad/hcp-openaccess/HCP1200/100206/MNINonLinear

so something to troubleshoot more and probably also for us to return host specification which was present originally.

yarikoptic · 2024-05-07T13:32:35Z

changed base to master since minor (maint for bug fixes really).
restarted brew test -- spurious connectivity issue
we can try to resolve host issue to make it "complete" and then try a little more "by hands" I guess

mslw and others added 20 commits April 6, 2024 14:39

Choose boto addressing style in some cases

a683f89

This restores a workaround for bucket names containing uppercase or dots, and is a replacement of old boto OrdinaryCallingFormat; see stackoverflow.com/a/19089045/1265472, github.com/boto/boto3/issues/334

Restore progress bars in s3 downloader

74f9edc

Boto3 invokes the callback "intermittently", and it seems it does a good job deciding how often to update.

Remove unused imports

79a3cb9

This removes the last traces of old boto from s3 downloaders.

Clean up comments and docstrings

86eae84

Add changelog snippet

19c9a0e

TEST: Add anonymous S3 access test

93420ea

FIX: Ensure temp_filepath is always defined (referenced in finally)

802c4f5

RF: Drop boto3 credential search, head_bucket call

f9b1fce

DOC: Drop claim to read credentials from AWS configuration

21ad47b

mslw added the semver-minor Increment the minor version when merged label Apr 6, 2024

yarikoptic mentioned this pull request May 6, 2024

Unable to get HCP FC datalad data in a pyenv with Python 3.12.2 ["you need 'boto' dependency which seems to be missing"] #7587

Open

Remove test_get_key which tests removed interfaces

90d8792

yarikoptic reviewed May 6, 2024

View reviewed changes

yarikoptic changed the base branch from maint to master May 7, 2024 13:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RF: Drop boto for boto3 #7575

RF: Drop boto for boto3 #7575

effigies commented Apr 6, 2024 •

edited by yarikoptic

codecov bot commented Apr 6, 2024 •

edited

effigies commented Apr 6, 2024

yarikoptic commented May 6, 2024

effigies commented May 6, 2024

yarikoptic left a comment

yarikoptic May 6, 2024

effigies May 7, 2024

yarikoptic May 7, 2024

effigies May 7, 2024

yarikoptic commented May 6, 2024

yarikoptic commented May 7, 2024

yarikoptic commented May 7, 2024

RF: Drop boto for boto3 #7575

Are you sure you want to change the base?

RF: Drop boto for boto3 #7575

Conversation

effigies commented Apr 6, 2024 • edited by yarikoptic

Notes

codecov bot commented Apr 6, 2024 • edited

Codecov Report

effigies commented Apr 6, 2024

yarikoptic commented May 6, 2024

effigies commented May 6, 2024

yarikoptic left a comment

Choose a reason for hiding this comment

yarikoptic May 6, 2024

Choose a reason for hiding this comment

effigies May 7, 2024

Choose a reason for hiding this comment

yarikoptic May 7, 2024

Choose a reason for hiding this comment

effigies May 7, 2024

Choose a reason for hiding this comment

yarikoptic commented May 6, 2024

yarikoptic commented May 7, 2024

yarikoptic commented May 7, 2024

effigies commented Apr 6, 2024 •

edited by yarikoptic

codecov bot commented Apr 6, 2024 •

edited