-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RF: Drop boto for boto3 #7575
base: master
Are you sure you want to change the base?
RF: Drop boto for boto3 #7575
Conversation
This commit replaces the usage of boto (which has been deprecated) with boto3 in the s3 downloaders module. Instead of boto bucket and key objects, the functions in this module will now operate primarily on a boto3 client object ("low-level interface to AWS whose method map close to 1:1 with service APIs"). Although boto3 has a high-level resource object for a Bucket, it behaves differently than the old boto bucket, and I found working with a client much more suitable for the kind of operations done by the downloaders code (setting credentials, asking about headers, etc.). The commit also allows boto3 to use credentials coming from its own configuration mechanism (`AWS_*` environment variables, `~/.aws/credentials`, `~/.aws/config`, `/etc/boto.cfg` and `~/.boto`) allowing AWS users to rely on their existing configuration. The preference order is DataLad credentials, boto credentials, anonymous. Some unused code pieces (now incompatible with the rest of the file) are left over, and some new TODOs for completing the transition are introduced. These will be addressed by subsequent commits. But the current code works for simple (authenticated & anonymous) download scenarios already. Note: The S3Authenticator.authenticate docstring said that it authenticates to the specified bucket. This behaviour was maintained for compatibility - we effectively check bucket permissions, as the old code did. But the transition to boto3 client object makes it apparent that authenticating to a bucket is not really what we care about if all we want is to download an object. Authentication is done to S3, and then object permissions are decisive for downloading. For example, there can be a public file (public read permission) in a bucket that doesn't block public access, but does not grant public permission to list objects (easy and reasonable setup in S3) - in which case non-owner bucket interaction will end in 403 although the object could be retrieved.
Methods in the downloader framework need headers as an input argument. These are now taken from the output of boto3's head_object(). Assembling the headers is delegated to a function replacing old get_key_headers(). Type coercion is improved to match the types obtained by working with old boto's bucket.get_key(). Also, now we work only with the "top-level", documented keys of head_object() response. I also see that there are "HTTPHeaders" under "ResponseMetadata", which may be interesting, but I decided not to use them as all key information was available on "top-level" anyway. Note: head_object() response includes "LastModified" (datetime), which removes the necessity of converting from rfc2822 or iso8601 formats (HTTPHeaders still contain the rfc2822 date string). I'm preserving here a comment from the removed original code: "boto would return time string the way amazon returns which returns it in two different ones depending on how key information was obtained (bucket vs object -- mslw), github.com/boto/boto/issues/466"
The _get_key / _get_key_via_get approach was used to gain additional insight from the get response if the first (head) attempt fails. I am wary of the fact that fixes usually accumulate in the code for good reasons, but I feel that these can be retired without massive consequences. Admittedly, we currently only perform head requests (head_object in this context), which have no response body and therefore return just a generic 404 Not Found or 403 Forbidden code if things go wrong. But I think this is workable already. However, if the get query is deemed necessary, the commit can be reverted, and the logic rebuilt based on get_object, which might be more convenient to work with.
This was introduced as part of the boto to boto3 rewrite, but if the code comment about NDA still holds true (I did not verify), it is redundant. We simply won't add VersionId to download parameters unless it was contained in the URL, no unsetting necessary. Original comment introduced in 34ba22a is left for posterity.
Boto3 session does not have a "host" argument; instead it can take region (like eu-west-1) and it will figure out the correct endpoint by itself. If no region is specified, boto will use its default, most likely us-east-1. Specifying host mattered a lot for old boto: many regions enforce use of sigv4 request authentication, which in turn required an explicit value for host / region. These things are done by default now. IMHO the most convenient way to pick a region for users who care about their requests going through a specific region (like eu-west-1) would be through ~/aws/config; but the possibility to set it in the authenticator remains. Note: although boto3 figures out the aws endpoint on its own, it is also possible and easy to set client's endpoint_url in the same way as the region. There are other possible uses for the 'host' argument that aren't covered by setting the region (non-aws endpoints, e.g. fake-s3 on localhost) - but if I understand correctly, the authenticator aspires to be an aws s3 authenticator, so I leave that out.
This restores a workaround for bucket names containing uppercase or dots, and is a replacement of old boto OrdinaryCallingFormat; see stackoverflow.com/a/19089045/1265472, github.com/boto/boto3/issues/334
Boto3 itself is capable of retrying calls to AWS. This commit removes downloader's usage of our custom try_multiple_dec_s3 and instructs boto3 to use its standard retry mode (default is legacy). The standard mode will do a default of 3 maximum retry attempts with an exponential backoff time. https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html As it stands, the mode setting will override user aws config (if any is present at all), but I think it's fair. Alternatively we could do the same thing as for credentials, i.e. create a session first and see if it picks up any retry mode. Possible modes are: legacy, standard, and adaptive (the latter is described as experimental).
Boto3 invokes the callback "intermittently", and it seems it does a good job deciding how often to update.
This removes the last traces of old boto from s3 downloaders.
It seems that most of support/s3.py documents the process of creating test buckets - these functions are not imported elsewhere. Usage of some functions had been removed when transitioning s3 downloader to boto3. Judging by git grep, now only get_versioned_url and add_version_to_url are used elsewhere (the former in local/addurls.py, the latter in tests). Functions that aren't usied locally (to generate test buckets) or imported elsewhere are now removed. Generation of test buckets is migrated to boto3, and based on on high-level resources (which are different from low-level clients used in the downloaders code, but more fit for purpose here). In manual testing, I had to rename the buckets (otherwise, I got 403), but after doing so I was able to run all the bucket-generating code under my aws account. The get_versioned_url function is yet to be reworked and isn't functional as of this commit - but because it has a lot of logic, the task is left for a separate commit.
The get_versioned_url function is now updated to boto3. Matching S3 downloader, it now uses boto3 s3 client object for S3 interactions. URL pattern matching is improved to allow (optional) region codes in the urls. Boto3 helps us not having to deal with delete markers - they are also included in the response, but separately from versions.
Boto package is no longer maintained (last release is from July 2018, v.2.49.0) and has been replaced by Boto3. This change of dependencies follows a series of changes to migrate our code from Boto to Boto3. This commit also adjusts external_versions to match.
There are two places where AWS can respond with 403 access denied, causing botocore.exceptions.ClientError: 1. When the S3Authenticator calls head_bucket (note that this would be a false positive for buckets that have public objects but don't allow listing; however it follows the logic of "authenticating to a bucket" that was present in boto2 code). 2. When the S3Downloader calls head_object These ClientErrors are now caught and raised as DataLad's AccessDeniedError. IMHO this leads to less informative error messages, but AccessDeniedError seems to be expected in other places of the codebase. The ClientError message would say, e.g., "An error occurred (403) when calling the HeadObject operation: Forbidden", indicating whether it came from HeadBucket or HeadObject. To preserve the error message, at least in debug output, I am using the original error as the message for AccessDeniedError. What is certain is that raising AccessDeniedError was necessary to make test_deny_access pass when "doesn't matter" (used by the test as a download url) was replaced with an url to an actual S3 bucket or object that returned 403. The test as it is still fails, because boto3 now tries to validate the url format and raises a different error (in any case, not equivalent to access denied), which needs to be addressed in the test itself, with next commit.
Although test_deny_access patched downloader._download() to raise AccessDeniedError, the patch was never triggered! downloader.download() calls downloader.access(), which first calls downloader.establish_session() and only then, if successful, downloader._download(). However, with url="doesn't matter", in boto2 code it was within downloader.establish_session() that authentication was attempted, a DownloadError was (eventually) produced, and the assertion passed. After updating to boto3, downloader.establish_session(), would raise a different error, "Invalid bucket name". Because _download() was patched and the url was "doesn't matter", it seems to me that the intention of the test was to check error conversion coming from a download attempt, not connection attempt with misformed url. Hence, I'm also patching downloader_establish_session to do nothing and return False (indicating things went well and session is not reused). However, an ideal test should try downloading e.g. a non-public object from a public bucket. I understand that we don't have those in the existing datalad-* test buckets. I made one ad-hoc, used its url, and the test passed with patching removed. However, I don't plan to keep test buckets owned by my account perpetually, so not using those. In summary - for migration to boto3 I tried keeping the test close to what I believe it was, even though it has more to do with BaseDownloader than S3Downloader behaviour. TBF, I'd also be happy to retire this test.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## maint #7575 +/- ##
==========================================
+ Coverage 91.21% 91.33% +0.11%
==========================================
Files 325 325
Lines 43442 43345 -97
Branches 5785 0 -5785
==========================================
- Hits 39626 39588 -38
+ Misses 3801 3757 -44
+ Partials 15 0 -15 ☔ View full report in Codecov by Sentry. |
@mslw @yarikoptic It looks like datalad-crawler depends on the boto bucket API: |
…'s logger This should bring running tests of downloaders/tests/test_s3.py which were completely skipped on CI and thus code coverage was low. Tested that logger is indeed boto3 and not just boto in the boto3's codebase: ❯ git grep getLogger boto3/__init__.py: logger = logging.getLogger(name) boto3/__init__.py:logging.getLogger('boto3').addHandler(NullHandler()) boto3/dynamodb/table.py:logger = logging.getLogger(__name__) boto3/resources/action.py:logger = logging.getLogger(__name__) boto3/resources/base.py:logger = logging.getLogger(__name__) boto3/resources/collection.py:logger = logging.getLogger(__name__) boto3/resources/factory.py:logger = logging.getLogger(__name__) boto3/resources/model.py:logger = logging.getLogger(__name__) boto3/s3/transfer.py:logger = logging.getLogger(__name__) docs/source/guide/error-handling.rst: logger = logging.getLogger() tests/integration/test_s3.py:LOG = logging.getLogger('boto3.tests.integration') ❯ git describe 1.34.99
My understanding in Dusseldorf was that these tests, which require credentials, had been running on Travis. I wrote new tests that don't require credentials, and was noting that the authentication won't be tested without credentials. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initial comments/question, just started to dig, had to recall how to get tests to run to use credentials ;) should be
DATALAD_TESTS_CREDENTIALS=system python -m pytest -s -v datalad/downloaders/tests/test_s3.py
if e.g. already have credentials known to the system by e.g. invoking datalad ls 's3://datalad-test0-versioned/'
running them gives:
FAILED datalad/downloaders/tests/test_s3.py::test_s3_download_basic[s3://datalad-test0-versioned/2versions-nonversioned1.txt-version2-version1] - datalad.support.exceptions.UnaccountedDownloadError: Downloaded size 8 differs from originally announced 1
FAILED datalad/downloaders/tests/test_s3.py::test_s3_download_basic[s3://datalad-test0-versioned/2versions-nonversioned1.txt?versionId=V4Dqhu0QTEtxmvoNkCHGrjVZVomR1Ryo-version2-version1] - datalad.support.exceptions.UnaccountedDownloadError: Downloaded size 8 differs from originally announced 1
FAILED datalad/downloaders/tests/test_s3.py::test_s3_download_basic[s3://datalad-test0-versioned/2versions-nonversioned1.txt?versionId=null-version1-version2] - datalad.support.exceptions.UnaccountedDownloadError: Downloaded size 8 differs from originally announced 1
FAILED datalad/downloaders/tests/test_s3.py::test_s3_download_basic[s3://datalad.test1/version1.txt-version1-nothing] - datalad.support.exceptions.UnaccountedDownloadError: Downloaded size 9 differs from originally announced 1
so seems like somewhere getting the size of a key is incorrect.
I have pushed 2 commits
- to make downloaders/ test_s3.py not skip (so we should get some code covered now)
- to remove test which no longer applies kinda (tested removed interfaces) but we might want to bring it back in some fashion
@@ -60,32 +52,46 @@ class S3Authenticator(Authenticator): | |||
allows_anonymous = True | |||
DEFAULT_CREDENTIAL_TYPE = 'aws-s3' | |||
|
|||
def __init__(self, *args, host=None, **kwargs): | |||
def __init__(self, *args, host=None, region=None, **kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What was overall motivation to remove host
?
I am not aware of any use case but I could foresee people potentially using it against some non-AWS S3 servers (minio etc) which might work...? meanwhile formulated that idea for testing:
It seems that host
can be supported via endpoint_url
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-example-privatelink.html and https://www.stackhero.io/en/services/MinIO/documentations/Getting-started/Connect-to-MinIO-from-Python even shows example of specifying it with the region (may be implementation somehow handles it internally)? WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was removed by @mslw in the previous PR. My guess is because host did not have an obvious counterpart. I think it would be sensible to restore it if endpoint_url
is the equivalent concept (modulo some shim?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha... insofar it seems that host was just a FQDN (e.g. s3.us-east-2.amazonaws.com) whenever endpoint_url
should be a URL thus have protocol (https) added... So I will reincarnate host
and extend it to start with protocol (https) if doesn't have ://
in it, and provide as endpoint_url
. Sounds ok'ish?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems fine to me.
I think that particular size mismatches are due to removed
where we allowed for download of specific leading portion of the file... smth to get back |
…ing of the file Had to decompose s3_download_kwargs since VersionId sometimes goes along with Bucket and Key, but some times need to go as ExtraKwargs in new boto3 interface. Also apparently there is no Range support for download_fileobj so as a really dirty workaround we will just read into memory and save into a file. But getting the key does not get a Callback -- so fixed for that too. I think such use cases of getting size specified should not be frequent -- at least with "download-url" we do get to download_fileobj and get a Callback engaged. Altogether this seems to address current test cases.
I pushed some changes to make it seemingly work for all test cases we have. but trying on an hcp case lead to nothing downloaded but progress bar showing GBps speed ;)❯ datalad get -J5 .
2024-05-06 23:33:47,125 [WARNING] DueCredit internal failure while running <function _get_active_due at 0x7f5884a4dd00>: ModuleNotFoundError("No module named 'distutils'"). Please report to developers at https://github.com/duecredit/duecredit/issues (utils.py:194)
[ERROR ] Failed to import duecredit due to Both inactive and active collectors should be provided. Got active=None, inactive=InactiveDueCreditCollector()
[WARNING] Requested extension 'next' is not available
[WARNING] Requested extension 'dataverse' is not available
get(error): 100206.164k_fs_LR.wb.spec (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): 100206.ArealDistortion_MSMAll.164k_fs_LR.dscalar.nii (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): 100206.ArealDistortion_FS.164k_fs_LR.dscalar.nii (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): 100206.ArealDistortion_MSMSulc.164k_fs_LR.dscalar.nii (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): 100206.L.RefMyelinMap.164k_fs_LR.func.gii (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): 100206.L.MyelinMap_BC.164k_fs_LR.func.gii (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): 100206.L.SmoothedMyelinMap.164k_fs_LR.func.gii (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): 100206.L.BA.164k_fs_LR.label.gii (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): 100206.L.curvature.164k_fs_LR.shape.gii (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): 100206.L.ArealDistortion_FS.164k_fs_LR.shape.gii (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
[11837 similar messages have been suppressed; disable with datalad.ui.suppress-similar-results=off]
action summary:
get (error: 11847)
datalad get -J5 . 48.36s user 4.85s system 251% cpu 21.124 total
❯ pwd
/home/yoh/datalad/hcp-openaccess/HCP1200/100206/MNINonLinear
so something to troubleshoot more and probably also for us to return |
|
This PR transitions from the deeply defunct
boto
package forboto3
. We refrain from using the high-level API, as it makes it difficult to customize the authentication method.This PR preserves the old behavior of using datalad-managed credentials, falling back to anonymous access. A future effort could use the boto3 API to give users the option of pulling credentials from AWS sources instead of manually re-entering, but this is specifically avoided here. This is the main difference between this and #7340.
I've added an anonymous fetch test using a small file from OpenNeuro, which at least exercises most of the functionality. To test the authentication, we will need #7574.
Revives #7340.
Closes #5597, #7587 .
Still relevant inclusions from original PR text: