Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RF: Drop boto for boto3 #7575

Open
wants to merge 23 commits into
base: master
Choose a base branch
from
Open

RF: Drop boto for boto3 #7575

wants to merge 23 commits into from

Conversation

effigies
Copy link
Contributor

@effigies effigies commented Apr 6, 2024

This PR transitions from the deeply defunct boto package for boto3. We refrain from using the high-level API, as it makes it difficult to customize the authentication method.

This PR preserves the old behavior of using datalad-managed credentials, falling back to anonymous access. A future effort could use the boto3 API to give users the option of pulling credentials from AWS sources instead of manually re-entering, but this is specifically avoided here. This is the main difference between this and #7340.

I've added an anonymous fetch test using a small file from OpenNeuro, which at least exercises most of the functionality. To test the authentication, we will need #7574.

Revives #7340.
Closes #5597, #7587 .

Still relevant inclusions from original PR text:

Boto package is no longer maintained (last release is from July 2018, v.2.49.0) and has been replaced by Boto3. This PR removes the boto dependency in favour of boto3, and makes all code adjustments necessary.

Three most significant changes include:

  • Custom retry logic for S3 connections is removed in favour of boto3's built-in Standard retry Mode (see boto3 docs: Retries).

  • Instead of old boto bucket and key objects, the functions in the downloader module will now operate primarily on a boto3 S3 client object ("low-level interface to AWS whose method map close to 1:1 with service APIs"). Although boto3 has a high-level resource object for a Bucket, its methods are different from the old boto bucket, and I found working with a client much more suitable for the kind of operations done by the downloaders code (setting credentials, asking about headers, etc.).

See commit messages for details on specific decisions - [@mslw] tried to be verbose.

Notes

Note: The downloader/authenticator code based on boto seems to be built around "authenticating to a bucket". Authenticator's behaviour was maintained for compatibility (it still checks bucket access permissions, as the old code did). But the transition to boto3 client object makes it apparent that authenticating to a bucket is not really what matters if all we want is to download an object. Authentication is done to S3, and then object permissions are decisive for downloading. For example, a public file (public read permission) can be in a bucket that does not block public access, but does not grant public permission to list objects either (easy and reasonable setup in S3) - in which case non-owner bucket interaction will end in 403 although the object could be retrieved.

mslw and others added 20 commits April 6, 2024 14:39
This commit replaces the usage of boto (which has been deprecated)
with boto3 in the s3 downloaders module.

Instead of boto bucket and key objects, the functions in this module
will now operate primarily on a boto3 client object ("low-level
interface to AWS whose method map close to 1:1 with service
APIs"). Although boto3 has a high-level resource object for a Bucket,
it behaves differently than the old boto bucket, and I found working
with a client much more suitable for the kind of operations done by
the downloaders code (setting credentials, asking about headers,
etc.).

The commit also allows boto3 to use credentials coming from its own
configuration mechanism (`AWS_*` environment variables,
`~/.aws/credentials`, `~/.aws/config`, `/etc/boto.cfg` and `~/.boto`)
allowing AWS users to rely on their existing configuration. The
preference order is DataLad credentials, boto credentials, anonymous.

Some unused code pieces (now incompatible with the rest of the file)
are left over, and some new TODOs for completing the transition are
introduced. These will be addressed by subsequent commits. But the
current code works for simple (authenticated & anonymous) download
scenarios already.

Note: The S3Authenticator.authenticate docstring said that it
authenticates to the specified bucket. This behaviour was maintained
for compatibility - we effectively check bucket permissions, as the
old code did. But the transition to boto3 client object makes it
apparent that authenticating to a bucket is not really what we care
about if all we want is to download an object. Authentication is done
to S3, and then object permissions are decisive for downloading. For
example, there can be a public file (public read permission) in a
bucket that doesn't block public access, but does not grant public
permission to list objects (easy and reasonable setup in S3) - in
which case non-owner bucket interaction will end in 403 although the
object could be retrieved.
Methods in the downloader framework need headers as an input
argument. These are now taken from the output of boto3's
head_object(). Assembling the headers is delegated to a function
replacing old get_key_headers(). Type coercion is improved to match
the types obtained by working with old boto's bucket.get_key().

Also, now we work only with the "top-level", documented keys of
head_object() response. I also see that there are "HTTPHeaders" under
"ResponseMetadata", which may be interesting, but I decided not to use
them as all key information was available on "top-level" anyway.

Note: head_object() response includes "LastModified" (datetime), which
removes the necessity of converting from rfc2822 or iso8601
formats (HTTPHeaders still contain the rfc2822 date string). I'm
preserving here a comment from the removed original code: "boto would
return time string the way amazon returns which returns it in two
different ones depending on how key information was obtained
(bucket vs object -- mslw), github.com/boto/boto/issues/466"
The _get_key / _get_key_via_get approach was used to gain additional
insight from the get response if the first (head) attempt fails. I am
wary of the fact that fixes usually accumulate in the code for good
reasons, but I feel that these can be retired without massive
consequences.

Admittedly, we currently only perform head requests (head_object in
this context), which have no response body and therefore return just a
generic 404 Not Found or 403 Forbidden code if things go wrong. But I
think this is workable already. However, if the get query is deemed
necessary, the commit can be reverted, and the logic rebuilt based on
get_object, which might be more convenient to work with.
This was introduced as part of the boto to boto3 rewrite, but if the
code comment about NDA still holds true (I did not verify), it is
redundant. We simply won't add VersionId to download parameters unless
it was contained in the URL, no unsetting necessary. Original comment
introduced in 34ba22a is left for posterity.
Boto3 session does not have a "host" argument; instead it can take
region (like eu-west-1) and it will figure out the correct endpoint by
itself. If no region is specified, boto will use its default, most
likely us-east-1.

Specifying host mattered a lot for old boto: many regions enforce use
of sigv4 request authentication, which in turn required an explicit
value for host / region. These things are done by default now. IMHO
the most convenient way to pick a region for users who care about
their requests going through a specific region (like eu-west-1) would
be through ~/aws/config; but the possibility to set it in the
authenticator remains.

Note: although boto3 figures out the aws endpoint on its own, it is
also possible and easy to set client's endpoint_url in the same way as
the region. There are other possible uses for the 'host' argument that
aren't covered by setting the region (non-aws endpoints, e.g. fake-s3
on localhost) - but if I understand correctly, the authenticator
aspires to be an aws s3 authenticator, so I leave that out.
This restores a workaround for bucket names containing uppercase or
dots, and is a replacement of old boto OrdinaryCallingFormat; see
stackoverflow.com/a/19089045/1265472, github.com/boto/boto3/issues/334
Boto3 itself is capable of retrying calls to AWS. This commit removes
downloader's usage of our custom try_multiple_dec_s3 and instructs
boto3 to use its standard retry mode (default is legacy). The standard
mode will do a default of 3 maximum retry attempts with an exponential
backoff time.

https://boto3.amazonaws.com/v1/documentation/api/latest/guide/retries.html

As it stands, the mode setting will override user aws config (if any
is present at all), but I think it's fair. Alternatively we could do
the same thing as for credentials, i.e. create a session first and see
if it picks up any retry mode. Possible modes are: legacy, standard,
and adaptive (the latter is described as experimental).
Boto3 invokes the callback "intermittently", and it seems it does a
good job deciding how often to update.
This removes the last traces of old boto from s3 downloaders.
It seems that most of support/s3.py documents the process of creating
test buckets - these functions are not imported elsewhere. Usage of
some functions had been removed when transitioning s3 downloader to
boto3. Judging by git grep, now only get_versioned_url and
add_version_to_url are used elsewhere (the former in local/addurls.py,
the latter in tests).

Functions that aren't usied locally (to generate test buckets) or
imported elsewhere are now removed. Generation of test buckets is
migrated to boto3, and based on on high-level resources (which are
different from low-level clients used in the downloaders code, but
more fit for purpose here).

In manual testing, I had to rename the buckets (otherwise, I got 403),
but after doing so I was able to run all the bucket-generating code
under my aws account.

The get_versioned_url function is yet to be reworked and isn't
functional as of this commit - but because it has a lot of logic, the
task is left for a separate commit.
The get_versioned_url function is now updated to boto3. Matching S3
downloader, it now uses boto3 s3 client object for S3
interactions. URL pattern matching is improved to allow (optional)
region codes in the urls. Boto3 helps us not having to deal with
delete markers - they are also included in the response, but
separately from versions.
Boto package is no longer maintained (last release is from July 2018,
v.2.49.0) and has been replaced by Boto3. This change of dependencies
follows a series of changes to migrate our code from Boto to Boto3.

This commit also adjusts external_versions to match.
There are two places where AWS can respond with 403 access denied,
causing botocore.exceptions.ClientError:

1. When the S3Authenticator calls head_bucket (note that this would be
a false positive for buckets that have public objects but don't allow
listing; however it follows the logic of "authenticating to a bucket"
that was present in boto2 code).

2. When the S3Downloader calls head_object

These ClientErrors are now caught and raised as DataLad's
AccessDeniedError.

IMHO this leads to less informative error messages, but
AccessDeniedError seems to be expected in other places of the
codebase. The ClientError message would say, e.g., "An error
occurred (403) when calling the HeadObject operation: Forbidden",
indicating whether it came from HeadBucket or HeadObject. To preserve
the error message, at least in debug output, I am using the original
error as the message for AccessDeniedError.

What is certain is that raising AccessDeniedError was necessary to
make test_deny_access pass when "doesn't matter" (used by the test as
a download url) was replaced with an url to an actual S3 bucket or
object that returned 403. The test as it is still fails, because boto3
now tries to validate the url format and raises a different error (in
any case, not equivalent to access denied), which needs to be
addressed in the test itself, with next commit.
Although test_deny_access patched downloader._download() to raise
AccessDeniedError, the patch was never triggered!

downloader.download() calls downloader.access(), which first calls
downloader.establish_session() and only then, if successful,
downloader._download(). However, with url="doesn't matter", in boto2
code it was within downloader.establish_session() that authentication
was attempted, a DownloadError was (eventually) produced, and the
assertion passed.

After updating to boto3, downloader.establish_session(), would raise a
different error, "Invalid bucket name".

Because _download() was patched and the url was "doesn't matter", it
seems to me that the intention of the test was to check error
conversion coming from a download attempt, not connection attempt with
misformed url. Hence, I'm also patching downloader_establish_session
to do nothing and return False (indicating things went well and
session is not reused).

However, an ideal test should try downloading e.g. a non-public object
from a public bucket. I understand that we don't have those in the
existing datalad-* test buckets. I made one ad-hoc, used its url, and
the test passed with patching removed. However, I don't plan to keep
test buckets owned by my account perpetually, so not using those.

In summary - for migration to boto3 I tried keeping the test close to
what I believe it was, even though it has more to do with
BaseDownloader than S3Downloader behaviour. TBF, I'd also be happy to
retire this test.
@mslw mslw added the semver-minor Increment the minor version when merged label Apr 6, 2024
Copy link

codecov bot commented Apr 6, 2024

Codecov Report

Attention: Patch coverage is 48.64865% with 76 lines in your changes are missing coverage. Please review.

Project coverage is 91.33%. Comparing base (35c5492) to head (9a8dd62).
Report is 31 commits behind head on maint.

Files Patch % Lines
datalad/support/s3.py 28.35% 48 Missing ⚠️
datalad/downloaders/s3.py 60.00% 28 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##            maint    #7575      +/-   ##
==========================================
+ Coverage   91.21%   91.33%   +0.11%     
==========================================
  Files         325      325              
  Lines       43442    43345      -97     
  Branches     5785        0    -5785     
==========================================
- Hits        39626    39588      -38     
+ Misses       3801     3757      -44     
+ Partials       15        0      -15     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@effigies
Copy link
Contributor Author

effigies commented Apr 6, 2024

@yarikoptic
Copy link
Member

Thank you @effigies and sorry for the silence -- somehow this awesome PR went behind my consciousness radar.

To test the authentication, we will need #7574.

why? we didn't have any credentials provided there either, so should probably be the same.

…'s logger

This should bring running tests of downloaders/tests/test_s3.py which
were completely skipped on CI and thus code coverage was low.

Tested that logger is indeed boto3 and not just boto in the boto3's codebase:

    ❯ git grep getLogger
    boto3/__init__.py:    logger = logging.getLogger(name)
    boto3/__init__.py:logging.getLogger('boto3').addHandler(NullHandler())
    boto3/dynamodb/table.py:logger = logging.getLogger(__name__)
    boto3/resources/action.py:logger = logging.getLogger(__name__)
    boto3/resources/base.py:logger = logging.getLogger(__name__)
    boto3/resources/collection.py:logger = logging.getLogger(__name__)
    boto3/resources/factory.py:logger = logging.getLogger(__name__)
    boto3/resources/model.py:logger = logging.getLogger(__name__)
    boto3/s3/transfer.py:logger = logging.getLogger(__name__)
    docs/source/guide/error-handling.rst:    logger = logging.getLogger()
    tests/integration/test_s3.py:LOG = logging.getLogger('boto3.tests.integration')
    ❯ git describe
    1.34.99
@effigies
Copy link
Contributor Author

effigies commented May 6, 2024

My understanding in Dusseldorf was that these tests, which require credentials, had been running on Travis. I wrote new tests that don't require credentials, and was noting that the authentication won't be tested without credentials.

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

initial comments/question, just started to dig, had to recall how to get tests to run to use credentials ;) should be

DATALAD_TESTS_CREDENTIALS=system python -m pytest -s -v datalad/downloaders/tests/test_s3.py

if e.g. already have credentials known to the system by e.g. invoking datalad ls 's3://datalad-test0-versioned/'

running them gives:

FAILED datalad/downloaders/tests/test_s3.py::test_s3_download_basic[s3://datalad-test0-versioned/2versions-nonversioned1.txt-version2-version1] - datalad.support.exceptions.UnaccountedDownloadError: Downloaded size 8 differs from originally announced 1
FAILED datalad/downloaders/tests/test_s3.py::test_s3_download_basic[s3://datalad-test0-versioned/2versions-nonversioned1.txt?versionId=V4Dqhu0QTEtxmvoNkCHGrjVZVomR1Ryo-version2-version1] - datalad.support.exceptions.UnaccountedDownloadError: Downloaded size 8 differs from originally announced 1
FAILED datalad/downloaders/tests/test_s3.py::test_s3_download_basic[s3://datalad-test0-versioned/2versions-nonversioned1.txt?versionId=null-version1-version2] - datalad.support.exceptions.UnaccountedDownloadError: Downloaded size 8 differs from originally announced 1
FAILED datalad/downloaders/tests/test_s3.py::test_s3_download_basic[s3://datalad.test1/version1.txt-version1-nothing] - datalad.support.exceptions.UnaccountedDownloadError: Downloaded size 9 differs from originally announced 1

so seems like somewhere getting the size of a key is incorrect.

I have pushed 2 commits

  • to make downloaders/ test_s3.py not skip (so we should get some code covered now)
  • to remove test which no longer applies kinda (tested removed interfaces) but we might want to bring it back in some fashion

@@ -60,32 +52,46 @@ class S3Authenticator(Authenticator):
allows_anonymous = True
DEFAULT_CREDENTIAL_TYPE = 'aws-s3'

def __init__(self, *args, host=None, **kwargs):
def __init__(self, *args, host=None, region=None, **kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was overall motivation to remove host?

I am not aware of any use case but I could foresee people potentially using it against some non-AWS S3 servers (minio etc) which might work...? meanwhile formulated that idea for testing:

It seems that host can be supported via endpoint_url https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-example-privatelink.html and https://www.stackhero.io/en/services/MinIO/documentations/Getting-started/Connect-to-MinIO-from-Python even shows example of specifying it with the region (may be implementation somehow handles it internally)? WDYT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was removed by @mslw in the previous PR. My guess is because host did not have an obvious counterpart. I think it would be sensible to restore it if endpoint_url is the equivalent concept (modulo some shim?).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gotcha... insofar it seems that host was just a FQDN (e.g. s3.us-east-2.amazonaws.com) whenever endpoint_url should be a URL thus have protocol (https) added... So I will reincarnate host and extend it to start with protocol (https) if doesn't have :// in it, and provide as endpoint_url. Sounds ok'ish?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems fine to me.

@yarikoptic
Copy link
Member

I think that particular size mismatches are due to removed

        if size:
-            headers['Range'] = 'bytes=0-%d' % (size - 1)

where we allowed for download of specific leading portion of the file... smth to get back

…ing of the file

Had to decompose s3_download_kwargs since VersionId sometimes goes along with
Bucket and Key, but some times need to go as ExtraKwargs in new boto3 interface.

Also apparently there is no Range support for download_fileobj  so as a really
dirty workaround we will just read into memory and save into a file.  But getting
the key does not get a Callback -- so fixed for that too.

I think such use cases of getting size specified should not be frequent
-- at least with "download-url" we do get to download_fileobj and get a
Callback engaged.

Altogether this seems to address current test cases.
@yarikoptic
Copy link
Member

I pushed some changes to make it seemingly work for all test cases we have. but trying on an hcp case

lead to nothing downloaded but progress bar showing GBps speed ;)
❯ datalad get -J5 .
2024-05-06 23:33:47,125 [WARNING] DueCredit internal failure while running <function _get_active_due at 0x7f5884a4dd00>: ModuleNotFoundError("No module named 'distutils'"). Please report to developers at https://github.com/duecredit/duecredit/issues (utils.py:194)
[ERROR  ] Failed to import duecredit due to Both inactive and active collectors should be provided. Got active=None, inactive=InactiveDueCreditCollector() 
[WARNING] Requested extension 'next' is not available 
[WARNING] Requested extension 'dataverse' is not available 
get(error): 100206.164k_fs_LR.wb.spec (file) [not available; (Note that these git remotes have annex-ignore set: origin)]                                                                                                         
get(error): 100206.ArealDistortion_MSMAll.164k_fs_LR.dscalar.nii (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): 100206.ArealDistortion_FS.164k_fs_LR.dscalar.nii (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): 100206.ArealDistortion_MSMSulc.164k_fs_LR.dscalar.nii (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): 100206.L.RefMyelinMap.164k_fs_LR.func.gii (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): 100206.L.MyelinMap_BC.164k_fs_LR.func.gii (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): 100206.L.SmoothedMyelinMap.164k_fs_LR.func.gii (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): 100206.L.BA.164k_fs_LR.label.gii (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): 100206.L.curvature.164k_fs_LR.shape.gii (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
get(error): 100206.L.ArealDistortion_FS.164k_fs_LR.shape.gii (file) [not available; (Note that these git remotes have annex-ignore set: origin)]
  [11837 similar messages have been suppressed; disable with datalad.ui.suppress-similar-results=off]
action summary:
  get (error: 11847)
datalad get -J5 .  48.36s user 4.85s system 251% cpu 21.124 total
❯ pwd
/home/yoh/datalad/hcp-openaccess/HCP1200/100206/MNINonLinear

so something to troubleshoot more and probably also for us to return host specification which was present originally.

@yarikoptic yarikoptic changed the base branch from maint to master May 7, 2024 13:30
@yarikoptic
Copy link
Member

  • changed base to master since minor (maint for bug fixes really).
  • restarted brew test -- spurious connectivity issue
  • we can try to resolve host issue to make it "complete" and then try a little more "by hands" I guess

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
semver-minor Increment the minor version when merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

downloaders/s3: migrate from boto (2.x) to boto3
3 participants