Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace awscli v1 with boto3 during environment bootstrapping #1778

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
20 changes: 16 additions & 4 deletions metaflow/metaflow_environment.py
Original file line number Diff line number Diff line change
Expand Up @@ -89,10 +89,22 @@ def _get_download_code_package_cmd(self, code_package_url, datastore_type):
It should work silently if everything goes well.
"""
if datastore_type == "s3":
from .plugins.aws.aws_utils import parse_s3_full_path

bucket, s3_object = parse_s3_full_path(code_package_url)
# Explicitly _don't_ use awscli here, because of version incompatibilities
# between awscli v1 (installed via pypi) and awscli v2 (installed _not_
# through pypi).
# Instead, just use boto3 which we already have installed.
return (
'%s -m awscli ${METAFLOW_S3_ENDPOINT_URL:+--endpoint-url=\\"${METAFLOW_S3_ENDPOINT_URL}\\"} '
+ "s3 cp %s job.tar >/dev/null"
) % (self._python(), code_package_url)
'%s -c "import boto3; import os; '
"boto3.client('s3', endpoint_url=os.getenv(METAFLOW_S3_ENDPOINT_URL))"
".download_file('%s', '%s', 'job.tar')\""
) % (
self._python,
bucket,
s3_object,
)
elif datastore_type == "azure":
from .plugins.azure.azure_utils import parse_azure_full_path

Expand Down Expand Up @@ -121,7 +133,7 @@ def _get_download_code_package_cmd(self, code_package_url, datastore_type):
def _get_install_dependencies_cmd(self, datastore_type):
cmds = ["%s -m pip install requests -qqq" % self._python()]
if datastore_type == "s3":
cmds.append("%s -m pip install awscli boto3 -qqq" % self._python())
cmds.append("%s -m pip install boto3 -qqq" % self._python())
elif datastore_type == "azure":
cmds.append(
"%s -m pip install azure-identity azure-storage-blob simple-azure-blob-downloader -qqq"
Expand Down
16 changes: 16 additions & 0 deletions metaflow/plugins/aws/aws_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,22 @@
from metaflow.exception import MetaflowException


def parse_s3_full_path(s3_uri):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@trhodeos is there any specific reason to create a utility function? maybe we can have this snippet live right next to L:91 in metaflow_environment.py. it will help in making the code much more readable (avoiding an extra jump to a different file)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mirrors what we do for azure and gcs. I was just following convention here, happy to pull this inline if you think it's better.

from urllib.parse import urlparse

# <scheme>://<netloc>/<path>;<params>?<query>#<fragment>
scheme, netloc, path, _, _, _ = urlparse(s3_uri)
assert scheme == "s3"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

within metaflow code base, we have elected to not use assert statements. also, when the assert statement fails, the user is left no better in their understanding of how to fix the issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is copied from parse_gcs_full_path.. Happy to remove it though if we don't want it!

assert netloc is not None

bucket = netloc
path = path.lstrip("/").rstrip("/")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the rstrip needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need it for this case, but just keeping this in line with the other parse functions.. We can probably remove if we just inline this.

if path == "":
path = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't expect this to be None

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, removing it makes sense for this use case.. I sound like a broken record at this point, but I just copied this wholesale from parse_gcs_full_path.


return bucket, path


def get_ec2_instance_metadata():
"""
Fetches the EC2 instance metadata through AWS instance metadata service
Expand Down