-
-
Notifications
You must be signed in to change notification settings - Fork 390
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up mast.Observations.get_cloud_uris()
#2145
base: main
Are you sure you want to change the base?
Changes from all commits
8710d25
39d7e4e
8d87d84
0a30aa2
e4cf69a
42a4642
493e467
b865ea0
f1e5852
1e9750b
4c53a19
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
|
@@ -109,30 +109,14 @@ | |||
found in the cloud, None is returned. | ||||
""" | ||||
|
||||
s3_client = self.boto3.client('s3', config=self.config) | ||||
|
||||
path = utils.mast_relative_path(data_product["dataURI"]) | ||||
if path is None: | ||||
raise InvalidQueryError("Malformed data uri {}".format(data_product['dataURI'])) | ||||
uri_list = self.get_cloud_uri_list(data_product, include_bucket=include_bucket, full_url=full_url) | ||||
|
||||
if 'galex' in path: | ||||
path = path.lstrip("/mast/") | ||||
# Making sure we got at least 1 URI from the query above. | ||||
if uri_list[0] == None: | ||||
warnings.warn("Unable to locate file {}.".format(data_product), NoResultsWarning) | ||||
else: | ||||
path = path.lstrip("/") | ||||
|
||||
try: | ||||
s3_client.head_object(Bucket=self.pubdata_bucket, Key=path) | ||||
if include_bucket: | ||||
path = "s3://{}/{}".format(self.pubdata_bucket, path) | ||||
elif full_url: | ||||
path = "http://s3.amazonaws.com/{}/{}".format(self.pubdata_bucket, path) | ||||
return path | ||||
except self.botocore.exceptions.ClientError as e: | ||||
if e.response['Error']['Code'] != "404": | ||||
raise | ||||
|
||||
warnings.warn("Unable to locate file {}.".format(data_product['productFilename']), NoResultsWarning) | ||||
return None | ||||
# Output from ``get_cloud_uri_list`` is always a list even when it's only 1 URI | ||||
return uri_list[0] | ||||
|
||||
def get_cloud_uri_list(self, data_products, include_bucket=True, full_url=False): | ||||
""" | ||||
|
@@ -156,8 +140,33 @@ | |||
List of URIs generated from the data products, list way contain entries that are None | ||||
if data_products includes products not found in the cloud. | ||||
""" | ||||
s3_client = self.boto3.client('s3', config=self.config) | ||||
|
||||
return [self.get_cloud_uri(product, include_bucket, full_url) for product in data_products] | ||||
paths = utils.mast_relative_path(data_products["dataURI"]) | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This will break for GALEX data retrieval from the cloud, unfortunately. See here for how I tweaked this output in a recent PR following the new GALEX availability on the cloud: astroquery/astroquery/mast/cloud.py Line 119 in 2b3d954
Your two options would be to either:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. re this suggestion: in the cloud WG we're working on generalizing these methods and move them out from mast. So a tweak at the place of usage, or a generically usable kwarg would be preferable rather than a hard-wired conditional on mast specific strings. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What I did was move the |
||||
if isinstance(paths, str): # Handle the case where only one product was requested | ||||
paths = [paths] | ||||
|
||||
uri_list = [] | ||||
for path in paths: | ||||
if path is None: | ||||
uri_list.append(None) | ||||
bsipocz marked this conversation as resolved.
Show resolved
Hide resolved
|
||||
else: | ||||
try: | ||||
# Use `head_object` to verify that the product is available on S3 (not all products are) | ||||
s3_client.head_object(Bucket=self.pubdata_bucket, Key=path) | ||||
if include_bucket: | ||||
s3_path = "s3://{}/{}".format(self.pubdata_bucket, path) | ||||
uri_list.append(s3_path) | ||||
elif full_url: | ||||
path = "http://s3.amazonaws.com/{}/{}".format(self._pubdata_bucket, path) | ||||
uri_list.append(path) | ||||
except self.botocore.exceptions.ClientError as e: | ||||
if e.response['Error']['Code'] != "404": | ||||
raise | ||||
warnings.warn("Unable to locate file {}.".format(path), NoResultsWarning) | ||||
uri_list.append(None) | ||||
|
||||
return uri_list | ||||
|
||||
def download_file(self, data_product, local_path, cache=True): | ||||
""" | ||||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -155,22 +155,52 @@ | |
|
||
def mast_relative_path(mast_uri): | ||
""" | ||
Given a MAST dataURI, return the associated relative path. | ||
Given one or more MAST dataURI(s), return the associated relative path(s). | ||
|
||
Parameters | ||
---------- | ||
mast_uri : str | ||
The MAST uri. | ||
mast_uri : str, list of str | ||
The MAST uri(s). | ||
|
||
Returns | ||
------- | ||
response : str | ||
The associated relative path. | ||
response : str, list of str | ||
The associated relative path(s). | ||
""" | ||
|
||
response = _simple_request("https://mast.stsci.edu/api/v0.1/path_lookup/", | ||
{"uri": mast_uri}) | ||
result = response.json() | ||
uri_result = result.get(mast_uri) | ||
|
||
return uri_result["path"] | ||
if isinstance(mast_uri, str): | ||
uri_list = [("uri", mast_uri)] | ||
else: # mast_uri parameter is a list | ||
uri_list = [("uri", uri) for uri in mast_uri] | ||
|
||
# Split the list into chunks of 50 URIs; this is necessary | ||
# to avoid "414 Client Error: Request-URI Too Large". | ||
uri_list_chunks = list(_split_list_into_chunks(uri_list, chunk_size=50)) | ||
|
||
result = [] | ||
for chunk in uri_list_chunks: | ||
response = _simple_request("https://mast.stsci.edu/api/v0.1/path_lookup/", | ||
{"uri": chunk}) | ||
json_response = response.json() | ||
|
||
for uri in chunk: | ||
# Chunk is a list of tuples where the tuple is | ||
# ("uri", "/path/to/product") | ||
# so we index for path (index=1) | ||
path = json_response.get(uri[1])["path"] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't see what's expected to be in |
||
if 'galex' in path: | ||
path = path.lstrip("/mast/") | ||
else: | ||
path = path.lstrip("/") | ||
result.append(path) | ||
|
||
# If the input was a single URI string, we return a single string | ||
if isinstance(mast_uri, str): | ||
return result[0] | ||
# Else, return a list of paths | ||
return result | ||
|
||
|
||
def _split_list_into_chunks(input_list, chunk_size): | ||
"""Helper function for `mast_relative_path`.""" | ||
for idx in range(0, len(input_list), chunk_size): | ||
yield input_list[idx:idx + chunk_size] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only the first element, what is expected to be included in the rest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output for
get_cloud_uri_list
is always a list, even when the output returns only 1 URI, so the sole element gets indexed out before being returned. I just committed some commentary above this line for clarity.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's indeed a good way to handle the one element lists. I wonder what happens when there are more than one, would the rest just be ignored?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There wouldn't be a case where a request for a single cloud URI would return more than one because of L171 in
util.py
of this branch, but I can raise an AssertionError here to ensure that there is only 1 element in the list, what do you think?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm really going into bikeshedding now, but if that's the case, why does it have to be a list to begin with?
Also, L173 is never reached?
Maybe just a comment about this would be enough, no need for the assertion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, just revisiting this branch in a while, what assertion are you referring to? I added a comment above L118 in this branch explaining why the zero-indexing has to happen. It's because
uri_list
is declared as a list inget_cloud_uri_list
whichget_cloud_uri
wraps around. Everything is wrapped aroundget_cloud_uri_list
because that method callsutils.mast_relative_path
which has the new chunking functionality that Geert implemented, which speeds up the call by making 1 request for multiple URIs rather than 1 request per URI.