Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up mast.Observations.get_cloud_uris() #2145

Open
wants to merge 11 commits into
base: main
Choose a base branch
from
6 changes: 6 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -242,6 +242,12 @@ xmatch
- Minor internal change to use VOTable as the response format that include
units, too. [#1375]

mast
^^^^

- Increased the speed of ``Observations.get_cloud_uris`` by obtaining multiple
URIs from MAST at once. [#2145]


Infrastructure, Utility and Other Changes and Additions
-------------------------------------------------------
Expand Down
55 changes: 32 additions & 23 deletions astroquery/mast/cloud.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,30 +109,14 @@
found in the cloud, None is returned.
"""

s3_client = self.boto3.client('s3', config=self.config)

path = utils.mast_relative_path(data_product["dataURI"])
if path is None:
raise InvalidQueryError("Malformed data uri {}".format(data_product['dataURI']))
uri_list = self.get_cloud_uri_list(data_product, include_bucket=include_bucket, full_url=full_url)

Check warning on line 112 in astroquery/mast/cloud.py

View check run for this annotation

Codecov / codecov/patch

astroquery/mast/cloud.py#L112

Added line #L112 was not covered by tests

if 'galex' in path:
path = path.lstrip("/mast/")
# Making sure we got at least 1 URI from the query above.
if uri_list[0] == None:
warnings.warn("Unable to locate file {}.".format(data_product), NoResultsWarning)

Check warning on line 116 in astroquery/mast/cloud.py

View check run for this annotation

Codecov / codecov/patch

astroquery/mast/cloud.py#L115-L116

Added lines #L115 - L116 were not covered by tests
else:
path = path.lstrip("/")

try:
s3_client.head_object(Bucket=self.pubdata_bucket, Key=path)
if include_bucket:
path = "s3://{}/{}".format(self.pubdata_bucket, path)
elif full_url:
path = "http://s3.amazonaws.com/{}/{}".format(self.pubdata_bucket, path)
return path
except self.botocore.exceptions.ClientError as e:
if e.response['Error']['Code'] != "404":
raise

warnings.warn("Unable to locate file {}.".format(data_product['productFilename']), NoResultsWarning)
return None
# Output from ``get_cloud_uri_list`` is always a list even when it's only 1 URI
return uri_list[0]

Check warning on line 119 in astroquery/mast/cloud.py

View check run for this annotation

Codecov / codecov/patch

astroquery/mast/cloud.py#L119

Added line #L119 was not covered by tests
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only the first element, what is expected to be included in the rest?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output for get_cloud_uri_list is always a list, even when the output returns only 1 URI, so the sole element gets indexed out before being returned. I just committed some commentary above this line for clarity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's indeed a good way to handle the one element lists. I wonder what happens when there are more than one, would the rest just be ignored?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There wouldn't be a case where a request for a single cloud URI would return more than one because of L171 in util.py of this branch, but I can raise an AssertionError here to ensure that there is only 1 element in the list, what do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really going into bikeshedding now, but if that's the case, why does it have to be a list to begin with?

Also, L173 is never reached?

Maybe just a comment about this would be enough, no need for the assertion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, just revisiting this branch in a while, what assertion are you referring to? I added a comment above L118 in this branch explaining why the zero-indexing has to happen. It's because uri_list is declared as a list in get_cloud_uri_list which get_cloud_uri wraps around. Everything is wrapped around get_cloud_uri_list because that method calls utils.mast_relative_path which has the new chunking functionality that Geert implemented, which speeds up the call by making 1 request for multiple URIs rather than 1 request per URI.


def get_cloud_uri_list(self, data_products, include_bucket=True, full_url=False):
"""
Expand All @@ -156,8 +140,33 @@
List of URIs generated from the data products, list way contain entries that are None
if data_products includes products not found in the cloud.
"""
s3_client = self.boto3.client('s3', config=self.config)

Check warning on line 143 in astroquery/mast/cloud.py

View check run for this annotation

Codecov / codecov/patch

astroquery/mast/cloud.py#L143

Added line #L143 was not covered by tests

return [self.get_cloud_uri(product, include_bucket, full_url) for product in data_products]
paths = utils.mast_relative_path(data_products["dataURI"])
Copy link
Contributor

@jaymedina jaymedina Mar 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will break for GALEX data retrieval from the cloud, unfortunately. See here for how I tweaked this output in a recent PR following the new GALEX availability on the cloud:

if 'galex' in path:

Your two options would be to either:

  1. Make a tweak similar to this in get_cloud_uri_list
  2. Make the tweak in mast_relative_path - this is probably the cleanest option because then we can remove my tweak from cloud.py.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re this suggestion: in the cloud WG we're working on generalizing these methods and move them out from mast. So a tweak at the place of usage, or a generically usable kwarg would be preferable rather than a hard-wired conditional on mast specific strings.

Copy link
Contributor

@jaymedina jaymedina Jul 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I did was move the path.strip() logic out of cloud.py and into util.py's mast_relative_path function. In other words, I essentially went with option 2. listed above. And because this is a MAST-specific function in utils, it makes sense for MAST-specific tweaks to take place in there. Let me know if that works for you!

if isinstance(paths, str): # Handle the case where only one product was requested
paths = [paths]

Check warning on line 147 in astroquery/mast/cloud.py

View check run for this annotation

Codecov / codecov/patch

astroquery/mast/cloud.py#L145-L147

Added lines #L145 - L147 were not covered by tests

uri_list = []
for path in paths:
if path is None:
uri_list.append(None)

Check warning on line 152 in astroquery/mast/cloud.py

View check run for this annotation

Codecov / codecov/patch

astroquery/mast/cloud.py#L149-L152

Added lines #L149 - L152 were not covered by tests
bsipocz marked this conversation as resolved.
Show resolved Hide resolved
else:
try:

Check warning on line 154 in astroquery/mast/cloud.py

View check run for this annotation

Codecov / codecov/patch

astroquery/mast/cloud.py#L154

Added line #L154 was not covered by tests
# Use `head_object` to verify that the product is available on S3 (not all products are)
s3_client.head_object(Bucket=self.pubdata_bucket, Key=path)
if include_bucket:
s3_path = "s3://{}/{}".format(self.pubdata_bucket, path)
uri_list.append(s3_path)
elif full_url:
path = "http://s3.amazonaws.com/{}/{}".format(self._pubdata_bucket, path)
uri_list.append(path)
except self.botocore.exceptions.ClientError as e:
if e.response['Error']['Code'] != "404":
raise
warnings.warn("Unable to locate file {}.".format(path), NoResultsWarning)
uri_list.append(None)

Check warning on line 167 in astroquery/mast/cloud.py

View check run for this annotation

Codecov / codecov/patch

astroquery/mast/cloud.py#L156-L167

Added lines #L156 - L167 were not covered by tests

return uri_list

Check warning on line 169 in astroquery/mast/cloud.py

View check run for this annotation

Codecov / codecov/patch

astroquery/mast/cloud.py#L169

Added line #L169 was not covered by tests

def download_file(self, data_product, local_path, cache=True):
"""
Expand Down
10 changes: 6 additions & 4 deletions astroquery/mast/tests/test_mast_remote.py
Original file line number Diff line number Diff line change
Expand Up @@ -403,14 +403,15 @@ def test_get_cloud_uri(self):

assert len(uri) > 0, f'Product for OBSID {test_obs_id} was not found in the cloud.'

def test_get_cloud_uris(self):
@pytest.mark.parametrize("test_obs_id", ["25568122", "31411"])
def test_get_cloud_uris(self, test_obs_id):
pytest.importorskip("boto3")
test_obs_id = '25568122'

# get a product list
products = mast.Observations.get_product_list(test_obs_id)[24:]
index = 24 if test_obs_id == "25568122" else 0
products = mast.Observations.get_product_list(test_obs_id)[index:]

assert len(products) > 0, (f'No products found for OBSID {test_obs_id}. '
assert len(products) > 0, (f'No products found for OBSID {test_obs_id}.'
'Unable to move forward with getting URIs from the cloud.')

# enable access to public AWS S3 bucket
Expand All @@ -427,6 +428,7 @@ def test_get_cloud_uris(self):

# query functions
def test_catalogs_query_region_async(self):

responses = mast.Catalogs.query_region_async("158.47924 -7.30962", catalog="Galex")
assert isinstance(responses, list)

Expand Down
54 changes: 42 additions & 12 deletions astroquery/mast/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -155,22 +155,52 @@

def mast_relative_path(mast_uri):
"""
Given a MAST dataURI, return the associated relative path.
Given one or more MAST dataURI(s), return the associated relative path(s).

Parameters
----------
mast_uri : str
The MAST uri.
mast_uri : str, list of str
The MAST uri(s).

Returns
-------
response : str
The associated relative path.
response : str, list of str
The associated relative path(s).
"""

response = _simple_request("https://mast.stsci.edu/api/v0.1/path_lookup/",
{"uri": mast_uri})
result = response.json()
uri_result = result.get(mast_uri)

return uri_result["path"]
if isinstance(mast_uri, str):
uri_list = [("uri", mast_uri)]
else: # mast_uri parameter is a list
uri_list = [("uri", uri) for uri in mast_uri]

Check warning on line 174 in astroquery/mast/utils.py

View check run for this annotation

Codecov / codecov/patch

astroquery/mast/utils.py#L173-L174

Added lines #L173 - L174 were not covered by tests
# Split the list into chunks of 50 URIs; this is necessary
# to avoid "414 Client Error: Request-URI Too Large".

Check warning on line 176 in astroquery/mast/utils.py

View check run for this annotation

Codecov / codecov/patch

astroquery/mast/utils.py#L176

Added line #L176 was not covered by tests
uri_list_chunks = list(_split_list_into_chunks(uri_list, chunk_size=50))

result = []
for chunk in uri_list_chunks:

Check warning on line 180 in astroquery/mast/utils.py

View check run for this annotation

Codecov / codecov/patch

astroquery/mast/utils.py#L180

Added line #L180 was not covered by tests
response = _simple_request("https://mast.stsci.edu/api/v0.1/path_lookup/",
{"uri": chunk})
json_response = response.json()

Check warning on line 184 in astroquery/mast/utils.py

View check run for this annotation

Codecov / codecov/patch

astroquery/mast/utils.py#L182-L184

Added lines #L182 - L184 were not covered by tests
for uri in chunk:
# Chunk is a list of tuples where the tuple is

Check warning on line 186 in astroquery/mast/utils.py

View check run for this annotation

Codecov / codecov/patch

astroquery/mast/utils.py#L186

Added line #L186 was not covered by tests
# ("uri", "/path/to/product")
# so we index for path (index=1)

Check warning on line 188 in astroquery/mast/utils.py

View check run for this annotation

Codecov / codecov/patch

astroquery/mast/utils.py#L188

Added line #L188 was not covered by tests
path = json_response.get(uri[1])["path"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see what's expected to be in uri here, could you explain it a bit, so I see why the second element?

if 'galex' in path:
path = path.lstrip("/mast/")
else:
path = path.lstrip("/")
result.append(path)

Check warning on line 194 in astroquery/mast/utils.py

View check run for this annotation

Codecov / codecov/patch

astroquery/mast/utils.py#L192-L194

Added lines #L192 - L194 were not covered by tests

# If the input was a single URI string, we return a single string
if isinstance(mast_uri, str):

Check warning on line 197 in astroquery/mast/utils.py

View check run for this annotation

Codecov / codecov/patch

astroquery/mast/utils.py#L196-L197

Added lines #L196 - L197 were not covered by tests
return result[0]
# Else, return a list of paths
return result

Check warning on line 201 in astroquery/mast/utils.py

View check run for this annotation

Codecov / codecov/patch

astroquery/mast/utils.py#L200-L201

Added lines #L200 - L201 were not covered by tests

def _split_list_into_chunks(input_list, chunk_size):

Check warning on line 203 in astroquery/mast/utils.py

View check run for this annotation

Codecov / codecov/patch

astroquery/mast/utils.py#L203

Added line #L203 was not covered by tests
"""Helper function for `mast_relative_path`."""
for idx in range(0, len(input_list), chunk_size):
yield input_list[idx:idx + chunk_size]