Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stack Contains NaT in Time Field #235

Open
J6767 opened this issue Dec 12, 2023 · 5 comments
Open

Stack Contains NaT in Time Field #235

J6767 opened this issue Dec 12, 2023 · 5 comments

Comments

@J6767
Copy link

J6767 commented Dec 12, 2023

I am having a problem with missing time values in my stack. Reproduce as follows:

from pystac_client import Client
import stackstac

def pystac_query(max_items, time_range, lon, lat, max_cloud):
    client = Client.open("https://earth-search.aws.element84.com/v1")

    query = client.search(
        max_items=max_items,
        collections=['sentinel-2-l1c'],
        query={"eo:cloud_cover": {"lt": max_cloud}},
        datetime = time_range,
        intersects=dict(type="Point", coordinates=[lon, lat]),
    )
    items = query.item_collection()
    return query, items

lat = 31.77877
lon = 5.99571
time_range =  '2022-08-04/2022-09-06'
query, items = pystac_query(99, time_range, lon, lat, 99)

The query contains 28 items.

for i in range(len(items)):
    t = items[i].properties["datetime"]

All of the items have times. All times are 27 char strings, apart from "2022-09-03T10:32:05Z" - is this the source of the bug?

However, when I stack them:

stack = stackstac.stack(items, epsg=4326)

...one of the items has its time replaced with a NaT.

stack.time()

array(['2022-08-04T10:32:05.634000000', '2022-08-04T10:32:08.099000000',
       '2022-08-06T10:22:16.627000000', '2022-08-06T10:22:18.455000000',
       '2022-08-09T10:32:12.888000000', '2022-08-09T10:32:15.351000000',
       '2022-08-11T10:22:08.959000000', '2022-08-11T10:22:10.785000000',
       '2022-08-14T10:32:04.138000000', '2022-08-14T10:32:06.617000000',
       '2022-08-16T10:22:17.889000000', '2022-08-16T10:22:19.720000000',
       '2022-08-19T10:32:13.767000000', '2022-08-19T10:32:16.217000000',
       '2022-08-21T10:22:06.588000000', '2022-08-21T10:22:08.412000000',
       '2022-08-24T10:32:01.411000000', '2022-08-24T10:32:03.910000000',
       '2022-08-26T10:22:20.430000000', '2022-08-26T10:22:21.453000000',
       '2022-08-29T10:32:13.577000000', '2022-08-29T10:32:16.024000000',
       '2022-08-31T10:22:06.328000000', '2022-08-31T10:22:08.153000000',
       '2022-09-03T10:32:02.517000000',                           'NaT',
       '2022-09-05T10:22:19.490000000', '2022-09-05T10:22:20.519000000'],
      dtype='datetime64[ns]')

This is causing an error when I try to crop the stack to my AOI. What is a suitable workaround for this? Many thanks.

@clausmichele
Copy link

You can drop them before trying to crop to the AOI like this:

https://github.com/Open-EO/openeo-processes-dask/blob/8d9e3fb009676a39c3ce5009bc1c09909e5bbfb3/openeo_processes_dask/process_implementations/cubes/_filter.py#L82

@gjoseph92
Copy link
Owner

@clausmichele is right; a workaround for now is to drop the missing item. Or you could manually fix the input datetime properties to all be in the same format.

However, I consider this a stackstac bug. The problem indeed is with that one datetime near the end that doesn't have fractional seconds:

 '2022-09-03T10:32:02.517000Z',
 '2022-09-03T10:32:05Z',
 '2022-09-05T10:22:19.490000Z',
 '2022-09-05T10:22:20.519000Z']

The STAC spec just says that datetimes should be formatted according to RFC 3339, section 5.6. We can see there that time-secfrac is an optional part of partial-time, so this is still a valid datetime and should be parsed correctly.

I found that the issue seems to be related to the errors="coerce" argument to pd.to_datetime here, along with infer_datetime_format:

times = pd.to_datetime(
[item["properties"]["datetime"] for item in items],
infer_datetime_format=True,
errors="coerce",
)

Weirdly, with pandas 1.3.5, using errors='raise' doesn't raise any error, but seems to parse correctly:

>>> ts = sorted(item.properties["datetime"] for item in items)

>>> pd.to_datetime(ts, infer_datetime_format=True, errors='raise')
DatetimeIndex(['2022-08-04 10:32:05.634000+00:00',
               ...
               '2022-09-03 10:32:02.517000+00:00',
                      '2022-09-03 10:32:05+00:00',
               '2022-09-05 10:22:19.490000+00:00',
               '2022-09-05 10:22:20.519000+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)

>>> pd.to_datetime(ts, infer_datetime_format=True, errors='coerce')
DatetimeIndex(['2022-08-04 10:32:05.634000', '2022-08-04 10:32:08.099000',
               '2022-08-06 10:22:16.627000', '2022-08-06 10:22:18.455000',
               '2022-08-09 10:32:12.888000', '2022-08-09 10:32:15.351000',
               '2022-08-11 10:22:08.959000', '2022-08-11 10:22:10.785000',
               '2022-08-14 10:32:04.138000', '2022-08-14 10:32:06.617000',
               '2022-08-16 10:22:17.889000', '2022-08-16 10:22:19.720000',
               '2022-08-19 10:32:13.767000', '2022-08-19 10:32:16.217000',
               '2022-08-21 10:22:06.588000', '2022-08-21 10:22:08.412000',
               '2022-08-24 10:32:01.411000', '2022-08-24 10:32:03.910000',
               '2022-08-26 10:22:20.430000', '2022-08-26 10:22:21.453000',
               '2022-08-29 10:32:13.577000', '2022-08-29 10:32:16.024000',
               '2022-08-31 10:22:06.328000', '2022-08-31 10:22:08.153000',
               '2022-09-03 10:32:02.517000',                        'NaT',
               '2022-09-05 10:22:19.490000', '2022-09-05 10:22:20.519000'],
              dtype='datetime64[ns]', freq=None)

However, after switching to pandas 2, we do get a helpful and informative error:

>>> pd.__version__
2.0.3

>>> In [7]: pd.to_datetime(ts, errors='raise')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-7-0ac50347e0b9> in <cell line: 1>()
----> 1 pd.to_datetime(ts, errors='raise')
...
ValueError: time data "2022-09-03T10:32:05Z" doesn't match format "%Y-%m-%dT%H:%M:%S.%f%z", at position 25. You might want to try:
    - passing `format` if your strings have a consistent format;
    - passing `format='ISO8601'` if your strings are all ISO8601 but not necessarily in exactly the same format;
    - passing `format='mixed'`, and the format will be inferred for each element individually. You might want to use `dayfirst` alongside this.

This makes sense: there are multiple datetime formats present, so infer_datetime_format (which became True by default in pandas 2) rightfully complains about this. The fact that it didn't raise an error in older versions was probably a bug.

Given that the STAC spec indicates all datetimes should be a subset of ISO 8601, format='ISO8601' seems like just what we want:

>>> pd.to_datetime(ts, errors='raise', format='ISO8601')
DatetimeIndex(['2022-08-04 10:32:05.634000+00:00',
               ...
               '2022-08-31 10:22:08.153000+00:00',
               '2022-09-03 10:32:02.517000+00:00',
                      '2022-09-03 10:32:05+00:00',
               '2022-09-05 10:22:19.490000+00:00',
               '2022-09-05 10:22:20.519000+00:00'],
              dtype='datetime64[ns, UTC]', freq=None)

format='ISO8601' is only available in pandas 2, so that's a compelling reason to require pandas 2, xref #213. (pandas-dev/pandas#41047 is also fixed in newer versions of pandas, allowing us to remove some old code for #2).

I think I'll probably fix this by using format='ISO8601' and requiring pandas 2.

@J6767
Copy link
Author

J6767 commented Dec 18, 2023

Thanks guys, that worked great. stack = stack.where(~np.isnat(stack.time), drop=True)

@J6767 J6767 closed this as completed Dec 18, 2023
@gjoseph92
Copy link
Owner

Glad that works for you, but dropping data doesn't seem like an ideal workaround to me, so I'll reopen this until the underlying problem is fixed.

@gjoseph92 gjoseph92 reopened this Dec 19, 2023
fkroeber added a commit to fkroeber/stackstac that referenced this issue May 17, 2024
@ZZMitch
Copy link

ZZMitch commented May 27, 2024

To provide a bit more context to this bug, I have been using stackstac to document HLS (Landsat 8/9 + Sentinel-2) clear observation availability over Canada and came across these NaT time-steps.

My "fix" has been to drop the NaT time-steps in a similar manner as demonstrated here, and I have documenting how prevalent they are across the HLS archive. Originally, I thought this was a metadata issue within the HLS archive, and only recently realized this was a stackstac bug, otherwise I would have fixed the missing dates, rather than removing them.

For HLS, this impacts Sentinel-2 more than Landsat. Across the couple years of imagery I have documented so far (2018 - 2020), it impacts <0.1% of Landsat images and 0.2 - 1% of Sentinel-2 images depending on the year. It seems to be more prevalent in 2018 for S30 (1%), dropping to 0.2% by 2020.

Most of the time, the NaT time-steps are a couple images here and there (e.g., <5 in a year looking at all available imagery). However, I have noticed a few cases where almost every time-step in a year is NaT.

Here is a basic reproducible example of this case:

lon, lat = -91.4888563, 48.8885808 # All but first S30 will be NaT datetime
#lon, lat = -89.4888563, 48.8885808 # No NaT datetime
 
start = '2019-01-01' # In 2020 or starting on 01-03, no issue
end = '2019-12-31'

import os
import pystac_client as pc

gdalEnv = ss.DEFAULT_GDAL_ENV.updated(dict(
    GDAL_DISABLE_READDIR_ON_OPEN = 'TRUE',
    GDAL_HTTP_COOKIEFILE = os.path.expanduser('~/cookies.txt'),
    GDAL_HTTP_COOKIEJAR = os.path.expanduser('~/cookies.txt'),
    GDAL_HTTP_MAX_RETRY = 10,
    GDAL_HTTP_RETRY_DELAY = 15))

catalog = pc.Client.open('https://cmr.earthdata.nasa.gov/stac/LPCLOUD')

itemsS30 = catalog.search(intersects = dict(type = 'Point', coordinates = [lon, lat]), datetime = f'{start}/{end}', collections = ['HLSS30.v2.0'], limit = 100).item_collection()
#itemsS30 # Can see datetime values in all images

import stackstac as ss

s30 = ss.stack(itemsS30, assets = ['Fmask'], epsg = 32615, resolution = 30, gdal_env = gdalEnv)
#s30 # Will see #NaT date-times here

s30.time.values # All but first are NaT

In this case, the first datetime in the stack is '2019-01-02T17:10:02.000000000'. It appears that since this date time does not have fractional seconds, and it is the first datetime in the stack, all the other datetimes are considered invalid and become NaT. Obviously a much more problematic result than having just one time here and there becoming NaT! These situations are pretty rare (I have noticed it only 3-4 times across the few years of work I have done over Canada so far) - but wanted to mention here now that I know it is a stackstac issue rather than an HLS metadata issue.

When doing NRT work or phenology, for-example, every image counts! So good to document and hopefully fix these edge cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants