download_and_extract failure #34

cifkao · 2020-12-27T09:12:17Z

RemoteDataset(download_and_extract=True) fails if:

the download directory doesn't exist or
the dataset is already extracted and the file permissions don't allow overwriting some of the files.

In the first case, we can simply create the directory (it's only going to be used for this one dataset anyway). In the second case, we can skip extraction if the .muspy.success file exists. Or (to make sure the data is not corrupt) we could just check the files that exist before trying to overwrite them.

The text was updated successfully, but these errors were encountered:

salu133445 · 2020-12-27T20:37:05Z

I am not in favor of the idea of automatically creating the directory if it does not exist as typo can be quite common. For example, a typo like a/x/c/d/e can be hard to revert. Interactive prompts would be preferable in this case.

For the second case, the current implementation always download, extract and overwrite the data. We could add a overwrite argument to prevent repeating this by looking for .muspy.success.

In some case, we might interrupt the downloading or the extraction process, so we might need to handle this as well. For example, if the existing achieve size is incorrect, then we should remove it and redownload it. Also, if part of the files are already extracted, we could skip them, which can be done by modifying datasets.utils.extract_archive using TarFile.getnames with some existence checks.

cifkao · 2020-12-27T20:46:14Z

I am not in favor of the idea of automatically creating the directory if it does not exist as typo can be quite common. For example, a typo like a/x/c/d/e can be hard to revert. Interactive prompts would be preferable in this case.

But mkdir(parents=False, exist_ok=True) prevents typos like a/x/c/d/e. To me, it's reasonable to require the parent directory to exist, but not root itself, as it is a directory dedicated to the dataset (which is not expected to exist yet).

cifkao · 2020-12-27T20:50:21Z

And the issue I was having with my dataset is that the archive contained some read-only files. So creating the dataset a second time resulted in a PermissionError. Also, I can imagine a dataset to exist in a shared location which is not writable.

cifkao · 2020-12-27T20:52:35Z

In some case, we might interrupt the downloading or the extraction process, so we might need to handle this as well.

If the extraction process is interrupted, .muspy.success should not exist, right? And as far as I understand, you are already checking the archive checksum if it exists.

salu133445 · 2020-12-27T21:03:55Z

But mkdir(parents=False, exist_ok=True) prevents typos like a/x/c/d/e. To me, it's reasonable to require the parent directory to exist, but not root itself, as it is a directory dedicated to the dataset (which is not expected to exist yet).

Yup. Sounds good.

salu133445 · 2020-12-27T21:09:02Z

If the extraction process is interrupted, .muspy.success should not exist, right? And as far as I understand, you are already checking the archive checksum if it exists.

That's right. Totally forgot that.

So what's not done yet is to have a boolean argument overwrite in Dataset.__init__ that passed to datasets.utils.extract_archive to control whether existing files should be skipped or overwritten.

- Add argument `overwrite` to several functions and methods - Add argument `verbose` to several functions and methods - Support sha256 hash check in `datasets.utils.download_url` - Support xz files in `datasets.utils.extract_archives`

salu133445 · 2021-01-03T15:36:57Z

datasets.utils.download_url now has an overwrite argument, but it's actually quite tricky to add it to datasets.utils.extract_archive.

salu133445 · 2021-01-03T15:44:18Z

Checking the existence of .muspy.success for download_and_extract is not intuitive enough by its name. Perhaps what would be more handy is an argument like make_sure_exists or download_and_extract="auto" that automatically checks the files if exists and downloads/extracts them if necessary.

cifkao · 2021-01-03T16:51:42Z

As for download: I think the old behaviour (i.e. what happens now with overwrite=False) was reasonable. What is there to gain by re-downloading the archive if it already exists and the checksum is correct?

Basically, there should be a way (preferably the default one) to create a dataset that:

doesn't fail unnecessarily (e.g. because of permissions of files that already exist)
doesn't perform unnecessary long-running tasks (downloading and extracting things that have already been downloaded and extracted)

And I think that checking that an archive is correctly extracted can be an example of an unnecessary long-running task. Especially if the dataset is already converted to the MusPy format and you end up using the converted version, which you also do not check.

salu133445 added the enhancement New feature or request label Dec 27, 2020

salu133445 added a commit that referenced this issue Jan 3, 2021

Create root if not exists (#34)

f92f315

salu133445 added a commit that referenced this issue Jan 15, 2021

Make overwrite default to False (#34)

be4bd7a

salu133445 added a commit that referenced this issue Jan 15, 2021

Check existence before downloading/extracting(#34)

2973343

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

download_and_extract failure #34

download_and_extract failure #34

cifkao commented Dec 27, 2020 •

edited

salu133445 commented Dec 27, 2020

cifkao commented Dec 27, 2020

cifkao commented Dec 27, 2020

cifkao commented Dec 27, 2020

salu133445 commented Dec 27, 2020

salu133445 commented Dec 27, 2020

salu133445 commented Jan 3, 2021

salu133445 commented Jan 3, 2021

cifkao commented Jan 3, 2021 •

edited

download_and_extract failure #34

download_and_extract failure #34

Comments

cifkao commented Dec 27, 2020 • edited

salu133445 commented Dec 27, 2020

cifkao commented Dec 27, 2020

cifkao commented Dec 27, 2020

cifkao commented Dec 27, 2020

salu133445 commented Dec 27, 2020

salu133445 commented Dec 27, 2020

salu133445 commented Jan 3, 2021

salu133445 commented Jan 3, 2021

cifkao commented Jan 3, 2021 • edited

cifkao commented Dec 27, 2020 •

edited

cifkao commented Jan 3, 2021 •

edited