Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

download_and_extract failure #34

Open
cifkao opened this issue Dec 27, 2020 · 9 comments
Open

download_and_extract failure #34

cifkao opened this issue Dec 27, 2020 · 9 comments
Labels
enhancement New feature or request

Comments

@cifkao
Copy link
Contributor

cifkao commented Dec 27, 2020

RemoteDataset(download_and_extract=True) fails if:

  • the download directory doesn't exist or
  • the dataset is already extracted and the file permissions don't allow overwriting some of the files.

In the first case, we can simply create the directory (it's only going to be used for this one dataset anyway). In the second case, we can skip extraction if the .muspy.success file exists. Or (to make sure the data is not corrupt) we could just check the files that exist before trying to overwrite them.

@salu133445
Copy link
Owner

I am not in favor of the idea of automatically creating the directory if it does not exist as typo can be quite common. For example, a typo like a/x/c/d/e can be hard to revert. Interactive prompts would be preferable in this case.

For the second case, the current implementation always download, extract and overwrite the data. We could add a overwrite argument to prevent repeating this by looking for .muspy.success.

In some case, we might interrupt the downloading or the extraction process, so we might need to handle this as well. For example, if the existing achieve size is incorrect, then we should remove it and redownload it. Also, if part of the files are already extracted, we could skip them, which can be done by modifying datasets.utils.extract_archive using TarFile.getnames with some existence checks.

@salu133445 salu133445 added the enhancement New feature or request label Dec 27, 2020
@cifkao
Copy link
Contributor Author

cifkao commented Dec 27, 2020

I am not in favor of the idea of automatically creating the directory if it does not exist as typo can be quite common. For example, a typo like a/x/c/d/e can be hard to revert. Interactive prompts would be preferable in this case.

But mkdir(parents=False, exist_ok=True) prevents typos like a/x/c/d/e. To me, it's reasonable to require the parent directory to exist, but not root itself, as it is a directory dedicated to the dataset (which is not expected to exist yet).

@cifkao
Copy link
Contributor Author

cifkao commented Dec 27, 2020

And the issue I was having with my dataset is that the archive contained some read-only files. So creating the dataset a second time resulted in a PermissionError. Also, I can imagine a dataset to exist in a shared location which is not writable.

@cifkao
Copy link
Contributor Author

cifkao commented Dec 27, 2020

In some case, we might interrupt the downloading or the extraction process, so we might need to handle this as well.

If the extraction process is interrupted, .muspy.success should not exist, right? And as far as I understand, you are already checking the archive checksum if it exists.

@salu133445
Copy link
Owner

But mkdir(parents=False, exist_ok=True) prevents typos like a/x/c/d/e. To me, it's reasonable to require the parent directory to exist, but not root itself, as it is a directory dedicated to the dataset (which is not expected to exist yet).

Yup. Sounds good.

@salu133445
Copy link
Owner

If the extraction process is interrupted, .muspy.success should not exist, right? And as far as I understand, you are already checking the archive checksum if it exists.

That's right. Totally forgot that.

So what's not done yet is to have a boolean argument overwrite in Dataset.__init__ that passed to datasets.utils.extract_archive to control whether existing files should be skipped or overwritten.

salu133445 added a commit that referenced this issue Jan 3, 2021
salu133445 added a commit that referenced this issue Jan 3, 2021
- Add argument `overwrite` to several functions and methods
- Add argument `verbose` to several functions and methods
- Support sha256 hash check in `datasets.utils.download_url`
- Support xz files in `datasets.utils.extract_archives`
@salu133445
Copy link
Owner

datasets.utils.download_url now has an overwrite argument, but it's actually quite tricky to add it to datasets.utils.extract_archive.

@salu133445
Copy link
Owner

Checking the existence of .muspy.success for download_and_extract is not intuitive enough by its name. Perhaps what would be more handy is an argument like make_sure_exists or download_and_extract="auto" that automatically checks the files if exists and downloads/extracts them if necessary.

@cifkao
Copy link
Contributor Author

cifkao commented Jan 3, 2021

As for download: I think the old behaviour (i.e. what happens now with overwrite=False) was reasonable. What is there to gain by re-downloading the archive if it already exists and the checksum is correct?

Basically, there should be a way (preferably the default one) to create a dataset that:

  1. doesn't fail unnecessarily (e.g. because of permissions of files that already exist)
  2. doesn't perform unnecessary long-running tasks (downloading and extracting things that have already been downloaded and extracted)

And I think that checking that an archive is correctly extracted can be an example of an unnecessary long-running task. Especially if the dataset is already converted to the MusPy format and you end up using the converted version, which you also do not check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants