Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check the header “Content-Disposition” to determine the filename to store #1118

Open
tats-u opened this issue Jan 22, 2018 · 30 comments · May be fixed by #2153
Open

Check the header “Content-Disposition” to determine the filename to store #1118

tats-u opened this issue Jan 22, 2018 · 30 comments · May be fixed by #2153

Comments

@tats-u
Copy link

tats-u commented Jan 22, 2018

For example, the link http://audeering.com/download/1318/ (linked from http://audeering.com/technology/opensmile/#download) provides a Gzip-compressed tarball, but aria2 stored the tarball as the name index.html by the command aria2c -x4 http://audeering.com/download/1318/.
This link returns the header Content-Disposition as attachment; filename="opensmile-2.3.0.tar.gz";, so aria2 must interpret it and store the tarball as the name opensmile-2.3.0.tar.gz.

@Baneeishaque
Copy link

Baneeishaque commented Jan 23, 2018

+1

@tatsuhiro-t
Copy link
Collaborator

aria2 supports content-disposition header field, and the value which server sent is syntactically incorrect. The trailing ";" must not be there. The server is broken.

@kwkam
Copy link
Contributor

kwkam commented Feb 12, 2018

A simple workaround for this:

diff --git a/src/util.cc b/src/util.cc
index be73a6e4..ffa4236b 100644
--- a/src/util.cc
+++ b/src/util.cc
@@ -1555,6 +1555,12 @@ ssize_t parse_content_disposition(char* dest, size_t destlen,
   case CD_AFTER_VALUE:
   case CD_TOKEN:
     return destlen - dlen;
+  case CD_BEFORE_DISPOSITION_PARM_NAME:
+    if ((flags & CD_FILENAME_FOUND) == 0 &&
+        (flags & CD_EXT_FILENAME_FOUND) == 0) {
+      return -1;
+    }
+    return destlen - dlen;
   case CD_VALUE_CHARS:
     if (charset == CD_ENC_UTF8 && dfa_state != UTF8_ACCEPT) {
       return -1;

@Baneeishaque
Copy link

Is the workaround ok?

@ahmedtds
Copy link

ahmedtds commented Oct 8, 2019

well server broken or not it's something happens in the wild and should be handled. wget can recognize all content-disposition replies pretty good but aria2 fails many times

@eddiezato
Copy link

_ttps://drscdn.500px.org/photo/1022280281/q%3D80_m%3D1000/v2?sig=3234f9ae309e3e6a7a31f95ace6754e45c5171a8be579de3ae30a464f239beae

content-disposition: filename=stock-photo-1022280281.jpg

aria2c v1.35 doesn't use the remote name when saving the file, curl does.

@shadowzoom
Copy link

@eddiezato @tats-u @tatsuhiro-t @ahmedtds @Baneeishaque @kwkam
Does aria still can't autorename file from content-disposition?

@tats-u
Copy link
Author

tats-u commented May 16, 2021

@tatsuhiro-t
I have been quiet about your opinion, but I found an evidence that trailing semicolons should be permitted.
https://datatracker.ietf.org/doc/html/rfc2183#section-3

One of examples of RFC2183 is:

Content-Disposition: attachment; filename=genome.jpeg;

This shows that trailing semicolons are officially allowed.

It is unfortunate that the above links are dead, though.

@popugasik
Copy link

@tats-u
Looks like according to RFC 6266 trailing semicolon must be omitted.

RFC 2616 defines the Content-Disposition response header field, but
points out that it is not part of the HTTP/1.1 Standard.

@nikicat
Copy link

nikicat commented Mar 2, 2023

I can't download models from https://huggingface.co due to this bug 😞

@shirooo39
Copy link

shirooo39 commented Mar 19, 2023

I can't download models from https://huggingface.co due to this bug disappointed

you can just wget directly from huggingface. change the /blob/ into /resolve/

@nikicat
Copy link

nikicat commented Mar 19, 2023

you can just wget directly from huggingface. change the /blob/ into /resolve/

Surely I can download using wget or curl, but I want to use aria2c for multipart download.

@tats-u
Copy link
Author

tats-u commented Mar 20, 2023

There's no reason to reject a response from a server just because of its trivial and easily-recoverable syntax "error".
As you know servers SHALL generate correct response though.

@neggles
Copy link

neggles commented Jul 16, 2023

@popugasik said:

@tats-u Looks like according to RFC 6266 trailing semicolon must be omitted.

It says quite the opposite, actually. The grammar is defined as:

     content-disposition = "Content-Disposition" ":"
                            disposition-type *( ";" disposition-parm )

     disposition-type    = "inline" | "attachment" | disp-ext-type
                         ; case-insensitive
     disp-ext-type       = token

     disposition-parm    = filename-parm | disp-ext-parm

Specifically, the content-disposition field's final element is *( ";" disposition-parm ).
Checking the grammar rules in RFC 2616 section 2.1:

   (rule1 rule2)
      Elements enclosed in parentheses are treated as a single element.
      Thus, "(elem (foo | bar) elem)" allows the token sequences "elem
      foo elem" and "elem bar elem".

   *rule
      The character "*" preceding an element indicates repetition. The
      full form is "<n>*<m>element" indicating at least <n> and at most
      <m> occurrences of element. Default values are 0 and infinity so
      that "*(element)" allows any number, including zero; "1*element"
      requires at least one; and "1*2element" allows one or two.

we see that *( ";" disposition-parm ) translates to "Zero or more instances of disposition-parm OR ;".

Absolutely nothing prohibits having a trailing semicolon, it's explicitly allowed. In fact you could have a hundred of them and still be compliant.

Even if RFC6266 did forbid a trailing semicolon - which it doesn't - rejecting the header makes aria2 functionally unusable on a huge percentage of websites, and is a horrible user experience.

HuggingFace is serving these files from AWS S3 via AWS CloudFront. CloudFront/S3 are responsible for these headers. A significant portion of websites serve large files from S3 via CloudFront. Rejecting this RFC-spec-compliant header breaks all downloads from these sites.

It also (at last check) breaks downloads from recent nginx versions using the default out-of-the-box header configuration.

curl gets this right. wget gets this right. python requests gets this right.

Yet this issue has been open for over five years and aria2 is still getting it wrong...


n.b. on top of the above, section 4.3 of RFC 6266 states:

   o  Recipients SHOULD strip or replace character sequences that are
      known to cause confusion both in user interfaces and in filenames,
      such as control characters and leading and trailing whitespace.

I'd argue semicolons fall into that category, though that's probably beside the point.

@fenopa
Copy link

fenopa commented Jul 19, 2023

Aria2 seems to be kind of dead anyway, is there an active aria2 fork out there that doesn't do this kind of nonsense? If not, maybe it's time to fork it now.

@fenopa
Copy link

fenopa commented Aug 15, 2023

nobody willing to start an active fork and fix issues like this?

@fenopa
Copy link

fenopa commented Nov 3, 2023

@tatsuhiro-t you still resist to fix this?

Do you still think curl is wrong, wget is wrong, python is wrong, Amazon is wrong, nginx and several other most popular web servers are wrong but only you are correct?

@fenopa
Copy link

fenopa commented Nov 14, 2023

@tatsuhiro-t while you're active nowadays, can you finally solve this issue before a new version?

@fenopa fenopa mentioned this issue Nov 14, 2023
@jyxjjj
Copy link

jyxjjj commented Nov 24, 2023

@tatsuhiro-t you still resist to fix this?

Do you still think curl is wrong, wget is wrong, python is wrong, Amazon is wrong, nginx and several other most popular web servers are wrong but only you are correct?

Agree, Even it is wrong, Others supported as they can.

@mgrinzPlayer
Copy link

@popugasik said:
(...) It says quite the opposite, actually. The grammar is defined as:

     content-disposition = "Content-Disposition" ":"
                            disposition-type *( ";" disposition-parm )

(...)
we see that *( ";" disposition-parm ) translates to "Zero or more instances of disposition-parm OR ;".

It translates to zero or more instances of ";"disposition-parm, there's no "or"

Anyway, I agree aria2 should handle this situation. Maybe something like this at the beginning of ssize_t parse_content_disposition function (with the hint about wrong syntax):

  if (in[len-1]==';'){
    A2_LOG_NOTICE(fmt("Content-Disposition header field (RFC1806) minor corruption (trailing ';') - fixing"));
    --len;
  }

Output:

>aria2c-x86_64 "https://huggingface.co/microsoft/Orca-2-13b/resolve/main/pytorch_model-00001-of-00006.bin"

11/25 17:57:26 [NOTICE] Downloading 1 item(s)

11/25 17:57:26 [NOTICE] CUID#7 - Redirecting to https://cdn-lfs-us-1.huggingface.co/repos/a2/6b/a26b589f0cd574f46956be719d27409ed2fde99088cd84b2a8e0403d0de10ab4/48b50d9d6ef77522b210c62ec214dd01295fcf88a3160acce32cae23d033ce1d?response-content-disposition=attachment%3B+filename*%3DUTF-8%27%27pytorch_model-00001-of-00006.bin%3B+filename%3D%22pytorch_model-00001-of-00006.bin%22%3B&response-content-type=application%2Foctet-stream&Expires=1701189679&Policy=eyJTdGF0ZW1lbnQiOlt7IkNvbmRpdGlvbiI6eyJEYXRlTGVzc1RoYW4iOnsiQVdTOkVwb2NoVGltZSI6MTcwMTE4OTY3OX19LCJSZXNvdXJjZSI6Imh0dHBzOi8vY2RuLWxmcy11cy0xLmh1Z2dpbmdmYWNlLmNvL3JlcG9zL2EyLzZiL2EyNmI1ODlmMGNkNTc0ZjQ2OTU2YmU3MTlkMjc0MDllZDJmZGU5OTA4OGNkODRiMmE4ZTA0MDNkMGRlMTBhYjQvNDhiNTBkOWQ2ZWY3NzUyMmIyMTBjNjJlYzIxNGRkMDEyOTVmY2Y4OGEzMTYwYWNjZTMyY2FlMjNkMDMzY2UxZD9yZXNwb25zZS1jb250ZW50LWRpc3Bvc2l0aW9uPSomcmVzcG9uc2UtY29udGVudC10eXBlPSoifV19&Signature=U4omCd1sHHmXoPvqWRCLAUwgfaUsF1gVWJtHDAhnMkgWDT-5NzYycBsVNCLVARFy8k%7ER6RIyWC-0ylvrGb0jW5BWwCIoi7Y007TlHzNX382JVtdWrtSc8yZPVWVwrWiBjrVm5piIT4487rheKUQ8bXxbCnChuBD5ZJIKQ%7E4oSvdVEHgwuhhLfGz9MiaLHC%7E-W85FdVQX3JPBWBgu71yWtOz1qSjNCNRUHGHWTwPcredMfT7jGbdkFiHqfTj4mQFQUXK9-XNRMWiksNvQGrcmB9%7EVf7uisIngiGwqsHwlpX4OmkiyUyiGXlJ%7E9szHI7-6JXi0EXVj2gMqhMbYXhlW8g__&Key-Pair-Id=KCD77M1F0VK2B

11/25 17:57:27 [NOTICE] Content-Disposition header field (RFC1806) minor corruption (trailing ';') - fixing

11/25 17:57:27 [NOTICE] Allocating disk space. Use --file-allocation=none to disable it. See --file-allocation option in man page for more details.
 *** Download Progress Summary as of Sat Nov 25 17:57:37 2023 ***
====================================================================================================================================================
[#3aa206 0B/9.2GiB(0%) CN:1 DL:0B]
FILE: F:/pytorch_model-00001-of-00006.bin

@fenopa
Copy link

fenopa commented Nov 27, 2023

nobody willing to start an active fork and fix issues like this?

you really want this to be fixed huh? isn't this a fix? just compile it afterwards

I don't know how to compile it but I'm not the only one who needs it, as you can see it affects a lot of people, so a fix in the release would be better.

@fenopa
Copy link

fenopa commented Nov 27, 2023

Anyway, I agree aria2 should handle this situation. Maybe something like this at the beginning of ssize_t parse_content_disposition function (with the hint about wrong syntax):

  if (in[len-1]==';'){
    A2_LOG_NOTICE(fmt("Content-Disposition header field (RFC1806) minor corruption (trailing ';') - fixing"));
    --len;
  }

I don't think it needs to be logged. It's not something the user needs/wants to know and considering all the other major software and web server behaviour, it's most likely not right to call it a "corruption that was fixed"

@mgrinzPlayer
Copy link

mgrinzPlayer commented Nov 28, 2023

@fenopa, similar situation is with 7-Zip. You can find discussions there. Basically, in many cases, Igor refused to open/handle files that doesn't follow specification. Similar to aria2 situation was with WordPress plugin BackWPup generated tar files - GNU tar (and other programs) silently ignored the error, Igor decided to still report error but also unpack all files ( link ).

Here my workflow script file. And here is github-actions release.

@pawamoy
Copy link

pawamoy commented Nov 28, 2023

Maintainers did not explicitly close this issue or stated they would not accept a pull request, so I suppose anyone could send a PR.

@fenopa
Copy link

fenopa commented Nov 28, 2023

@mgrinzPlayer thanks a lot for that build.

@fenopa
Copy link

fenopa commented Dec 8, 2023

@reschke can you put an end to this? who is right?

@reschke
Copy link

reschke commented Dec 8, 2023

The grammar does not allow trailing semicolons.

But that doesn't necessarily mean it would be a problem to allow them.

In any case, I would report site bugs when I'd encounter those.

@neggles
Copy link

neggles commented Dec 10, 2023

The grammar does not allow trailing semicolons.

But that doesn't necessarily mean it would be a problem to allow them.

In any case, I would report site bugs when I'd encounter those.

You and @mgrinzPlayer are right, I misread the grammar (think my annoyed-at-the-time brain inserted a | that wasn't there), but that doesn't change the fact that CloudFront injects a trailing semicolon, therefore a significant chunk of the internet does, and nothing any of us can do will change that.

W.R.T. 7-Zip, Igor's persistent refusal to accept the reality of formats that people are using and want to use has caused just as much strife (if not more) and no doubt had a negative effect on uptake/usage of 7-Zip. It's also not entirely comparable, since in the 7-Zip case we're talking about new, relatively-unknown formats that are not supported by much else in the way of software save for the existing 7-Zip forks.

While in this case, we're talking about something that every other library/client comparable to aria2 handles gracefully, that's widely deployed and accepted across the internet as a whole.

I'd be happy to send a PR similar to the above if I had any confidence that it'd be accepted and not just closed with "fix your noncompliant server" (since as discussed above "fixing cloudfront" is entirely beyond what any of us can do) [Edit:] Eh, money where my mouth is. Will open a draft PR using something like what's suggested above shortly...

@mgrinzPlayer
Copy link

If cloudfront doesn't follow simple "Content-Disposition" ":" disposition-type *( ";" disposition-parm ) ... what other things they messed.

@0wwafa
Copy link

0wwafa commented May 21, 2024

I can't download models from https://huggingface.co due to this bug 😞

same for me... so frustrated I use this now: https://github.com/Zibri/HFdl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.