Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IRI output #323

Open
ghost opened this issue Aug 20, 2023 · 8 comments
Open

IRI output #323

ghost opened this issue Aug 20, 2023 · 8 comments

Comments

@ghost
Copy link

ghost commented Aug 20, 2023

If I run this command, I get encoded output:

> xh -h https://wiktionary.org/wiki/Ῥόδος
HTTP/2.0 301 Moved Permanently
location: "https://www.wiktionary.org/wiki/\xe1\xbf\xac\xcf\x8c\xce\xb4\xce\xbf\xcf\x82"

but with this Go code:

package main

import (
   "net/http"
   "os"
)

func main() {
   req, _ := http.NewRequest("HEAD", os.Args[1], nil)
   res, _ := new(http.Transport).RoundTrip(req)
   println(res.Header.Get("location"))
}

I get expected output:

> go run http.go https://wiktionary.org/wiki/Ῥόδος
https://www.wiktionary.org/wiki/Ῥόδος

https://wikipedia.org/wiki/Internationalized_Resource_Identifier

@blyxxyz
Copy link
Collaborator

blyxxyz commented Aug 20, 2023

This one's tricky. Historically header values were defined to be ISO-8859-1. But the latest recommendation is this:

A recipient SHOULD treat other octets in field content (obs-text) as opaque data.

And that's how the library we use handles it. It only allows string conversion if it's pure-ASCII.
And otherwise we print it in the form you see there.

HTTPie decodes it as ISO-8859-1, which is wrong in this case:

$ http -h https://wiktionary.org/wiki/Ῥόδος
HTTP/1.1 301 Moved Permanently
[...]
location: https://www.wiktionary.org/wiki/ῬÏδοÏ

And that's also how the dev tools of Firefox and Chromium handle it:
image

So I see a few options:

  • Change nothing, pedantically follow the RFC.
  • Try to decode as UTF-8 (with std::str::from_utf8(value.as_bytes())). Other bytes are somewhat unlikely to be valid UTF-8, so this won't have that many false negatives, hopefully.
  • Try to decode as ISO-8859-1, to match the browser tools and HTTPie.

I'm torn between all three, I don't know what's best.

(The current behavior deserves a mention on #322.)

@ghost
Copy link
Author

ghost commented Aug 20, 2023

since a header line is expected to end in a newline, to me that puts header values in the category of "text data", rather than "binary data". I think it would ultimately be a net positive to at least try to decode the values as UTF-8. if I visit the original link in a browser:

https://wiktionary.org/wiki/Ῥόδος

the values are not escaped in the address bar. also especially if Greek was someones native language, they would probably prefer to hand write the URL without encoding, and view it that way as well. so ideally to me this behavior should persist throughout the stack. however as was keenly pointed out, its not required by the spec, so I will respect if the decision is "do nothing".

@blyxxyz
Copy link
Collaborator

blyxxyz commented Aug 20, 2023

We know it's text, but we don't know the text's encoding (maybe ISO-8859-1, maybe UTF-8, maybe even some other ASCII superset), and it can be dangerous to guess.

It kind of pains me but I now think that decoding as ISO-8859-1 is the way to go. If you send an HTTP request from a browser then this is the result:

> let r = await fetch('https://wiktionary.org/wiki/Ῥόδος', { redirect:'manual' })
> r.headers.get('location')
'https://www.wiktionary.org/wiki/ῬÏ\x8CδοÏ\x82'

At my day job I use xh to test APIs for use in the browser. It can be a big headache if something looks fine on one side and falls apart on the other. It's best not to sweep the brokenness under the rug.

It would still be nice to somehow separately display the common-sense URL that the redirect points to. But I don't want to mislead people into thinking they can reliably extract it from the header.

@ghost
Copy link
Author

ghost commented Aug 20, 2023

ISO-8859-1 is hands down the worst option of the three. the original representation is lost, and the hex values aren't returned either, making that part of the URL essentially meaningless garbage to a human, and only useful to a computer. its a lose lose situation. please reconsider that opinion.

@blyxxyz
Copy link
Collaborator

blyxxyz commented Aug 20, 2023

The one benefit is that you know what to expect if you try to handle the header programmatically. IMO that's very important, maybe important enough to outweigh everything else.

But the current output also hints that something is wrong, so maybe we don't need to go that far.

@ghost
Copy link
Author

ghost commented Aug 20, 2023

not sure if it helps, but I also have my own encoding that might work here:

https://github.com/1268/strconv/blob/main/encode.go

its similar to percent-encoding, but will only encode instances of the escape character (%) or "binary data" defined (by me) as control characters, excluding whitespace. it has the nice property that it can be decoded by a normal percent-encoding function.

@barkoder

This comment was marked as off-topic.

@ghost

This comment was marked as off-topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants