Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backslash to Forward slash correction #199

Open
acrois opened this issue Sep 11, 2021 · 2 comments
Open

Backslash to Forward slash correction #199

acrois opened this issue Sep 11, 2021 · 2 comments

Comments

@acrois
Copy link

acrois commented Sep 11, 2021

It seems that browsing to a URL like this: https:// possibility.com/Clearing/Images\brnbagbk.gif (github encodes these URLs so I had to put a space after "https://")
with a backward slash will resolve properly in browsers (latest Google Chrome and Firefox tested) by changing it to a forward slash when the URL is processed. It will automatically replace the backward slashes with forward slashes.

When it comes to storing the response in WARC as well as tools retrieving the correct URL (similar to how a browser would correct the request) I am not sure of the implications. I can imagine a process where a URL that resolves 404 with a backward slash in it can be retried with the backslashes replaced with forward slashes.

From the dashboard, when viewed in Google Chrome:
image
image

What is the best way to handle this in grab-site?

@TheTechRobo
Copy link
Contributor

I might send a PR, but I don't really understand the codebase that well. 🤕

@ivan
Copy link
Contributor

ivan commented Oct 25, 2021

Browsers convert backslashes to slashes when they parse hrefs. They also do other odd things there, like trim leading and trailing whitespace.

I believe the right place to implement this would be in the href parsing in ludios_wpull.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants