Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some websites block archival process. #561

Open
luigifcruz opened this issue Apr 17, 2024 · 9 comments
Open

Some websites block archival process. #561

luigifcruz opened this issue Apr 17, 2024 · 9 comments
Labels
bug Something isn't working

Comments

@luigifcruz
Copy link

Some websites are inaccessible to the archival process. Not sure if a solution is feasible.

@luigifcruz luigifcruz added the bug Something isn't working label Apr 17, 2024
@daniel31x13
Copy link
Member

This is because some websites prevent automated requests from accessing their content.
There's an upcoming file upload feature which lets users take the screenshot manually.

@luigifcruz
Copy link
Author

Cool, file uploading is a very welcome feature and will mostly fix this problem.

I'm still a bit curious about how the websites are identifying this as an automated request. It can't be IP flagging because this is using my home IP address. Perhaps User-Agent matching?

@daniel31x13
Copy link
Member

Oh, if that's the case you can add the following to the env file, it should fix most of the links:

AUTOSCROLL_TIMEOUT=1000 # Amount in seconds (default 30)
NEXT_PUBLIC_MAX_FILE_SIZE=1000 # Amount in MB (default 30)
IGNORE_UNAUTHORIZED_CA=true
IGNORE_HTTPS_ERRORS=true

@daniel31x13
Copy link
Member

Also for the third link, just merged a hotfix in v2.5.2, please note that you'll need to add the following to the env file as well: IGNORE_URL_SIZE_LIMIT=true.

@luigifcruz
Copy link
Author

The patch for the third like indeed works. The URL is added to the system. But it also breaks the archival of every URL. I noticed a new error popping up:

Linkwarden    | [1] Processing link https://hackaday.com/2024/04/14/hackaday-links-april-14-2024/ for user 1
Linkwarden    | [1] Something went wrong while retrieving the file size.

@luigifcruz
Copy link
Author

Regarding the other variables, they unfortunately didn't help in preventing the detection of an automated request. 😕

@daniel31x13
Copy link
Member

But it also breaks the archival of every URL.

Forgot to add a single line, fixed in v2.5.3.

@luigifcruz
Copy link
Author

Thanks! I'll test this soon.

Regarding the bot detection circumvention, looks like you are using the vanilla Playwright which makes it very easy for the website to detect an automated request. I tried the example of Puppeteer Stealth plugin for Playwright and was able to successfully circumvent bot detection in 2 of the 3 links. Looking at the archival code, very little modification would be necessary to support plugins.

@daniel31x13
Copy link
Member

That’s a great news!
If you want, make a PR for this and I’ll be merging it.
Otherwise I’ll be getting to it soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants