-
-
Notifications
You must be signed in to change notification settings - Fork 233
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some websites block archival process. #561
Comments
This is because some websites prevent automated requests from accessing their content. |
Cool, file uploading is a very welcome feature and will mostly fix this problem. I'm still a bit curious about how the websites are identifying this as an automated request. It can't be IP flagging because this is using my home IP address. Perhaps User-Agent matching? |
Oh, if that's the case you can add the following to the env file, it should fix most of the links:
|
Also for the third link, just merged a hotfix in v2.5.2, please note that you'll need to add the following to the env file as well: |
The patch for the third like indeed works. The URL is added to the system. But it also breaks the archival of every URL. I noticed a new error popping up:
|
Regarding the other variables, they unfortunately didn't help in preventing the detection of an automated request. 😕 |
Forgot to add a single line, fixed in v2.5.3. |
Thanks! I'll test this soon. Regarding the bot detection circumvention, looks like you are using the vanilla Playwright which makes it very easy for the website to detect an automated request. I tried the example of Puppeteer Stealth plugin for Playwright and was able to successfully circumvent bot detection in 2 of the 3 links. Looking at the archival code, very little modification would be necessary to support plugins. |
That’s a great news! |
Some websites are inaccessible to the archival process. Not sure if a solution is feasible.
The text was updated successfully, but these errors were encountered: