Some websites block archival process. #561

luigifcruz · 2024-04-17T20:34:41Z

Some websites are inaccessible to the archival process. Not sure if a solution is feasible.

https://www.digikey.com/en/products/detail/analog-devices-inc./DC1524A-A/4890577
https://www.trulia.com/building/caspian-260-brooklyn-basin-way-oakland-ca-94606-2710623398
https://www.apartments.com/caspian-oakland-ca/pp0w7es (this one never gets added, it just hangs)
https://indico.fnal.gov/event/46955/contributions/204594/attachments/138539/173497/GPU_Direct_IO_with_HDF5.pdf (can't download the PDF)

daniel31x13 · 2024-04-17T20:44:57Z

This is because some websites prevent automated requests from accessing their content.
There's an upcoming file upload feature which lets users take the screenshot manually.

luigifcruz · 2024-04-17T20:53:50Z

Cool, file uploading is a very welcome feature and will mostly fix this problem.

I'm still a bit curious about how the websites are identifying this as an automated request. It can't be IP flagging because this is using my home IP address. Perhaps User-Agent matching?

daniel31x13 · 2024-04-17T21:03:16Z

Oh, if that's the case you can add the following to the env file, it should fix most of the links:

AUTOSCROLL_TIMEOUT=1000 # Amount in seconds (default 30)
NEXT_PUBLIC_MAX_FILE_SIZE=1000 # Amount in MB (default 30)
IGNORE_UNAUTHORIZED_CA=true
IGNORE_HTTPS_ERRORS=true

daniel31x13 · 2024-04-17T22:24:37Z

Also for the third link, just merged a hotfix in v2.5.2, please note that you'll need to add the following to the env file as well: IGNORE_URL_SIZE_LIMIT=true.

luigifcruz · 2024-04-18T03:34:43Z

The patch for the third like indeed works. The URL is added to the system. But it also breaks the archival of every URL. I noticed a new error popping up:

Linkwarden    | [1] Processing link https://hackaday.com/2024/04/14/hackaday-links-april-14-2024/ for user 1
Linkwarden    | [1] Something went wrong while retrieving the file size.

luigifcruz · 2024-04-18T03:41:39Z

Regarding the other variables, they unfortunately didn't help in preventing the detection of an automated request. 😕

daniel31x13 · 2024-04-18T10:20:11Z

But it also breaks the archival of every URL.

Forgot to add a single line, fixed in v2.5.3.

luigifcruz · 2024-04-18T21:08:08Z

Thanks! I'll test this soon.

Regarding the bot detection circumvention, looks like you are using the vanilla Playwright which makes it very easy for the website to detect an automated request. I tried the example of Puppeteer Stealth plugin for Playwright and was able to successfully circumvent bot detection in 2 of the 3 links. Looking at the archival code, very little modification would be necessary to support plugins.

daniel31x13 · 2024-04-18T21:14:37Z

That’s a great news!
If you want, make a PR for this and I’ll be merging it.
Otherwise I’ll be getting to it soon.

luigifcruz added the bug Something isn't working label Apr 17, 2024

daniel31x13 closed this as completed Apr 17, 2024

daniel31x13 reopened this Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some websites block archival process. #561

Some websites block archival process. #561

luigifcruz commented Apr 17, 2024

daniel31x13 commented Apr 17, 2024

luigifcruz commented Apr 17, 2024

daniel31x13 commented Apr 17, 2024

daniel31x13 commented Apr 17, 2024

luigifcruz commented Apr 18, 2024

luigifcruz commented Apr 18, 2024

daniel31x13 commented Apr 18, 2024

luigifcruz commented Apr 18, 2024

daniel31x13 commented Apr 18, 2024

Some websites block archival process. #561

Some websites block archival process. #561

Comments

luigifcruz commented Apr 17, 2024

daniel31x13 commented Apr 17, 2024

luigifcruz commented Apr 17, 2024

daniel31x13 commented Apr 17, 2024

daniel31x13 commented Apr 17, 2024

luigifcruz commented Apr 18, 2024

luigifcruz commented Apr 18, 2024

daniel31x13 commented Apr 18, 2024

luigifcruz commented Apr 18, 2024

daniel31x13 commented Apr 18, 2024