Read and cache robots.txt files for each host using thread-local storage #302

ephphatha · 2023-04-25T02:15:16Z

Would be better to share between threads to minimise requests to the same host, but that'll require synchronising read() calls and object initialisation.

This also doesn't respect the request-rate or crawl-delay directives as that requires communicating the last access time per host across the pool. That still needs to be done to properly address #48

This should be merged after #300 to allow hosts to set a policy for this tool. urllib.robotparser expects a simple useragent string and splits on the first /. The current useragent will cause it to only apply the catch-all policy or a policy intended for browsers (User-Agent: Mozilla).

rom1504 · 2023-04-25T07:03:36Z

Pretty short code, looks good!

Could you:

Have a look as to why the CI is failing (see how to run tests locally in the readme)
run a small benchmark to make sure it doesn't make the tool much slower (there's an example file in the test folder)

ephphatha · 2023-04-25T13:23:26Z

Benchmarking is not really feasible from my connection given the tests connect to real hosts. If you have a synthetic test case I can run that.

I did see a few things that could be improved when experimenting, urllib.robotparser doesn't use a timeout in read() which caused it to wait longer than the 10 second timeout used for downloading images. Also added some code around responses to more closely align with RFC9309 which should help when a single host is referenced multiple times in a thread.

Fixed the lint issues and it looks like tests are running in approximately the same time as main.

rom1504 · 2023-04-25T23:32:39Z

so I tried to run this and I found

it's only moderately slower than before (50%)
there seems to be some issues, it's showing a false positive rate higher than expected.

Some examples of urls that are considered blocked by robots.txt but shouldn't:

http://previewcf.turbosquid.com/Preview/2014/05/26__10_46_05/Catwalk_Stage_Render_01.jpga292b0be-bfb3-44b0-8299-7db594536ca6Large.jpg -> robots.txt is 403 so it should be ok according to 2.3.1.3
https://images.webfronts.com/cache/frorifnnoole.jpg?imgeng=/w_500/h_500/m_letterbox_ffffff_100 -> robots.txt is 401 so it should be ok according to 2.3.1.3
http://kimkelly.smugmug.com/photos/i-qjP8Qcb/0/L/i-qjP8Qcb-L.jpg -> 404, should be ok according to 2.3.1.3

I see your code is trying to handle these cases properly already, but it seems something is not working.

Could you add some tests to make sure we can handle those properly ? (you can follow the example of the server serving synthetic examples in the other tests)

rom1504 · 2023-04-25T23:38:03Z

img2dataset/downloader.py

+ raw = f.read()
+ self.parse(raw.decode("utf-8").splitlines())
+ except urllib.error.HTTPError as err:
+ if err.code in (401, 403) or 500 <= err.code <= 599:


according to 2.3.1.3 of https://datatracker.ietf.org/doc/rfc9309/ 401 and 403 are considered Unavailable hence should be allow all

This was based on the cpython implementation of read, which treats authentication errors as unreachable statuses. I forgot to comment that behaviour so have added it now.

rom1504 · 2023-04-25T23:41:49Z

img2dataset/downloader.py

+ self.allow_all = True
+ except Exception: # pylint: disable=broad-except
+ # treat other errors as meaning the server is unreachable (timeout, ssl error, dns failure)
+ self.disallow_all = True


this should be measured, probably not cached, it is not clear one failure of the server means the server is fully down

Most of the errors that lead to this block are network errors (name resolution failures, timeout). There's definitely improvements that can be made for network errors but for this initial implementation I've just gone with the required behaviour described (by MUST) in 2.3.1.4.

Another option could be to let these Exceptions propagate up to be caught by the download_image try/except block, this would skip requesting the current image and try fetching robots.txt the next time the host is encountered in the thread.

rom1504 · 2023-04-25T23:47:33Z

img2dataset/downloader.py

+ """Returns True if the given user agent is allowed to fetch this url based on the hosts robots.txt"""
+ robots_txt_cache = thread_context.robots_txt_cache
+
+ robots_url = urllib.parse.urljoin(url, "/robots.txt")


this does not seem correct
The rules MUST be accessible in a file named "/robots.txt" (all
lowercase) in the top-level path of the service.

https://datatracker.ietf.org/doc/rfc9309/ 2.3

url is not the top-level path of the service
it would be needed to keep only the domain name

https://tio.run/##TcgxDoMwDAXQnVOgTCBV9tCN2wRqkaAk/nJciZ4@7djxPXw8aXuOkSvUfH5bKXknROsyTbDcfPk/@uHS3JaQ3NE3ZrljRRE6tHLXKvwSASN6ogtneMyBTXf1Tn57WNcxvg

"/robots.txt" is a root-relative URL. urljoin with a root relative URL as the second argument uses the host, scheme, and port from the first argument to make a complete URL.

rom1504 · 2023-04-25T23:53:21Z

ok yeah I see, fetching robots.txt at the root + correctly handling 401 and 403 seems to find some true positives
I'll check more another day

ephphatha · 2023-04-26T00:02:35Z

Some examples of urls that are considered blocked by robots.txt but shouldn't:

http://kimkelly.smugmug.com/photos/i-qjP8Qcb/0/L/i-qjP8Qcb-L.jpg -> 404, should be ok according to 2.3.1.3

http://kimkelly.smugmug.com/robots.txt has a default policy of Disallow: /. If robots.txt was unavailable it would be treated as allow all.

I've already added a robots.txt response to the test server with a default policy of Disallow: /disallowed, do you specifically want cases where the server responds with different statuses to requests for robots.txt?

raincoastchris · 2023-04-26T11:36:40Z

tests/http_server.py

@@ -22,5 +22,10 @@ async def get():
 return "hi"


+@app.get("/robots.txt")


Won’t this change break test coverage for x-robots-header parsing by causing the downloader to fail early? Maybe make a different 3rd mount point that’s disallowed by robots.txt and not by HTTP headers, and add a separate test case using that?

it would've, but that test doesn't actually test whether the disallowed paths/files are not downloaded or what point in the process they were disallowed. The FastAPI server was returning a json response for robots.txt causing it to be ignored (content type was meant to be text/plain).

The test case still doesn't validate that the disallow method was correct but will at least fail if one isn't working now.

Would be better to share between threads to minimise requests to the same host, but that'll require synchronising read() calls and object initialisation.

rom1504 · 2023-05-08T14:13:46Z

Hi,
Just to give some updates.
I did not have time recently to work on this.

If you want to get this merged sooner these tasks are necessary

Benchmark the speed on some relevant dataset and on representative hardware. For example downloading 20 shards of laion400m
check the stats on new filtering due to robots.txt and whether each new group of error is valid
provide an option (turned off by default) to disable robots.txt filtering in order to let users gracefully transition to using robots.txt by running experiments with / without it (to avoid making all datasets instantly non reproducible)

Of course you have no obligation to work on any that.

I'll get around to it eventually. Just wanted to let you know what blocks merging in my opinion.

ephphatha force-pushed the robots-txt-local branch 5 times, most recently from bc26f88 to b1cf7b7 Compare April 25, 2023 13:22

rom1504 reviewed Apr 25, 2023

View reviewed changes

ephphatha force-pushed the robots-txt-local branch from b1cf7b7 to f685f9c Compare April 25, 2023 23:56

raincoastchris reviewed Apr 26, 2023

View reviewed changes

Read and cache robots.txt files for each host using thread-local storage

da369f6

Would be better to share between threads to minimise requests to the same host, but that'll require synchronising read() calls and object initialisation.

ephphatha force-pushed the robots-txt-local branch from f685f9c to da369f6 Compare April 26, 2023 12:19

maathieu mentioned this pull request May 25, 2023

Implement the W3C TDM Reservation Protocol and enable a more standard opt-out mechanism #308

Open

rom1504 added this to Needs triage in PR Triage May 28, 2023

rom1504 added the filtering label Jan 9, 2024

rom1504 moved this from Needs triage to Waiting for user input in PR Triage Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read and cache robots.txt files for each host using thread-local storage #302

Read and cache robots.txt files for each host using thread-local storage #302

ephphatha commented Apr 25, 2023

rom1504 commented Apr 25, 2023

ephphatha commented Apr 25, 2023

rom1504 commented Apr 25, 2023 •

edited

rom1504 Apr 25, 2023

ephphatha Apr 25, 2023

rom1504 Apr 25, 2023

ephphatha Apr 26, 2023

rom1504 Apr 25, 2023

ephphatha Apr 25, 2023

rom1504 commented Apr 25, 2023

ephphatha commented Apr 26, 2023

raincoastchris Apr 26, 2023 •

edited

ephphatha Apr 26, 2023

rom1504 commented May 8, 2023

		@@ -22,5 +22,10 @@ async def get():
		return "hi"


		@app.get("/robots.txt")

Read and cache robots.txt files for each host using thread-local storage #302

Are you sure you want to change the base?

Read and cache robots.txt files for each host using thread-local storage #302

Conversation

ephphatha commented Apr 25, 2023

rom1504 commented Apr 25, 2023

ephphatha commented Apr 25, 2023

rom1504 commented Apr 25, 2023 • edited

rom1504 Apr 25, 2023

Choose a reason for hiding this comment

ephphatha Apr 25, 2023

Choose a reason for hiding this comment

rom1504 Apr 25, 2023

Choose a reason for hiding this comment

ephphatha Apr 26, 2023

Choose a reason for hiding this comment

rom1504 Apr 25, 2023

Choose a reason for hiding this comment

ephphatha Apr 25, 2023

Choose a reason for hiding this comment

rom1504 commented Apr 25, 2023

ephphatha commented Apr 26, 2023

raincoastchris Apr 26, 2023 • edited

Choose a reason for hiding this comment

ephphatha Apr 26, 2023

Choose a reason for hiding this comment

rom1504 commented May 8, 2023

rom1504 commented Apr 25, 2023 •

edited

raincoastchris Apr 26, 2023 •

edited