Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read and cache robots.txt files for each host using thread-local storage #302

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ephphatha
Copy link

Would be better to share between threads to minimise requests to the same host, but that'll require synchronising read() calls and object initialisation.

This also doesn't respect the request-rate or crawl-delay directives as that requires communicating the last access time per host across the pool. That still needs to be done to properly address #48

This should be merged after #300 to allow hosts to set a policy for this tool. urllib.robotparser expects a simple useragent string and splits on the first /. The current useragent will cause it to only apply the catch-all policy or a policy intended for browsers (User-Agent: Mozilla).

@rom1504
Copy link
Owner

rom1504 commented Apr 25, 2023

Pretty short code, looks good!

Could you:

  • Have a look as to why the CI is failing (see how to run tests locally in the readme)
  • run a small benchmark to make sure it doesn't make the tool much slower (there's an example file in the test folder)

@ephphatha ephphatha force-pushed the robots-txt-local branch 5 times, most recently from bc26f88 to b1cf7b7 Compare April 25, 2023 13:22
@ephphatha
Copy link
Author

Benchmarking is not really feasible from my connection given the tests connect to real hosts. If you have a synthetic test case I can run that.

I did see a few things that could be improved when experimenting, urllib.robotparser doesn't use a timeout in read() which caused it to wait longer than the 10 second timeout used for downloading images. Also added some code around responses to more closely align with RFC9309 which should help when a single host is referenced multiple times in a thread.

Fixed the lint issues and it looks like tests are running in approximately the same time as main.

@rom1504
Copy link
Owner

rom1504 commented Apr 25, 2023

so I tried to run this and I found

  • it's only moderately slower than before (50%)
  • there seems to be some issues, it's showing a false positive rate higher than expected.

Some examples of urls that are considered blocked by robots.txt but shouldn't:

I see your code is trying to handle these cases properly already, but it seems something is not working.

Could you add some tests to make sure we can handle those properly ? (you can follow the example of the server serving synthetic examples in the other tests)

raw = f.read()
self.parse(raw.decode("utf-8").splitlines())
except urllib.error.HTTPError as err:
if err.code in (401, 403) or 500 <= err.code <= 599:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to 2.3.1.3 of https://datatracker.ietf.org/doc/rfc9309/ 401 and 403 are considered Unavailable hence should be allow all

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was based on the cpython implementation of read, which treats authentication errors as unreachable statuses. I forgot to comment that behaviour so have added it now.

self.allow_all = True
except Exception: # pylint: disable=broad-except
# treat other errors as meaning the server is unreachable (timeout, ssl error, dns failure)
self.disallow_all = True
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be measured, probably not cached, it is not clear one failure of the server means the server is fully down

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the errors that lead to this block are network errors (name resolution failures, timeout). There's definitely improvements that can be made for network errors but for this initial implementation I've just gone with the required behaviour described (by MUST) in 2.3.1.4.

Another option could be to let these Exceptions propagate up to be caught by the download_image try/except block, this would skip requesting the current image and try fetching robots.txt the next time the host is encountered in the thread.

"""Returns True if the given user agent is allowed to fetch this url based on the hosts robots.txt"""
robots_txt_cache = thread_context.robots_txt_cache

robots_url = urllib.parse.urljoin(url, "/robots.txt")
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not seem correct
The rules MUST be accessible in a file named "/robots.txt" (all
lowercase) in the top-level path of the service.

https://datatracker.ietf.org/doc/rfc9309/ 2.3

url is not the top-level path of the service
it would be needed to keep only the domain name

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://tio.run/##TcgxDoMwDAXQnVOgTCBV9tCN2wRqkaAk/nJciZ4@7djxPXw8aXuOkSvUfH5bKXknROsyTbDcfPk/@uHS3JaQ3NE3ZrljRRE6tHLXKvwSASN6ogtneMyBTXf1Tn57WNcxvg

"/robots.txt" is a root-relative URL. urljoin with a root relative URL as the second argument uses the host, scheme, and port from the first argument to make a complete URL.

@rom1504
Copy link
Owner

rom1504 commented Apr 25, 2023

ok yeah I see, fetching robots.txt at the root + correctly handling 401 and 403 seems to find some true positives
I'll check more another day

@ephphatha
Copy link
Author

Some examples of urls that are considered blocked by robots.txt but shouldn't:

http://kimkelly.smugmug.com/robots.txt has a default policy of Disallow: /. If robots.txt was unavailable it would be treated as allow all.

I've already added a robots.txt response to the test server with a default policy of Disallow: /disallowed, do you specifically want cases where the server responds with different statuses to requests for robots.txt?

@@ -22,5 +22,10 @@ async def get():
return "hi"


@app.get("/robots.txt")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won’t this change break test coverage for x-robots-header parsing by causing the downloader to fail early? Maybe make a different 3rd mount point that’s disallowed by robots.txt and not by HTTP headers, and add a separate test case using that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would've, but that test doesn't actually test whether the disallowed paths/files are not downloaded or what point in the process they were disallowed. The FastAPI server was returning a json response for robots.txt causing it to be ignored (content type was meant to be text/plain).

The test case still doesn't validate that the disallow method was correct but will at least fail if one isn't working now.

Would be better to share between threads to minimise requests to the same host, but that'll require synchronising read() calls and object initialisation.
@rom1504
Copy link
Owner

rom1504 commented May 8, 2023

Hi,
Just to give some updates.
I did not have time recently to work on this.

If you want to get this merged sooner these tasks are necessary

  • Benchmark the speed on some relevant dataset and on representative hardware. For example downloading 20 shards of laion400m
  • check the stats on new filtering due to robots.txt and whether each new group of error is valid
  • provide an option (turned off by default) to disable robots.txt filtering in order to let users gracefully transition to using robots.txt by running experiments with / without it (to avoid making all datasets instantly non reproducible)

Of course you have no obligation to work on any that.

I'll get around to it eventually. Just wanted to let you know what blocks merging in my opinion.

@rom1504 rom1504 added this to Needs triage in PR Triage May 28, 2023
@rom1504 rom1504 moved this from Needs triage to Waiting for user input in PR Triage Jan 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
PR Triage
Waiting for user input
Development

Successfully merging this pull request may close these issues.

None yet

3 participants