-
Notifications
You must be signed in to change notification settings - Fork 55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce memory footprint of multiple logically same RobotsData acquired from different inputs #5
Comments
I think step #1, as you identified, is to do a crawl and gather some data on the frequency of these rules. My guess is, you're intuition is right and we'll find that most rules are, in fact, very similar. What those rules are will also dictate the solution. If in 90% of the cases the rules are simple allow/disallow all, then just taking care of that would give you a major win.. Beyond that, there are all kinds of compression tricks that you could apply towards building a compact representation of the rules. But once again, gather data before making any of these design decisions.. :) |
Preliminary research based on 36000 robots.txt from random web sites of Alexa top 1M produced following unsatisfying results: 11.2% of robots.txt-s allow all, 1.5% disallow all. Few observations (e.g. 20% of hosts served by Wordpress) lead to conclusion that analyzed data is biased, not representative subset of whole Web. I will conduct another research involving more hosts later. |
Currently FromResponseBytes will return singleton RobotsData of allow-all for 404 status code and disallow-all for 401/403 status codes. For any other input, unique RobotsData will be created even though they could share subset of or all rules. Sharing all rules is equivalent to having another singleton RobotsData.
Plan:
Wild guess is that most (say 95%) fall into one of "allow all" or "disallow all".
If hypothesis is confirmed, some normalization technique could be applied to reduce memory footprint and cache locality of real world web crawlers using robotstxt.go library.
Possible normalizations:
These two techniques do not even conflict, post-processing parsed rules seems a worthy optimisation anyway, exporting unique value could further allow to cache non-trivial but still popular rules.
Even in unlikely event that rule sets distribution is closer to uniform, distribution of individual rules definitely must exhibit large spikes around agent=* and url=/. For that case, library can return singleton popular rules. Now that i think of it, maintaining a few extremely popular individual Rule singletons could be a worthy optimisation on its own. TODO: benchmark.
The text was updated successfully, but these errors were encountered: