Skip to content
This repository has been archived by the owner on Jan 13, 2023. It is now read-only.

Scrape Everything From Github #164

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

wdahlenburg
Copy link

Pull Request for #163

Made use of the ListAll method to implement the ability to scrape all repositories. A binary search was implemented to determine the upper limit of repositories. This was then split evenly between threads so that large numbers of repositories could be pulled concurrently.

Currently only the master branch is pulled from repositories. This helps prevent path explosion, but could be improved in the future.

Closing issues
closes #163

@mattyjones
Copy link

mattyjones commented May 19, 2020

@wdahlenburg By this do you mean crawl through all of github.com? How are you getting around rate limiting and performance issues? I agree with this being cool, but without some guardrails or a way to save/continue a search, this is a rough process. I have looked at this type of functionality before and used it to great length in GHE and I rate limit out in <1m or if I go not threaded, it took ~8 hours to go through 60k repos, not including any additional network latency.

@wdahlenburg
Copy link
Author

@mattyjones Yes so this will technically crawl through all of github.com, except the rate limiting isn't built into it. After you hit that rate limit it will fail. I found differences in github.com versus an enterprise version of github (github.company.com), where the later did not have any rate limit.

The rate limiting would be a little tricky to implement due to concurrency, but overall worthwhile. It realistically should be added as a separate feature and then this can be committed afterwards. I would need to add support to check if the rate limit exists.

This code should work fine as-is if you are running it on an enterprise instance w/o rate limiting. It definitely does take a lot of time and CPU.

Some potential options for the file size and partial results:

  • You could save each repos data in a database to break up the huge results

  • A progress file could be created that stores the start:end ranges along with the last repository id completed. A signal could be sent to gitrob to trigger the save and exit gracefully. A cli option could be added to resume from a save file.

@mattyjones
Copy link

@wdahlenburg I can confirm that Enterprise at the least has the option to rate limit. The key here could be to check the request and it a rate limit is hit then we sleep for a little while and then try again. This will take while but that is something I was toying with. I also have could to dump stuff into sql lit for later querying, I may implement that here. I certainly agree with the rate limiting being a second feature as well. I will take and play with this for a little while and then merge it if all goes well.

@mattyjones
Copy link

@wdahlenburg Thanks for the work on this. I have now merged it in and will be playing with it over the next few days to ensure all is light and bright. I also need to write tests for all this code before I start to muck with it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
2 participants