Scrape Everything From Github #164

wdahlenburg · 2018-10-13T18:21:43Z

Pull Request for #163

Made use of the ListAll method to implement the ability to scrape all repositories. A binary search was implemented to determine the upper limit of repositories. This was then split evenly between threads so that large numbers of repositories could be pulled concurrently.

Currently only the master branch is pulled from repositories. This helps prevent path explosion, but could be improved in the future.

Closing issues
closes #163

mattyjones · 2020-05-19T16:09:35Z

@wdahlenburg By this do you mean crawl through all of github.com? How are you getting around rate limiting and performance issues? I agree with this being cool, but without some guardrails or a way to save/continue a search, this is a rough process. I have looked at this type of functionality before and used it to great length in GHE and I rate limit out in <1m or if I go not threaded, it took ~8 hours to go through 60k repos, not including any additional network latency.

wdahlenburg · 2020-05-19T16:57:23Z

@mattyjones Yes so this will technically crawl through all of github.com, except the rate limiting isn't built into it. After you hit that rate limit it will fail. I found differences in github.com versus an enterprise version of github (github.company.com), where the later did not have any rate limit.

The rate limiting would be a little tricky to implement due to concurrency, but overall worthwhile. It realistically should be added as a separate feature and then this can be committed afterwards. I would need to add support to check if the rate limit exists.

This code should work fine as-is if you are running it on an enterprise instance w/o rate limiting. It definitely does take a lot of time and CPU.

Some potential options for the file size and partial results:

You could save each repos data in a database to break up the huge results
A progress file could be created that stores the start:end ranges along with the last repository id completed. A signal could be sent to gitrob to trigger the save and exit gracefully. A cli option could be added to resume from a save file.

mattyjones · 2020-05-19T17:43:48Z

@wdahlenburg I can confirm that Enterprise at the least has the option to rate limit. The key here could be to check the request and it a rate limit is hit then we sleep for a little while and then try again. This will take while but that is something I was toying with. I also have could to dump stuff into sql lit for later querying, I may implement that here. I certainly agree with the rate limiting being a second feature as well. I will take and play with this for a little while and then merge it if all goes well.

mattyjones · 2020-05-20T01:56:35Z

@wdahlenburg Thanks for the work on this. I have now merged it in and will be playing with it over the next few days to ensure all is light and bright. I also need to write tests for all this code before I start to muck with it.

Scrape Everything From Github

bb91ed7

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrape Everything From Github #164

Scrape Everything From Github #164

wdahlenburg commented Oct 13, 2018

mattyjones commented May 19, 2020 •

edited

wdahlenburg commented May 19, 2020

mattyjones commented May 19, 2020

mattyjones commented May 20, 2020

Scrape Everything From Github #164

Are you sure you want to change the base?

Scrape Everything From Github #164

Conversation

wdahlenburg commented Oct 13, 2018

Pull Request for #163

mattyjones commented May 19, 2020 • edited

wdahlenburg commented May 19, 2020

mattyjones commented May 19, 2020

mattyjones commented May 20, 2020

mattyjones commented May 19, 2020 •

edited