github-issue-crawler

This contains a script that crawls the Github v3 API seeking out all public repositories, issues, and comments, and dropping them into a MongoDB database.

Usage

$ node index.js

The standard API limit is about 5000 requests an hour, so it takes a while. Crawling the whole of Github with this script is a task of years, due to the API limitations mentioned below.

To be done

Port to the v4 API and GraphQL

Limitations

While this script is designed to allow restarting and so on, it isn't highly performant, due to the structure of the API itself. This is because the Github issue API endpoints require at least one query to a repository to determine whether or not it has any issues. Due to the very high proportion of empty forked repositories, this means that about 98% of all endpoint requests return no data.
Handling of rate limiting is especially naive, but functional for now. A beter approach will totally be needed for a GraphQL revision.
Configuration is hard-wired and it shouldn't ought to be.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
index.js		index.js
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

index.js

index.js

package.json

package.json

Repository files navigation

github-issue-crawler

Usage

To be done

Limitations

About

Releases

Packages

Languages

License

morungos/github-issue-crawler

Folders and files

Latest commit

History

Repository files navigation

github-issue-crawler

Usage

To be done

Limitations

About

Topics

Resources

License

Stars

Watchers

Forks

Languages