Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.
Jeff McAffer edited this page May 10, 2018 · 9 revisions

The operational doc for GHCrawler is still in development and will evolve here. Feel free to contribute changes, additions or ideas for the doc.

GHCrawler is like a normal web crawler but it traverses REST APIs, in particular, the GitHub REST API. At its core is a relatively simple loop with the following four steps repeated over and over concurrently.

  1. Get request -- Requests are read off a set of prioritized queues. Each request has a type, a URL to fetch, and a policy that describes both if and how to process the API response as well as what links to follows.
  2. Fetch document -- Given the request URL, the corresponding response document is fetched either from the crawler's cache or from the origin (e.g., GitHub).
  3. Process document -- The fetched document is analyzed and updated, and any interesting links to other APIs are queued for further traversal and processing.
  4. Save document -- The document coming out of the processing is stored for future reference either as cache content or for generating insights.

Each of the infrastructure pieces involved in the processing loop (e.g., queuing, fetching, processing, doc storage, ...) is supplied by a provider. The crawler is highly configurable and can reasonably run in many different configurations.

The documentation here dives into these elements, explaining them and guiding you as to how to configure and run the system.

  • Running the crawler -- Details of how to operate the crawler

  • Architecture -- An overview of how the crawler is structured and how it works

  • Configuration -- A detailed guide to configuring the crawler with different infrastructure and operation

  • Operator's guide -- Tutorial style guide for running the crawler