Skip to content
This repository has been archived by the owner on Nov 16, 2023. It is now read-only.

Limit fetched data to only some collections #112

Open
stuartlangridge opened this issue Sep 6, 2017 · 3 comments
Open

Limit fetched data to only some collections #112

stuartlangridge opened this issue Sep 6, 2017 · 3 comments

Comments

@stuartlangridge
Copy link
Contributor

stuartlangridge commented Sep 6, 2017

It would be useful to only fetch some data from Github; for example, if I only need issues, pull_requests, and repos, to not have to fetch commits or issue_comments so as to reduce the amount I need to hit Github. Is this possible? The docs on policies seem to suggest that it might be doable, but I don't think I understand well enough how to do it; perhaps a doc clarification? (Or alternatively saying "you can't" would be OK here too.)

@jeffmcaffer
Copy link
Contributor

You can using the visitor map concept. Essentially you create an object that has every node and edge you want to traverse. Take a look at the end of visitorMap.js. There you will see the full list of everything the crawler can currently traverse.

For your case, you would create a scenario (e.g., minimal) and then add in a map for what to do at each entity you care about. Then you can use that scenario when queuing a request to traverse. Something like the following that only allow the crawler to traverse repos, users, issues and pull requests. The key to remember is that the _type is used by the crawler to find the map for a given entity it encounters and the properties of the map identify the edges out of the entity the crawler is allowed to traverse. The value of the property in a map is the map to use when you get to end of the edge. (drawing it out helps...)

const repo = {
  _type: 'repo',
  owner: self,
  issues: collection(issue)
};

const user = {
  _type: 'user'
};

const issue = {
  _type: 'issue',
  user: self,
  repo: self,
  assignee: self,
  closed_by: self
};

const pull_request = {
  _type: 'pull_request',
  user: self,
  merged_by: self,
  assignee: self,
  head: self,
  base: self,
  issue: issue
};

const minimal = {
  self: self,
  neighbors: neighbors,
  repo: repo,
  issue: issue,
  pull_request: pull_request
};

mapList.minimal = minimal;

From there you can reference that the minimal scenario when queuing a request. Check out the policy spec doc.

Fully get that this is not as easy as one might hope. Ideally the set of maps is something that comes from a configuration file or some such. We could totally do that and a PR to that effect would certainly be welcomed.

@stuartlangridge
Copy link
Contributor Author

stuartlangridge commented Mar 7, 2018

A followup on this issue; I've been looking into this with the intention of using new visitor maps. Once I've worked it out, I'll put together a thing that allows supplying a JSON configuration file for a custom visitor map or similar, which will be nice. However, I don't fully understand it and so I have questions. Let's imagine that I plan to use a custom map as follows, by dropping this code into visitorMap.js.

const sil_issue = {_type: "issue", user:self, repo:self, closed_by:self, assignee: self}
const sil_repo = {_type: 'repo', owner: self, organization: self, issues: collection(issue)}
const sil = {self: self, issue:sil_issue, repo:sil_repo};
mapList.sil = sil;

How do I correctly queue a request to use that map?

From the dashboard (or via the API) I can queue a request by passing an object with type, url, and policy keys. So, if I want to queue a particular repository (say, Microsoft/ghcrawler!), I pass {"type": "repo", "url": "https://api.github.com/repos/Microsoft/ghcrawler", "policy": "???"}, but I'm very unsure what to put in the policy key. I would think it'd be something like default:sil/repo, to use the default policyName, my custom visitorMap, and a repo fetch, but that fetches just repo details into the Mongo repo collection, and doesn't fetch any issues at all. I would have thought that the crawler would follow my edge to the sil_issue object and fetch all the associated issues as well but it isn't. Is this because I've got the policy spec wrong, or because I'm specifying the visitorMap wrong, or something else? Happy to hear any guidance you may have here.

@jeffmcaffer
Copy link
Contributor

Phew, I'm going to have to dig into the code on this one. The feature (being able to spec maps) is there but has seen relatively little use. I'm pretty sure it's possible but will need to look carefully. You are on the right path (no pun intended) but there is likely a subtlety to the way the map path is spec'd in the request.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants