Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawl error causes skipDuplicates to be set to false #345

Open
villesau opened this issue May 2, 2020 · 1 comment
Open

Crawl error causes skipDuplicates to be set to false #345

villesau opened this issue May 2, 2020 · 1 comment

Comments

@villesau
Copy link

villesau commented May 2, 2020

Seems that when the site errors out, self.options.skipDuplicates is se to false, but never set back to true in the current version. This lets duplicate urls to end up crawled by the system.

Better solution could be something like:

if (options.retries) {
  self.options.skipDuplicates = false;
  setTimeout(function() {
    options.retries--;
    const skipDuplicates = self.options.skipDuplicates;
    self.options.skipDuplicates = false;
    self.queue(options);
    self.options.skipDuplicates = skipDuplicates;
    options.release();
  },options.retryTimeout);
} else{
  options.callback(error,{options:options},options.release);
}

If .queue method can throw, then something like


if (options.retries) {
  self.options.skipDuplicates = false;
  setTimeout(function() {
    options.retries--;
    const skipDuplicates = self.options.skipDuplicates;
    try {
      self.options.skipDuplicates = false;
      self.queue(options);
    } finally {
      self.options.skipDuplicates = skipDuplicates;
    }
    options.release();
  },options.retryTimeout);
} else{
  options.callback(error,{options:options},options.release);
}
@mike442144
Copy link
Collaborator

Thanks, it's a problem found and fixed in the master branch, and will be included in next patch. if you're interested in the solution feel free checkout the source code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants