Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TFIDF content matching should check inter-scrape #97

Open
PaulMcInnis opened this issue Sep 12, 2020 · 2 comments
Open

TFIDF content matching should check inter-scrape #97

PaulMcInnis opened this issue Sep 12, 2020 · 2 comments

Comments

@PaulMcInnis
Copy link
Owner

Description

Currently we remove duplicates everywhere but we only remove duplicates by description (TFIDF) between the masterlist and all scrape data.

We should allow masterlist to perform a content match to itself.

Steps to Reproduce

  1. scrape some jobs to .pkl
  2. copy-paste a row a few times, only changing the key_id
  3. run again with --no-scrape

Expected behavior

We should be running TFIDF inter-scrape data and inter-master csv

Actual behavior

Only duplicates in the ncoming dict are identified based on master CSV

Environment

  • Build: 3.0.0
@PaulMcInnis
Copy link
Owner Author

PaulMcInnis commented Sep 12, 2020

Thinking about this... we can simply identify matches in the backwards direction after we perform the matching in the forwards direction? (Currently its incoming job vs master dict, but we want master job vs master dict)

@PaulMcInnis
Copy link
Owner Author

Accidentally tapped close...

@PaulMcInnis PaulMcInnis reopened this Sep 12, 2020
@PaulMcInnis PaulMcInnis added this to To do in New Features Sep 13, 2020
@PaulMcInnis PaulMcInnis removed this from To do in New Features Sep 13, 2020
@PaulMcInnis PaulMcInnis added this to Low priority in Bug Triage Sep 13, 2020
@PaulMcInnis PaulMcInnis added this to the 4.0 milestone Nov 25, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Bug Triage
Low priority
Development

No branches or pull requests

1 participant