-
Notifications
You must be signed in to change notification settings - Fork 204
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TFIDF content matching should check inter-scrape #97
Comments
Thinking about this... we can simply identify matches in the backwards direction after we perform the matching in the forwards direction? (Currently its incoming job vs master dict, but we want master job vs master dict) |
Accidentally tapped close... |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Description
Currently we remove duplicates everywhere but we only remove duplicates by description (TFIDF) between the masterlist and all scrape data.
We should allow masterlist to perform a content match to itself.
Steps to Reproduce
Expected behavior
We should be running TFIDF inter-scrape data and inter-master csv
Actual behavior
Only duplicates in the ncoming dict are identified based on master CSV
Environment
The text was updated successfully, but these errors were encountered: