TFIDF content matching should check inter-scrape #97

PaulMcInnis · 2020-09-12T02:25:35Z

Description

Currently we remove duplicates everywhere but we only remove duplicates by description (TFIDF) between the masterlist and all scrape data.

We should allow masterlist to perform a content match to itself.

Steps to Reproduce

scrape some jobs to .pkl
copy-paste a row a few times, only changing the key_id
run again with --no-scrape

Expected behavior

We should be running TFIDF inter-scrape data and inter-master csv

Actual behavior

Only duplicates in the ncoming dict are identified based on master CSV

Environment

Build: 3.0.0

PaulMcInnis · 2020-09-12T19:32:50Z

Thinking about this... we can simply identify matches in the backwards direction after we perform the matching in the forwards direction? (Currently its incoming job vs master dict, but we want master job vs master dict)

PaulMcInnis · 2020-09-12T19:33:20Z

Accidentally tapped close...

PaulMcInnis added bug help wanted labels Sep 12, 2020

PaulMcInnis closed this as completed Sep 12, 2020

PaulMcInnis reopened this Sep 12, 2020

PaulMcInnis added this to To do in New Features Sep 13, 2020

PaulMcInnis removed this from To do in New Features Sep 13, 2020

PaulMcInnis added this to Low priority in Bug Triage Sep 13, 2020

PaulMcInnis added this to the 4.0 milestone Nov 25, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TFIDF content matching should check inter-scrape #97

TFIDF content matching should check inter-scrape #97

PaulMcInnis commented Sep 12, 2020

PaulMcInnis commented Sep 12, 2020 •

edited

PaulMcInnis commented Sep 12, 2020

TFIDF content matching should check inter-scrape #97

TFIDF content matching should check inter-scrape #97

Comments

PaulMcInnis commented Sep 12, 2020

Description

Steps to Reproduce

Expected behavior

Actual behavior

Environment

PaulMcInnis commented Sep 12, 2020 • edited

PaulMcInnis commented Sep 12, 2020

PaulMcInnis commented Sep 12, 2020 •

edited