Replies: 2 comments
-
500 million is a pretty big linkage. In general, it's not theoretically valid to train on a sample of 1m, but you might find it gives you reasonably decent results anyway. You want to make sure that your sample of 1m has a decent number of matches in it, so you might not want to take a completely random sample and instead take a sample that's more likely to to include more matches For further advice on scaling see: |
Beta Was this translation helpful? Give feedback.
-
Hi not sure i understand your first part of the answer. |
Beta Was this translation helpful? Give feedback.
-
Hi, testing Splink for only use case of deduping a dataset of 500M that updates daily rows.
The pipelines run in databricks using heavy spark jobs. I’m just concerned about the scale.
I did a POC using spark on 1M records and it looks good as I configure blocking rules based on the skew data but wondering if the same will work on 500M.
Do I really need to train a model about all the dataset? Or can I use just 1M rows for training and predict on the entire 500M?
Also are there specific quick win advices for such scale in the configuration?
Much appreciated
Beta Was this translation helpful? Give feedback.
All reactions