Running in large scale #2173

guylissak · 2024-05-10T18:24:23Z

guylissak
May 10, 2024

Hi, testing Splink for only use case of deduping a dataset of 500M that updates daily rows.
The pipelines run in databricks using heavy spark jobs. I’m just concerned about the scale.
I did a POC using spark on 1M records and it looks good as I configure blocking rules based on the skew data but wondering if the same will work on 500M.

Do I really need to train a model about all the dataset? Or can I use just 1M rows for training and predict on the entire 500M?
Also are there specific quick win advices for such scale in the configuration?
Much appreciated

RobinL · 2024-05-12T09:19:09Z

RobinL
May 12, 2024
Maintainer

500 million is a pretty big linkage. In general, it's not theoretically valid to train on a sample of 1m, but you might find it gives you reasonably decent results anyway. You want to make sure that your sample of 1m has a decent number of matches in it, so you might not want to take a completely random sample and instead take a sample that's more likely to to include more matches

For further advice on scaling see:
https://moj-analytical-services.github.io/splink/topic_guides/performance/drivers_of_performance.html

0 replies

guylissak · 2024-05-12T10:40:16Z

guylissak
May 12, 2024
Author

Hi not sure i understand your first part of the answer.
1M samples are large training amount? or you mean 1M samples are not enough to train on for 500M prediction dataset?
what is the ratio you suggest between the training dataset to the validation dataset?
Also i made sure those 1m records contain duplications.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running in large scale #2173

{{title}}

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Running in large scale #2173

guylissak May 10, 2024

Replies: 2 comments

RobinL May 12, 2024 Maintainer

guylissak May 12, 2024 Author

guylissak
May 10, 2024

RobinL
May 12, 2024
Maintainer

guylissak
May 12, 2024
Author