-
Notifications
You must be signed in to change notification settings - Fork 150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Length mismatch at #202
Comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
When trying to slice candidates (pd.MultiIndex) and perform comparer.compute(), sometimes among my partitions there will be
I have checked the shapes are all right, then the problem should be about the computing process. I assume that the additional 12 rows are misplaced columns (10 comparing columns + two index columns)
`import os
import pandas as pd
import recordlinkage
from recordlinkage import Compare
Perform the actual comparisons and store the result
print("Performing comparisons...")
def compute_and_save_partition(comparer, candidates, df1, df2, start, end, partition_path):
def parallel_compute_and_save(comparer, candidates, df1, df2, output_dir, partition=1000000):
result = []
total_candidates = len(candidates)
for i in range(0, total_candidates, partition):
end = min(i + partition, total_candidates)
partition_path = os.path.join(output_dir, f'partition_{i}_{end}.parquet')
result.append(compute_and_save_partition(comparer, candidates, df1, df2, i, end, partition_path))
return result
Setup the output directory
name = "your_dataset_name" # Replace with your dataset name
output_dir = f"../Output/temp/{name}_compare"
os.makedirs(output_dir, exist_ok=True)
Replace _1861[cols_to_compare] and _1851[cols_to_compare] with your dataframes and columns to compare
final_result = parallel_compute_and_save(comparer, candidates, _1861[cols_to_compare], _1851[cols_to_compare], output_dir)
Update the columns of the result dataframe
final_result = pd.concat(final_result)
final_result.columns = ['pname', 'oname', 'sname', 'pname_soundex', 'sname_soundex', 'pname_metaphone', 'sname_metaphone', 'address', 'sname_pop_metaphone', 'dateofbirth']
final_result = final_result.reset_index(drop=True)
`
The text was updated successfully, but these errors were encountered: