Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recordlinkage, ValueError: index of DataFrame is not unique #157

Open
lsun907 opened this issue Feb 14, 2021 · 3 comments
Open

Recordlinkage, ValueError: index of DataFrame is not unique #157

lsun907 opened this issue Feb 14, 2021 · 3 comments

Comments

@lsun907
Copy link

lsun907 commented Feb 14, 2021

Hi
I am linking two datasets. Both of them contain unique id's as identifiers. After reading two datasets into pandas data frames I set those id's as their indexes. So that after the classification, I would be able to figure out which records from each dataset matched. But after setting those Id's as indexes, I am getting an error in the blocking step.

ValueError: index of DataFrame is not unique

I am sure the two IDs do not have duplicates. Here are some of the codes. Can you please help what the problem is?


import pandas as pd
import recordlinkage
firm_name = pd.read_csv(r"C:\Users\XXX\Dropbox\YYY\firmname.csv", index_col='ID_EMPLOYER', encoding='latin-1')
ccm_name = pd.read_csv(r"C:\Users\XXX\Dropbox\YYY\comphist.csv", index_col='ID_HCONM', encoding='latin-1')
indexer = recordlinkage.Index()
indexer.block(left_on='EMPLOYER_STATE', right_on='HSTATE')
candidates = indexer.index(firm_name, ccm_name)


Then I got this error messsage:
ValueError: index of DataFrame is not unique

Can anyone help please?

@lsun907
Copy link
Author

lsun907 commented Feb 14, 2021

By the way, the ID in each dataset is a sequence of numbers from 1 to N (the total number of observations in the dataset)

@titipata
Copy link

titipata commented Mar 4, 2021

@Isun907 I had a similar issue and I reindex my dataframe df.index = np.arange(len(df)) or do data_df.reset_index(col_level=1, drop=True, inplace=True) to solve this issue. Someone might have a better solution to this.

@ethan-huffington
Copy link

df.index = np.arange(len(df)) worked for me. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants