Recordlinkage, ValueError: index of DataFrame is not unique #157

lsun907 · 2021-02-14T21:20:18Z

Hi
I am linking two datasets. Both of them contain unique id's as identifiers. After reading two datasets into pandas data frames I set those id's as their indexes. So that after the classification, I would be able to figure out which records from each dataset matched. But after setting those Id's as indexes, I am getting an error in the blocking step.

ValueError: index of DataFrame is not unique

I am sure the two IDs do not have duplicates. Here are some of the codes. Can you please help what the problem is?

import pandas as pd
import recordlinkage
firm_name = pd.read_csv(r"C:\Users\XXX\Dropbox\YYY\firmname.csv", index_col='ID_EMPLOYER', encoding='latin-1')
ccm_name = pd.read_csv(r"C:\Users\XXX\Dropbox\YYY\comphist.csv", index_col='ID_HCONM', encoding='latin-1')
indexer = recordlinkage.Index()
indexer.block(left_on='EMPLOYER_STATE', right_on='HSTATE')
candidates = indexer.index(firm_name, ccm_name)

Then I got this error messsage:
ValueError: index of DataFrame is not unique

Can anyone help please?

lsun907 · 2021-02-14T21:22:04Z

By the way, the ID in each dataset is a sequence of numbers from 1 to N (the total number of observations in the dataset)

titipata · 2021-03-04T12:39:10Z

@Isun907 I had a similar issue and I reindex my dataframe df.index = np.arange(len(df)) or do data_df.reset_index(col_level=1, drop=True, inplace=True) to solve this issue. Someone might have a better solution to this.

ethan-huffington · 2022-03-24T00:18:00Z

df.index = np.arange(len(df)) worked for me. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recordlinkage, ValueError: index of DataFrame is not unique #157

Recordlinkage, ValueError: index of DataFrame is not unique #157

lsun907 commented Feb 14, 2021

lsun907 commented Feb 14, 2021

titipata commented Mar 4, 2021

ethan-huffington commented Mar 24, 2022

Recordlinkage, ValueError: index of DataFrame is not unique #157

Recordlinkage, ValueError: index of DataFrame is not unique #157

Comments

lsun907 commented Feb 14, 2021

lsun907 commented Feb 14, 2021

titipata commented Mar 4, 2021

ethan-huffington commented Mar 24, 2022