representr

Create representative records post-record linkage for use in downstream tasks after record linkage is performed. Multiple methods for creating the records are provided, including two methods based on the posterior distribution of linkage resulting from a Bayesian analysis.

Installation

devtools::install_github("cleanzr/representr", build_vignettes = TRUE)

Citation

This package implements the methods introduced in the following paper:

Andee Kaplan, Brenda Betancourt & Rebecca C. Steorts (2022) A Practical Approach to Proper Inference with Linked Data, The American Statistician, DOI: 10.1080/00031305.2022.2041482

Background

Record linkage (entity resolution or de-duplication) is used to join multiple databases to remove duplicate entities. While record linkage removes the duplicate entities from the data, many researchers are interested in performing inference, prediction, or post-linkage analysis on the linked data (e.g., regression or capture-recapture), which we call the downstream task. Depending on the downstream task, one may wish to find the most representative record before performing the post-linkage analysis. For example, when the values of features used in a downstream task differ for linked data, which values should be used? This is where representr comes in.

Main functions

The two main functions in representr are represent and pp_weights, which perform pointwise and fully Bayesian prototyping, respectively. Additionally, we have added a function aid in the evaluation of prototyping methods by estimating an empirical KL divergence through the function emp_kl. To read more about the specific prototyping functions available, see the help pages.

help(representr)

For more extensive documentation of the use of this package, please see the vignette.

vignette("representr")

Acknowledgments

This work was partially supported by the National Science Foundation through NSF-1652431 and NSF-1534412 and the Laboratory for Analytic Sciences at NC State University.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
R		R
data		data
inst		inst
man		man
src		src
vignettes		vignettes
.DS_Store		.DS_Store
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
.travis.yml		.travis.yml
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.md		README.md
cran-comments.md		cran-comments.md

cleanzr/representr

Folders and files

Latest commit

History

Repository files navigation

representr

Installation

Citation

Background

Main functions

Acknowledgments

About

Topics

Resources

Stars

Watchers

Forks

Languages