Skip to content

Prepare electronic medical record data from the UK Biobank for time-to-event analyses

License

Notifications You must be signed in to change notification settings

machiela-lab/UKBBcleanR

Repository files navigation

UKBBcleanR: Prepare electronic medical record data from the UK Biobank for time-to-event analyses

R-CMD-check License: MIT GitHub last commit DOI

Date repository last updated: January 26, 2023

Overview

The UKBBcleanR package contains an R function that prepares time-to-event data from raw UK Biobank electronic medical record data. The prepared data can be used for cancer outcomes, but could be modified for other health outcomes. This package is not available on CRAN.

Installation

To install the development version from GitHub:

devtools::install_github("machiela-lab/UKBBcleanR")

Available function(s)

Function Description
tte Prepares time-to-event data from raw UK Biobank electronic medical record data.

The repository also includes the resources and code to create the project hex sticker.

Authors

  • Alexander Depaulis - Integrative Tumor Epidemiology Branch (ITEB), Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute (NCI), National Institutes of Health (NIH), Rockville, Maryland (MD), USA - GitHub

  • Derek W. Brown - ITEB, DCEG, NCI, NIH, Rockville, MD, USA (original) - GitHub - ORCID

  • Aubrey K. Hubbard - ITEB, DCEG, NCI, NIH, Rockville, MD, USA - ORCID

See also the list of contributors who participated in this package, including:

  • Ian D. Buller - Social & Scientific Systems, Inc., a division of DLH Corporation, Silver Spring, Maryland (current) - Occupational and Environmental Epidemiology Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Rockville, Maryland (original) - GitHub - ORCID

  • Mitchell J. Machiela - ITEB, DCEG, NCI, NIH, Rockville, MD, USA - GitHub - ORCID

Getting Started

The tte function requires several raw UK Biobank variables to run correctly. A detailed list of required variables are provided in the README_required_variables.txt file.

Data can be loaded in the tte function in two ways:

  • The user can specify a working directory using setwd() to where each individual data set is stored.

    • NOTE: These individual data sets must contain the specific variables and have names which match the README_required_variables.txt file. Example data is available within the package.
  • The user can generate a single data set containing all the variables of interest. This data set can then be loaded into the tte function using the combined_data argument. Example data is available within the package.

Usage

# ------------------ #
# Necessary packages #
# ------------------ #

library(UKBBcleanR)

# -------- #
# Settings #
# -------- #

##### Input UKBBcleanR sample data

 # Use combined data set
 testdata <- as.data.frame(combined_data)
 
 # Set ICD-10 outcome of interest
 cancer_outcome <- c("C911") 
 
 # Set prevalent cancers to identify in data cleaning
 prevalent_cancers <- c("D37", "D38", "D39", "D40", "D41", "D42",
                        "D43", "D44", "D45", "D46", "D47", "D48") 
 
 # Set incident cancers to identify in data cleaning
 incident_cancers <- c("C900") 
 
# ------- #
# Run tte #
# ------- #

# Run without removing prevalent cancers from analysis
test1 <- tte(combined_data = testdata, 
             cancer_of_interest_ICD10 = cancer_outcome,
             prevalent_cancer_list = prevalent_cancers, 
             prevalent_C_cancers = TRUE, 
             incident_cancer_list = incident_cancers, 
             remove_prevalent_cancer = FALSE, 
             remove_self_reported_cancer = FALSE)
            
table(test1$case_control_cancer_ignore)  # tte outcome ignoring other incident cancers
table(test1$case_control_cancer_control) # tte outcome controlling for other incident cancers


# Run with removing prevalent cancers from analysis
test2 <- tte(combined_data = testdata, 
             cancer_of_interest_ICD10 = cancer_outcome,
             prevalent_cancer_list = prevalent_cancers, 
             prevalent_C_cancers = TRUE, 
             incident_cancer_list = incident_cancers, 
             remove_prevalent_cancer = TRUE, 
             remove_self_reported_cancer = TRUE)
table(test2$case_control_cancer_ignore)  # tte outcome ignoring other incident cancers
table(test2$case_control_cancer_control) # tte outcome controlling for other incident cancers

Vignette

We provide a vignette with a practical example and work through of the provided example data.

Funding

Package was developed while the first author was a participant of the 2022 National Institutes of Health Summer Internship Program in Biomedical Research and while the second author was a postdoctoral fellow supported by the Cancer Prevention Fellowship Program at the National Cancer Institute (NCI) and the third author was a postdoctoral fellow in the NCI Division of Cancer Epidemiology and Genetics.

Acknowledgments

When citing this package for publication, please cite follow:

citation("UKBBcleanR")

Questions? Feedback?

For questions about the package please contact the maintainer Dr. Derek Brown or submit a new issue.