GEMS Data Generation

POC: Xiaodan Xu, Ph.D. ([email protected])

Latest update: 02/27/2024

Theme A: Geographic boundary

a1. collecting micro-geotype boundary

step 1: collecting and cleaning spatial crosswalk file for LODES8 (using 2020 census boundary)

input: downloaded state-by-state crosswalk from LEHD website (the R API doesn't work) https://lehd.ces.census.gov/data/

process: Clean up raw crosswalk files by state and create single lookup table for all states

output:spatial_boundary/CleanData/cleaned_lodes8_crosswalk.csv

step 2: Collecting Census 2020 Tiger line/boundary for census tract, county and CBSA

code:0_clean_boundaries.R

Input: queried 2020 Census boundary from API in R; Cleaned census crosswalk file downloaded here: https://www2.census.gov/geo/docs/maps-data/data/rel2020/tract/tab20_tract20_tract10_natl.txt

process:

load census tract geometry and land/water area (for microtypes)
load and combine county and CBSA boundary
load and clean spatial crosswalks (2020-2010 crosswalk, county-cbsa-tract crosswalk)

output:

spatial_boundary/CleanData/combined_tracts_{year}.geojson and csv
spatial_boundary/CleanData/combined_county_{year}.geojson and csv
spatial_boundary/CleanData/combined_geotype_unit_{year}.geojson and csv
spatial_boundary/CleanData/cleaned_lodes8_crosswalk_with_ID.csv

step 3: Collecting Zip code, city, county and census tract crosswalk

code:0_clean_boundaries.R

Input: queried HUD-USPS ZIP Code Crosswalk from API in Python;

process:

load HUD-USPS ZIP Code Crosswalk
writing output if success or return the error message if fail

output:

spatial_boundary/CleanData/ZIP_COUNTY_LOOKUP_2023.csv

Theme B: Demograhic characteristics

b1. Collecting ACS data at census tract level

code: 1_ACS_compile_tracts.R

Input: query 2021 ACS 5-Year estimates from API

process: collecting tract-level demographic variables for persons, households and housing units from ACS 5-year estimates using tidycensus (with ACS API)

output:

Demography/CleanData/acs_data_tracts_{date}.csv

Theme C: Collect demand related attributes

c1. clean and processing LEHD LODES8 data

step 1: collecting LEHD LODES 8 data

code:0_clean_lehd_2017.R

Input: query latest LEHD LODES8 data from API for all states at tract level (2021 for most states, with AK, AR, MS only has older data available)

process:

load and clean LODES8 Workplace Area Characteristics (WAC) at census tract level using LEHDR (R package for accessing LEHD data)
load and clean LODES8 Origin-Destination (OD) data at census tract level using LEHDR, include auxiliary data for cross-state commute

output:

Demand/CleanData/wac_tract_{year}.csv
Demand/CleanData/OD/*

step 2: calculate commute distance (using great circle distance between OD centroids)

code:generate_od_distance.py

Input: Demand/CleanData/OD/* and spatial_boundary/CleanData/combined_tracts_{year}.csv (2021 for most states, with AK, AR, MS only has older data available)

process:

Calculate euclidean distance between each OD pair
For intrazonal OD, using 1/3* sqrt(area) as a proxy for distance

output:

Demand/CleanData/OD_distance/*

Theme D: Collect land use attributes

D1. land use characteristics from NLCD data

step 1: collecting and cleaning NLCD data Code:process_NLCD_data.R

Input:

CONUS: Land_use/RawData/US_2020_nlcd_shapefiles_24mar2023/NLCD
Alaska: Land_use/RawData/emiss_shp2017/NLCD
Hawaii: Land_use/RawData/hi_hawaii_2010_ccap_hr_land_cover20150120

Processes:

For CONUS and AK, assign centroids of grid cell to each census tract, and aggregate areas by land use types in each census tract
For HI, generate grid cell (polygons) from pixels in raster image (island by island), and follow the similar process as CONUS and AK

Output:

Land_use/CleanData/tract_level_land_use_no_ak.csv
Land_use/CleanData/tract_level_land_use_ak.csv
Land_use/CleanData/HI_NLCD_2010_{island_names}.csv

step 2: Compile and impute NLCD data

Code:compile_nlcd_data.py

Input:

Land_use/CleanData/tract_level_land_use_no_ak.csv
Land_use/CleanData/tract_level_land_use_ak.csv
Land_use/CleanData/HI_NLCD_2010_{island_names}.csv
spatial_boundary/CleanData/combined_tracts_{year}.geojson

Processes:

Unify land use type names, e.g., map different types of imperious developed land to variable 'imperious developed'
Combine land use data for all states
Calculate fractions of imperious land and developed open space by census tract, and impute missing value using values from nearest census tracts

Output: Land_use/CleanData/processed_NLCD_data.csv (for all land use types, such as forest, agriculture, developed) Land_use/CleanData/imputed_NLCD_data_dev_only.csv

D2. uban area definition from U.S.Census Bureau

Code:1_clean_urban_areas_definitions.R

Input:

Spatial crosswalk between urban area and census block: spatial_boundary/RawData/2020_UA_BLOCKS.txt'

Processes:

Merge census urban area (UA) boundary with all census tract
Generate urban/rural indicators based on merge results (1 - if within UA, 0 - otherwise)
Assign urban area definition used in V1 typology for validation (majority of the classification results are the same)

Output: spatial_boundary/CleanData/urban_divisions_2021.csv

Theme E: Network generation

e1. processing OSMNX data at census tract level

code: generate_OSMNX_metrics.py

Input: queried 2023 OSM metrics from API in R and spatial boundary from step a1 above

load network statistics from OSMNX
if no statistics found, fill in tract-level attributes with NA

output:

Network/CleanData/OSMNX/*

Theme F: Spatial clustering

F1. Develop socio-economic microtype

step 1: compile demand attributes at census tract level

code: demand_variable_generation.py

Input: compile inputs from various themes, including:

load spatial boundary from Theme A: spatial_boundary/CleanData/combined_tracts_{year}.geojson and csv
load population data from Theme B: Demography/CleanData/acs_data_tracts_{date}.csv
load lehd wac data from Theme C: Demand/CleanData/wac_tract_{year}.csv AND Demand/CleanData/OD_distance/*
Load NLCD land use data from Theme D: Land_use/CleanData/imputed_NLCD_data_dev_only.csv
Load urban area definition from Theme D: spatial_boundary/CleanData/urban_divisions_2021.csv

Processes:

Load all data sources needed for socio-economic clusters
Filter out census tracts with only water surface (no land)
Generate tract-level variables using 2020 census boundary and count number of missing values for each variable generated

output:

Demand/CleanData/microtype_inputs_demand_V2.csv

step 2: Develop and validate socio-economic microtype

code: demand_microtype_cluster.R

dependency:initialization.R and functions.R

Input:

Demand/CleanData/microtype_inputs_demand.csv
spatial_boundary/CleanData/cleaned_lodes8_crosswalk_with_ID.csv

Processes:

load data, performing cleaning, scaling, imputation and then split by rural/urban boundary
spatial cluster for rural and urban respectively
validate spatial clusters through visualization

Output:

Demand/Results/microtypes_inputs_demand_scaled.csv
Demand/Results/clustering_outputs_with_raw_data.csv

F2. Develop geotype

step 1: compile attributes at CBSA/county level code: geotype_variable_generation.py

Input: compile inputs from various themes, including:

load spatial boundary from Theme A: spatial_boundary/CleanData/combined_geotype_unit_{year}.geojson and csv AND spatial_boundary/CleanData/combined_tracts_{year}.geojson and csv AND 'spatial_boundary/CleanData/cleaned_lodes8_crosswalk_with_ID.csv'
load lehd wac data from Theme C: Demand/CleanData/wac_tract_{year}.csv AND Demand/CleanData/OD_distance/*
Load NLCD land use data from Theme D: Land_use/CleanData/processed_NLCD_data.csv
Load compiled network attributes: Network/CleanData/network_microtype_metrics.csv
Load socio-economic typology results: Demand/Results/clustering_outputs_with_raw_data.csv

Processes:

Load all data sources needed for geotype clusters
Generate variables using 2020 census boundary and generate plots for each metric for checking

output:

Demand/CleanData/geotype_inputs.csv

Theme G: Accessibility and mode availability

G1. processing bike density at census tract level

code: process_bike_station.py

Input:

load BTS NTAD data: Network/RawData/BTS/Locations of Docked Bikeshare Stations by System and Year_20240306.geojson
load spatial boundary from Theme A: spatial_boundary/CleanData/combined_tracts_{year}.geojson and csv
load population data from Theme B: Demography/CleanData/acs_data_tracts_{date}.csv

Processes:

load bike station shapefile from NTAD and select type of analysis (1-using 2010 boundary for mode choice; 2- use 2020 boundary for GEMS input)
intersect bike station with census tract boundary and calculate density

output:

Network/CleanData/bike_availability_{year}.csv

G2. processing transit density at census tract level

code: process_transit_networks.py

Input:

load BTS NTAD data: Network/RawData/NTAD/National_Transit_Map_Routes.geojson
load spatial boundary from Theme A: spatial_boundary/CleanData/combined_tracts_{year}.geojson and csv

Processes:

load transit shapefile from NTAD and select type of analysis (1-using 2010 boundary for mode choice; 2- use 2020 boundary for GEMS input)
intersect rail route with census tract boundary and calculate availability metrics
calculate distance between tract centroid to nearest rail line (the rail service only cross small numbers of trucks but is accessible to nearby riders directly or through connecting mode. This distance is used as supporting information on aproximity to rail service from each tract.)

output:

Network/CleanData/transit_availability_with_dist_{year}.csv

Theme H: User cost

H1. processing transit fare at census tract level

code: clean_transit_fare.py

Input:

load APTA transit fare data: Cost/RawData/BTS/2017-APTA-Fare-Database.xlsx # can transfer to use another year of input data as long as APTA data format stay the same
load spatial crosswalk from Theme A: spatial_boundary/CleanData/ZIP_COUNTY_LOOKUP_2023.csv

Processes:

Select fare from bus and rail service
assign transit fare to census tract using spatial crosswalk

output:

'Cost/CleanData/transit_fare_by_tract_{year}.csv'

H2. processing parking and ridehailing cost at census tract level

code: clean_parking_and_tnc_cost.py

Input:

load Parkopedia data (after cleaning some typos in city name): Cost/RawData/parkopedia_cleaned_name.csv # created from 'parkopedia.xlsx' in the same directory
load Uber fare data (after cleaning some typos in city name): Cost/RawData/Uber_fare_cleaned_name.csv # created from 'Uber_fare.csv' in the same directory
load spatial crosswalk from Theme A: spatial_boundary/CleanData/ZIP_COUNTY_LOOKUP_2023.csv

Processes:

Calculate hourly parking rate for each city
assign parking and ridehail cost to census tract using spatial crosswalk

output:

'Cost/CleanData/parking_tract_{year}.csv'
'Cost/CleanData/uber_fare_tract_{year}.csv'

Theme I: System cost

H1. processing transit system cost at county level

code: 0_clean_transit_costs.R

Input:

load 2018 NTD data: Cost/RawData/NTD/* # can transfer to use another year of input data as long as NTD data format stay the same
load spatial crosswalk from Theme A: spatial_boundary/CleanData/ZIP_COUNTY_LOOKUP_2023.csv

Processes:

Select agency, fleet, expanse and operation data from bus and rail service
assign transit attributes to county using spatial crosswalk

output:

'Cost/CleanData/transit_system_cost.csv'

H2. processing highway system cost at county level

code: 0_clean_road_network_costs.R

Input:

load HERS data: Cost/RawData/G_01_AppA_H_TypUrbCapcCostsPerLM_A-8_2018-09-28+.xlsx
load urban area definition: spatial_boundary/CleanData/urban_divisions_2021.csv
load processed network data: Network/CleanData/network_microtype_metrics_2.csv

Processes:

Assign cost groups to both HERS data and processed network data (at tract-level)
Calculate weighted highway system cost per tract using lane mile fraction by functional class and cost group at tract level

output:

'Cost/CleanData/highway_cost_per_tract.csv'

Name		Name	Last commit message	Last commit date
Latest commit History 89 Commits
accessibility		accessibility
cost		cost
demand		demand
demographic		demographic
geography		geography
network		network
other		other
spatial_cluster		spatial_cluster
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GEMS Data Generation

POC: Xiaodan Xu, Ph.D. ([email protected])

Latest update: 02/27/2024

Theme A: Geographic boundary

a1. collecting micro-geotype boundary

Theme B: Demograhic characteristics

b1. Collecting ACS data at census tract level

Theme C: Collect demand related attributes

c1. clean and processing LEHD LODES8 data

Theme D: Collect land use attributes

D1. land use characteristics from NLCD data

D2. uban area definition from U.S.Census Bureau

Theme E: Network generation

e1. processing OSMNX data at census tract level

Theme F: Spatial clustering

F1. Develop socio-economic microtype

F2. Develop geotype

Theme G: Accessibility and mode availability

G1. processing bike density at census tract level

G2. processing transit density at census tract level

Theme H: User cost

H1. processing transit fare at census tract level

H2. processing parking and ridehailing cost at census tract level

Theme I: System cost

H1. processing transit system cost at county level

H2. processing highway system cost at county level

About

Releases

Packages

Contributors 2

Languages

LBNL-UCB-STI/GEMS-data

Folders and files

Latest commit

History

Repository files navigation

GEMS Data Generation

POC: Xiaodan Xu, Ph.D. ([email protected])

Latest update: 02/27/2024

Theme A: Geographic boundary

a1. collecting micro-geotype boundary

Theme B: Demograhic characteristics

b1. Collecting ACS data at census tract level

Theme C: Collect demand related attributes

c1. clean and processing LEHD LODES8 data

Theme D: Collect land use attributes

D1. land use characteristics from NLCD data

D2. uban area definition from U.S.Census Bureau

Theme E: Network generation

e1. processing OSMNX data at census tract level

Theme F: Spatial clustering

F1. Develop socio-economic microtype

F2. Develop geotype

Theme G: Accessibility and mode availability

G1. processing bike density at census tract level

G2. processing transit density at census tract level

Theme H: User cost

H1. processing transit fare at census tract level

H2. processing parking and ridehailing cost at census tract level

Theme I: System cost

H1. processing transit system cost at county level

H2. processing highway system cost at county level

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages