This project contains a dagster repository that extracts, cleans and loads PUDL's usage metrics into a data warehouse for analysis.
This is the project structure generated by the dagster cli:
Name | Description |
---|---|
README.md |
A description and guide for this code repository |
workspace.yaml |
A file that specifies the location of the user code for Dagit and the Dagster CLI |
src/usage_metrics/ |
A Python directory that contains code for your Dagster repository |
usage_metrics_tests/ |
A Python directory that contains tests for usage_metrics |
setup.py |
A build script with Python package dependencies for this code repository |
We use the conda package manager to specify and update our development environment. We recommend using miniconda rather than the large pre-defined collection of scientific packages bundled together in the Anaconda Python distribution. You may also want to consider using mamba – a faster drop-in replacement for conda written in C++.
conda update conda
conda env create --name pudl-usage-metrics --file environment.yml
conda activate pudl-usage-metrics
The ETL uses ipinfo to geocode ip addresses. You need to obtain an ipinfo API token and store it in the IPINFO_TOKEN
environment variable.
Dagster stores run logs and caches in a directory stored in the DAGSTER_HOME
environment variable. The usage_metrics/dagster_home/dagster.yaml
file contains configuration for the dagster instance. Note: The usage_metrics/dagster_home/storage
directory could grow to become a couple GBs because all op outputs for every run are stored there. You can read more about the dagster_home directory in the dagster docs.
To set these environment variables, run these commands:
conda activate pudl-usage-metrics
conda env config vars set IPINFO_TOKEN="{your_api_key_here}"
conda env config vars set DAGSTER_HOME="$(pwd)/dagster_home/"
conda activate pudl-usage-metrics
Ask the project admin of the catalyst-cooperative-pudl
to add your email to the pudl-usage-metrics-etl
group to acquire the adequate permissions to run the ETL locally. Once you have been added to the group, run:
gcloud auth application-default login
in your terminal. This command will prompt you to log in to your gmail account. Once completed, your Google credentials will be available in your environment.
Git hooks let you automatically run scripts at various points as you manage your source code. “Pre-commit” hook scripts are run when you try to make a new commit. These scripts can review your code and identify bugs, formatting errors, bad coding habits, and other issues before the code gets checked in. This gives you the opportunity to fix those issues before publishing them.
To make sure they are run before you commit any code, you need to enable the pre-commit hooks scripts with this command:
pre-commit install
The scripts that run are configured in the .pre-commit-config.yaml file.
Now the environment is all set up and we can start up dagster!
In one terminal window start the dagster-daemon by running these commands:
conda activate pudl-usage-metrics
dagster-daemon run
The dagster-daemon is a long-running service required for schedules, sensors and run queueing. The usage metrics ETL requires the daemon because the data is processed in partitions. Dagster kicks off individual runs for each partition which are sent to a queue managed by the dagster-daemon.
In another terminal window, start the dagit UI by running these commands:
conda activate pudl-usage-metrics
dagit
This will launch dagit at http://localhost:3000/
. If you have another service running on port 3000 you can change the port by running:
dagit -p {another_cool_port}
Dagit allows you to kick off backfills
and run partitions with specific configuration.
There is a module in the usage_metrics/jobs
sub package for each datasource (e.g datasette logs, github metrics…) Each job module contains one graph of ops that extracts, transforms and loads the data. Two jobs are created for each graph, one job loads data to a local sqlite database for development and the other job loads data to a Google Cloud SQL Postgres database for a Preset dashboard to access.
You can run the ETL via the dagit UI or the dagster CLI.
To run a complete backfill for a job, run:
dagster job backfill --all {YOUR_JOB_NAME}
To run a a complete backfill from the Dagit UI go to the job's partitions tab. Then click on the "Launch Backfill" button in the upper left corner of the window. This should bring up a new window with a list of partitions. Click "Select All" and then click the "Submit" button. This will submit a run for each partition. You can follow the runs on the "Runs" tab.
Jobs in the local_usage_metrics
dagster repository create a sqlite database called usage_metrics.db
in the usage_metrics/data/
directory. A primary key constraint error will be thrown if you rerun the ETL for a partition. If you want to recreate the entire database just delete the sqlite database and rerun the ETL.
Jobs in the gcp_usage_metrics
dagster repository append new partitions to tables in a Cloud SQL postgres database. A primary key constraint error will be thrown if you rerun the ETL for a partition. The load-metrics
GitHub action is responsible for updating the database with new partitioned data.
If a new column is added or data is processed in a new way, you'll have to delete the table in the database and rerun a complete backfill. Note: The Preset dashboard will be unavailable during the complete backfill.
To run jobs in the gcp_usage_metrics
repo, you need to whitelist your ip address for the database:
gcloud sql instances patch pudl-usage-metrics-db --authorized-networks={YOUR_IP_ADDRESS}
Then add the connection details as environment variables to your conda environment:
conda activate pudl-usage-metrics
conda env config vars set POSTGRES_IP={PUDL_USAGE_METRICS_DB_IP}
conda env config vars set POSTGRES_USER={PUDL_USAGE_METRICS_DB_USER}
conda env config vars set POSTGRES_PASSWORD={PUDL_USAGE_METRICS_DB_PASSWORD}
conda env config vars set POSTGRES_DB={PUDL_USAGE_METRICS_DB_DB}
conda env config vars set POSTGRES_PORT={PUDL_USAGE_METRICS_DB_PORT}
conda activate pudl-usage-metrics
You can find the connection details in the
The ETL uses ipinfo for geocoding the user ip addresses which provides 50k free API requests a month. The usage_metrics.helpers.geocode_ip()
function using joblib to cache API calls so we don't call the API multiple times for a single ip address. The first time you run the ETL no API calls will be cached so the geocode_ips()
op will take a while to complete.
To add a new data source to the dagster repo, add new modules to the usage_metrics/jobs/
and usage_metrics/ops/
directories and create jobs that use the SQLite
and PostgresManager
. Then, create a new dagster repository in the repository module that contains the dataset jobs. Once the dataset has been tested locally, run a complete backfill for the job that uses the PostgresManager
to populate the Cloud SQL database.