Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bugs in allocate_gen_fuel #3690

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

grgmiller
Copy link
Collaborator

@grgmiller grgmiller commented Jun 22, 2024

Overview

This PR fixes several issues that we identified in the analysis.allocate_gen_fuel module and had fixed in our fork of pudl for OGE. We are now trying to get rid of our dependency on the pudl code, so want to migrate all of our changes over to pudl so that we can directly use the output table from this module.

This is part of work we are tracking in OGE here: singularity-energy/open-grid-emissions#369

First, this addresses an issue where some retiring generators were incorrectly identified and being dropped: singularity-energy#1

This PR fixes a bug where during the generation and fuel allocation process, data for report months after the reported retirement date was getting dropped for generators that retire mid-year. For example, the retirement date of both generators at plant 50937 was "2022-09-01", and the previous behavior was to drop all data after september, even though this plant continued to report fuel consumption after september. This fix keeps all report dates through the end of the current year to avoid dropping this data.

Second, this addresses issues with duplicate generators, as described in this PR: singularity-energy#3

When running the pudl.analysis.allocate_gen_fuel pipeline for 2016 and 2017, we were getting a TypeError at group_duplicate_keys(), because this function was trying to groupby().sum() non-numeric columns like generator_retirement_date.
The group_duplicate_keys() will only work if we drop any datetime and boolean columns before using this function, and considering carefully whether we want to sum any of the frac columns or not. This PR, however, does not touch this function, but rather fixes the issue upstream.
We were only running into this issue with group_duplicate_keys() because there were duplicate keys in the dataframe, so this PR addresses the root cause of where duplicate keys were getting introduced in the first place.
It turns out that when creating the gen_assoc table with associate_generator_tables(), one of the steps is remove_inactive_generators(), which removes certain generators by creating six different dataframes with different generators based on their operating status: existing, retiring_generators, retired_plants, proposed_generators, proposed_plants, and unassociated_plants. These six dataframes are then concat'ed together. Previously our assumption was that these six dataframes should be non-overlapping. However, it turns out that this is not always the case.
For example, in 2016, plant 56846 generator GTG1 ended up in both proposed_generators and proposed_plants, which was causing it to be duplicated.
We fix this by simply adding .drop_duplicates() after these six dataframes are concat'ed together. This fixes the issue that we were experiencing in 2016 and 2017.
For now, we will leave group_duplicate_keys() alone even though it does not work. It effectively acts as an error if there are ever any duplicate keys since it will raise a typeerror like we saw for 2016 and 2017.

Testing

We have successfully run this after importing pudl and running it in the OGE pipeline. However, we had previously been testing this with an older release of pudl (v2023.12.01).

I have a Windows machine so there are not great instructions on getting the pudl dev environment set up on Windows. This is a pretty small amount of code change, so I'm hoping that someone with the dev environment already set up may be able to help test this.

To-do list

Edit tasklist title
Beta Give feedback Tasklist To-do list, more options

Delete tasklist

Delete tasklist block?
Are you sure? All relationships in this tasklist will be removed.
  1. If updating analyses or data processing functions: make sure to update or write data validation tests (e.g., test_minmax_rows())
    Options
  2. Update the release notes: reference the PR and related issues.
    Options
  3. Ensure docs build, unit & integration tests, and test coverage pass locally with make pytest-coverage (otherwise the merge queue may reject your PR)
    Options
  4. Review the PR yourself and call out any questions or issues you have
    Options
  5. For minor ETL changes or data additions, once make pytest-coverage passes, make sure you have a fresh full PUDL DB downloaded locally, materialize new/changed assets and all their downstream assets and run relevant data validation tests using pytest and --live-dbs.
    Options
  6. For significant ETL, data coverage or analysis changes, once make pytest-coverage passes, ensure the full ETL runs locally and run data validation tests using make pytest-validate (a ~10 hour run). If you can't run this locally, run the build-deploy-pudl GitHub Action (or ask someone with permissions to). Then, check the logs on the #pudl-deployments Slack channel or gs://builds.catalyst.coop.
    Options
Loading

@grgmiller grgmiller requested a review from cmgosnell June 23, 2024 00:11
@grgmiller
Copy link
Collaborator Author

@cmgosnell not sure who would be best to review this so added you for now.

@zaneselvans zaneselvans added analysis Data analysis tasks that involve actually using PUDL to figure things out, like calculating MCOE. community labels Jun 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
analysis Data analysis tasks that involve actually using PUDL to figure things out, like calculating MCOE. community
Projects
Status: In progress
Development

Successfully merging this pull request may close these issues.

None yet

2 participants