Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug 509 #2333

Open
wants to merge 40 commits into
base: main
Choose a base branch
from
Open

Bug 509 #2333

wants to merge 40 commits into from

Conversation

knordback
Copy link
Collaborator

This is a proposed way of dealing with fields that are in the input data but we're no longer dropping in the eia923 transform() functions, but aren't in the corresponding SQLite tables. Basically, I'm just stripping out the columns after the harvesting step.

In ToT code, the fields are dropped so they never appear in the DataFrames. If we don't drop them, then leaving the corresponding columns in results in a crash when trying to populate the SQLite DB.

In this code, the relevant field I'm dealing with is utility_name_eia, which comes in as operator_name. This is in the generation_fuel_eia923 table, but gets duplicated in generation_fuel_nuclear_eia923. So it needs to be removed from both.

This approach feels very special-case-y. But maybe that's the nature of the beast. If the general approach seems okay and other currently-dropped fields need to be handled in the same way, then I'd probably at least pull the new code out into a separate function.

@codecov
Copy link

codecov bot commented Feb 27, 2023

Codecov Report

Patch and project coverage have no change.

Comparison is base (62eee76) 88.4% compared to head (eaa7615) 88.4%.

Additional details and impacted files
@@          Coverage Diff          @@
##             dev   #2333   +/-   ##
=====================================
  Coverage   88.4%   88.4%           
=====================================
  Files         87      87           
  Lines      10139   10139           
=====================================
  Hits        8971    8971           
  Misses      1168    1168           
Impacted Files Coverage Δ
src/pudl/transform/eia923.py 85.2% <ø> (ø)

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Comment on lines 1176 to 1182
eia_transformed_dfs["generation_fuel_eia923"] = eia_transformed_dfs[
"generation_fuel_eia923"
].drop(columns=["utility_name_eia"])
eia_transformed_dfs["generation_fuel_nuclear_eia923"] = eia_transformed_dfs[
"generation_fuel_nuclear_eia923"
].drop(columns=["utility_name_eia"])

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little surprised that this is necessary. I thought that by default the harvesting process dropped any column that was harvested from the table it was found in (unless we specifically tell it not to rip the column out). But I could be wrong about this @cmgosnell would probably know better.

One option to deal with the issue of leftover columns that need to be harvested, but shouldn't stick around in a generic way is to use the pudl.metadata.classes.Resource.format_df() method, as we are in pudl.transform.classes.AbstractTableTransformer.enforce_schema().

In fact, copying enforce_schema() almost verbatim into a new function might be the way to go. I think the only big difference is you'd need to look up the appropriate resource using the table name that is the key in that dictionary of dataframes, rather than having self.table_id to work with.

Eventually the plan is to refactor the EIA transforms to use the same class design that we've applied to FERC 1, but for now just iterating over all the finished dataframes at the very end of the EIA transformation process would avoid the need to special case out which columns should be present or not, and do some other basic checks that will cause the DB to complain (no null PKs, unique PKs, etc).

@zaneselvans zaneselvans added data-cleaning Tasks related to cleaning & regularizing data during ETL. harvest Normalization of poorly normalized inputs and reconciliation of internal inconsistencies labels Feb 27, 2023
@zaneselvans zaneselvans changed the base branch from main to dev February 28, 2023 15:58
Comment on lines 1176 to 1182
for cat in eia_transformed_dfs:
resource = (
pudl.metadata.classes.Package.from_resource_ids().
get_resource(cat)
)
eia_transformed_dfs[cat] = resource.enforce_schema(
eia_transformed_dfs[cat])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason not to wait to do this until the very end (after the balancing authority fix below)?

Is there any reason not to call enforce_schema() on the entity tables too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I had already moved the code to the very end of the function.

And I added code to do the same thing for the entity tables. That doesn't seem to cause any distress.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow that's great! I'm slightly surprised. Do you want to try running tox -e nuke overnight and see if stays happy?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems not successful:
nuke: exit -9 (641.56 seconds) /home/knordback/Kurt/pudl> bash -c 'coverage run --append src/pudl/cli.py --logfile tox-nuke.log --clobber src/pudl/package_data/settings/etl_full.yml' pid=793582
.pkg: _exit> python /home/knordback/.conda/envs/pudl-dev/lib/python3.10/site-packages/pyproject_api/_backend.py True setuptools.build_meta
nuke: FAIL code -9 (7080.40=setup[98.70]+cmd[0.12,0.54,0.48,0.39,0.41,0.42,0.39,0.53,0.24,6.33,6.21,1.24,0.03,1.72,1.23,161.77,33.89,25.29,3980.21,0.01,2118.71,641.56] seconds)
evaluation failed :( (7081.04 seconds)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of those tests run successfully. So does tox with no arguments (which I thought was the commit criterion). So it seems to be something specific to the "nuke".

I am indeed getting copious output, but it's not (to my eye) indicating what the problem is. As mentioned above, I get some warnings but nothing obviously connected to this.

I'll be away next week. I'll will look into this more when I get back.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Zane. I'm back, for a little while, and have some time to work on this again. I can continue un-dropping fields, but I'm not sure how worried to be about not being able to run tox -e nuke successfully. As noted, the other tests run fine for me. Output is attached in case you want to take a look.
tox-out.txt

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried running the full ETL with the pudl_etl command line tool? The place that it's failing in your logs is the most memory intensive part of the ETL so my guess is it's running out of memory and crashing? I thought you'd been able to do the full ETL previously? How much memory + swap do you have? I think it may take up to 25GB as it is now.

tox with no arguments is what gets run in CI on GitHub. It only processes 1 year of data. For changes like the one you're making that will affect all years of data we try and run the full ETL locally before it gets merged into dev.

tox -e nuke is a blunt instrument that not only runs the full ETL, but also all the unit + integration tests on all the data + data validation tests. You probably want to be trying to run the full ETL, but only for the EIA data, to check whether everything is working. The devtools/eia-etl-debug.ipynb notebook is the easiest / most efficient way to do that right now, since you don't have to run the extract step over and over again. It'll also avoid this (FERC 1) related memory intensive step.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have 16GB ram + 15GB swap. I thought I could run the full ETL but apparently not that either. I was able to run devtools/eia-etl-debug.ipynb (first time using Jupyter -- it's kinda slick).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Jupyter is great for working with data. Lots more human-friendly tools for seeing what's going on in there! Definitely worth getting familiar with.

@zaneselvans zaneselvans added eia923 Anything having to do with EIA Form 923 eia860 Anything having to do with EIA Form 860 labels Mar 16, 2023
@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

Copy link
Member

@zaneselvans zaneselvans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other than removing the null df.drop() I think this looks great!

Do you want to go ahead and merge dev in and try to run the full ETL + data validations to see if anything is broken?

"net_generation_mwh_year_to_date",
"early_release",
],
[],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should just remove this .drop() altogether.

@zaneselvans zaneselvans marked this pull request as ready for review May 25, 2023 01:14
@zaneselvans
Copy link
Member

If you've got the full DB already populated locally, you can run the full data tests and validations in parallel in two different windows with these commands from the main repo directory:

pytest --live-dbs test/validate
pytest --live-dbs --etl-settings src/pudl/package_data/settings/etl_full.yml test/integration

@knordback
Copy link
Collaborator Author

knordback commented May 26, 2023

With the current code I get this:

===================================================================================================== short test summary info ======================================================================================================
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-bf_eia923-1427692-1427692-119611] - ValueError: bf_eia923: found 1428112 rows, expected 1427692. Off by 0.029%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-frc_eia923-597000-244415-24065] - ValueError: frc_eia923: found 608494 rows, expected 597000. Off by 1.925%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-gf_eia923-2687345-2687345-230149] - ValueError: gf_eia923: found 2690175 rows, expected 2687345. Off by 0.105%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-pu_eia860-184745-184745-184745] - ValueError: pu_eia860: found 184831 rows, expected 184745. Off by 0.047%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_raw-utils_eia860-119388-119388-119388] - ValueError: utils_eia860: found 121464 rows, expected 119388. Off by 1.739%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-bf_eia923-1427692-1427692-119611] - ValueError: bf_eia923: found 119646 rows, expected 119611. Off by 0.029%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-frc_eia923-597000-244415-24065] - ValueError: frc_eia923: found 24285 rows, expected 24065. Off by 0.914%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gen_eia923-None-5171497-432570] - ValueError: gen_eia923: found 428497 rows, expected 432570. Off by -0.942%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-gf_eia923-2687345-2687345-230149] - ValueError: gf_eia923: found 230344 rows, expected 230149. Off by 0.085%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-pu_eia860-184745-184745-184745] - ValueError: pu_eia860: found 184831 rows, expected 184745. Off by 0.047%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_annual-utils_eia860-119388-119388-119388] - ValueError: utils_eia860: found 121464 rows, expected 119388. Off by 1.739%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-hr_by_unit-362381-30340] - ValueError: hr_by_unit: found 30342 rows, expected 30340. Off by 0.007%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-hr_by_gen-555119-46408] - ValueError: hr_by_gen: found 46410 rows, expected 46408. Off by 0.004%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-fuel_cost-555119-46408] - ValueError: fuel_cost: found 46410 rows, expected 46408. Off by 0.004%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-capacity_factor-5171497-432570] - ValueError: capacity_factor: found 428497 rows, expected 432570. Off by -0.942%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_annual-mcoe-5171881-432602] - ValueError: mcoe: found 428529 rows, expected 432602. Off by -0.942%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-bf_eia923-1427692-1427692-119611] - ValueError: bf_eia923: found 1428112 rows, expected 1427692. Off by 0.029%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-frc_eia923-597000-244415-24065] - ValueError: frc_eia923: found 246368 rows, expected 244415. Off by 0.799%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gen_eia923-None-5171497-432570] - ValueError: gen_eia923: found 5122777 rows, expected 5171497. Off by -0.942%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-gf_eia923-2687345-2687345-230149] - ValueError: gf_eia923: found 2690175 rows, expected 2687345. Off by 0.105%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-pu_eia860-184745-184745-184745] - ValueError: pu_eia860: found 184831 rows, expected 184745. Off by 0.047%, allowed margin of 0.000%
FAILED test/validate/eia_test.py::test_minmax_rows[eia_monthly-utils_eia860-119388-119388-119388] - ValueError: utils_eia860: found 121464 rows, expected 119388. Off by 1.739%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-hr_by_unit-362381-30340] - ValueError: hr_by_unit: found 362405 rows, expected 362381. Off by 0.007%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-hr_by_gen-555119-46408] - ValueError: hr_by_gen: found 555143 rows, expected 555119. Off by 0.004%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-fuel_cost-555119-46408] - ValueError: fuel_cost: found 555143 rows, expected 555119. Off by 0.004%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-capacity_factor-5171497-432570] - ValueError: capacity_factor: found 5122777 rows, expected 5171497. Off by -0.942%, allowed margin of 0.000%
FAILED test/validate/mcoe_test.py::test_minmax_rows_mcoe[eia_monthly-mcoe-5171881-432602] - ValueError: mcoe: found 5123161 rows, expected 5171881. Off by -0.942%, allowed margin of 0.000%
================================================================================ 27 failed, 35 passed, 6 skipped, 47 warnings in 2606.97s (0:43:26) ================================================================================
minmax_rows: exit 1 (2625.05 seconds) /home/knordback/Kurt/pudl> pytest --color=yes --live-dbs test/validate/epacamd_eia_test.py::test_minmax_rows test/validate/ferc1_test.py::test_minmax_rows test/validate/eia_test.py::test_minmax_rows test/validate/mcoe_test.py::test_minmax_rows_mcoe pid=3047649

Does this make sense ? Is there an easy way to look at the diffs?

@zaneselvans
Copy link
Member

Most of these changes are small increases in the number of rows which seems pretty reasonable. Changes that sounds out as maybe funny:

  • Nearly 2% increase in the number of records in the fuel receipts and costs table (frc_eia923)
  • Nearly 2% increase in the number of utility records (utils_eia860)
  • Loss of almost 1% of the records in the generation table (gen_eia923) and the tables that depend on it directly.

@cmgosnell do you have any thoughts on why these tables would have such big changes just because we're not dropping "extra" columns before harvesting? The utilities make some sense, but the FRC and generation seem surprising.

To diff the tables, there's sqldiff but it looks like that may only be a Windows utility. I think @rousik or @zschira were doing some DB table diffing, maybe with another utility?

I would probably read the two tables into pandas and set the index to the primary key and see what records exist in one table but not the other by taking the difference between the two indexes. But that won't work well on the FRC table since it has no natural primary key.

@zaneselvans
Copy link
Member

@knordback it looks like @rousik has been doing a bunch of work diffing SQL tables and this might be another good test of his little toolkit too.

@knordback
Copy link
Collaborator Author

@knordback it looks like @rousik has been doing a bunch of work diffing SQL tables and this might be another good test of his little toolkit too.

Okay, that would be good. I get a lot of output just running sqldiff, and I don't know how to extract the essential difference.

@rousik
Copy link
Collaborator

rousik commented Jun 2, 2023 via email

@knordback
Copy link
Collaborator Author

Do you have the outputs on gcs, or can you tell me how to repro the etl run in question? I could test my scripts against this case.

On Fri, Jun 2, 2023, 05:06 knordback @.> wrote: @knordback https://github.com/knordback it looks like @rousik https://github.com/rousik has been doing a bunch of work diffing SQL tables and this might be another good test of his little toolkit too. Okay, that would be good. I get a lot of output just running sqldiff, and I don't know how to extract the essential difference. — Reply to this email directly, view it on GitHub <#2333 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABYSCGGHM45YV6HFPBLDI5DXJFKC5ANCNFSM6AAAAAAVI5CXOE . You are receiving this because you were mentioned.Message ID: @.>

I'm wanting to compare the pudl.sqlite produced on dev against that produced in the bug-509 branch. (I just merged from dev to the branch, so the diffs should only correspond to the branch changes.) I also have both locally and could put them on Google. Is there a particular place this kind of thing should go?

@rousik
Copy link
Collaborator

rousik commented Jun 2, 2023

All right, I have ran bug-509 branch vs dev and here's the output diff report:

RowCount(pudl.sqlite/denorm_boiler_fuel_eia923): 108
RowCount(pudl.sqlite/denorm_boiler_fuel_monthly_eia923): 108
RowCount(pudl.sqlite/denorm_boiler_fuel_yearly_eia923): 9
RowCount(pudl.sqlite/denorm_fuel_receipts_costs_eia923): 1390
RowCount(pudl.sqlite/denorm_fuel_receipts_costs_monthly_eia923): 248
RowCount(pudl.sqlite/denorm_fuel_receipts_costs_yearly_eia923): 32
RowCount(pudl.sqlite/denorm_generation_eia923): 24
RowCount(pudl.sqlite/denorm_generation_fuel_combined_eia923): 1300
RowCount(pudl.sqlite/denorm_generation_fuel_combined_monthly_eia923): 1300
RowCount(pudl.sqlite/denorm_generation_fuel_combined_yearly_eia923): 117
RowCount(pudl.sqlite/denorm_generation_monthly_eia923): 24
RowCount(pudl.sqlite/denorm_generation_yearly_eia923): 2
RowCount(pudl.sqlite/denorm_plants_utilities_eia): 65
RowCount(pudl.sqlite/denorm_utilities_eia): 7
RowCount(pudl.sqlite/utilities_eia860): 7
RowCount(pudl.sqlite/utilities_entity_eia): 1

The positive number means that that many rows were added in the branch bug-509 (right side of the comparison).

The relevant piece of code you can use to fetch the differing rows would be something along the lines of:

def read_table_as_df(db_path, table_name):
  con = create_engine(db_path)
  return pd.concat([
    df for df in pd.read_sql_table(table_name, con, chunksize=100_000
  ])

df_left = read_table_as_df(left_db, table) 
df_right = read_table_as_df(right_db, table)

df_merge = df_left.merge(df_right, how="outer", indicator=True)
# Then, df_merge["_merge"] has values "left_only", "right_only" or "both" depending on where it occurs

@knordback
Copy link
Collaborator Author

Ah, this is super helpful. More rows in bug-509 is what we would expect. Which version of the ETL did you run to get this?

@rousik
Copy link
Collaborator

rousik commented Jun 5, 2023

Latest dev available at the time and latest commit of bug-509. rousik-output-diff branch has a tool "output_diff" you could point at two directories and it will produce the report for you. I'll be working on automating this.

@zaneselvans
Copy link
Member

Huh, those differences in the number of rows seem pretty different from what @knordback got when he ran the minmax_rows tests. I wonder why?

@knordback
Copy link
Collaborator Author

It depends on the ETL version used to create the pudl.sqlite, right? So I'm running modified and reference versions of etl_full_no_cems and will try @rousik 's tool on the output.

@zaneselvans zaneselvans added this to the 2023 Spring milestone Jun 5, 2023
@zaneselvans
Copy link
Member

If you've got two different DBs side by side and you want to see differences between the outputs they generate you can do something like the following in a Jupyter Notebook (here looking at the plant_in_service_ferc1 output table, which should have some differences between your DB and the last successful nightly build output from dev:

import pandas as pd
import sqlalchemy as sa

from pudl.metadata.classes import Resource
from pudl.output.pudltabl import PudlTabl

pk_cols = Resource.from_id("plant_in_service_ferc1").schema.primary_key

# Use the real paths to your 2 DBs obvs:
left_db_url = "sqlite:///" + "/Users/zane/code/catalyst/pudl-work/output/pudl.sqlite"
left_engine = sa.create_engine(left_db_url)
left_pudl_out = PudlTabl(left_engine)
left_df = pudl_out.plant_in_service_ferc1().set_index(pk_cols)

# Use the real paths to your 2 DBs obvs:
right_db_url = "sqlite:///" + "/Users/zane/code/catalyst/pudl-work/output/pudl.sqlite"
right_engine = sa.create_engine(right_db_url)
right_pudl_out = PudlTabl(right_engine)
right_df = pudl_out.plant_in_service_ferc1().set_index(pk_cols)

left_only_index = left_df.index.difference(right_df.index)
right_only_index = right_df.index.difference(left_df.index)

left_only_df = left_df.loc[left_only_index]
right_only_df = right_df.loc[right_only_index]

@knordback
Copy link
Collaborator Author

knordback commented Jun 26, 2023

Remarkably, I finally got back to this. I ran the bug-509 code and compared output against dev. To compare, I first ran Jan's output_diff tool to find pudl.sqlite tables that differ. It produced the following:

2023-06-25 17:37:04 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/advanced_metering_infrastructure_eia861.
2023-06-25 17:37:05 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/alembic_version.
2023-06-25 17:37:05 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/averaging_periods_eia.
2023-06-25 17:37:05 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/balance_sheet_assets_ferc1.
2023-06-25 17:37:06 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/balance_sheet_liabilities_ferc1.
2023-06-25 17:37:07 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/balancing_authorities_eia.
2023-06-25 17:37:07 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/balancing_authority_assn_eia861.
2023-06-25 17:37:07 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/balancing_authority_eia861.
2023-06-25 17:37:07 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/boiler_cooling_assn_eia860.
2023-06-25 17:37:07 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/boiler_emissions_control_equipment_assn_eia860.
2023-06-25 17:37:07 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/boiler_fuel_eia923.
2023-06-25 17:37:10 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/boiler_generator_assn_eia860.
2023-06-25 17:37:10 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/boiler_generator_assn_types_eia.
2023-06-25 17:37:10 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/boiler_stack_flue_assn_eia860.
2023-06-25 17:37:10 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/boiler_status_eia.
2023-06-25 17:37:10 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/boiler_types_eia.
2023-06-25 17:37:10 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/boilers_eia860.
2023-06-25 17:37:10 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/boilers_entity_eia.
2023-06-25 17:37:10 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/cash_flow_ferc1.
2023-06-25 17:37:12 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/coalmine_eia923.
2023-06-25 17:37:12 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/coalmine_types_eia.
2023-06-25 17:37:12 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/contract_types_eia.
2023-06-25 17:37:12 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/data_maturities.
2023-06-25 17:37:12 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/datasources.
2023-06-25 17:37:12 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/demand_hourly_pa_ferc714.
2023-06-25 17:37:49 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/demand_response_eia861.
2023-06-25 17:37:49 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/demand_response_water_heater_eia861.
2023-06-25 17:37:49 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/demand_side_management_ee_dr_eia861.
2023-06-25 17:37:49 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/demand_side_management_misc_eia861.
2023-06-25 17:37:49 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/demand_side_management_sales_eia861.
2023-06-25 17:37:49 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_balance_sheet_assets_ferc1.
2023-06-25 17:37:50 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_balance_sheet_liabilities_ferc1.
2023-06-25 17:37:51 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_boiler_fuel_eia923.
RowCount(pudl.sqlite/denorm_boiler_fuel_eia923): 420
2023-06-25 17:37:54 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_boiler_fuel_monthly_eia923.
RowCount(pudl.sqlite/denorm_boiler_fuel_monthly_eia923): 420
2023-06-25 17:37:57 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_boiler_fuel_yearly_eia923.
RowCount(pudl.sqlite/denorm_boiler_fuel_yearly_eia923): 35
2023-06-25 17:37:58 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_boilers_eia.
2023-06-25 17:37:58 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_cash_flow_ferc1.
2023-06-25 17:37:59 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_depreciation_amortization_summary_ferc1.
2023-06-25 17:38:00 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_electric_energy_dispositions_ferc1.
2023-06-25 17:38:00 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_electric_energy_sources_ferc1.
2023-06-25 17:38:00 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_electric_operating_expenses_ferc1.
2023-06-25 17:38:02 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_electric_operating_revenues_ferc1.
2023-06-25 17:38:03 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_electric_plant_depreciation_changes_ferc1.
2023-06-25 17:38:04 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_electric_plant_depreciation_functional_ferc1.
2023-06-25 17:38:04 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_electricity_sales_by_rate_schedule_ferc1.
2023-06-25 17:38:04 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_emissions_control_equipment_eia860.
2023-06-25 17:38:04 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_fuel_by_plant_ferc1.
2023-06-25 17:38:04 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_fuel_ferc1.
2023-06-25 17:38:05 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_fuel_receipts_costs_eia923.
RowCount(pudl.sqlite/denorm_fuel_receipts_costs_eia923): 11494
2023-06-25 17:38:05 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_fuel_receipts_costs_monthly_eia923.
RowCount(pudl.sqlite/denorm_fuel_receipts_costs_monthly_eia923): 1953
2023-06-25 17:38:05 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_fuel_receipts_costs_yearly_eia923.
RowCount(pudl.sqlite/denorm_fuel_receipts_costs_yearly_eia923): 220
2023-06-25 17:38:05 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_generation_eia923.
RowCount(pudl.sqlite/denorm_generation_eia923): 264
2023-06-25 17:38:06 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_generation_fuel_combined_eia923.
RowCount(pudl.sqlite/denorm_generation_fuel_combined_eia923): 2794
2023-06-25 17:38:10 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_generation_fuel_combined_monthly_eia923.
RowCount(pudl.sqlite/denorm_generation_fuel_combined_monthly_eia923): 2794
2023-06-25 17:38:13 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_generation_fuel_combined_yearly_eia923.
RowCount(pudl.sqlite/denorm_generation_fuel_combined_yearly_eia923): 192
2023-06-25 17:38:13 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_generation_monthly_eia923.
RowCount(pudl.sqlite/denorm_generation_monthly_eia923): 264
2023-06-25 17:38:15 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_generation_yearly_eia923.
RowCount(pudl.sqlite/denorm_generation_yearly_eia923): 22
2023-06-25 17:38:15 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_generators_eia.
2023-06-25 17:38:15 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_income_statement_ferc1.
2023-06-25 17:38:17 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_other_regulatory_liabilities_ferc1.
2023-06-25 17:38:17 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_ownership_eia860.
2023-06-25 17:38:17 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_plant_in_service_ferc1.
2023-06-25 17:38:19 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_plants_all_ferc1.
2023-06-25 17:38:19 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_plants_eia.
2023-06-25 17:38:19 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_plants_hydro_ferc1.
2023-06-25 17:38:19 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_plants_pumped_storage_ferc1.
2023-06-25 17:38:19 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_plants_small_ferc1.
2023-06-25 17:38:19 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_plants_steam_ferc1.
2023-06-25 17:38:19 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_plants_utilities_eia.
RowCount(pudl.sqlite/denorm_plants_utilities_eia): 83
2023-06-25 17:38:20 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_plants_utilities_ferc1.
2023-06-25 17:38:20 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_purchased_power_ferc1.
2023-06-25 17:38:20 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_retained_earnings_ferc1.
2023-06-25 17:38:20 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_transmission_statistics_ferc1.
2023-06-25 17:38:21 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_utilities_eia.
RowCount(pudl.sqlite/denorm_utilities_eia): 2076
2023-06-25 17:38:21 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/denorm_utility_plant_summary_ferc1.
2023-06-25 17:38:22 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/depreciation_amortization_summary_ferc1.
2023-06-25 17:38:23 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/distributed_generation_fuel_eia861.
2023-06-25 17:38:23 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/distributed_generation_misc_eia861.
2023-06-25 17:38:23 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/distributed_generation_tech_eia861.
2023-06-25 17:38:23 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/distribution_systems_eia861.
2023-06-25 17:38:23 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/dynamic_pricing_eia861.
2023-06-25 17:38:23 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/electric_energy_dispositions_ferc1.
2023-06-25 17:38:23 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/electric_energy_sources_ferc1.
2023-06-25 17:38:23 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/electric_operating_expenses_ferc1.
2023-06-25 17:38:25 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/electric_operating_revenues_ferc1.
2023-06-25 17:38:25 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/electric_plant_depreciation_changes_ferc1.
2023-06-25 17:38:26 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/electric_plant_depreciation_functional_ferc1.
2023-06-25 17:38:27 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/electricity_sales_by_rate_schedule_ferc1.
2023-06-25 17:38:27 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/emissions_control_equipment_eia860.
2023-06-25 17:38:27 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/energy_efficiency_eia861.
2023-06-25 17:38:27 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/energy_sources_eia.
2023-06-25 17:38:27 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/environmental_equipment_manufacturers_eia.
2023-06-25 17:38:27 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/epacamd_eia.
2023-06-25 17:38:27 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/epacamd_eia_subplant_ids.
2023-06-25 17:38:27 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/ferc_accounts.
2023-06-25 17:38:27 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/firing_types_eia.
2023-06-25 17:38:27 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/fuel_ferc1.
2023-06-25 17:38:27 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/fuel_receipts_costs_aggs_eia.
2023-06-25 17:38:28 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/fuel_receipts_costs_eia923.
2023-06-25 17:38:28 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/fuel_transportation_modes_eia.
2023-06-25 17:38:28 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/fuel_types_aer_eia.
2023-06-25 17:38:28 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/generation_eia923.
2023-06-25 17:38:29 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/generation_fuel_eia923.
2023-06-25 17:38:35 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/generation_fuel_nuclear_eia923.
2023-06-25 17:38:35 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/generators_eia860.
2023-06-25 17:38:36 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/generators_entity_eia.
2023-06-25 17:38:36 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/green_pricing_eia861.
2023-06-25 17:38:36 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/income_statement_ferc1.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/mercury_compliance_strategies_eia.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/mergers_eia861.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/momentary_interruptions_eia.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/net_metering_customer_fuel_class_eia861.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/net_metering_misc_eia861.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/non_net_metering_customer_fuel_class_eia861.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/non_net_metering_misc_eia861.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/nox_compliance_strategies_eia.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/nox_control_status_eia.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/nox_units_eia.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/operational_data_misc_eia861.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/operational_data_revenue_eia861.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/operational_status_eia.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/other_regulatory_liabilities_ferc1.
2023-06-25 17:38:38 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/ownership_eia860.
2023-06-25 17:38:39 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/particulate_compliance_strategies_eia.
2023-06-25 17:38:39 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/particulate_units_eia.
2023-06-25 17:38:39 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/plant_in_service_ferc1.
2023-06-25 17:38:40 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/plants_eia.
2023-06-25 17:38:40 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/plants_eia860.
2023-06-25 17:38:40 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/plants_entity_eia.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/plants_ferc1.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/plants_hydro_ferc1.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/plants_pudl.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/plants_pumped_storage_ferc1.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/plants_small_ferc1.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/plants_steam_ferc1.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/political_subdivisions.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/power_purchase_types_ferc1.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/prime_movers_eia.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/purchased_power_ferc1.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/regulations_eia.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/reliability_eia861.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/reporting_frequencies_eia.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/respondent_id_ferc714.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/retained_earnings_ferc1.
2023-06-25 17:38:41 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/sales_eia861.
2023-06-25 17:38:43 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/sector_consolidated_eia.
2023-06-25 17:38:43 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/service_territory_eia861.
2023-06-25 17:38:43 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/so2_compliance_strategies_eia.
2023-06-25 17:38:43 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/so2_units_eia.
2023-06-25 17:38:43 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/steam_plant_types_eia.
2023-06-25 17:38:43 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/transmission_statistics_ferc1.
2023-06-25 17:38:44 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/utilities_eia.
2023-06-25 17:38:44 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/utilities_eia860.
RowCount(pudl.sqlite/utilities_eia860): 2076
2023-06-25 17:38:44 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/utilities_entity_eia.
RowCount(pudl.sqlite/utilities_entity_eia): 1
2023-06-25 17:38:44 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/utilities_ferc1.
2023-06-25 17:38:44 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/utilities_ferc1_dbf.
2023-06-25 17:38:44 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/utilities_ferc1_xbrl.
2023-06-25 17:38:44 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/utilities_pudl.
2023-06-25 17:38:44 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/utility_assn_eia861.
2023-06-25 17:38:44 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/utility_data_misc_eia861.
2023-06-25 17:38:44 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/utility_data_nerc_eia861.
2023-06-25 17:38:44 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/utility_data_rto_eia861.
2023-06-25 17:38:44 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/utility_plant_assn.
2023-06-25 17:38:44 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/utility_plant_summary_ferc1.
2023-06-25 17:38:45 [    INFO] catalystcoop.pudl.output_diff.cli:331 Analyzing rows of pudl.sqlite/wet_dry_bottom_eia.

(It also found differences in ferc2.sqlite, but I think that those occur because resource constraints on my machine meant that not everything could run.)

Anyhow, in the output above I chose table differences that I felt pretty confident I could identify in PudlTabl. I started with denorm_plants_utlities_eia, which appears in PudlTabl and apparently is accessed via pu_eia860(), and put them into the Jupyter code above. I then get the output below. First, the right-versus-left difference in rows is 83, which is the same thing output_diff gave, so that much is a good sign. But it means there are rows that appear in the bug-509 output that don't appear in dev. That is not what I would expect. Second, some of the rows have changed. E.g., the first row of left_only_df corresponds to the second row of right_only_df, with the same plant IDs and plant name. But the utility ID changes. Since utility_id_eia is one of the fields that's no longer dropped, it makes sense that this would change. But I would have expected it to take on some null value in dev code, which doesn't seem to be the case. Moreover, the utlity_name_eia changes (from American Electric Power Co Inc to Public Service Co of Oklahoma). A quick search indicates that the former is correct: https://en.wikipedia.org/wiki/Oklaunion_Power_Plant

So this is all worrisome and it appears maybe something is going wrong. It would be good to get a quick check of this to see if maybe this is better than it appears to me, of if I need to investigate more deeply.
image

@zaneselvans
Copy link
Member

I'm not sure this is necessarily concerning. You're comparing the left_only and right_only dataframes which must, by definition, not match in some way. I don't think that the ordering is necessarily deterministic, unless you're sorting by index / value at some point.

I think it makes sense to start with one of the "base" tables rather than derived tables that are several steps downstream, since the tables that are the immediate outputs from the harvesting are going to be the first places that the consequences of the new process appear, and everything downstream will be a result of those changes.

The outputs of the harvesting process are the entity and annual tables for plants, generators, boilers, and utilities:

  • plants_entity_eia
  • plants_eia860
  • generators_entity_eia
  • generators_eia860
  • boilers_entity_eia
  • boilers_eia860
  • utilities_entity_eia
  • utilities_eia860

You might also want to pull the data directly from the database(s) rather than using PudlTabl just to make sure that there's nothing in that software layer that's causing a discrepancy.

@knordback
Copy link
Collaborator Author

Here's a selection of the differences I'm seeing. This is not comprehensive, as some of the tables have many differences and I haven't looked at all of them.

  • plants_entity_eia:
    • mostly changes in plants_entity_eia: many from <NA> to something that looks valid; others seem like just different versions of the name
  • plants_eia860:
    • utility_id_eia value 15143 replaced with <NA>
    • nerc_region value SERC replaced with WECC
    • in general, lots and lots of changes to nerc_region, some of which seem like improvements, some which seem like regressions
  • generators_entity_eia:
    • no differences
  • generators_eia860:
    • seems to all be changes in utility_id_eia value from <NA> to something credible-looking
  • boilers_entity_eia:
    • no differences
  • boilers_eia860:
    • no differences
  • utilities_entity_eia:
    • one additional entry in the table, that looks generally credible
    • also lots of changes to utility_name_eia
  • utilities_eia860:
    • 2076 additional entries, all with what look like reasonable utility_id_eia and report_date values, and all other entries <NA>

The following files show the differences. (I should have named them better: *-left.txt come from dev; *-right.txt come from bug-509 branch.) These are generated with code that loads the created pudl.sqlite files, extracts the referenced tables, converts to two DataFrames, and then differences series-by-series and outputs as text the differences. The diffs are viewable by comparing foo-left.txt and foo-right.txt using a visual differencer (tkdiff, meld, etc.)
utilities_eia860-right.txt
utilities_eia860-left.txt
utilities_entity_eia-right.txt
utilities_entity_eia-left.txt
generators_eia860-right.txt
generators_eia860-left.txt
plants_entity_eia-right.txt
plants_entity_eia-left.txt
plants_eia860-right.txt
plants_eia860-left.txt

@bendnorman bendnorman added the requires-debug Things that have been worked on but hit an issue that requires debugging. label Jul 24, 2023
Base automatically changed from dev to main January 5, 2024 04:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community data-cleaning Tasks related to cleaning & regularizing data during ETL. eia860 Anything having to do with EIA Form 860 eia923 Anything having to do with EIA Form 923 harvest Normalization of poorly normalized inputs and reconciliation of internal inconsistencies inframundo requires-debug Things that have been worked on but hit an issue that requires debugging.
Projects
Status: Backlog
Development

Successfully merging this pull request may close these issues.

Retain all harvestable fields during EIA transforms
5 participants