Create an asset factory to generate FERC1 output tables #3557

hfireborn · 2024-04-11T20:22:13Z

draft of asset factories for #3147.

Based on the tutorial in https://dagster.io/blog/python-factory-patterns, creates 3 assets using a factory for: out_ferc1__yearly_purchased_power_and_exchanges_sched326
out_ferc1__yearly_plant_in_service_sched204
out_ferc1__yearly_balance_sheet_assets_sched110

all of which just merge and then organize columns.

a few questions:

Is this the type of solution you were looking for?
double checking: Is it ok to omit the organize_cols parts of those functions?
We are currently omitting the pd.DataFrame input type specification in these assets, is this ok?

For more information, see https://pre-commit.ci

zaneselvans

This is generally the shape of the solution we're looking for, but see my inline comments in the PR.

It looks like you don't have the pre-commit hooks installed locally, or the linters set up in VS Code, as the pre-commit checks and unit tests are failing on the PR, so you'll want to get that set up. The documentation build is also failing. It looks like you've got an undefined variable.

You'll definitely need to try running the ETL via Dagster locally to know whether your changes are working, and debug issues. You'll probably need to run it many many times, but you can just run the portion of the DAG that you're working on, in this case the FERC 1 related parts.

zaneselvans · 2024-04-12T03:27:36Z

src/pudl/output/ferc1.py

+ # pudl.helpers.organize_cols,
+ # [
+ # "report_year",
+ # "utility_id_ferc1",
+ # "utility_id_pudl",
+ # "utility_name_ferc1",
+ # "seller_name",
+ # "record_id",
+ # ],
+ # )


Yes, the organize_cols can and should be removed from all of the asset definitions in this module, since the database schema is now what determines the ordering of the columns.

@zaneselvans If all of the helpers.organize_cols can be removed, is it necessary to make an asset factory for this? It looks to me like all of the functions are essentially doing the same thing then; they take two DataFrames as inputs, merge them on something hard-coded and the same across all the functions, and then return.

Is there a reason why we can't we just simplify them all into one function and change where the functions are called?

We do need an asset factory that pumps out the function definitions, because those @asset decorated function definitions and their inputs and outputs are what define the dependency graph in Dagster. It might help to read some background on Dagster's software defined assets or some of their examples, and also to try running the ETL through the Dagster UI locally if you haven't already. It looks like they've got some videos too.

Of course each of the assets (functions) defined by the factory will be very similar (which is why we want to use a factory) but it's the additional information from the decorator or the arguments and return value that stitch the functions and the data flowing through them together.

zaneselvans · 2024-04-12T03:30:24Z

src/pudl/output/ferc1.py

+# draft asset_factory
+def generate_asset_factory(spec):
+ @asset(io_manager_key="pudl_io_manager", compute_kind="Python")
+ def _asset() -> pd.DataFrame:


The function which is used to define the asset needs to have arguments, because that's how Dagster infers the dependencies between assets. For more complex cases they can also be defined with arguments in the asset decorator, but I don't think that's necessary here. You can't just add them inside the inner function definition.

Search for "_asset_factory" in the code base to find other examples of how we're using this pattern.

zaneselvans · 2024-04-12T03:31:43Z

src/pudl/output/ferc1.py

+ {
+ "name": "out_ferc1__yearly_purchased_power_and_exchanges_sched326",
+ "df1": "core_ferc1__yearly_purchased_power_and_exchanges_sched326",
+ "df2": "core_pudl__assn_ferc1_pudl_utilities",


Note that in all cases, df2 is the same. So it doesn't need to be part of these inputs. It can be hard-coded as one of the input assets that are used by the asset definition (I think it's used in all of the FERC 1 output assets)

zaneselvans · 2024-04-12T03:33:36Z

src/pudl/output/ferc1.py

+## example/draft factory pattern
+specs = [
+ {
+ "name": "out_ferc1__yearly_purchased_power_and_exchanges_sched326",


Note that in all cases, there's an easy way to construct the input (core) and output asset names, given just a single string, (purchased_power_and_exchanges_sched326 in this case) -- they all have the same prefix structure. So you only need to store that string.

zaneselvans · 2024-04-12T03:34:50Z

src/pudl/output/ferc1.py

+ "name": "out_ferc1__yearly_purchased_power_and_exchanges_sched326",
+ "df1": "core_ferc1__yearly_purchased_power_and_exchanges_sched326",
+ "df2": "core_pudl__assn_ferc1_pudl_utilities",
+ "mg": "utility_id_ferc1",


Similarly, they all seem to get merged on the same column, so this doesn't need to be part of the specification -- it can just be hard-coded inside the inner factory function definition (unless there are some tables that need a different merge key)

For more information, see https://pre-commit.ci

hfireborn · 2024-04-24T18:58:03Z

We've created an asset factory that appeared to work locally on dagster. When trying to push our code, we ran into some errors getting our codebase up to date that we are currently resolving.

hfireborn · 2024-04-24T19:18:01Z

We've created an asset factory that appeared to work locally on dagster. When trying to push our code, we ran into some errors getting our codebase up to date that we are currently resolving.

@zaneselvans For some reason we are failing unit tests when we try to commit that seem unrelated to the code in output/ferc1.py we were working with.

=========================== short test summary info ============================
FAILED test/unit/workspace/datastore_test.py::TestDatapackageDescriptor::test_get_partition_filters - TypeError: Package.metadata_validate() missing 1 required positional argume...
FAILED test/unit/workspace/datastore_test.py::TestDatapackageDescriptor::test_get_resource_path - TypeError: Package.metadata_validate() missing 1 required positional argume...
FAILED test/unit/workspace/datastore_test.py::TestDatapackageDescriptor::test_get_resources_filtering - TypeError: Package.metadata_validate() missing 1 required positional argume...
FAILED test/unit/workspace/datastore_test.py::TestDatapackageDescriptor::test_json_string_representation - TypeError: Package.metadata_validate() missing 1 required positional argume...
FAILED test/unit/workspace/datastore_test.py::TestDatapackageDescriptor::test_modernize_zenodo_legacy_api_url - TypeError: Package.metadata_validate() missing 1 required positional argume...
FAILED test/unit/workspace/datastore_test.py::TestZenodoFetcher::test_doi_format_is_correct - TypeError: Package.metadata_validate() missing 1 required positional argume...
FAILED test/unit/workspace/datastore_test.py::TestZenodoFetcher::test_doi_of_prod_epacems_matches - TypeError: Package.metadata_validate() missing 1 required positional argume...
FAILED test/unit/workspace/datastore_test.py::TestZenodoFetcher::test_get_descriptor_http_calls - TypeError: Package.metadata_validate() missing 1 required positional argume...
FAILED test/unit/workspace/datastore_test.py::TestZenodoFetcher::test_get_known_datasets - TypeError: Package.metadata_validate() missing 1 required positional argume...
FAILED test/unit/workspace/datastore_test.py::TestZenodoFetcher::test_get_resource - TypeError: Package.metadata_validate() missing 1 required positional argume...
FAILED test/unit/workspace/datastore_test.py::TestZenodoFetcher::test_get_resource_with_invalid_checksum - TypeError: Package.metadata_validate() missing 1 required positional argume...
FAILED test/unit/workspace/datastore_test.py::TestZenodoFetcher::test_get_resource_with_nonexistent_resource_fails - TypeError: Package.metadata_validate() missing 1 required positional argume...
FAILED test/unit/workspace/datastore_test.py::TestZenodoFetcher::test_get_unknown_dataset - TypeError: Package.metadata_validate() missing 1 required positional argume...
= 13 failed, 1367 passed, 1 skipped, 3 deselected, 9 xfailed, 44 warnings in 32.77s =

conda-lock...........................................(no files to check)Skipped

We thought it was a package update issue, guessing that we were pulling code from the remote but not updating the packages that accompany them, so we tried:

make conda-clean
make conda-lock.yml
make install-pudl

but we are still receiving these errors. Not sure where to go from here, any help is appreciated!

zaneselvans · 2024-04-25T14:45:12Z

Those unit test errors look like they result from having an older version of the frictionless package installed in your environment. We recently updated from v4 to v5.

I would remove any local changes to the conda lockfiles, remove your existing pudl-dev environment, make sure your local branch has all the most recent changes from main upstream, and rebuild the conda environment, which should look something like:

git checkout -- environments
mamba deactivate
mamba env remove -n pudl-dev
git pull
make install-pudl
mamba activate pudl-dev

Given the changes you're making, you shouldn't need to run make conda-clean or make conda-lock.yml, which will update/change the locked dependencies.

zaneselvans · 2024-04-25T14:47:26Z

src/pudl/output/ferc1.py

+specs = [
+ {"name": "ferc1__yearly_purchased_power_and_exchanges_sched326"},
+ {"name": "ferc1__yearly_plant_in_service_sched204"},
+ {"name": "ferc1__yearly_balance_sheet_assets_sched110"},
+]


I don't think this needs to be stored in a separate variable or a dictionary. The list of AssetsDefinitions can be built using a list comprehension that iterates over a list of strings (the base table name).

Please remove the specs variable and just list a list of the table names with no prefix in a list comprehension to generate the assets.

zaneselvans · 2024-04-25T15:02:47Z

src/pudl/output/ferc1.py

+def generate_asset_factory(spec) -> AssetsDefinition:
+ var_name = "core_" + spec["name"]
+ core_ = globals()[var_name]
+
+ @asset(
+ name=f"_out_{get_core_ferc1_asset_description(spec["name"])}",
+ io_manager_key="pudl_io_manager",
+ compute_kind="Python",
+ )
+ def _asset(
+ core_: pd.DataFrame,
+ core_pudl__assn_ferc1_pudl_utilities: pd.DataFrame,
+ ) -> pd.DataFrame:
+ """Generate a dataframe for {} asset specification.""".format(spec["name"])
+ return_df = core_.merge(
+ core_pudl__assn_ferc1_pudl_utilities, on="utility_id_ferc1"
+ )
+ return return_df
+
+ return _asset


A more specific name would be helpful, and I think this can be as simple as just taking a string as input:

Suggested change

def generate_asset_factory(spec) -> AssetsDefinition:

var_name = "core_" + spec["name"]

core_ = globals()[var_name]

@asset(

name=f"_out_{get_core_ferc1_asset_description(spec["name"])}",

io_manager_key="pudl_io_manager",

compute_kind="Python",

)

def _asset(

core_: pd.DataFrame,

core_pudl__assn_ferc1_pudl_utilities: pd.DataFrame,

) -> pd.DataFrame:

"""Generate a dataframe for {} asset specification.""".format(spec["name"])

return_df = core_.merge(

core_pudl__assn_ferc1_pudl_utilities, on="utility_id_ferc1"

)

return return_df

return _asset

def ferc1_output_asset_factory(table_name: str) -> AssetsDefinition:

@asset(

name=f"out_ferc1__{table_name}",

ins={

core_pudl__assn_ferc1_pudl_utilities: AssetIn(),

f"core_ferc1__{table_name}": AssetIn(),

},

io_manager_key="pudl_io_manager",

compute_kind="Python",

)

def _ferc1_output_asset(**kwargs) -> pd.DataFrame:

f"""Generate a dataframe for out_ferc1__{table_name} asset specification."""

return kwargs[core_ferc1__{table_name}].merge(

core_pudl__assn_ferc1_pudl_utilities, on="utility_id_ferc1"

)

return _ferc1_output_asset

zaneselvans · 2024-04-25T15:06:12Z

src/pudl/output/ferc1.py

+def create_generated_assets() -> list[AssetsDefinition]:
+ """Create a list of generated FERC Form 1 assets.
+
+ Returns:
+ A list of :class:`AssetsDefinitions` where each asset is an generated FERC Form 1
+ table.
+ """
+ return [generate_asset_factory(**kwargs) for kwargs in specs]
+
+
+exploded_ferc1_assets = create_generated_assets()


I don't think a whole separate function is needed here, we can generate the AssetsDefinitions for dagster to pick up on module import with something like:

ferc1_output_assets = [ ferc1_output_asset_factory(table_name) for table name in [ "yearly_purchased_power_and_exchanges_sched326", "yearly_plant_in_service_sched204", "yearly_balance_sheet_assets_sched110", ] ]

hfireborn · 2024-04-25T20:05:40Z

@zaneselvans our most recent commit from right before our meeting has some updated changes to it as well as fixing the merge conflicts, if you are able to take a look at that.

zaneselvans · 2024-04-25T22:16:30Z

src/pudl/output/ferc1.py

+specs = [
+ {"name": "ferc1__yearly_purchased_power_and_exchanges_sched326"},
+ {"name": "ferc1__yearly_plant_in_service_sched204"},
+ {"name": "ferc1__yearly_balance_sheet_assets_sched110"},
+]


Please remove the specs variable and just list a list of the table names with no prefix in a list comprehension to generate the assets.

zaneselvans · 2024-04-25T22:18:52Z

src/pudl/output/ferc1.py

+ def _asset(
+ **kwargs: dict[str, pd.DataFrame],
+ ) -> pd.DataFrame:
+ """Generate a dataframe for {} asset specification.""".format(spec["name"])


Please use f-string formatting here.

zaneselvans · 2024-04-25T22:19:59Z

src/pudl/output/ferc1.py

- )
- return out_ferc1__yearly_balance_sheet_assets_sched110
+
+def create_generated_assets() -> list[AssetsDefinition]:


Please remove this function and use a list comprehension that iterates of the table names with no prefix to call the asset factory function.

zaneselvans · 2024-04-25T22:22:49Z

src/pudl/output/ferc1.py

- "record_id",
- ],
+# draft asset_factory
+def generate_asset_factory(spec) -> AssetsDefinition:


Please rename to ferc1_output_asset_factory and have the factory depend on a string, which is the table name with no prefix, rather than spec.

hfireborn and others added 2 commits April 11, 2024 16:11

catalyst-cooperative#3147 draft1

d499764

[pre-commit.ci] auto fixes from pre-commit.com hooks

a5e46e9

For more information, see https://pre-commit.ci

zaneselvans added community ferc1 Anything having to do with FERC Form 1 dagster Issues related to our use of the Dagster orchestrator labels Apr 12, 2024

zaneselvans linked an issue Apr 12, 2024 that may be closed by this pull request

Consolidate ferc1 outputs using Dagster asset factories #3147

Open

zaneselvans requested changes Apr 12, 2024

View reviewed changes

zaneselvans changed the title ~~#3147 draft1~~ Create an asset factory to generate FERC1 output tables Apr 16, 2024

hfireborn and others added 5 commits April 21, 2024 18:25

cleaned up with linters and ran from container for precommit hooks

bfc3502

testing precommit hooks

879e59e

[pre-commit.ci] auto fixes from pre-commit.com hooks

8189132

For more information, see https://pre-commit.ci

Merge branch 'main' into main

8275fd4

updated draft of asset factory solution

c902417

zaneselvans requested changes Apr 25, 2024

View reviewed changes

fixing merge conflicts

b4799ea

Merge branch 'main' into main

6828f5a

zaneselvans requested changes Apr 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create an asset factory to generate FERC1 output tables #3557

Create an asset factory to generate FERC1 output tables #3557

hfireborn commented Apr 11, 2024

zaneselvans left a comment

zaneselvans Apr 12, 2024

hfireborn Apr 16, 2024

zaneselvans Apr 16, 2024 •

edited

Loading

zaneselvans Apr 16, 2024

zaneselvans Apr 12, 2024

zaneselvans Apr 12, 2024

zaneselvans Apr 12, 2024

zaneselvans Apr 12, 2024

hfireborn commented Apr 24, 2024

hfireborn commented Apr 24, 2024

zaneselvans commented Apr 25, 2024

zaneselvans Apr 25, 2024

zaneselvans Apr 25, 2024

zaneselvans Apr 25, 2024

zaneselvans Apr 25, 2024

hfireborn commented Apr 25, 2024

zaneselvans Apr 25, 2024

zaneselvans Apr 25, 2024

zaneselvans Apr 25, 2024

zaneselvans Apr 25, 2024

Create an asset factory to generate FERC1 output tables #3557

Are you sure you want to change the base?

Create an asset factory to generate FERC1 output tables #3557

Conversation

hfireborn commented Apr 11, 2024

zaneselvans left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans Apr 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hfireborn commented Apr 24, 2024

hfireborn commented Apr 24, 2024

zaneselvans commented Apr 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hfireborn commented Apr 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zaneselvans Apr 16, 2024 •

edited

Loading