FERC 714: transform of hourly demand table (dbf +xbrl) (#3842)

* first very wip draft of transofmring the hourly 714 table * early processing of datetimes and initial cleaning of timezone codes * lil function suffix cleanup * group the table-specific transforms into staticmethods of a table transform class * yay add the hour into the csv report_date early so i'm not oopsies loosing all the report_dates plus lots of documentation * lil extra doc clean * Map FERC 714 XBRL and CSV IDs (#3849) * Add respondent ID csv * Add notes columns to CSV * Preliminary fixes to the 714 data source page * integrate the respondent_id_ferc714 map into transforms * Add notes on CSV-XBRL ID linkage to docs * Write preliminary transform class and function for XBRL and CSV core_ferc714__yearly_planning_area_demand_forecast table * wip first round of respondent table transforming * Combine XBRL and CSV tables * Add forecast to forecast column names * Add migration file for new forecast cols * finish eia_code mapping and wrap up transforms * udpate docs * udpate docs again lol spaces * fix forcast to forecast type and add to run() docstring * convert :meth: to :func: * lower expected forecast year range * fix docs typo * Use split/apply/combine for deduping and update assertion * responding to pr comments mostly doc updates * Add new years to Ferc714CheckSpec * update docs * first pass of adding respondend id tables * add alembic migration for the glue tables * remove the lil post process step * Light edits * release notes and metadata updates * Add table description for annual forecast table and fix indentation errors * update docs and metadata, plus stop trying to impute midnight jan 1st 2024 * update the validation test expectations for the analysis downstream stuff * update the settinggggsss omigosh plus restrict the imputations based on the years processed * add module-level design notes * add move color to the fast test 12 assertion * remove the lil context thing that is no longer necessary --------- Co-authored-by: E. Belfer <37471869+e-belfer@users.noreply.github.com> Co-authored-by: Austen Sharpe <austensharpe@gmail.com> Co-authored-by: e-belfer <ella.belfer@catalyst.coop> Co-authored-by: Austen Sharpe <49878195+aesharpe@users.noreply.github.com>
catalyst-cooperative · Sep 25, 2024 · b291160 · b291160
1 parent ee74680
commit b291160
Show file tree

Hide file tree

Showing 22 changed files with 1,769 additions and 560 deletions.
diff --git a/docs/release_notes.rst b/docs/release_notes.rst
@@ -6,6 +6,16 @@ PUDL Release Notes
 v2024.X.x (2024-XX-XX)
 ---------------------------------------------------------------------------------------
 
+New Data Coverage
+^^^^^^^^^^^^^^^^^
+
+FERC Form 714
+~~~~~~~~~~~~~
+* Integrate 2021-2023 years of the FERC Form 714 data. FERC updated its reporting
+  format for 2021 from a CSV files to XBRL files. This update integrates the two
+  raw data sources and extends the data coverage through 2023. See :issue:`3809`
+  and :pr:`3842`.
+
 Schema Changes
 ^^^^^^^^^^^^^^
 * Added :ref:`out_eia__yearly_assn_plant_parts_plant_gen` table. This table associates

diff --git a/docs/templates/ferc714_child.rst.jinja b/docs/templates/ferc714_child.rst.jinja
@@ -1,6 +1,9 @@
 {% extends "data_source_parent.rst.jinja" %}
 
 {% block background %}
+FERC Form 714, otherwise known as the Annual Electric Balancing Authority Area and
+Planning Area Report, collects data and provides insights about balancing authority
+area and planning area operations.
 
 {% endblock %}
 
@@ -13,28 +16,21 @@
 {% block availability %}
 The data we've integrated from FERC Form 714 includes:
 
-* hourly electricity demand by utility or balancing authority from 2006-2020
-* a table identifying the form respondents including their EIA utility or balancing
+* Hourly electricity demand by utility or balancing authority.
+* Annual demand forecast.
+* A table identifying the form respondents including their EIA utility or balancing
   authority ID, which allows us to link the FERC-714 data to other information
   reported in :doc:`eia860` and :doc:`eia861`.
 
-We have not yet had the opportunity to work with the most recent FERC-714 data (2021 and
-later), which is now being published using the new XBRL format.
-
-The hourly demand data for 2006-2020 is about 15 million records. There are about 200
-respondents that show up in the respondents table.
-
-WIth the EIA IDs, we link the hourly electricity demand to a particular georgraphic
-region at the county level, because utilities and balancing authorities report their
-service territories in :ref:`core_eia861__yearly_service_territory`, and from that
-information we can estimate historical hourly electricity demand by state.
+With the EIA IDs we can link the hourly electricity demand to a particular geographic
+region at the county level because utilities and balancing authorities report their
+service territories in :ref:`core_eia861__yearly_service_territory`. From that
+information we estimate historical hourly electricity demand by state.
 
 Plant operators reported in :ref:`core_eia860__scd_plants` and generator ownership
 information reported in :ref:`core_eia860__scd_ownership` are linked to
 :ref:`core_eia860__scd_utilities` and :ref:`core_eia861__yearly_balancing_authority` and
-so can also be linked to the :ref:`core_ferc714__respondent_id` table, as well as the
-:ref:`core_epacems__hourly_emissions` unit-level emissions and generation data reported
-in :doc:`epacems`.
+can therefore be linked to the :ref:`core_ferc714__respondent_id` table.
 
 {% endblock %}
 
@@ -56,32 +52,44 @@ formats:
 * **2021-present**: Standardized electronic filing using the XBRL (eXtensible Business
   Reporting Language) dialect of XML.
 
-We only have plans to integrate the data from the standardized electronic reporting era
-since the format of the earlier data varies for each reporting balancing authority and
-utility, and would be very labor intensive to parse and reconcile.
+We only plan to integrate the data from the standardized electronic reporting era
+(2006+) since the format of the earlier data varies for each reporting balancing authority
+and utility, and would be very labor intensive to parse and reconcile.
 
 {% endblock %}
 
 {% block notable_irregularities %}
 
+Timezone errors
+---------------
+
 The original hourly electricity demand time series is plagued with timezone and daylight
 savings vs. standard time irregularities, which we have done our best to clean up. The
 timestamps in the clean data are all in UTC, with a timezone code stored in a separate
 column, so that the times can be easily localized or converted. It's certainly not
 perfect, but its much better than the original data and it's easy to work with!
 
+Sign errors
+-----------
+
 Not all respondents use the same sign convention for reporting "demand." The vast
 majority consider demand / load that they serve to be a positive number, and so we've
 standardized the data to use that convention.
 
+Reporting gaps
+--------------
+
 There are a lot of reporting gaps, especially for smaller respondents. Sometimes these
 are brief, and sometimes they are entire years. There are also a number of outliers and
 suspicious values (e.g. a long series of identical consecutive values). We have some
 tools that we've built to clean up these outliers in
 :mod:`pudl.analysis.timeseries_cleaning`.
 
+Respondent-to-balancing-authority inconsistencies
+-------------------------------------------------
+
 Because utilities and balancing authorities occasionally change their service
-territories or merge, the demand reproted by any individual "respondent" may correspond
+territories or merge, the demand reported by any individual "respondent" may correspond
 to wildly different consumers in different years. To make it at least somewhat possible
 to compare the reported data across time, we've also compiled historical service
 territory maps for the respondents based on data reported in :doc:`eia861`. However,
@@ -93,4 +101,34 @@ be found in :mod:`pudl.analysis.service_territory` and :mod:`pudl.analysis.spati
 The :mod:`pudl.analysis.state_demand` script brings together all of the above to
 estimate historical hourly electricity demand by state for 2006-2020.
 
+Combining XBRL and CSV data
+---------------------------
+
+The format of the company identifiers (CIDs) used in the CSV data (2006-2020) and the
+XBRL data (2021+) differs. To link respondents between both data formats, we manually
+map the IDs from both datasets and create a ``respondent_id_ferc714`` in
+:mod:`pudl.package_data.glue.respondent_id_ferc714.csv`.
+
+This CSV builds on the `migrated data
+<https://www.ferc.gov/filing-forms/eforms-refresh/migrated-data-downloads>`__ provided
+by FERC during the transition from CSV to XBRL data, which notes that:
+
+  Companies that did not have a CID prior to the migration have been assigned a CID that
+  begins with R, i.e., a temporary RID. These RIDs will be replaced in future with the
+  accurate CIDs and new datasets will be published.
+
+The file names of the migrated data (which correspond to CSV IDs) and the respondent
+CIDs in the migrated files provide the basis for ID mapping. Though CIDs are intended to
+be static, some of the CIDs in the migrated data weren't found in the actual XBRL data,
+and the same respondents were reporting data using different CIDs. To ensure accurate
+record matching, we manually reviewed the CIDs for each respondent, matching based on
+name and location. Some quirks to note:
+
+* All respondents are matched 1:1 from CSV to XBRL data. Unmatched respondents mostly
+  occur due to mergers, splits, acquisitions, and companies that no longer exist.
+* Some CIDs assigned during the migration process do not appear in the data. Given the
+  intention by FERC to make these CIDs permanent, they are still included in the mapping
+  CSV in case these respondents re-appear. All temporary IDs (beginning with R) were
+  removed.
+
 {% endblock %}
diff --git a/migrations/versions/8fffc1d0399a_add_my_cool_lil_respondent_id_glue_.py b/migrations/versions/8fffc1d0399a_add_my_cool_lil_respondent_id_glue_.py
@@ -0,0 +1,91 @@
+"""Add my cool lil respondent id glue tables and other 714 xbrl updates
+
+Revision ID: 8fffc1d0399a
+Revises: a93bdb8d4fbd
+Create Date: 2024-09-24 09:28:45.862748
+
+"""
+from alembic import op
+import sqlalchemy as sa
+
+
+# revision identifiers, used by Alembic.
+revision = '8fffc1d0399a'
+down_revision = 'a93bdb8d4fbd'
+branch_labels = None
+depends_on = None
+
+
+def upgrade() -> None:
+    # ### commands auto generated by Alembic - please adjust! ###
+    op.create_table('core_pudl__assn_ferc714_pudl_respondents',
+    sa.Column('respondent_id_ferc714', sa.Integer(), nullable=False, comment='PUDL-assigned identifying a respondent to FERC Form 714. This ID associates natively reported respondent IDs from the orignal CSV and XBRL data sources.'),
+    sa.PrimaryKeyConstraint('respondent_id_ferc714', name=op.f('pk_core_pudl__assn_ferc714_pudl_respondents'))
+    )
+    op.create_table('core_pudl__assn_ferc714_csv_pudl_respondents',
+    sa.Column('respondent_id_ferc714', sa.Integer(), nullable=False, comment='PUDL-assigned identifying a respondent to FERC Form 714. This ID associates natively reported respondent IDs from the orignal CSV and XBRL data sources.'),
+    sa.Column('respondent_id_ferc714_csv', sa.Integer(), nullable=False, comment='FERC Form 714 respondent ID from CSV reported data - published from years: 2006-2020. This ID is linked to the newer years of reported XBRL data through the PUDL-assigned respondent_id_ferc714 ID. This ID was originally reported as respondent_id. Note that this ID does not correspond to FERC respondent IDs from other forms.'),
+    sa.ForeignKeyConstraint(['respondent_id_ferc714'], ['core_pudl__assn_ferc714_pudl_respondents.respondent_id_ferc714'], name=op.f('fk_core_pudl__assn_ferc714_csv_pudl_respondents_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents')),
+    sa.PrimaryKeyConstraint('respondent_id_ferc714', 'respondent_id_ferc714_csv', name=op.f('pk_core_pudl__assn_ferc714_csv_pudl_respondents'))
+    )
+    op.create_table('core_pudl__assn_ferc714_xbrl_pudl_respondents',
+    sa.Column('respondent_id_ferc714', sa.Integer(), nullable=False, comment='PUDL-assigned identifying a respondent to FERC Form 714. This ID associates natively reported respondent IDs from the orignal CSV and XBRL data sources.'),
+    sa.Column('respondent_id_ferc714_xbrl', sa.Text(), nullable=False, comment='FERC Form 714 respondent ID from XBRL reported data - published from years: 2021-present. This ID is linked to the older years of reported CSV data through the PUDL-assigned respondent_id_ferc714 ID. This ID was originally reported as entity_id. Note that this ID does not correspond to FERC respondent IDs from other forms.'),
+    sa.ForeignKeyConstraint(['respondent_id_ferc714'], ['core_pudl__assn_ferc714_pudl_respondents.respondent_id_ferc714'], name=op.f('fk_core_pudl__assn_ferc714_xbrl_pudl_respondents_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents')),
+    sa.PrimaryKeyConstraint('respondent_id_ferc714', 'respondent_id_ferc714_xbrl', name=op.f('pk_core_pudl__assn_ferc714_xbrl_pudl_respondents'))
+    )
+    with op.batch_alter_table('core_ferc714__respondent_id', schema=None) as batch_op:
+        batch_op.add_column(sa.Column('respondent_id_ferc714_csv', sa.Integer(), nullable=True, comment='FERC Form 714 respondent ID from CSV reported data - published from years: 2006-2020. This ID is linked to the newer years of reported XBRL data through the PUDL-assigned respondent_id_ferc714 ID. This ID was originally reported as respondent_id. Note that this ID does not correspond to FERC respondent IDs from other forms.'))
+        batch_op.add_column(sa.Column('respondent_id_ferc714_xbrl', sa.Text(), nullable=True, comment='FERC Form 714 respondent ID from XBRL reported data - published from years: 2021-present. This ID is linked to the older years of reported CSV data through the PUDL-assigned respondent_id_ferc714 ID. This ID was originally reported as entity_id. Note that this ID does not correspond to FERC respondent IDs from other forms.'))
+        batch_op.create_foreign_key(batch_op.f('fk_core_ferc714__respondent_id_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents'), 'core_pudl__assn_ferc714_pudl_respondents', ['respondent_id_ferc714'], ['respondent_id_ferc714'])
+
+    with op.batch_alter_table('core_ferc714__yearly_planning_area_demand_forecast', schema=None) as batch_op:
+        batch_op.add_column(sa.Column('summer_peak_demand_forecast_mw', sa.Float(), nullable=True, comment='The maximum forecasted hourly sumemr load (for the months of June through September).'))
+        batch_op.add_column(sa.Column('winter_peak_demand_forecast_mw', sa.Float(), nullable=True, comment='The maximum forecasted hourly winter load (for the months of January through March).'))
+        batch_op.add_column(sa.Column('net_demand_forecast_mwh', sa.Float(), nullable=True, comment='Net forecasted electricity demand for the specific period in megawatt-hours (MWh).'))
+        batch_op.drop_constraint('fk_core_ferc714__yearly_planning_area_demand_forecast_respondent_id_ferc714_core_ferc714__respondent_id', type_='foreignkey')
+        batch_op.create_foreign_key(batch_op.f('fk_core_ferc714__yearly_planning_area_demand_forecast_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents'), 'core_pudl__assn_ferc714_pudl_respondents', ['respondent_id_ferc714'], ['respondent_id_ferc714'])
+        batch_op.drop_column('summer_peak_demand_mw')
+        batch_op.drop_column('net_demand_mwh')
+        batch_op.drop_column('winter_peak_demand_mw')
+
+    with op.batch_alter_table('out_ferc714__respondents_with_fips', schema=None) as batch_op:
+        batch_op.drop_constraint('fk_out_ferc714__respondents_with_fips_respondent_id_ferc714_core_ferc714__respondent_id', type_='foreignkey')
+        batch_op.create_foreign_key(batch_op.f('fk_out_ferc714__respondents_with_fips_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents'), 'core_pudl__assn_ferc714_pudl_respondents', ['respondent_id_ferc714'], ['respondent_id_ferc714'])
+
+    with op.batch_alter_table('out_ferc714__summarized_demand', schema=None) as batch_op:
+        batch_op.drop_constraint('fk_out_ferc714__summarized_demand_respondent_id_ferc714_core_ferc714__respondent_id', type_='foreignkey')
+        batch_op.create_foreign_key(batch_op.f('fk_out_ferc714__summarized_demand_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents'), 'core_pudl__assn_ferc714_pudl_respondents', ['respondent_id_ferc714'], ['respondent_id_ferc714'])
+
+    # ### end Alembic commands ###
+
+
+def downgrade() -> None:
+    # ### commands auto generated by Alembic - please adjust! ###
+    with op.batch_alter_table('out_ferc714__summarized_demand', schema=None) as batch_op:
+        batch_op.drop_constraint(batch_op.f('fk_out_ferc714__summarized_demand_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents'), type_='foreignkey')
+        batch_op.create_foreign_key('fk_out_ferc714__summarized_demand_respondent_id_ferc714_core_ferc714__respondent_id', 'core_ferc714__respondent_id', ['respondent_id_ferc714'], ['respondent_id_ferc714'])
+
+    with op.batch_alter_table('out_ferc714__respondents_with_fips', schema=None) as batch_op:
+        batch_op.drop_constraint(batch_op.f('fk_out_ferc714__respondents_with_fips_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents'), type_='foreignkey')
+        batch_op.create_foreign_key('fk_out_ferc714__respondents_with_fips_respondent_id_ferc714_core_ferc714__respondent_id', 'core_ferc714__respondent_id', ['respondent_id_ferc714'], ['respondent_id_ferc714'])
+
+    with op.batch_alter_table('core_ferc714__yearly_planning_area_demand_forecast', schema=None) as batch_op:
+        batch_op.add_column(sa.Column('winter_peak_demand_mw', sa.FLOAT(), nullable=True))
+        batch_op.add_column(sa.Column('net_demand_mwh', sa.FLOAT(), nullable=True))
+        batch_op.add_column(sa.Column('summer_peak_demand_mw', sa.FLOAT(), nullable=True))
+        batch_op.drop_constraint(batch_op.f('fk_core_ferc714__yearly_planning_area_demand_forecast_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents'), type_='foreignkey')
+        batch_op.create_foreign_key('fk_core_ferc714__yearly_planning_area_demand_forecast_respondent_id_ferc714_core_ferc714__respondent_id', 'core_ferc714__respondent_id', ['respondent_id_ferc714'], ['respondent_id_ferc714'])
+        batch_op.drop_column('net_demand_forecast_mwh')
+        batch_op.drop_column('winter_peak_demand_forecast_mw')
+        batch_op.drop_column('summer_peak_demand_forecast_mw')
+
+    with op.batch_alter_table('core_ferc714__respondent_id', schema=None) as batch_op:
+        batch_op.drop_constraint(batch_op.f('fk_core_ferc714__respondent_id_respondent_id_ferc714_core_pudl__assn_ferc714_pudl_respondents'), type_='foreignkey')
+        batch_op.drop_column('respondent_id_ferc714_xbrl')
+        batch_op.drop_column('respondent_id_ferc714_csv')
+
+    op.drop_table('core_pudl__assn_ferc714_xbrl_pudl_respondents')
+    op.drop_table('core_pudl__assn_ferc714_csv_pudl_respondents')
+    op.drop_table('core_pudl__assn_ferc714_pudl_respondents')
+    # ### end Alembic commands ###
diff --git a/src/pudl/analysis/state_demand.py b/src/pudl/analysis/state_demand.py
@@ -293,9 +293,9 @@ def load_hourly_demand_matrix_ferc714(
     matrix = out_ferc714__hourly_planning_area_demand.pivot(
         index="datetime", columns="respondent_id_ferc714", values="demand_mwh"
     )
-    # List timezone by year for each respondent
+    # List timezone by year for each respondent by the datetime
     out_ferc714__hourly_planning_area_demand["year"] = (
-        out_ferc714__hourly_planning_area_demand["report_date"].dt.year
+        out_ferc714__hourly_planning_area_demand["datetime"].dt.year
     )
     utc_offset = out_ferc714__hourly_planning_area_demand.groupby(
         ["respondent_id_ferc714", "year"], as_index=False
@@ -378,7 +378,9 @@ def filter_ferc714_hourly_demand_matrix(
     return df
 
 
-def impute_ferc714_hourly_demand_matrix(df: pd.DataFrame) -> pd.DataFrame:
+def impute_ferc714_hourly_demand_matrix(
+    df: pd.DataFrame, years: list[int]
+) -> pd.DataFrame:
     """Impute null values in FERC 714 hourly demand matrix.
 
     Imputation is performed separately for each year,
@@ -390,17 +392,28 @@ def impute_ferc714_hourly_demand_matrix(df: pd.DataFrame) -> pd.DataFrame:
     Args:
         df: FERC 714 hourly demand matrix,
           as described in :func:`load_ferc714_hourly_demand_matrix`.
+        years: list of years to input
 
     Returns:
         Copy of `df` with imputed values.
     """
     results = []
-    for year, gdf in df.groupby(df.index.year):
-        logger.info(f"Imputing year {year}")
-        keep = df.columns[~gdf.isnull().all()]
-        tsi = pudl.analysis.timeseries_cleaning.Timeseries(gdf[keep])
-        result = tsi.to_dataframe(tsi.impute(method="tnn"), copy=False)
-        results.append(result)
+    # sort here and then don't sort in the groupby so we can process
+    # the newer years of data first. This is so we can see early if
+    # new data causes any failures.
+    df = df.sort_index(ascending=False)
+    for year, gdf in df.groupby(df.index.year, sort=False):
+        # remove the records o/s of the working years because some
+        # respondents report one record of midnight of January first
+        # of the next year (report_date.dt.year + 1). and
+        # impute_ferc714_hourly_demand_matrix chunks over years at a time
+        # and having only one record
+        if year in years:
+            logger.info(f"Imputing year {year}")
+            keep = df.columns[~gdf.isnull().all()]
+            tsi = pudl.analysis.timeseries_cleaning.Timeseries(gdf[keep])
+            result = tsi.to_dataframe(tsi.impute(method="tnn"), copy=False)
+            results.append(result)
     return pd.concat(results)
 
 
@@ -474,8 +487,12 @@ def _out_ferc714__hourly_demand_matrix(
     return df
 
 
-@asset(compute_kind="NumPy")
+@asset(
+    compute_kind="NumPy",
+    required_resource_keys={"dataset_settings"},
+)
 def _out_ferc714__hourly_imputed_demand(
+    context,
     _out_ferc714__hourly_demand_matrix: pd.DataFrame,
     _out_ferc714__utc_offset: pd.DataFrame,
 ) -> pd.DataFrame:
@@ -492,7 +509,8 @@ def _out_ferc714__hourly_imputed_demand(
     Returns:
         df: DataFrame with imputed FERC714 hourly demand.
     """
-    df = impute_ferc714_hourly_demand_matrix(_out_ferc714__hourly_demand_matrix)
+    years = context.resources.dataset_settings.ferc714.years
+    df = impute_ferc714_hourly_demand_matrix(_out_ferc714__hourly_demand_matrix, years)
     df = melt_ferc714_hourly_demand_matrix(df, _out_ferc714__utc_offset)
     return df