Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set up data tests in PLUTO #1141

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Set up data tests in PLUTO #1141

wants to merge 6 commits into from

Conversation

sf-dcp
Copy link
Contributor

@sf-dcp sf-dcp commented Sep 18, 2024

Fixes #962. See the issue for test framework details.

This PR establishes dbt framework for running data tests post PLUTO build, keeping in mind future migration of the entire PLUTO pipeline to dbt.
I also implemented first data test that I've been working on with Amanda; this test helps us identify DOF PTS condo bbls that likely have incorrect residential unit count.

Data Test description

  • As mentioned above this test helps us identify DOF condo records that likely have incorrect residential unit number. The test consists of 2 parts:

    • First, we check DOF semi-raw data (pluto_rpad_geo table) to find prime bbls where multiple unit condos have res unit > 1. Why? Because each residential unit in a condo building should have exactly one condo.

    For example, the following prime bbl would be flagged if it has the following records:

    primebbl bbl (unit bbl) coop_units (res units)
    75123 1 3
    75123 2 3
    75123 3 1
    • Second, we cross-check this prime bbl with the corrections file pluto_input_research.csv, the Developments database (DevdB), and a file of bbls that should be ignored during the check.
      • Why there is a file of bbls that should be ignored during the check? Because DevDB isn't perfect and it only contains building construction records going back to 2010ish. So for example, if a lot has 2 buildings built in 1980 and only 1 building was renovated post 2010ish, DevDB will should residential units for that 1 building.
  • test logic tree:

flowchart TD

check_name["condominium has multiple units with coop_units > 1?"]
corrected["Has res unit total been manually corrected?"]
in_ignore_file["Is in the 'ignore' csv file?"]
devdb["Is res unit total aligns with DevDB?"]
over_50["Is res unit count diff >= 50?"]
fix{"Manually research bbl. Needs correction?"}
add_to_corrections["Add to corrections & report to DOF"]
add_to_ignore_file["Add to 'ignore' csv file"]

ignore{"Ignore ✅"}

check_name --> |Yes| corrected
check_name --> |No| ignore

corrected --> |Yes| ignore
corrected --> |No| in_ignore_file

in_ignore_file --> |Yes| ignore
in_ignore_file --> |No| devdb

devdb --> |Yes| ignore
devdb --> |No| over_50

over_50 --> |Yes| fix
over_50 --> |No| ignore

fix --> |Yes| add_to_corrections
fix --> |No| add_to_ignore_file
Loading
  • dbt lineage graph:

    image

Commits

  1. Initiate dbt project inside of PLUTO directory

  2. Add DevDB dataset to PLUTO recipes. We need DevDB for the data test

  3. Create a custom dbt test for residential units in tests/assert_condo_bbl_unit_count_research_required.sql file. In dbt, you can create your own custom, "singular" test which is pretty much a sql query stored in a tests directory. This is slightly different from macros that we've created before in a way that it's not reusable for different tables.

    • Test description and our next steps in the case of failure are documented in the config object of the test file. At the moment, dbt doesn't have description property for tests, but it should be available with dbt next release (they recently implemented this feature here).
    • This test is refactored in commit 4, so it may not be necessary to check out this commit; I kept it just in case.
  4. Refactor the test by creating intermediate tables. The reasons to do so are: 1) make the test more readable and 2) use these intermediate tables for creating reports to send to DOF.

    • I created models/qaqc/ directory that is intended for test/qaqc intermediate tables that are NOT a part of the build and for qaqc reports. As we migrate PLUTO build, we can use the reports subdirectory for QAQC app tables. I also added yaml files to describe various tables in the qaqc dir. This directory contains some of the refactored tables used in the test.
    • Simplify the test itself given the refactored tables.
    • Add a seed file that will contain condo bbls that should be ignored during the test. This seed file is manually filled out during manual research step.
  5. Run dbt test as a separate step in PLUTO build GHA.

Test Documentation in dbt

  • As I previously mentioned, dbt is in process of implementing description property for tests: i.e. we will be able to document generic and singular tests in a yaml file. And then the test descriptions will be parsed to dbt docs pages here:
    image

    image
  • For now, we can document tests in a test's config.meta property which can be a dict with any key-value pairs. This property is available for both generic and singular tests.

  • Where test results and test properties can be accessed? In dbt-generated artifacts: manifest.json & run_results.json

@sf-dcp sf-dcp force-pushed the sf-pluto-dbt branch 3 times, most recently from 870ee35 to e103768 Compare September 25, 2024 18:26
@sf-dcp sf-dcp force-pushed the sf-pluto-dbt branch 7 times, most recently from 99155d5 to 046a5e5 Compare September 27, 2024 15:40
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✂️ ? Plus pluto_rpad_geo

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can't remove them because dbt needs these models for the defined models/tests. If I remove these, dbt fails

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For which tests specifically?

Copy link
Contributor Author

@sf-dcp sf-dcp Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the models & 1 test using these tables:

pluto_rpad_geo:

  • models/qaqc/intermediate/qaqc_int__active_condo_bbl_unitsres_corrections.sql
  • models/qaqc/reports/qaqc_reports__dof_incorrect_condo_res_units.sql
  • tests/assert_condo_bbl_unit_count_research_required.sql

dof_pts_propmaster:

  • models/qaqc/reports/qaqc_reports__dof_pts_incorrect_condo_res_units.sql

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay - I think these can be defined as sources and referenced that way

Copy link
Member

@damonmcc damonmcc Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fvankrieken since those tables (pluto_rpad_geo, dof_pts_propmaster) are created during the build (via dbt or not), we thought it'd be better to consider them models than to consider them sources

seems like the only tables that should be considered sources are those listed in a recipe.yml

Copy link
Contributor

@fvankrieken fvankrieken Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But they're literally sources for the dbt code, no? We could have two category of sources, since dbt lets you define different source groups - "recipe_sources" and "build_sources" or something like that to keep them distinct. It just gives room to be explicit that these tables exist outside of dbt. I think that's less confusing than having fake models (that also violate our dbt naming conventions), and kind of tricking dbt into thinking that they're part of its build.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using different sources groups would work to keep them distinct, although that's usually used to distinguish where data was extracted from (e.g. salesforce, stripe) [docs]

since dbt's description of sources is that they're "data loaded into your warehouse by your Extract and Load tools", I wouldn't say the these tables are sources. all tables are sources for dbt code, but not all tables are sources

Copy link
Member

@damonmcc damonmcc Oct 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe something like:

sources:
  - name: recipe_sources
    ...

  - name: pluto_build
    schema: "{{ env_var('BUILD_ENGINE_SCHEMA') }}"
    tables:
      - name: pluto_rpad_geo
        ...

would mean no empty sql files?

@sf-dcp this would also mean changing things like ref('pluto_rpad_geo') to source('pluto_build', 'pluto_rpad_geo')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is what I'm suggesting

SELECT *
FROM uncorrected_primebbl_offenders
WHERE primebbl NOT IN (
SELECT bbl::decimal::bigint::text
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we get rid of the triple cast? Just read in as text to begin with?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sooooo I don't think we can cast straight to text because this seed model is a csv, ignored_bbls_for_unit_count_test. The risk of casting to text directly is a bbl being something like 123.00

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it's relatively straightforward (see the code snippet in seeds/properties.yml)

@fvankrieken
Copy link
Contributor

This is very cool. Do you have any examples of runs, and the output json files from running tests?

Copy link
Contributor

@alexrichey alexrichey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! I'd leave the actual stamp for Damon and Finn, who know more about DBT norms and PLUTO than I do. Also curious about the outputs, e.g. how many offenders this finds.

@sf-dcp
Copy link
Contributor Author

sf-dcp commented Oct 7, 2024

This looks great! I'd leave the actual stamp for Damon and Finn, who know more about DBT norms and PLUTO than I do. Also curious about the outputs, e.g. how many offenders this finds.

@alexrichey, thank you! Will provide output files asap. When working on PLUTO 24v3, ~26 bbl records failed the check. Out of the 26, about a half of condo bbls needed a manual correction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: New
Development

Successfully merging this pull request may close these issues.

Product Test Framework: checks implementation and storage of their results
4 participants