Set up data tests in PLUTO #1141

sf-dcp · 2024-09-18T17:28:44Z

Fixes #962. See the issue for test framework details.

This PR establishes dbt framework for running data tests post PLUTO build, keeping in mind future migration of the entire PLUTO pipeline to dbt.
I also implemented first data test that I've been working on with Amanda; this test helps us identify DOF PTS condo bbls that likely have incorrect residential unit count.

Data Test description

As mentioned above this test helps us identify DOF condo records that likely have incorrect residential unit number. The test consists of 2 parts:
- First, we check DOF semi-raw data (pluto_rpad_geo table) to find prime bbls where multiple unit condos have res unit > 1. Why? Because each residential unit in a condo building should have exactly one condo.
For example, the following prime bbl would be flagged if it has the following records:

primebbl bbl (unit bbl) coop_units (res units)

75123 1 3

75123 2 3

75123 3 1
- Second, we cross-check this prime bbl with the corrections file pluto_input_research.csv, the Developments database (DevdB), and a file of bbls that should be ignored during the check.
  - Why there is a file of bbls that should be ignored during the check? Because DevDB isn't perfect and it only contains building construction records going back to 2010ish. So for example, if a lot has 2 buildings built in 1980 and only 1 building was renovated post 2010ish, DevDB will should residential units for that 1 building.
test logic tree:

flowchart TD

check_name["condominium has multiple units with coop_units > 1?"]
corrected["Has res unit total been manually corrected?"]
in_ignore_file["Is in the 'ignore' csv file?"]
devdb["Is res unit total aligns with DevDB?"]
over_50["Is res unit count diff >= 50?"]
fix{"Manually research bbl. Needs correction?"}
add_to_corrections["Add to corrections & report to DOF"]
add_to_ignore_file["Add to 'ignore' csv file"]

ignore{"Ignore ✅"}

check_name --> |Yes| corrected
check_name --> |No| ignore

corrected --> |Yes| ignore
corrected --> |No| in_ignore_file

in_ignore_file --> |Yes| ignore
in_ignore_file --> |No| devdb

devdb --> |Yes| ignore
devdb --> |No| over_50

over_50 --> |Yes| fix
over_50 --> |No| ignore

fix --> |Yes| add_to_corrections
fix --> |No| add_to_ignore_file

dbt lineage graph:

Commits

Initiate dbt project inside of PLUTO directory
Add DevDB dataset to PLUTO recipes. We need DevDB for the data test
Create a custom dbt test for residential units in tests/assert_condo_bbl_unit_count_research_required.sql file. In dbt, you can create your own custom, "singular" test which is pretty much a sql query stored in a tests directory. This is slightly different from macros that we've created before in a way that it's not reusable for different tables.
- Test description and our next steps in the case of failure are documented in the config object of the test file. At the moment, dbt doesn't have description property for tests, but it should be available with dbt next release (they recently implemented this feature here).
- This test is refactored in commit 4, so it may not be necessary to check out this commit; I kept it just in case.
Refactor the test by creating intermediate tables. The reasons to do so are: 1) make the test more readable and 2) use these intermediate tables for creating reports to send to DOF.
- I created models/qaqc/ directory that is intended for test/qaqc intermediate tables that are NOT a part of the build and for qaqc reports. As we migrate PLUTO build, we can use the reports subdirectory for QAQC app tables. I also added yaml files to describe various tables in the qaqc dir. This directory contains some of the refactored tables used in the test.
- Simplify the test itself given the refactored tables.
- Add a seed file that will contain condo bbls that should be ignored during the test. This seed file is manually filled out during manual research step.
Run dbt test as a separate step in PLUTO build GHA.

Test Documentation in `dbt`

As I previously mentioned, dbt is in process of implementing description property for tests: i.e. we will be able to document generic and singular tests in a yaml file. And then the test descriptions will be parsed to dbt docs pages here:
For now, we can document tests in a test's config.meta property which can be a dict with any key-value pairs. This property is available for both generic and singular tests.
Where test results and test properties can be accessed? In dbt-generated artifacts: manifest.json & run_results.json

* These intermediate tables can also be used to populate reports to share with DOF

fvankrieken · 2024-09-27T19:54:54Z

products/pluto/models/intermediate/dof_pts_propmaster.sql

✂️ ? Plus pluto_rpad_geo

we can't remove them because dbt needs these models for the defined models/tests. If I remove these, dbt fails

For which tests specifically?

Here are the models & 1 test using these tables:

pluto_rpad_geo:

models/qaqc/intermediate/qaqc_int__active_condo_bbl_unitsres_corrections.sql

models/qaqc/reports/qaqc_reports__dof_incorrect_condo_res_units.sql

tests/assert_condo_bbl_unit_count_research_required.sql

dof_pts_propmaster:

models/qaqc/reports/qaqc_reports__dof_pts_incorrect_condo_res_units.sql

Ah okay - I think these can be defined as sources and referenced that way

@fvankrieken since those tables (pluto_rpad_geo, dof_pts_propmaster) are created during the build (via dbt or not), we thought it'd be better to consider them models than to consider them sources

seems like the only tables that should be considered sources are those listed in a recipe.yml

But they're literally sources for the dbt code, no? We could have two category of sources, since dbt lets you define different source groups - "recipe_sources" and "build_sources" or something like that to keep them distinct. It just gives room to be explicit that these tables exist outside of dbt. I think that's less confusing than having fake models (that also violate our dbt naming conventions), and kind of tricking dbt into thinking that they're part of its build.

using different sources groups would work to keep them distinct, although that's usually used to distinguish where data was extracted from (e.g. salesforce, stripe) [docs]

since dbt's description of sources is that they're "data loaded into your warehouse by your Extract and Load tools", I wouldn't say the these tables are sources. all tables are sources for dbt code, but not all tables are sources

maybe something like:

sources: - name: recipe_sources ... - name: pluto_build schema: "{{ env_var('BUILD_ENGINE_SCHEMA') }}" tables: - name: pluto_rpad_geo ...

would mean no empty sql files?

@sf-dcp this would also mean changing things like ref('pluto_rpad_geo') to source('pluto_build', 'pluto_rpad_geo')

That is what I'm suggesting

fvankrieken · 2024-09-27T20:02:23Z

products/pluto/tests/assert_condo_bbl_unit_count_research_required.sql

+    SELECT *
+    FROM uncorrected_primebbl_offenders
+    WHERE primebbl NOT IN (
+        SELECT bbl::decimal::bigint::text


Can we get rid of the triple cast? Just read in as text to begin with?

sooooo I don't think we can cast straight to text because this seed model is a csv, ignored_bbls_for_unit_count_test. The risk of casting to text directly is a bbl being something like 123.00

Looks like it's relatively straightforward (see the code snippet in seeds/properties.yml)

fvankrieken · 2024-09-27T20:03:25Z

This is very cool. Do you have any examples of runs, and the output json files from running tests?

alexrichey

This looks great! I'd leave the actual stamp for Damon and Finn, who know more about DBT norms and PLUTO than I do. Also curious about the outputs, e.g. how many offenders this finds.

sf-dcp · 2024-10-07T17:49:41Z

This looks great! I'd leave the actual stamp for Damon and Finn, who know more about DBT norms and PLUTO than I do. Also curious about the outputs, e.g. how many offenders this finds.

@alexrichey, thank you! Will provide output files asap. When working on PLUTO 24v3, ~26 bbl records failed the check. Out of the 26, about a half of condo bbls needed a manual correction.

sf-dcp force-pushed the sf-pluto-dbt branch 3 times, most recently from 870ee35 to e103768 Compare September 25, 2024 18:26

sf-dcp force-pushed the sf-pluto-dbt branch 7 times, most recently from 99155d5 to 046a5e5 Compare September 27, 2024 15:40

sf-dcp added 4 commits September 27, 2024 12:20

Set up dbt project inside of pluto

a937b02

Add dcp_developments to PLUTO recipe to be used in data test.

c1945f9

Create singular dbt test for condo res units and its model dependencies

e036580

Refactor singular test to use intermediate tables:

c17438f

* These intermediate tables can also be used to populate reports to share with DOF

sf-dcp force-pushed the sf-pluto-dbt branch from 046a5e5 to f206de4 Compare September 27, 2024 16:20

sf-dcp marked this pull request as ready for review September 27, 2024 16:24

sf-dcp requested review from fvankrieken, alexrichey and damonmcc and removed request for damonmcc, fvankrieken and alexrichey September 27, 2024 16:26

sf-dcp assigned damonmcc, fvankrieken and alexrichey Sep 27, 2024

fvankrieken reviewed Sep 27, 2024

View reviewed changes

sf-dcp added 2 commits October 7, 2024 11:52

Add step to run dbt from pluto build GHA

b557b0c

POST REVIEW: move bbl column casting to seed yml file

d90feb7

sf-dcp force-pushed the sf-pluto-dbt branch from f206de4 to d90feb7 Compare October 7, 2024 15:53

alexrichey approved these changes Oct 7, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set up data tests in PLUTO #1141

Set up data tests in PLUTO #1141

sf-dcp commented Sep 18, 2024 •

edited

Loading

fvankrieken Sep 27, 2024

sf-dcp Oct 4, 2024

fvankrieken Oct 7, 2024

sf-dcp Oct 7, 2024 •

edited

Loading

fvankrieken Oct 7, 2024

damonmcc Oct 7, 2024 •

edited

Loading

fvankrieken Oct 7, 2024 •

edited

Loading

damonmcc Oct 7, 2024

damonmcc Oct 7, 2024 •

edited

Loading

fvankrieken Oct 7, 2024

fvankrieken Sep 27, 2024

sf-dcp Oct 4, 2024

fvankrieken Oct 7, 2024

fvankrieken commented Sep 27, 2024

alexrichey left a comment

sf-dcp commented Oct 7, 2024

primebbl	bbl (unit bbl)	coop_units (res units)
75123	1	3
75123	2	3
75123	3	1

Set up data tests in PLUTO #1141

Are you sure you want to change the base?

Set up data tests in PLUTO #1141

Conversation

sf-dcp commented Sep 18, 2024 • edited Loading

Data Test description

Commits

Test Documentation in dbt

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sf-dcp Oct 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

damonmcc Oct 7, 2024 • edited Loading

Choose a reason for hiding this comment

fvankrieken Oct 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

damonmcc Oct 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fvankrieken commented Sep 27, 2024

alexrichey left a comment

Choose a reason for hiding this comment

sf-dcp commented Oct 7, 2024

sf-dcp commented Sep 18, 2024 •

edited

Loading

Test Documentation in `dbt`

sf-dcp Oct 7, 2024 •

edited

Loading

damonmcc Oct 7, 2024 •

edited

Loading

fvankrieken Oct 7, 2024 •

edited

Loading

damonmcc Oct 7, 2024 •

edited

Loading