-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set up data tests in PLUTO #1141
base: main
Are you sure you want to change the base?
Conversation
870ee35
to
e103768
Compare
99155d5
to
046a5e5
Compare
* These intermediate tables can also be used to populate reports to share with DOF
046a5e5
to
f206de4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✂️ ? Plus pluto_rpad_geo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can't remove them because dbt
needs these models for the defined models/tests. If I remove these, dbt
fails
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For which tests specifically?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here are the models & 1 test using these tables:
pluto_rpad_geo
:
models/qaqc/intermediate/qaqc_int__active_condo_bbl_unitsres_corrections.sql
models/qaqc/reports/qaqc_reports__dof_incorrect_condo_res_units.sql
tests/assert_condo_bbl_unit_count_research_required.sql
dof_pts_propmaster
:
models/qaqc/reports/qaqc_reports__dof_pts_incorrect_condo_res_units.sql
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah okay - I think these can be defined as sources
and referenced that way
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fvankrieken since those tables (pluto_rpad_geo
, dof_pts_propmaster
) are created during the build (via dbt or not), we thought it'd be better to consider them models than to consider them sources
seems like the only tables that should be considered sources are those listed in a recipe.yml
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But they're literally sources for the dbt code, no? We could have two category of sources, since dbt lets you define different source groups - "recipe_sources" and "build_sources" or something like that to keep them distinct. It just gives room to be explicit that these tables exist outside of dbt. I think that's less confusing than having fake models (that also violate our dbt naming conventions), and kind of tricking dbt into thinking that they're part of its build.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using different sources groups would work to keep them distinct, although that's usually used to distinguish where data was extracted from (e.g. salesforce, stripe) [docs]
since dbt's description of sources is that they're "data loaded into your warehouse by your Extract and Load tools", I wouldn't say the these tables are sources. all tables are sources for dbt code, but not all tables are sources
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe something like:
sources:
- name: recipe_sources
...
- name: pluto_build
schema: "{{ env_var('BUILD_ENGINE_SCHEMA') }}"
tables:
- name: pluto_rpad_geo
...
would mean no empty sql files?
@sf-dcp this would also mean changing things like ref('pluto_rpad_geo')
to source('pluto_build', 'pluto_rpad_geo')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is what I'm suggesting
SELECT * | ||
FROM uncorrected_primebbl_offenders | ||
WHERE primebbl NOT IN ( | ||
SELECT bbl::decimal::bigint::text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we get rid of the triple cast? Just read in as text to begin with?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sooooo I don't think we can cast straight to text because this seed model is a csv, ignored_bbls_for_unit_count_test
. The risk of casting to text directly is a bbl being something like 123.00
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it's relatively straightforward (see the code snippet in seeds/properties.yml
)
This is very cool. Do you have any examples of runs, and the output json files from running tests? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great! I'd leave the actual stamp for Damon and Finn, who know more about DBT norms and PLUTO than I do. Also curious about the outputs, e.g. how many offenders this finds.
@alexrichey, thank you! Will provide output files asap. When working on PLUTO 24v3, ~26 bbl records failed the check. Out of the 26, about a half of condo bbls needed a manual correction. |
Fixes #962. See the issue for test framework details.
This PR establishes
dbt
framework for running data tests post PLUTO build, keeping in mind future migration of the entire PLUTO pipeline todbt
.I also implemented first data test that I've been working on with Amanda; this test helps us identify DOF PTS condo bbls that likely have incorrect residential unit count.
Data Test description
As mentioned above this test helps us identify DOF condo records that likely have incorrect residential unit number. The test consists of 2 parts:
pluto_rpad_geo
table) to find prime bbls where multiple unit condos have res unit > 1. Why? Because each residential unit in a condo building should have exactly one condo.For example, the following prime bbl would be flagged if it has the following records:
pluto_input_research.csv
, the Developments database (DevdB), and a file of bbls that should be ignored during the check.test logic tree:
dbt
lineage graph:Commits
Initiate
dbt
project inside of PLUTO directoryAdd DevDB dataset to PLUTO recipes. We need DevDB for the data test
Create a custom
dbt
test for residential units intests/assert_condo_bbl_unit_count_research_required.sql
file. Indbt
, you can create your own custom, "singular" test which is pretty much a sql query stored in atests
directory. This is slightly different from macros that we've created before in a way that it's not reusable for different tables.dbt
doesn't havedescription
property for tests, but it should be available withdbt
next release (they recently implemented this feature here).Refactor the test by creating intermediate tables. The reasons to do so are: 1) make the test more readable and 2) use these intermediate tables for creating reports to send to DOF.
models/qaqc/
directory that is intended for test/qaqc intermediate tables that are NOT a part of the build and for qaqc reports. As we migrate PLUTO build, we can use the reports subdirectory for QAQC app tables. I also added yaml files to describe various tables in the qaqc dir. This directory contains some of the refactored tables used in the test.Run
dbt
test as a separate step in PLUTO build GHA.Test Documentation in
dbt
As I previously mentioned,
dbt
is in process of implementingdescription
property for tests: i.e. we will be able to document generic and singular tests in a yaml file. And then the test descriptions will be parsed todbt
docs pages here:For now, we can document tests in a test's
config.meta
property which can be a dict with any key-value pairs. This property is available for both generic and singular tests.Where test results and test properties can be accessed? In
dbt
-generated artifacts:manifest.json
&run_results.json