-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Jinja Sampling to Stable Static Parser #3970
Conversation
# look for false positive configs | ||
for k in config._config_call_dict: | ||
if k not in config._config_call_dict: | ||
result += ["82_stable_false_positive_config_value"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, a different error code! really good. I was way overcomplicating this in my head
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I try to make things easy for you when I can 🙂
db5c188
to
ca69e84
Compare
Small thing, while we're hanging out in the model parser's We include that value when we fire off the total # of files "eligible" for static parsing, as the denominator when calculating the efficacy of our current grammar for one project. |
@jtcohen6, It's definitely not the plan to remove that line. It looks like it's still there on develop and in this branch. So I suspect it's just the PR view that makes it look like it disappeared. |
@nathaniel-may That's I observed this because I parsed a project with |
My mistake. Good catch though! I opened PR#3981. |
ca69e84
to
a5f2c1a
Compare
rebased so PR#3981 is reflected here now. |
automatic commit by git-black, original commits: b501f43
Description
Infrequently sample jinja parsing to compare against the new static parser. Without this, we'll have no data to compare the parity of jinja rendering and the stable static parser for the 1.0 version. This sampling is separate from the sampling pathway for the experimental parser, and both can be done at the same time. Since we expect a huge number of people to upgrade to 1.0, even though we have done our due diligence and expect the static parser to match parity with jinja rendering, this tracking information will allow us to validate our prior work while the feature is turned on by default for users.
The trade off here is that adding a second sampling path adds another bit of complexity to the
render_update
method and jinja rendering is often two orders of magnitude greater than the static parser making sampling extremely expensive. Right now the plan is to sample 1/5000 model files instead of 1/100 but this sample rate is flexible. A lower rate gives us less feedback and impacts users less, and a higher rate gives us more feedback sooner at a performance cost to users.This is not critical to ship in an early 1.0 pre-release.
Reviewers
First and foremost, do we even need this? The biggest drawback is that sampling jinja rendering is expensive and would add additional variation to project load times. The benefit of course is that we have any sort of warning that we have deviated from the existing jinja semantics. That being said, unless the project source is available to us we won't really have the ability to reproduce. However, if there are no problems after many samples, we can have great confidence that it works as designed across the wide variety of users that will inevitably upgrade to the big 1.0. Personally, If you're leaning towards "we don't need this" I'm leaning towards "what if we sample REALLY infrequently."
In order to jinja render separately, I deep copy both the
node
andconfig
objects and pass them to the super class'srender_update
method in the event we're taking a jinja sample. What do you think about this approach? I'm trying to avoid mutating the objects I'm working on currently for correctness, but I'm also not using the jinja rendering for the run when I have it so that subsequent user runs of dbt are deterministic and not dependent on the sampling rng. These deep copies only happen during a sample, so they won't be a huge impact to overall performance, but they do make taking the sample more expensive.There's no great way to test the tracking logic, so please take a good look at
_get_stable_sample_result
.Checklist
CHANGELOG.md
and added information about my change to the "dbt next" section.