Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Jinja Sampling to Stable Static Parser #3970

Merged
merged 1 commit into from
Oct 6, 2021
Merged

Conversation

nathaniel-may
Copy link
Contributor

@nathaniel-may nathaniel-may commented Sep 29, 2021

Description

Infrequently sample jinja parsing to compare against the new static parser. Without this, we'll have no data to compare the parity of jinja rendering and the stable static parser for the 1.0 version. This sampling is separate from the sampling pathway for the experimental parser, and both can be done at the same time. Since we expect a huge number of people to upgrade to 1.0, even though we have done our due diligence and expect the static parser to match parity with jinja rendering, this tracking information will allow us to validate our prior work while the feature is turned on by default for users.

The trade off here is that adding a second sampling path adds another bit of complexity to the render_update method and jinja rendering is often two orders of magnitude greater than the static parser making sampling extremely expensive. Right now the plan is to sample 1/5000 model files instead of 1/100 but this sample rate is flexible. A lower rate gives us less feedback and impacts users less, and a higher rate gives us more feedback sooner at a performance cost to users.

This is not critical to ship in an early 1.0 pre-release.

Reviewers

First and foremost, do we even need this? The biggest drawback is that sampling jinja rendering is expensive and would add additional variation to project load times. The benefit of course is that we have any sort of warning that we have deviated from the existing jinja semantics. That being said, unless the project source is available to us we won't really have the ability to reproduce. However, if there are no problems after many samples, we can have great confidence that it works as designed across the wide variety of users that will inevitably upgrade to the big 1.0. Personally, If you're leaning towards "we don't need this" I'm leaning towards "what if we sample REALLY infrequently."

In order to jinja render separately, I deep copy both the node and config objects and pass them to the super class's render_update method in the event we're taking a jinja sample. What do you think about this approach? I'm trying to avoid mutating the objects I'm working on currently for correctness, but I'm also not using the jinja rendering for the run when I have it so that subsequent user runs of dbt are deterministic and not dependent on the sampling rng. These deep copies only happen during a sample, so they won't be a huge impact to overall performance, but they do make taking the sample more expensive.

There's no great way to test the tracking logic, so please take a good look at _get_stable_sample_result.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

# look for false positive configs
for k in config._config_call_dict:
if k not in config._config_call_dict:
result += ["82_stable_false_positive_config_value"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, a different error code! really good. I was way overcomplicating this in my head

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I try to make things easy for you when I can 🙂

Base automatically changed from static-parser-by-default to develop September 29, 2021 20:01
@jtcohen6
Copy link
Contributor

Small thing, while we're hanging out in the model parser's render_update: I just realized that this line got lost in the shuffle (or did we remove it intentionally?):

https://github.com/dbt-labs/dbt/blob/733f35f5e133327e8e54477e94166ec76eec926e/core/dbt/parser/models.py#L32

We include that value when we fire off the total # of files "eligible" for static parsing, as the denominator when calculating the efficacy of our current grammar for one project.

@nathaniel-may
Copy link
Contributor Author

@jtcohen6, It's definitely not the plan to remove that line. It looks like it's still there on develop and in this branch. So I suspect it's just the PR view that makes it look like it disappeared.

@jtcohen6
Copy link
Contributor

@nathaniel-may That's static_analysis_parsed_path_count (numerator), which is distinct from static_analysis_path_count (denominator). I think static_analysis_path_count was incidentally removed in 3944705#diff-dc357e8548e098dd9f293a19d206d90baf4796d1f2e05e237342334fd6657356L32.

I observed this because I parsed a project with develop, inspected target/perf_info.json, and noticed that static_analysis_path_count is 0, at the same time that static_analysis_parsed_path_count is >0 and is_static_analysis_enabled is True.

@nathaniel-may
Copy link
Contributor Author

My mistake. Good catch though! I opened PR#3981.

@nathaniel-may
Copy link
Contributor Author

rebased so PR#3981 is reflected here now.

@nathaniel-may nathaniel-may marked this pull request as ready for review October 1, 2021 17:11
@nathaniel-may nathaniel-may merged commit b501f43 into develop Oct 6, 2021
@nathaniel-may nathaniel-may deleted the sampling2 branch October 6, 2021 13:48
iknox-fa pushed a commit that referenced this pull request Feb 8, 2022
automatic commit by git-black, original commits:
  b501f43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants