Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-677] [Bug] Downstream models do not run correctly after NetworkX 2.8.1 #5286

Closed
1 task done
iknox-fa opened this issue May 20, 2022 · 8 comments · Fixed by #5326
Closed
1 task done

[CT-677] [Bug] Downstream models do not run correctly after NetworkX 2.8.1 #5286

iknox-fa opened this issue May 20, 2022 · 8 comments · Fixed by #5326
Assignees
Labels
bug Something isn't working

Comments

@iknox-fa
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Current Behavior

Any 1.0.x version of dbt installed after NetworkX 2.8.1 was released does not correctly run downstream models when using selection syntax. For some reason this does not seem to effect 1.1

Expected Behavior

All downstream models are run

Steps To Reproduce

Install dbt- @ v1.0.x
Create a project with a model_foo and downstream model_bar
Run dbt with selection dbt run --select +model_foo+
Observe that only model_foo is run

Relevant log output

No response

Environment

- OS: MacOS
- Python: 3.10
- dbt: 1.0.6

What database are you using dbt with?

postgres

Additional Context

No response

@iknox-fa iknox-fa added bug Something isn't working triage labels May 20, 2022
@github-actions github-actions bot changed the title [Bug] Downstream models do not run correctly after NetworkX 2.8.1 [CT-677] [Bug] Downstream models do not run correctly after NetworkX 2.8.1 May 20, 2022
@barberscott
Copy link

https://github.com/networkx/networkx/milestone/21?closed=1 -- looks like 2.8.2 of networkx is incoming...

@gshank gshank removed the triage label May 20, 2022
@barberscott
Copy link

2.8.1 broke, but 2.8.2 did not fix whatever is now wrong with the graph stuff and newtorkx.

@iknox-fa iknox-fa self-assigned this Jun 1, 2022
@iknox-fa
Copy link
Contributor Author

iknox-fa commented Jun 1, 2022

After quite a few hours of investigation, this is proving a VERY difficult bug to track down.
Here's a synopsis of the work so far.

We're doing this wrong

The issue can be traced back to our use of nx.single_source_shortest_path_length in dbt-core <1.1.x here and here.

Specifically, it's used to return a list of a given node's descendants, and by reversing the graph first, it's ancestors. However, that isn't really what that function is supposed to do! Using the DAG-specific nx.ancestors and nx.descendants works correctly in all versions and should have probably been what we did in the first place. The only downside to this approach is that nx.desc / nx.ances don't offer a max depth option but it's a simple enough thing to implement.

There's still a bug, even when we do it the "right way"

In researching this it was discovered that doing any kind of iteration over self.graph immediately before applying nx.single_source_shortest_path_length seems to "fix" it. This explains why dbt-core>=1.1 works correctly because we changed the method used to reverse the graph to one that actually iterates over it. This makes absolutely no sense since iterating over the graph should not mutate it in any way!!!

I suspect that either we are constructing the initial graph in such a way that causes problems unless it's iterated over OR there's a bug in networkx somewhere very very very not obvious. Unfortunately, I wasn't able to definitively pin down the source of this bug despite several days worth of effort.

@iknox-fa
Copy link
Contributor Author

iknox-fa commented Jun 1, 2022

@jtcohen6 , @leahwicz Should we create another ticket to further investigate root cause here? I'm low-key concerned about not being able to pin down the exact root cause.... very low-key.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jun 2, 2022

@iknox-fa My inclination here is to move forward with the refactor that you've identified, after several days of detailed investigation. This is critical dbt functionality. It feels most important that we are confident in the approach we're taking going forward—specifically, that it is closely aligned with intended behavior of networkx.

We've now pinned dbt-core v1.0 to use networkx<2.8.1. dbt-core v1.1 is still allowing networkx<3 (since this error didn't occur there). Since we don't feel like we fully understand exactly why the extra iteration step added in v1.1 is working, let's either:

@leahwicz
Copy link
Contributor

leahwicz commented Jun 3, 2022

@jtcohen6 I like pinning 1.1 and not backporting since it's a change that introduces risk. We can add that to the RC2 that we are going to push out soon

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jun 6, 2022

@leahwicz Agree with pinning as the safer bet for v1.1. Let's make sure that gets in for RC2.

@drewbanin
Copy link
Contributor

I spent some time debugging this issue this morning. Two videos for you:

First, A walkthrough of using git bisect to find the first commit which breaks the graph traversal behavior in dbt Core

https://www.loom.com/share/084a91690e6744738063ad78c7b5e79e

The first commit in Networkx which causes the errant behavior is: networkx/networkx@f50fc70

Second, A deep dive with a debugger to figure out the root cause of the regression. I feel like I'm 90% of the way to understanding the culprit, but I'm missing the specific place where (surprise!) a cache is not being invalidated/updated

https://www.loom.com/share/a67fa29caaac436997e278742d61c031

Here's the code that I'm using to repro this issue: https://gist.github.com/drewbanin/60f3b6a5d19d8af694dc6830c8c986b3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants