Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Develop Pre-dissolved Basins Layer for Faster Delineations #9

Open
2 of 4 tasks
ptomasula opened this issue Jul 23, 2024 · 5 comments
Open
2 of 4 tasks

Develop Pre-dissolved Basins Layer for Faster Delineations #9

ptomasula opened this issue Jul 23, 2024 · 5 comments
Assignees
Labels
enhancement New feature or request

Comments

@ptomasula
Copy link
Contributor

ptomasula commented Jul 23, 2024

Summary

During the development of dissolve logic under #7, it was discovered that dissolving modestly large watersheds is resource intensive and non-performant. In a test on decent hardware, combining ~850 basins, the run time was ~12 seconds (this is just the dissolve operation and excludes overhead like loading the file, subsetting, etc.).

One way to mitigate this performance issue would be to develop a layer of pre-dissolved polygons that represent chunks of upstream watershed. When a dissolve operation needs to be performed on a large watershed, these pre-dissolved polygons can be substituted for n polygons that they represent. This will drastically reduce the number total number of polygons for the dissolve of larger watersheds, thereby increasing performance.

Closure Criteria

  • Logic is developed for pre-dissolving polygons into meaningful chunks.
  • Optimization for chunking has been tested and implemented.
  • Additional pre-dissolved basin layers have been developed and exported to compressed geoparquet.
  • Layers have been uploaded to sharing location and are available for development of dissolve logic.
@ptomasula ptomasula added the enhancement New feature or request label Jul 23, 2024
@ptomasula ptomasula self-assigned this Jul 23, 2024
@ptomasula
Copy link
Contributor Author

ptomasula commented Jul 23, 2024

I performed some initial testing to determine an appropriate upstream subshed threshold before performing a pre-dissolve. The goal is to identify a value (n) when reaches with subsheds less than or equal to the threshold are pre-dissolved.

Threshold in this case refers to a count of subshed polygons upstream, where a value less than or equal to the threshold are pre-dissolved.

One import note for interpreting results is that testing was conducted at the root of the watershed. For results with larger thresholds and smaller watersheds the pre-dissolve logic would have resulted in a entire watershed being predissolved. This makes the results for those scenarios look extremely performant, but the reality is different. Had we tested as just one reach upstream from root, the pre-dissolve logic would not have been used at all, and the results would be closer to the control scenario.

Threshold 857 subsheds 855 subsheds 689 subsheds 443 subsheds 263 subsheds 159 subsheds 131 subsheds 93 subsheds 57 subsheds 37 subsheds
No Per-dissolve (Control) 11.1982 10.7074 8.6145 4.9172 2.8233 1.2009 1.1691 1.0206 0.4646 0.3533
5 9.3891 9.7361 7.4810 4.6031 2.5932 1.2613 1.2640 0.9707 0.4871 0.3199
10 8.0661 8.2562 6.5070 3.8708 2.1506 0.9305 0.9579 0.7909 0.4117 0.2047
25 5.5936 5.4039 4.5478 2.7599 1.3417 0.6659 0.6535 0.4152 0.2134 0.1485
50 4.2615 5.2853 3.9101 2.6000 0.9594 0.6750 0.5557 0.2917 0.0716 0.0057
100 3.2576 4.1618 2.2441 2.2099 0.7508 0.4158 0.2937 0.0043 0.0354 0.0033
200 2.8408 2.7946 1.5041 2.0656 0.5336 0.0529 0.1380 0.0036 0.0378 0.0062

@ptomasula
Copy link
Contributor Author

I performed some initial testing to determine an appropriate upstream subshed threshold before performing a pre-dissolve. The goal is to identify a value (n) when reaches with subsheds less than or equal to the threshold are pre-dissolved.

From these initial results I think a threshold of around 100 seems like a reasonable setting, though I should also run this again with a 200 threshold. Looks at the control (no pre-dissolve) case it seem like 263 is a reasonable runtime and then 443 starts to get into too slow territory. I think any higher than 200 and we run the risk of selecting having reaches upstream of a pre-dissolve node with sufficiently large subshed counts as to impact delineation performance.

@ptomasula
Copy link
Contributor Author

I added in results for a threshold of 200. The performance does seem reasonably worth bump up to processing threshold to 200. Noting the the basing line performance for 200 polygons (without pre-dissolve) is reasonable.

@aufdenkampe
Copy link
Member

@ptomasula, Thanks for sharing all these results ! They're great to see.
How are you selecting which subsheds to pre-dissolve? Also, I'm wondering how pre-simplifying the pre-dissolved boundaries might further speed things up. My thinking is that fine resolution boundaries are no longer necessary once you move to larger watersheds.

ptomasula added a commit that referenced this issue Jul 31, 2024
Related issue #9
This adds a function to leverage the MNSI information and group upstream basins into meaningful groups that can be pre-dissolved. Pre-dissolving will allow for less total in the final dissolve.
aufdenkampe added a commit that referenced this issue Sep 11, 2024
@ptomasula, I figured out, fixed, and tested the issue with the batch  pipeline. The short story is that we used `compute_dissolve_groups()` in the wrong sequence of the workflow. I also added a few other fixes, such as for dtypes and adding ELEMENT_COUNT back to the output fields.
@aufdenkampe
Copy link
Member

@ptomasula, I figured out, fixed, and tested the issue with the batch pipeline in 3d441b5 (see commit notes).

Try running it through our files our modeling computer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants