Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: estimate query cost in Postgres #12130

Merged
merged 3 commits into from
Dec 19, 2020

Conversation

betodealmeida
Copy link
Member

SUMMARY

Currently the estimate query cost feature is only enabled for Presto, making it hard to test the functionality and iterate on SQL Lab without breaking it.

This PR adds a simple query cost estimator to Postgres.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Screenshot_2020-12-18 Superset

TEST PLAN

See screenshot.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Changes UI
  • Requires DB Migration.
  • Confirm DB Migration upgrade and downgrade tested.
  • Introduces new feature or API
  • Removes existing feature or API

@@ -132,7 +132,8 @@ class PrestoEngineSpec(BaseEngineSpec): # pylint: disable=too-many-public-metho
}

@classmethod
def get_allow_cost_estimate(cls, version: Optional[str] = None) -> bool:
def get_allow_cost_estimate(cls, extra: Dict[str, Any]) -> bool:
version = extra.get("version")
return version is not None and StrictVersion(version) >= StrictVersion("0.319")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one liner:
return extra.get("version") and StrictVersion(version) >= StrictVersion("0.319")

@hughhhh hughhhh self-requested a review December 18, 2020 20:41
@codecov-io
Copy link

codecov-io commented Dec 18, 2020

Codecov Report

Merging #12130 (7fa541e) into master (2a23744) will decrease coverage by 2.59%.
The diff coverage is 50.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #12130      +/-   ##
==========================================
- Coverage   66.22%   63.62%   -2.60%     
==========================================
  Files         972      481     -491     
  Lines       48098    29708   -18390     
  Branches     4752        0    -4752     
==========================================
- Hits        31852    18903   -12949     
+ Misses      16108    10805    -5303     
+ Partials      138        0     -138     
Flag Coverage Δ
cypress ?
javascript ?
python 63.62% <50.00%> (-0.58%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/db_engine_specs/base.py 86.08% <35.71%> (-1.52%) ⬇️
superset/commands/importers/v1/__init__.py 94.73% <50.00%> (-1.21%) ⬇️
superset/db_engine_specs/presto.py 70.56% <50.00%> (-11.85%) ⬇️
superset/db_engine_specs/postgres.py 85.96% <52.94%> (-14.04%) ⬇️
superset/config.py 90.54% <100.00%> (+0.03%) ⬆️
superset/databases/schemas.py 100.00% <100.00%> (ø)
superset/models/core.py 88.04% <100.00%> (-0.85%) ⬇️
superset/db_engines/hive.py 0.00% <0.00%> (-85.72%) ⬇️
superset/db_engine_specs/hive.py 54.61% <0.00%> (-29.62%) ⬇️
... and 497 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2a23744...7fa541e. Read the comment docs.

@ktmud
Copy link
Member

ktmud commented Dec 18, 2020

What's the unit of this cost? Seconds?

@betodealmeida
Copy link
Member Author

betodealmeida commented Dec 18, 2020

What's the unit of this cost? Seconds?

The costs are measured in arbitrary units determined by the planner's cost parameters (see Section 18.7.2). Traditional practice is to measure the costs in units of disk page fetches; that is, seq_page_cost is conventionally set to 1.0 and the other cost parameters are set relative to that. The examples in this section are run with the default cost parameters. [reference]

It's a relative value, so not super useful. But it's possible to collect stats about queries, and define a custom QUERY_COST_FORMATTERS_BY_ENGINE that maps from the relative values to something else. For example, you could run EXPLAIN offline every day in all recent queries, and use the results to built a histogram, so that the formatter says something like "this query is at the top 3% percentile of all queries".

We had this running at Lyft for Presto, and we would show wall time, dollar cost and carbon footprint this way.

See #8470.

@ktmud
Copy link
Member

ktmud commented Dec 18, 2020

Thanks for the explanation! In that case, do you think it's worth adding the link to Postgres doc somewhere in the modal. Or at least a one-liner to explain how is this calculated? E.g.

Query cost is estimated by `EXPLAIN` statement. Refer to the database docs for details.

@betodealmeida
Copy link
Member Author

betodealmeida commented Dec 18, 2020

@ktmud unfortunately the modal is not DB-specific, so we can't add that message to it.

I also don't expect people to use it without an admin setting up a custom formatter that translates the number into a meaningful value. I added an example to config.py showing to set it up so that the numbers are more tangible.

Here's a screenshot with that config in place:

Screen Shot 2020-12-18 at 3 15 37 PM

@betodealmeida betodealmeida merged commit 877b153 into apache:master Dec 19, 2020
@ktmud
Copy link
Member

ktmud commented Dec 19, 2020

@betodealmeida Thanks for the explanation. This makes sense.

@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.0.0 labels Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels preset-io size/L 🚢 1.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants