Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Derived metric can result in null if one of the components is null #764

Closed
2 tasks done
siljamardla opened this issue Sep 6, 2023 · 3 comments
Closed
2 tasks done
Labels
bug Something isn't working triage Tasks that need to be triaged

Comments

@siljamardla
Copy link

siljamardla commented Sep 6, 2023

Is this a new bug in metricflow?

  • I believe this is a new bug in metricflow
  • I have searched the existing issues, and I could not find an existing issue for this

Current Behavior

There is a derived metric:
derived_metric = metric1 + metric2
After some debugging I noticed that if metric2 is null then derived_metric is also null. Which makes sense technically, but in case of this particular metric does not at all make sense business wise. The example I'm working with is something like
total_price = base_price + extra_fees
and obviously I don't want the extra fees being null make the total price null.

Probably related #759

Expected Behavior

A derived metric that adds two or more metrics should not result in null if one of the components is null.

Steps To Reproduce

Generating sample data:

select 
      1 as test_order_id
    , 10 as price
    , 'base' as price_type
    , date('2023-08-01') as order_created_at
union select
      1 as test_order_id
    , 5 as price
    , 'additional' as price_type
    , date('2023-08-01') as order_created_at
union select 
      1 as test_order_id
    , 5 as price
    , 'delivery' as price_type
    , date('2023-08-01') as order_created_at

union select 
      2 as test_order_id
    , 10 as price
    , 'base' as price_type
    , date('2023-08-02') as order_created_at
union select
      2 as test_order_id
    , 5 as price
    , 'additional' as price_type
    , date('2023-08-02') as order_created_at
union select 
      2 as test_order_id
    , 0 as price
    , 'delivery' as price_type
    , date('2023-08-02') as order_created_at

union select 
      3 as test_order_id
    , 10 as price
    , 'base' as price_type
    , date('2023-08-03') as order_created_at
union select
      3 as test_order_id
    , 5 as price
    , 'additional' as price_type
    , date('2023-08-03') as order_created_at
test_order_id price price_type order_created_at
1 5 additional 02.08.2023
1 10 base 01.08.2023
1 5 delivery 03.08.2023
2 5 additional 02.08.2023
2 10 base 01.08.2023
2 0 delivery 03.08.2023
3 5 additional 02.08.2023
3 10 base 01.08.2023

Notice that test_order_id = 3 does not have a row for delivery price.

Semantic model and metrics:

semantic_models:
  - name: order_data
    defaults:
      agg_time_dimension: order_created_at
    model: ref('test_data')
    # Entities. These usually corespond to keys in the table.
    entities:
      - name: test_order_id
        type: primary
    # Measures. These are the aggregations on the columns in the table.
    measures:
      - name: price
        expr: price
        agg: sum
    # Dimensions. Either categorical or time.
    dimensions:
      - name: order_created_at
        type: time
        type_params:
          time_granularity: day
      - name: price_type
        type: categorical


metrics:
  - name: metric_base_price
    label: metric_base_price
    type: simple
    type_params:
      measure: price
    filter: |
      {{ Dimension('test_order_id__price_type') }} = 'base'

  - name: metric_additional_price
    label: metric_additional_price
    type: simple
    type_params:
      measure: price
    filter: |
      {{ Dimension('test_order_id__price_type') }} = 'additional'

  - name: metric_delivery_price
    label: metric_delivery_price
    type: simple
    type_params:
      measure: price
    filter: |
      {{ Dimension('test_order_id__price_type') }} = 'delivery'

  - name: total_price
    label: total_price
    type: derived
    type_params:
          expr: metric_base_price + metric_additional_price + metric_delivery_price
          metrics:
            - name: metric_base_price
            - name: metric_additional_price
            - name: metric_delivery_price

Query without group by:

mf query --metrics metric_base_price,metric_additional_price,metric_delivery_price,total_price

Returns the expected output:

|   metric_base_price |   metric_additional_price |   metric_delivery_price |   total_price |
|--------------------:|--------------------------:|------------------------:|--------------:|
|               30.00 |                     15.00 |                    5.00 |         50.00 |

Query with group by:

mf query --metrics metric_base_price,metric_additional_price,metric_delivery_price,total_price --group-by test_order_id__order_created_at

Returns total_price = null for 2023-08-03

| test_order_id__order_created_at__day   |   metric_base_price |   metric_additional_price |   metric_delivery_price |   total_price |
|:---------------------------------------|--------------------:|--------------------------:|------------------------:|--------------:|
| 2023-08-03                             |               10.00 |                      5.00 |                         |               |
| 2023-08-01                             |               10.00 |                      5.00 |                    5.00 |         20.00 |
| 2023-08-02                             |               10.00 |                      5.00 |                    0.00 |         15.00 |

although it should be 15.

I have reviewed the compiled SQL (only for total_price, less code to review).

mf query --metrics total_price --group-by test_order_id__order_created_at --explain

This is the original code:

SELECT
  test_order_id__order_created_at__day
  , metric_base_price + metric_additional_price + metric_delivery_price AS total_price
FROM (
  SELECT
    COALESCE(subq_6.test_order_id__order_created_at__day, subq_13.test_order_id__order_created_at__day, subq_20.test_order_id__order_created_at__day) AS test_order_id__order_created_at__day
    , subq_6.metric_base_price AS metric_base_price
    , subq_13.metric_additional_price AS metric_additional_price
    , subq_20.metric_delivery_price AS metric_delivery_price
  FROM (
    SELECT
      test_order_id__order_created_at__day
      , SUM(price) AS metric_base_price
    FROM (
      SELECT
        order_created_at AS test_order_id__order_created_at__day
        , price_type AS test_order_id__price_type
        , price
      FROM `schema_name`.`test_data` order_data_src_4
    ) subq_2
    WHERE test_order_id__price_type = 'base'
    GROUP BY
      test_order_id__order_created_at__day
  ) subq_6
  INNER JOIN (
    SELECT
      test_order_id__order_created_at__day
      , SUM(price) AS metric_additional_price
    FROM (
      SELECT
        order_created_at AS test_order_id__order_created_at__day
        , price_type AS test_order_id__price_type
        , price
      FROM `schema_name`.`test_data` order_data_src_4
    ) subq_9
    WHERE test_order_id__price_type = 'additional'
    GROUP BY
      test_order_id__order_created_at__day
  ) subq_13
  ON
    (
      subq_6.test_order_id__order_created_at__day = subq_13.test_order_id__order_created_at__day
    ) OR (
      (
        subq_6.test_order_id__order_created_at__day IS NULL
      ) AND (
        subq_13.test_order_id__order_created_at__day IS NULL
      )
    )
  INNER JOIN (
    SELECT
      test_order_id__order_created_at__day
      , SUM(price) AS metric_delivery_price
    FROM (
      SELECT
        order_created_at AS test_order_id__order_created_at__day
        , price_type AS test_order_id__price_type
        , price
      FROM `schema_name`.`test_data` order_data_src_4
    ) subq_16
    WHERE test_order_id__price_type = 'delivery'
    GROUP BY
      test_order_id__order_created_at__day
  ) subq_20
  ON
    (
      subq_6.test_order_id__order_created_at__day = subq_20.test_order_id__order_created_at__day
    ) OR (
      (
        subq_6.test_order_id__order_created_at__day IS NULL
      ) AND (
        subq_20.test_order_id__order_created_at__day IS NULL
      )
    )
) subq_21

I have a modified version that would output the expected result.

SELECT
  test_order_id__order_created_at__day
  ,   COALESCE(metric_base_price,0) 
    + COALESCE(metric_additional_price,0) 
    + COALESCE(metric_delivery_price,0) 
    AS total_price
FROM (
  SELECT
    COALESCE(subq_6.test_order_id__order_created_at__day, subq_13.test_order_id__order_created_at__day, subq_20.test_order_id__order_created_at__day) AS test_order_id__order_created_at__day
    , subq_6.metric_base_price AS metric_base_price
    , subq_13.metric_additional_price AS metric_additional_price
    , subq_20.metric_delivery_price AS metric_delivery_price
  FROM (
    SELECT
      test_order_id__order_created_at__day
      , SUM(price) AS metric_base_price
    FROM (
      SELECT
        order_created_at AS test_order_id__order_created_at__day
        , price_type AS test_order_id__price_type
        , price
      FROM `schema_name`.`test_data` order_data_src_4
    ) subq_2
    WHERE test_order_id__price_type = 'base'
    GROUP BY
      test_order_id__order_created_at__day
  ) subq_6
  FULL OUTER JOIN (
    SELECT
      test_order_id__order_created_at__day
      , SUM(price) AS metric_additional_price
    FROM (
      SELECT
        order_created_at AS test_order_id__order_created_at__day
        , price_type AS test_order_id__price_type
        , price
      FROM `schema_name`.`test_data` order_data_src_4
    ) subq_9
    WHERE test_order_id__price_type = 'additional'
    GROUP BY
      test_order_id__order_created_at__day
  ) subq_13
  ON
    (
      subq_6.test_order_id__order_created_at__day = subq_13.test_order_id__order_created_at__day
    ) OR (
      (
        subq_6.test_order_id__order_created_at__day IS NULL
      ) AND (
        subq_13.test_order_id__order_created_at__day IS NULL
      )
    )
  FULL OUTER JOIN (
    SELECT
      test_order_id__order_created_at__day
      , SUM(price) AS metric_delivery_price
    FROM (
      SELECT
        order_created_at AS test_order_id__order_created_at__day
        , price_type AS test_order_id__price_type
        , price
      FROM `schema_name`.`test_data` order_data_src_4
    ) subq_16
    WHERE test_order_id__price_type = 'delivery'
    GROUP BY
      test_order_id__order_created_at__day
  ) subq_20
  ON
    (
      subq_6.test_order_id__order_created_at__day = subq_20.test_order_id__order_created_at__day
    ) OR (
      (
        subq_6.test_order_id__order_created_at__day IS NULL
      ) AND (
        subq_20.test_order_id__order_created_at__day IS NULL
      )
    )
) subq_21

I have made two changes:

  • On line 3 I have added a COALESCE into the total_price calculation
  • On line 28 and 53 there is now a FULL OUTER JOIN instead of the original INNER JOIN

I am not 100% sure that this would be a good solution for all types of derived metrics, but for sum type metrics this seems like a good approach.

Relevant log output

No response

Environment

- OS: Mac
- Python: 3.11
- dbt: 1.6.1
- metricflow: 0.201.0

Which database are you using?

No response

Additional Context

Databricks

@siljamardla siljamardla added bug Something isn't working triage Tasks that need to be triaged labels Sep 6, 2023
@tlento
Copy link
Contributor

tlento commented Sep 7, 2023

Thank you for providing all of the context behind the issue as you've experienced it, it's very helpful! I agree zero-substitution would be the correct thing to do in most of these cases.

The main issue we face with defaulting to zero-substitution, even for cases like this, is there are scenarios where a NULL indicates a data error and zero-substitution would mask it with a return value that might be within a range where the final metric appears to be correct even though it is actually wrong. These scenarios are not always detectable from semantic analysis of the expr value itself (which we don't do anyway), and so we're kind of stuck with returning NULL by default in order to communicate to end consumers that something is, indeed, broken.

Of course our proposed solution in #759 adds configuration overhead for the many cases where returning 0 is more correct than returning NULL, and it puts people in an awkward place where they're getting results that are clearly wrong from some queries but not others until they add the relevant configuration parameter.

Such annoying configuration quirks are things we can and should address. To that end, I think there is a feature request in here which we should consider once #759 is available, which is to find a way to simplify these more complex derived metric configurations for exprs that are basically just adding things together. The idea is something similar to RATIO, but for other common operations. That metric type (or types) could conceivably allow for a default override for zero-substitution, making it much more parsimonious than tagging every input metric separately.

If this all seems reasonable I'll be happy to update the description here - with the excellent context you've provided as an example of where having to opt in to zero substitution on every input metric gets painful and perhaps a bit silly - as a feature request for better ways to configure derived metric expressions for common operations, including support for more natural zero-substitution behavior.

Please let us know what you think!

@siljamardla
Copy link
Author

So you are proposing a generic feature to define null behaviour + dedicated derived metric types for known operations where the null behaviour has already been configured appropriately. Sounds good to me.
What could possibly be the timeline for either?

Feel free to update the issue, but it would be nice if the original content did not get completely lost :)

@tlento
Copy link
Contributor

tlento commented Nov 15, 2023

So you are proposing a generic feature to define null behaviour + dedicated derived metric types for known operations where the null behaviour has already been configured appropriately.

The first of these (defining null behavior as specified in #759) has been released with MetricFlow 0.203.0.

Just to preserve history here I'm going to close this issue and open a new backlog feature request for the follow-up of making it easier to define metric types that do zero-coalescing when it makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working triage Tasks that need to be triaged
Projects
None yet
Development

No branches or pull requests

2 participants