Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lift + shift for cross-db macros #597

Merged
merged 15 commits into from
Jun 17, 2022
Merged

Lift + shift for cross-db macros #597

merged 15 commits into from
Jun 17, 2022

Conversation

dbeatty10
Copy link
Contributor

@dbeatty10 dbeatty10 commented May 25, 2022

Continues #594

This is a:

  • documentation update
  • bug fix with no breaking changes
  • new functionality
  • a breaking change
  • behind-the-scenes refactor

Description & motivation

See:

To do before merge

  • Reset dev-requirements.txt back to original
  • Explicitly document the definition of dbt_utils.current_timestamp() before and after (for each database)
    • Determine if the definitions are logically equivalent or not
  • Determine how to handle tests decorated with @pytest.mark

current_timestamp()

Database adapters already had a previous definition of current_timestamp(), and it is often expressed differently than within dbt-utils. I worry that this could lead to a breaking change, so we should determine if they are logically equivalent or not.

We have some options for how to handle this:

  1. Keep the existing definition within each database adapter and discard the dbt-utils definition
  2. Overwrite the existing definition within each database adapter in favor of the dbt-utils definition
  3. Keep the existing definition within each database adapter and re-institute the dbt-utils definition back into the macros/cross_db_utils folder.

Comparisons

See below for a comparison between the existing adapter definition vs. the dbt-utils definition.

Default

Adapter:

core/dbt/include/global_project/macros/adapters/freshness.sql

{% macro default__current_timestamp() -%}
  {{ exceptions.raise_not_implemented(
    'current_timestamp macro not implemented for adapter '+adapter.type()) }}
{%- endmacro %}

dbt-utils:

cross_db_utils/current_timestamp.sql

{% macro default__current_timestamp() %}
    current_timestamp::{{dbt_utils.type_timestamp()}}
{% endmacro %}

Postgres

Adapter:

plugins/postgres/dbt/include/postgres/macros/adapters.sql

{% macro postgres__current_timestamp() -%}
  now()
{%- endmacro %}

dbt-utils:

cross_db_utils/current_timestamp.sql

{% macro default__current_timestamp() %}
    current_timestamp::{{dbt_utils.type_timestamp()}}
{% endmacro %}

Redshift

Adapter:

dbt/include/redshift/macros/adapters.sql

{% macro redshift__current_timestamp() -%}
  getdate()
{%- endmacro %}

dbt-utils:

cross_db_utils/current_timestamp.sql

{% macro redshift__current_timestamp() %}
    getdate()
{% endmacro %}

Snowflake

Adapter:

dbt/include/snowflake/macros/adapters.sql

{% macro snowflake__current_timestamp() -%}
  convert_timezone('UTC', current_timestamp())
{%- endmacro %}

dbt-utils:

cross_db_utils/current_timestamp.sql

{% macro default__current_timestamp() %}
    current_timestamp::{{dbt_utils.type_timestamp()}}
{% endmacro %}

BigQuery

Adapter:

dbt/include/bigquery/macros/adapters.sql

{% macro bigquery__current_timestamp() -%}
  CURRENT_TIMESTAMP()
{%- endmacro %}

dbt-utils:

cross_db_utils/current_timestamp.sql

{% macro bigquery__current_timestamp() %}
    current_timestamp
{% endmacro %}

@dbeatty10
Copy link
Contributor Author

dbeatty10 commented May 31, 2022

Update 2022-08-29: See comment here for a contrarian analysis. Original post continues below.

Added a functional test to compare the competing definitions for current_timestamp() from the adapter vs. dbt-utils. Passed across all 4 adapters:

image

See below for the actual SQL utilized in each test.

Postgres

select
    now() as actual,
    
    current_timestamp::
    timestamp without time zone

 as expected

Redshift

select
    getdate() as actual,
    
    getdate()
 as expected

Snowflake

select
    convert_timezone('UTC', current_timestamp()) as actual,
    
    current_timestamp::
    timestamp_ntz

 as expected

BigQuery

select
    CURRENT_TIMESTAMP() as actual,
    
    current_timestamp
 as expected

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jun 1, 2022

@dbeatty10 Amazing work thus far!

As discussed yesterday, let's split out the current_timestamp work into its own separate effort: dbt-labs/dbt-core#5521. That way, the unresolved questions there don't have to block the bulk of this work, even if it means we need another follow-on PR in each plugin.

One realization I had with regard to backwards compatibility, which I don't think I was clear on in #594: This approach will break, I believe, anyone who is currently using a "shim" package (e.g. spark-utils, tsql-utils), and dispatch config that redirects macros from the dbt_utils namespace to prefer the "shim" package namespace. That is, if I have this in my project, as recommended:

# dbt_project.yml
dispatch:
  - macro_namespace: dbt_utils
    search_order: ['spark_utils', 'dbt_utils']

Because dbt_utils.dateadd is now being dispatched into the dbt namespace, rather than the dbt_utils one, it will stop preferring spark_utils.spark__dateadd. That's ameliorable in one of three ways:

  • We also add spark__dateadd to dbt-spark, which we should do (Initialize lift + shift for cross-db macros dbt-spark#359), but can't necessarily rely on all adapter maintainers to be complete with this
  • We tell users to switch the macro_namespace in that dispatch config from dbt_utils to dbt — which works, but we should really aim to avoid any required changes to user project code when upgrading minor versions
  • We add slightly more boilerplate code in this PR (blech), to preserve dbt_utils as one step along the dispatch train. So replacing this:
{% macro any_value(expression) -%}
    {{ return(adapter.dispatch('any_value', 'dbt') (expression)) }}
{% endmacro %}

With this:

{% macro any_value(expression) -%}
    {{ return(adapter.dispatch('any_value', 'dbt_utils') (expression)) }}
{% endmacro %}

{% macro default__any_value(expression) -%}
    {{ return(adapter.dispatch('any_value', 'dbt') (expression)) }}
{% endmacro %}

And maybe a note at the top of every file, "This is here for backwards compatibility only"?

@dbeatty10
Copy link
Contributor Author

Sounds great @jtcohen6 -- I will split out the current_timestamp work as you described.

In terms of backwards compatibility, I'm most attracted any solution that preserves full backwards compatibility without users or shim maintainers needing to do anything. So I'll take a swing at implementing the approach you described in the 3rd option for amelioration.

If we want, we can also add a deprecation warning so we have the option to fully remove this boilerplate in a future major release.

@dbeatty10
Copy link
Contributor Author

dbeatty10 commented Jun 8, 2022

@jtcohen6 could I get your help brainstorming what could be going wrong here?

All functional tests are working. But the dbt-utils OG integration tests are not working (for Redshift specifically).

Overview of failure on CircleCI

After setting up the dispatch train, all adapters are passing the OG integration tests... except for Redshift ☹️

Here's the error:

Steps to reproduce

  1. Clone this repo
  2. Checkout this branch
  3. ./run_test.sh redshift data_test_sequential_timestamps

Working hypothesis

Maybe a combination of nested multiple dispatch + non-static parsing + where resource replacement is overwriting or misplacing dbt-core's dateadd / default__dateadd Jinja node (with the get_where_subquery / default__get_where_subquery node)?

Possibly unrelated context:

  • test_dateadd within the OG dbt-utils integration tests is a generic test that automatically has a subquery inserted that contains a where clause.
  • The static parser is unable to parse cross_db_utils/test_dateadd.sql, so it falls back to using Jinja rendering instead.

What is different about dbt-redshift

The Redshift adapter is trying to explicitly reference the default implementation of dateadd that is within dbt-core (to avoid inheriting postgres__dateadd):

{#-- redshift should use default instead of postgres --#}
{% macro redshift__dateadd(datepart, interval, from_date_or_timestamp) %}
    {{ return(dbt.default__dateadd(datepart, interval, from_date_or_timestamp)) }}
{% endmacro %}

Troubleshooting assets

Standard out
$ ./run_test.sh redshift data_test_sequential_timestamps

13:54:40  Running with dbt=1.2.0-a1
13:54:40  Installing ../
13:54:40    Installed from <local @ ../>
13:54:42  Running with dbt=1.2.0-a1

dbt.exceptions.CompilationException: Compilation Error in test dbt_utils_recency_test_recency_day__today__1 (models/generic_tests/schema.yml)
  'dict object' has no attribute 'default__dateadd'
  
  > in macro default__dateadd (macros/cross_db_utils/dateadd.sql)
  > called by macro dateadd (macros/cross_db_utils/dateadd.sql)
  > called by macro default__test_recency (macros/generic_tests/recency.sql)
  > called by macro test_recency (macros/generic_tests/recency.sql)
  > called by macro redshift__dateadd (macros/utils/dateadd.sql)
  > called by test dbt_utils_recency_test_recency_day__today__1 (models/generic_tests/schema.yml)
Jinja state at error time

It is looking for a key named default__dateadd but the only keys within the dictionary are get_where_subquery and default__get_where_subquery.

The error is thrown from here.

Here are the values:

  • self._undefined_name
    • default__dateadd
  • self._undefined_obj
    • {'get_where_subquery': <dbt.clients.jinja.MacroGenerator object at 0x1131b3c10>, 'default__get_where_subquery': <dbt.clients.jinja.MacroGenerator object at 0x1131b3c40>}
Logs

To gain better visibility within the call stack (namely source repo and dispatched macro name), I added logging to the *dateadd macros like this:

{% macro dateadd(datepart, interval, from_date_or_timestamp) %}
  {{ log("dbt-utils dateadd", True) }}
  {{ return(adapter.dispatch('dateadd', 'dbt_utils')(datepart, interval, from_date_or_timestamp)) }}
{% endmacro %}

integration_tests/logs/dbt.log

============================== 2022-06-08 13:54:42.372447 | f8fbaae3-e054-4e64-a1b3-3617c5cd9b78 ==============================
13:54:42.372487 [info ] [MainThread]: Running with dbt=1.2.0-a1
13:54:42.372905 [debug] [MainThread]: running dbt with arguments {'write_json': True, 'use_colors': True, 'printer_width': 80, 'version_check': True, 'partial_parse': True, 'static_parser': True, 'profiles_dir': '/Users/dbeatty/.dbt', 'send_anonymous_usage_stats': False, 'event_buffer_size': 100000, 'quiet': False, 'no_print': False, 'target': 'redshift', 'full_refresh': True, 'show': False, 'which': 'seed', 'rpc_method': 'seed', 'indirect_selection': 'eager'}
13:54:42.373043 [debug] [MainThread]: Tracking: do not track
13:54:42.485033 [debug] [MainThread]: Partial parsing enabled: 0 files deleted, 0 files added, 2 files changed.
13:54:42.486362 [debug] [MainThread]: Partial parsing: updated file: dbt_redshift://macros/utils/dateadd.sql
13:54:42.486534 [debug] [MainThread]: Partial parsing: updated file: dbt://macros/utils/dateadd.sql
13:54:42.486696 [debug] [MainThread]: Parsing macros/utils/dateadd.sql
13:54:42.487674 [debug] [MainThread]: Parsing macros/utils/last_day.sql
13:54:42.488781 [debug] [MainThread]: Parsing macros/utils/dateadd.sql
13:54:42.490770 [debug] [MainThread]: Parsing macros/utils/last_day.sql
13:54:42.492173 [debug] [MainThread]: Parsing macros/cross_db_utils/dateadd.sql
13:54:42.493575 [debug] [MainThread]: Parsing macros/sql/date_spine.sql
13:54:42.496896 [debug] [MainThread]: Parsing macros/materializations/insert_by_period_materialization.sql
13:54:42.517900 [debug] [MainThread]: Parsing macros/generic_tests/recency.sql
13:54:42.519263 [debug] [MainThread]: Parsing macros/generic_tests/sequential_values.sql
13:54:42.552471 [debug] [MainThread]: 1603: static parser failed on datetime/test_date_spine.sql
13:54:42.572842 [info ] [MainThread]: DEBUG: dbt-utils dateadd
13:54:42.574659 [info ] [MainThread]: DEBUG: dbt-utils default__dateadd
13:54:42.576546 [info ] [MainThread]: DEBUG: dbt-redshift redshift__dateadd
13:54:42.578815 [info ] [MainThread]: DEBUG: dbt-core default__dateadd
13:54:42.579271 [info ] [MainThread]: 
13:54:42.587075 [debug] [MainThread]: 1602: parser fallback to jinja rendering on datetime/test_date_spine.sql
13:54:42.587971 [debug] [MainThread]: 1603: static parser failed on generic_tests/test_recency.sql
13:54:42.596275 [debug] [MainThread]: 1602: parser fallback to jinja rendering on generic_tests/test_recency.sql
13:54:42.597153 [debug] [MainThread]: 1603: static parser failed on cross_db_utils/test_dateadd.sql
13:54:42.600558 [info ] [MainThread]: DEBUG: dbt-utils dateadd
13:54:42.600850 [info ] [MainThread]: DEBUG: dbt-utils default__dateadd
13:54:42.601129 [info ] [MainThread]: DEBUG: dbt-redshift redshift__dateadd
13:54:42.601382 [info ] [MainThread]: DEBUG: dbt-core default__dateadd
13:54:42.601598 [info ] [MainThread]: 
13:54:42.603638 [info ] [MainThread]: DEBUG: dbt-utils dateadd
13:54:42.603985 [info ] [MainThread]: DEBUG: dbt-utils default__dateadd
13:54:42.604262 [info ] [MainThread]: DEBUG: dbt-redshift redshift__dateadd
13:54:42.604571 [info ] [MainThread]: DEBUG: dbt-core default__dateadd
13:54:42.606112 [info ] [MainThread]: 
13:54:42.607200 [info ] [MainThread]: DEBUG: dbt-utils dateadd
13:54:42.607989 [info ] [MainThread]: DEBUG: dbt-utils default__dateadd
13:54:42.608548 [info ] [MainThread]: DEBUG: dbt-redshift redshift__dateadd
13:54:42.608998 [info ] [MainThread]: DEBUG: dbt-core default__dateadd
13:54:42.609837 [info ] [MainThread]: 
13:54:42.610576 [info ] [MainThread]: DEBUG: dbt-utils dateadd
13:54:42.610927 [info ] [MainThread]: DEBUG: dbt-utils default__dateadd
13:54:42.611534 [info ] [MainThread]: DEBUG: dbt-redshift redshift__dateadd
13:54:42.611986 [info ] [MainThread]: DEBUG: dbt-core default__dateadd
13:54:42.612277 [info ] [MainThread]: 
13:54:42.613853 [debug] [MainThread]: 1602: parser fallback to jinja rendering on cross_db_utils/test_dateadd.sql
13:54:42.661481 [info ] [MainThread]: DEBUG: dbt-utils dateadd
13:54:42.661810 [info ] [MainThread]: DEBUG: dbt-utils default__dateadd
13:54:42.662073 [info ] [MainThread]: DEBUG: dbt-redshift redshift__dateadd
Stack trace

I commented out this section within core/dbt/main.py to see the stack trace.

Traceback (most recent call last):
  File "/dbt-core/core/dbt/clients/jinja.py", line 296, in exception_handler
    yield
  File "/dbt-core/core/dbt/clients/jinja.py", line 251, in call_macro
    return macro(*args, **kwargs)
  File "/dbt-utils/venv/lib/python3.8/site-packages/jinja2/runtime.py", line 675, in __call__
    return self._invoke(arguments, autoescape)
  File "/dbt-utils/venv/lib/python3.8/site-packages/jinja2/runtime.py", line 679, in _invoke
    rv = self._func(*arguments)
  File "<template>", line 28, in macro
  File "/dbt-utils/venv/lib/python3.8/site-packages/jinja2/sandbox.py", line 460, in call
    if not __self.is_safe_callable(__obj):
  File "/dbt-utils/venv/lib/python3.8/site-packages/jinja2/sandbox.py", line 360, in is_safe_callable
    getattr(obj, "unsafe_callable", False) or getattr(obj, "alters_data", False)
  File "/dbt-utils/venv/lib/python3.8/site-packages/jinja2/runtime.py", line 755, in __getattr__
    return self._fail_with_undefined_error()
  File "/dbt-utils/venv/lib/python3.8/site-packages/jinja2/runtime.py", line 749, in _fail_with_undefined_error
    raise self._undefined_exception(self._undefined_message)
jinja2.exceptions.UndefinedError: 'dict object' has no attribute 'default__dateadd'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/dbt-utils/venv/bin/dbt", line 7, in <module>
    exec(compile(f.read(), __file__, 'exec'))
  File "/dbt-core/core/scripts/dbt", line 7, in <module>
    dbt.main.main(sys.argv[1:])
  File "/dbt-core/core/dbt/main.py", line 129, in main
    results, succeeded = handle_and_check(args)
  File "/dbt-core/core/dbt/main.py", line 191, in handle_and_check
    task, res = run_from_args(parsed)
  File "/dbt-core/core/dbt/main.py", line 238, in run_from_args
    results = task.run()
  File "/dbt-core/core/dbt/task/runnable.py", line 451, in run
    self._runtime_initialize()
  File "/dbt-core/core/dbt/task/runnable.py", line 159, in _runtime_initialize
    super()._runtime_initialize()
  File "/dbt-core/core/dbt/task/runnable.py", line 92, in _runtime_initialize
    self.load_manifest()
  File "/dbt-core/core/dbt/task/runnable.py", line 81, in load_manifest
    self.manifest = ManifestLoader.get_full_manifest(self.config)
  File "/dbt-core/core/dbt/parser/manifest.py", line 218, in get_full_manifest
    manifest = loader.load()
  File "/dbt-core/core/dbt/parser/manifest.py", line 365, in load
    self.parse_project(
  File "/dbt-core/core/dbt/parser/manifest.py", line 467, in parse_project
    parser.parse_file(block, dct=dct)
  File "/dbt-core/core/dbt/parser/schemas.py", line 504, in parse_file
    self.parse_tests(test_block)
  File "/dbt-core/core/dbt/parser/schemas.py", line 483, in parse_tests
    self.parse_test(block, test, None)
  File "/dbt-core/core/dbt/parser/schemas.py", line 476, in parse_test
    self.parse_node(block)
  File "/dbt-core/core/dbt/parser/schemas.py", line 417, in parse_node
    node = self._parse_generic_test(
  File "/dbt-core/core/dbt/parser/schemas.py", line 345, in _parse_generic_test
    self.render_test_update(node, config, builder, schema_file_id)
  File "/dbt-core/core/dbt/parser/schemas.py", line 404, in render_test_update
    get_rendered(node.raw_sql, context, node, capture_macros=True)
  File "/dbt-core/core/dbt/clients/jinja.py", line 574, in get_rendered
    return render_template(template, ctx, node)
  File "/dbt-core/core/dbt/clients/jinja.py", line 529, in render_template
    return template.render(ctx)
  File "/dbt-utils/venv/lib/python3.8/site-packages/jinja2/environment.py", line 1090, in render
    self.environment.handle_exception()
  File "/dbt-utils/venv/lib/python3.8/site-packages/jinja2/environment.py", line 832, in handle_exception
    reraise(*rewrite_traceback_stack(source=source))
  File "/dbt-utils/venv/lib/python3.8/site-packages/jinja2/_compat.py", line 28, in reraise
    raise value.with_traceback(tb)
  File "<template>", line 1, in top-level template code
  File "/dbt-utils/venv/lib/python3.8/site-packages/jinja2/sandbox.py", line 462, in call
    return __context.call(__obj, *args, **kwargs)
  File "/dbt-core/core/dbt/clients/jinja.py", line 324, in __call__
    return self.call_macro(*args, **kwargs)
  File "/dbt-core/core/dbt/clients/jinja.py", line 251, in call_macro
    return macro(*args, **kwargs)
  File "/dbt-utils/venv/lib/python3.8/site-packages/jinja2/runtime.py", line 679, in _invoke
    rv = self._func(*arguments)
  File "<template>", line 2, in template
  File "/dbt-utils/venv/lib/python3.8/site-packages/jinja2/sandbox.py", line 462, in call
    return __context.call(__obj, *args, **kwargs)
  File "/dbt-core/core/dbt/clients/jinja.py", line 324, in __call__
    return self.call_macro(*args, **kwargs)
  File "/dbt-core/core/dbt/clients/jinja.py", line 251, in call_macro
    return macro(*args, **kwargs)
  File "/dbt-utils/venv/lib/python3.8/site-packages/jinja2/runtime.py", line 679, in _invoke
    rv = self._func(*arguments)
  File "<template>", line 3, in template
  File "/dbt-utils/venv/lib/python3.8/site-packages/jinja2/sandbox.py", line 462, in call
    return __context.call(__obj, *args, **kwargs)
  File "/dbt-core/core/dbt/clients/jinja.py", line 324, in __call__
    return self.call_macro(*args, **kwargs)
  File "/dbt-core/core/dbt/clients/jinja.py", line 251, in call_macro
    return macro(*args, **kwargs)
  File "/dbt-utils/venv/lib/python3.8/site-packages/jinja2/runtime.py", line 679, in _invoke
    rv = self._func(*arguments)
  File "<template>", line 3, in template
  File "/dbt-utils/venv/lib/python3.8/site-packages/jinja2/sandbox.py", line 462, in call
    return __context.call(__obj, *args, **kwargs)
  File "/dbt-core/core/dbt/clients/jinja.py", line 324, in __call__
    return self.call_macro(*args, **kwargs)
  File "/dbt-core/core/dbt/clients/jinja.py", line 251, in call_macro
    return macro(*args, **kwargs)
  File "/dbt-utils/venv/lib/python3.8/site-packages/jinja2/runtime.py", line 679, in _invoke
    rv = self._func(*arguments)
  File "<template>", line 3, in template
  File "/dbt-utils/venv/lib/python3.8/site-packages/jinja2/sandbox.py", line 462, in call
    return __context.call(__obj, *args, **kwargs)
  File "/dbt-core/core/dbt/clients/jinja.py", line 324, in __call__
    return self.call_macro(*args, **kwargs)
  File "/dbt-core/core/dbt/clients/jinja.py", line 253, in call_macro
    return e.value
  File "/.pyenv/versions/3.8.13/lib/python3.8/contextlib.py", line 131, in __exit__
    self.gen.throw(type, value, traceback)
  File "/dbt-core/core/dbt/clients/jinja.py", line 298, in exception_handler
    raise_compiler_error(str(e), self.macro)
  File "/dbt-core/core/dbt/exceptions.py", line 445, in raise_compiler_error
    raise CompilationException(msg, node)

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jun 9, 2022

@dbeatty10 Oy, this gets deep quickly. I stuck a few debuggers in a few places, and came up with strong evidence that this is indeed a circular dependency. Some printing from the end of this method:

dbt_redshift.redshift__dateadd: MacroDependsOn(macros=['macro.dbt_utils.default__dateadd'])
dbt_postgres.postgres__dateadd: MacroDependsOn(macros=[])
dbt.dateadd: MacroDependsOn(macros=['macro.dbt_redshift.redshift__dateadd'])
dbt.default__dateadd: MacroDependsOn(macros=[])
dbt_utils.dateadd: MacroDependsOn(macros=['macro.dbt_utils.default__dateadd'])
dbt_utils.default__dateadd: MacroDependsOn(macros=['macro.dbt_utils.dateadd', 'macro.dbt_redshift.redshift__dateadd'])

dbt_redshift.redshift__dateaddmacro.dbt_utils.default__dateaddmacro.dbt_redshift.redshift__dateadd

That first bit doesn't look quite right, but everything else does. I'd expect dbt_redshift.redshift__dateadd to depend on macro.dbt.default__dateadd, not macro.dbt_utils.default__dateadd!

Well... I think this is because we have special macro resolution rules for "internal" macros (dbt.*), since they're more eagerly overridden by end users who want to reimplement internal functionality: get_macrothis search order is defined here (prefer dbt_utils > dbt). That explains why we only see this when we mix dispatch to an internal and non-internal package... maybe?

Ultimately, instead of returning dbt.default__dateadd, this line returns dbt_utils.default__dateadd.

We could go very deep on this logic... or we could just duplicate the dateadd logic in dbt-redshift:

{#-- redshift should use default instead of postgres --#}
{% macro redshift__dateadd(datepart, interval, from_date_or_timestamp) %}
    dateadd(
        {{ datepart }},
        {{ interval }},
        {{ from_date_or_timestamp }}
        )
{% endmacro %}

That "just works."

@dbeatty10
Copy link
Contributor Author

Thank you for the explanation and examples, @jtcohen6!

As an aside

Do you interpret this as an issue? i.e., is it worth creating a ticket in dbt-core that describes this (undersireable?) behavior when mixing dispatch to an internal and non-internal package?

My next actions

As you suggested, I will copy-paste all the relevant logic into the Redshift adapter. Then all the tests should pass and then we'll be safe to merge this batch of pull requests.

@jtcohen6
Copy link
Contributor

Do you interpret this as an issue? i.e., is it worth creating a ticket in dbt-core that describes this (undersireable?) behavior when mixing dispatch to an internal and non-internal package?

It's a very niche edge case, but it all adds up to some unexpected behavior. I think an issue for it is totally justified. That could be me or could be you, let's take it as a TODO.

@dbeatty10 dbeatty10 marked this pull request as ready for review June 17, 2022 13:54
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚢

@dbeatty10 dbeatty10 merged commit ae6e1b4 into main Jun 17, 2022
@dbeatty10
Copy link
Contributor Author

dbeatty10 commented Aug 30, 2022

Comparison, revisited

In this comment, I showed the results of a functional test to comparing the competing definitions for current_timestamp() from the adapter vs. dbt-utils. At the time, that testing didn't find any differences when comparing values.

The method used to obtain this result always made me feel uneasy -- just because it didn't find any differences doesn't guarantee that there weren't any differences! 😅

Since @colin-rogers-dbt is digging into an implementation for dbt-labs/dbt-core#5521, I revisited this comparison and took a different approach.

Approach

  1. Get the rendered SQL for both adapter vs. dbt-utils for each adapter
  2. Manually look up the vendor-specific documentation for relevant functions to determine the final data type
  3. Translate the final data type to its semantic data type

Semantic data types

Snowflake has 3 different timestamp data types that are (generally*) representative of the observed behaviors across different databases: TIMESTAMP_TZ, TIMESTAMP_NTZ, and TIMESTAMP_LTZ.

At risk of further confusion (and introducing a 15th competing standard), I will use the following nomenclature instead of the Snowflake-specific counterpart:

  • TIMESTAMP_TZ → "offset timestamp" (offset-included absolute point in time)
  • TIMESTAMP_NTZ → "civil timestamp" (non-absolute point in time; "wallclock" time)
  • TIMESTAMP_LTZ → "session timestamp" (non-offset absolute point in time)

Results

image

Takeaways

  • ❌ There appear to be data type mismatches between dbt-utils and dbt-core+adapters for Snowflake and Postgres
    • ✅ Meanwhile, Redshift, BigQuery, and Spark appear to have matching data types between implementations
  • ⚠️ The implementations across dbt-utils use a mix of all 3 2 timestamp data types
    • Snowflake, Postgres, Redshift all use civil timestamps whereas BigQuery* uses offset timestamps and Spark use session timestamps
  • ⚠️ The implementations across dbt-core+adapters also use a mix of all 3 timestamp data types
    • Redshift uses civil timestamps, Snowflake and BigQuery* uses offset timestamps, and Postgres, BigQuery*, and Spark use session timestamps

The best way to verify these results intra-database would be to materialize tables with the results and then query the datatypes.

Being able to automatically discover the semantics of different data types across databases (based on behaviors) will require a bit more ingenuity.


* From the best I can tell, there are 3 different alternatives for how to think about the TIMESTAMP data type in BigQuery:

  1. "session timestamp" where UTC is the session default, but it can be changed (👈 maybe this is the correct way to think about it? This seems to lend credence in this direction.)
  2. "offset timestamp" where the offset is always UTC (what I've chosen here, somewhat arbitrarily)
  3. A unique fourth timestamp data type

Would love to be shown the light by anyone that knows for certain.

Regardless of which of the three is most accurate, it doesn't affect the takeaways above.

Edit: session timestamp seems the most likely for BigQuery since it "represents an absolute point in time, independent of any time zone or convention such as Daylight Savings Time". That implies that it doesn't store any UTC offset the way that offset timestamps do. Civil timestamps don't represent absolute points in time, so it isn't that either. Of the 3 types outlined above, that leaves only session timestamp remaining, which is the non-offset option for absolute point in time in the ontology above.

Updated the categorization of BigQuery in the "Results" section above accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants