Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Spike] Model versions parsing #7204

Closed
wants to merge 7 commits into from
Closed

[Spike] Model versions parsing #7204

wants to merge 7 commits into from

Conversation

MichelleArk
Copy link
Contributor

@MichelleArk MichelleArk commented Mar 21, 2023

resolves #6866

Description

Implements most of the spec for versions described from the discussion: #6736 (comment)

Example spec:

models:
  - name: dim_customers
    latest_version: 3
    config:
      contract: true
      materialized: table
      meta:
        key: value
    columns:
      - name: customer_id
        description: This is the primary key
        data_type: float
      - name: country_name
        description: Where this customer lives
        data_type: string
    versions:
      - name: 2
      - name: 1
        config:
          alias: dim_customers
          deprecation_date: '2022-01-01'
        columns:
          - include: * 
            exclude: customer_id
          - name: customer_id
            data_type: int
      - name: 3
        defined_in: my_new_customers_definition # matches SQL or python file name
        columns:
          - include: *
            exclude: 
              - country_name 

# Manifest representation
# model.project_name.dim_customers.v1
# * name: dim_customers
# * version: 1
# * path: dim_customers_v1.sql
# * alias: dim_customers

# model.project_name.dim_customers.v2
# * name: dim_customers
# * version: 2
# * path: dim_customers_v2.sql
# * alias: dim_customers_v2

# model.project_name.dim_customers.v3
# * name: dim_customers
# * version: 3
# * path: my_new_customers_definition.sql
# * alias: dim_customers_v3

@cla-bot cla-bot bot added the cla:yes label Mar 21, 2023
@github-actions
Copy link
Contributor

Thank you for your pull request! We could not find a changelog entry for this change. For details on how to document a change, see the contributing guide.

@MichelleArk MichelleArk changed the title versions parsing - first pass. columns + refs don't work Model versions parsing Mar 27, 2023
def _repack_args(self, name: str, package: Optional[str], version: Optional[str]) -> List[str]:
repacked_args = [name] if package is None else [package, name]
if version:
repacked_args.append(f"version:{version}")
Copy link
Contributor Author

@MichelleArk MichelleArk Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opted to repack the version arg as a formatted string to keep the structure of model.refs as List[List[str]] (result of this method is appended to model.refs here.

Leads to some not-so-fun string parsing when refs are processed.

Could also add it as a dictionary, which would hold all kwargs (if we ever add more) going forward. This would be a breaking change in the manifest (model.refs becomes List[List[Union[str, Dict[str,str]]]], but would be more straightforward (somewhat - will need to do some type checking) for consumers to parse when processing refs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My first reaction is to go with the dictionary. I think consumers are going to have to either do the same not-so-fun string parsing or checking for a dictionary anyway. I'm not sure how often external consumers would be processing the refs anyway; I think mostly they rely on us providing the references in the depends_on. @jtcohen6 might have a better idea about that than me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go with the dictionary. IMO:

  • consumers of the manifest should use depends_on.nodes, rather than refs, in almost all cases
  • the list-of-lists always felt a bit janky

This may require some changes in some of the weirder packages out there, but in a way that should make their lives easier, ultimately

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good to go with a dictionary as well!

What I'd originally thought up in terms of representation in the manifest.json for model.refs was a List[List[Union[str, Dict[str,str]]]]. For example,

"refs": ["package_name", "model_name", {"version":"3", ...}]

But what might be best is actually just a List[Dict[str, str]]. For example,

"refs": {
"package": "package_name",  # what should we actually call this key..?
"model": "model_name",
"version": "version_name"
}

That would enable us to simplify how we parse ref here as well.

Copy link
Contributor

@jtcohen6 jtcohen6 Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what should we actually call this key..?

We're pretty fast & loose with the package/project distinction. I think this is called package_name in the manifest currently:

"model.test.dim_customers": {
    "resource_type": "model",
    "name": "dim_customers",
    "package_name": "test",
    ...
}

And... it will continue to mean either/both things, since we'll support resolving references to a model in another package (status quo) and another project (via cross-project ref)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for List[Dict[str, str]] too. It's always a bit messy to have two different types. We're not allowing versioned sources, right? Sources currently List[List[str]] but always have two strings in the list, whereas refs only have one. Eventually it might be nice to switch all of those things to dictionaries.

Copy link
Contributor Author

@MichelleArk MichelleArk Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're not allowing versioned sources, right?

right - agree it'd be nice to switch them all to dictionaries. I can make an issue for sources, perhaps we can tackle it sa follow-on tech debt work.


# refables are actually unique, so the Dict[PackageName, UniqueID] will
# only ever have exactly one value, but doing 3 dict lookups instead of 1
# is not a big deal at all and retains consistency
def __init__(self, manifest: "Manifest"):
self.storage: Dict[str, Dict[PackageName, UniqueID]] = {}
self.storage: Dict[str, Dict[PackageName, List[UniqueID]]] = {}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref lookups now need to handle 3 access patterns (package handling remains unchanged):

  1. Lookup versioned model by name and version => lookup specified version
  2. Lookup versioned model by name without version => get the latest version
  3. Lookup unversioned model by name only => get only 'version'

For example, a manifest with the following nodes:

"model.jaffle_shop.dim_unversioned": {
  "name": "dim_unversioned",
  "version": null, 
  "is_latest_version": null
},
"model.jaffle_shop.dim_customers.v1": {
  "name": "dim_customers",
  "version": "1", 
  "is_latest_version": false
},
"model.jaffle_shop.dim_customers.v2": {
  "name": "dim_customers",
  "version": "2", 
  "is_latest_version": true
}

should find unique_ids for the following refs via RefableLookup.get_unique_id:

ref("dim_unversioned") => "model.jaffle_shop.dim_unversioned"
ref("dim_customers", version="1") => "model.jaffle_shop.dim_customers.v1"
ref("dim_customers") => "model.jaffle_shop.dim_customers.v2"

Copy link
Contributor Author

@MichelleArk MichelleArk Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I considered a few options for the storage structure here:

  • Dict[str, Dict[PackageName, List[UniqueID]]]
    • "latest" unique_id stored at the front of list
    • first element of UniqueID list always returned, unless a version is specified - in which case iterating through the list is necessary.
    • implemented in this draft
{
"dim_unversioned": {
    "jaffle_shop": ["model.jaffle_shop.dim_unversioned"]
  }, "dim_customers": {
    "jaffle_shop": ["model.jaffle_shop.dim_customers.v2", "model.jaffle_shop.dim_customers.v1"]
  }
}
  • Dict[str, Dict[PackageName, Dict[VersionName, UniqueID]]]
    • doesn't handle unversioned or "latest" versioned lookups models nicely
{ 
"dim_unversioned": {
  "jaffle_shop": {
    "??": "model.jaffle_shop.dim_unversioned"
  },
},
"dim_customers": {
  "jaffle_shop": {
      "2": "model.jaffle_shop.dim_customers.v2", 
      "1": "model.jaffle_shop.dim_customers.v1"
    }
  }
}
  • Dict[str, Dict[PackageName, UniqueID]]
    • doesn't handle "latest" versioned lookups models nicely
{ 
"dim_unversioned": {
  "jaffle_shop": model.jaffle_shop.dim_unversioned"
},
"dim_customers.v1": {
  "jaffle_shop": "model.jaffle_shop.dim_customers.v1"
},
"dim_customers.v2": {
  "jaffle_shop": "model.jaffle_shop.dim_customers.v2"
}
}

Open to feedback and suggestions here!

Copy link
Contributor Author

@MichelleArk MichelleArk Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the third option, we could store an additional key:value pair for each versioned model that maps the model name to its latest unique id. For example,

{ 
"dim_unversioned": {
  "jaffle_shop": model.jaffle_shop.dim_unversioned"
},
"dim_customers.v1": {
  "jaffle_shop": "model.jaffle_shop.dim_customers.v1"
},
"dim_customers.v2": {
  "jaffle_shop": "model.jaffle_shop.dim_customers.v2"
},
"dim_customers": {
  "jaffle_shop": "model.jaffle_shop.dim_customers.v2"
}
}

This has the benefits of:

  • keeping the storage data structure + lookup logic consistent across different lookup implementations
  • mapping closely with the access patterns => easier to reason about + constant-time lookup for all access patterns

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that the ref_lookup code is going to have to handle external references eventually too, right? Not yet sure how that would impact these design choices.

I am a bit wary of having "latest" only implied by the order of versions in the first option. That feels a bit fragile to me. It doesn't feel like checking the versions would be that much overhead. And it's always going to be numerical order, so the highest integer is always latest, right?

Copy link
Contributor Author

@MichelleArk MichelleArk Mar 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am a bit wary of having "latest" only implied by the order of versions in the first option. That feels a bit fragile to me.

Same here. I'm leaning to the structure proposed above for those reasons.

And it's always going to be numerical order, so the highest integer is always latest, right?

This is something I haven't put in much polish on for the spike yet but - but we'll be accepting strings for the version identifier, as well as a model-level latest_version. If the latest_version is provided - then the latest is that value, regardless of what the highest version is (to support the concept of prereleases).

If latest_version is not provided, dbt will compute a default latest version by attempting to order the version identifiers numerically if possible (casting to int/float), and fallback to alphanumeric ordering. This is to support a variety of versioning conventions out of the box, even if our recommendation is to use major versioning only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think the latest proposed structure is appealing.

access=versioned_model_node.access,
version=unparsed_version.name,
is_latest_version=latest_version == unparsed_version.name,
)
Copy link
Contributor Author

@MichelleArk MichelleArk Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some of these fields need a closer inspection - but the gist of it is that the versioned model node should inherit properties and configs from the base node patch (where the versions block is configured), extending or overriding them as appropriate.

is_latest_version=latest_version == unparsed_version.name,
)
versioned_model_node.patch(versioned_model_patch)
self.patch_node_config(versioned_model_node, versioned_model_patch)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alias gets updated here

versioned_model_node.patch(versioned_model_patch)
self.patch_node_config(versioned_model_node, versioned_model_patch)
self.manifest.rebuild_ref_lookup() # TODO: is this necessary?
return
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't handle parsing any tests defined on versions within the versions block yet

config=unparsed_version.config,
access=versioned_model_node.access,
version=unparsed_version.name,
is_latest_version=latest_version == unparsed_version.name,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opted to have an is_latest_version: bool attribute on the ModelNode instead of the raw latest_version: str attribute from the 'canonical' node patch as ref lookups need to know whether a given node is the latest version (logic here)

Could alternatively keep latest_version: str on the ModelNode and add an is_latest_version property if we'd like to keep the ModelNode mapping more closely to the patch.


# patch versioned nodes
versions = block.target.versions
if versions:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the meat and potatoes of the change. Currently it's in NonSourceParser, which is subclassed by TestablePatchParser for parsing models, seeds, and snapshots. Given the increasing number of model-specific patching that's emerged as part of the 'Models as APIs' work (access, constraints, now versions) - I'm thinking this could be split out into a new subclass for parsing models patches independently of seeds and snapshots. Perhaps simply ModelPatchParser. What do you think @gshank, cc @peterallenwebb?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've never been enthusiastic about having all that code in common. I understand why it was that way to start with, since originally there were no differences, but it provides friction when we need and want to handle differences. I'd be in favor of having them all subclasses, even if some of them were almost duplicates to start with.

)
versioned_model_node.patch(versioned_model_patch)
self.patch_node_config(versioned_model_node, versioned_model_patch)
self.manifest.rebuild_ref_lookup() # TODO: is this necessary?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wouldn't think you would need to rebuild the whole thing. There is a self.manifest.ref_lookup.add_node(...) method if you need to add the versioned nodes. It's not clear to me whether that would be necessary or not...

@MichelleArk MichelleArk changed the title Model versions parsing [Spike] Model versions parsing Mar 30, 2023
@MichelleArk
Copy link
Contributor Author

Got answers to how we'll implement RefLookup storage to handle versioned look ups, as well as more clarity in the spec from this spike. Closing now in favor of the new implementation ticket: #7263

@MichelleArk MichelleArk closed this Apr 3, 2023
@MichelleArk MichelleArk mentioned this pull request Apr 12, 2023
10 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-2035] [Spike] Parsing: model ‘version’ attribute
3 participants