Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

✨ restructure deps command with package-lock.yml #6735

Closed
wants to merge 13 commits into from

Conversation

justbldwn
Copy link
Contributor

@justbldwn justbldwn commented Jan 25, 2023

resolves #6643

Description

this PR restructures the dbt deps command, aiming to add a few new key features:

  • create a new package-lock.yml file that will keep record of all packages installed during the last dbt deps run
  • use newly created package-lock.yml as the default file to install packages from, which helps reduce install time
  • adds 2 new dbt deps subcommands: dbt deps lock and dbt deps add
  • refresh dbt_packages/ directory each time dbt deps is run (no more persisting old packages left in the directory)

the below will provide an overview of the changes, what they accomplish, and some a lot of detail on implementation and examples. sorry in advance for the length! 😅


dbt deps lock

  1. the new lock command will take the packages in packages.yml and list them and their transitive dependencies in a new package-lock.yml file.
  2. package versions should be resolved for each dependency, else, error thrown.
  3. package-lock.yml will get written to the project_root alongside packages.yml.

example command:

dbt deps lock

dbt deps add

  1. allows a user to add a package to their packages.yml file with configurable version and location of a given package.
  2. added a --dry-run flag to the dbt deps add sub-command that, when False, will re-create the lock file after a new package is added to the packages.yml file. (this is setup as False by default in argparse, a user can use the flag and set to True to not persist changes down to the package-lock.yml
  3. overwrite package info in packages.yml if a package name already exists (either explicit or fuzzy name matching)
  4. added a quality check to ensure a version is passed in whenever the source != "local"

example command:

# add package from hub (--source arg defaults to "hub")
dbt deps add --package dbt-labs/dbt_utils --version 1.0.0

# add package from hub with semantic version
dbt deps add --package dbt-labs/snowplow --version ">=0.7.0,<0.8.0"

# add package from git
dbt deps add --package https://github.com/fivetran/dbt_amplitude --version v0.3.0 --source git 

# add package from local (--version not required for local)
dbt deps add --package /opt/dbt/redshift --source local

# add package to packages.yml WITHOUT updating package-lock.yml
dbt deps add --package dbt-labs/dbt_utils --version 1.0.0 --dry-run True

dbt deps 🚩(this is now changed as of 2/10/2023, changes noted in strikethrough) 🚩

1. largely remains unchanged in order to stay backward compatible

  1. given the recent changes to migrate away from argparse -> click, i had to make changes to the dbt deps command. when working with click, i couldn't get dbt deps to be a @cli.group(), while also allowing it to run as an isolated command. i tried using invoke_without_command=True in the @cli.group() decorator, but still didn't get it to work. so for now, i changed dbt deps to be an isolated @cli.group() and added a new @deps.command("install") to handle the old dbt deps functionality (this mimics how dbt docs and dbt source are setup.

key takeaway: dbt deps has been changed to dbt deps install and should be treated as a breaking change. happy to discuss this further!

  1. when a user runs dbt deps dbt deps install, it will check immediately to see if a package-lock.yml file exists in the project root directory. if not, it will create one based on the packages.yml in the project root directory.
  2. package-lock.yml now becomes the default for installing packages.

example command:

# install packages
dbt deps install

# shows help page for dbt deps sub-commands
dbt deps

Examples:

example packages.yml:

packages:
  - package: dbt-labs/dbt_utils
    version: 0.9.2
  - git: "https://github.com/google/fhir-dbt-analytics"
    revision: 386e8430a4b9735466c90c3313ae67c31dbef58f
  - git: "https://github.com/elementary-data/dbt-data-reliability"
    revision: 0.6.11

example package-lock.yml

packages:
- package: dbt-labs/dbt_utils
  version: 0.9.2
- git: https://github.com/google/fhir-dbt-analytics
  revision: 386e8430a4b9735466c90c3313ae67c31dbef58f
- git: https://github.com/elementary-data/dbt-data-reliability
  revision: 0.6.11
- package: calogica/dbt_expectations
  version: 0.7.0
- package: yu-iskw/dbt_unittest
  version: 0.3.1
- package: calogica/dbt_date
  version: 0.6.3

⚠️ Some advice/clarification needed:

first, i believe the breaking change from dbt deps to dbt deps install should warrant discussion on this PR. notes are in the above section titled dbt deps

additionally, there were 2 notable strange yaml formatting issues i ran into with this when outputting data to yaml files:

number 1:
using yaml.dump() or yaml.safe_dump() when writing package/lock results formatted slightly differently than what's in a packages.yml and advertised on the dbt website. i tried indent formatting but nothing I tried seemed to be able to replicate like what's above in the packages.yml vs. package-lock.yml examples.

number 2:
another strange one that i couldn't figure out with yaml formatting was how to get the semantic versioning to display like it's advertised on the dbt website. for example:

online:

packages:
  - package: dbt-labs/dbt_utils
    version: 0.9.2
  - package: dbt-labs/snowplow
    version: [">=0.7.0", "<0.8.0"]

dbt deps add

packages:
- package: dbt-labs/dbt_utils
  version: 0.9.2
- package: dbt-labs/snowplow
  version:
  - '>=0.7.0'
  - <0.8.0

if i use the default_flow_style=None argument in yaml.safe_dump(), we can fix how lists are displayed in the yaml file, but it then messes up how dicts are displayed....

packages:
- {package: dbt-labs/dbt_utils, version: 0.9.2}
- package: dbt-labs/snowplow
  version: ['>=0.7.0', <0.8.0]

functionally, neither of these should impact programmatic reading of the yaml data, but it does make things a touch different than what's in the examples online/the expectations you hope for with file formatting.

Conclusion:

I think that covers nearly everything! please don't hesitate to ask me any questions on this. i'm really looking forward to your review and working on any changes that may be needed. thank you!

Checklist

@cla-bot cla-bot bot added the cla:yes label Jan 25, 2023
@justbldwn justbldwn changed the title ✨ adding installed_packages.json functionality ✨ restructure deps command with package-lock.yml Feb 10, 2023
@justbldwn justbldwn marked this pull request as ready for review February 10, 2023 17:14
@justbldwn justbldwn requested a review from a team as a code owner February 10, 2023 17:14
@justbldwn justbldwn requested a review from a team February 10, 2023 17:14
@justbldwn justbldwn requested review from a team as code owners February 10, 2023 17:14
@@ -289,7 +289,13 @@ def debug(ctx, **kwargs):


# dbt deps
@cli.command("deps")
@cli.group()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured out why invoke_without_command=True doesn't work for this case. Everything in click works, but in Flags we did some not ideal lookup here that made the assumption that a function name should always match the last part of input command. I will create a follow up ticket for that but you will find that if you change function name deps_install and deps_lock below to just install and lock, you should be able to run without any issue. One thing that I didn't check is whether the params at group level is correctly represented in Flags.

Copy link
Contributor

@ChenyuLInx ChenyuLInx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @justbldwn on this amazing PR! And sorry for getting back to you late! This makes a lot of sense to me, left some comments questions to see what you think of them!

@@ -94,13 +94,14 @@ def _load_yaml(path):
return load_yaml_text(contents)


def package_data_from_root(project_root):
package_filepath = resolve_path_from_base("packages.yml", project_root)
def package_data_from_root(project_root, package_file_name="packages.yml"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the update and how this is being used it seems like just more of a function to load a yaml file vs anything related to package data. Maybe consider not have the default value for package_file_name and just rename the function completely?

class DepsTask(BaseTask):
def __init__(self, args: Any, project: Project):
move_to_nearest_project_dir(project.project_root)
super().__init__(args=args, config=None, project=project)
self.cli_vars = args.vars

if not system.path_exists(f"{self.project.project_root}/package-lock.yml"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If ja lock file already exists, I updated my packages.yml, and run dbt deps, do we update the lock yaml somehow?

@@ -55,39 +96,149 @@ def track_package_install(
)

def run(self) -> None:
if system.path_exists(self.project.packages_install_path):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think this could cause any surprise behavior that's different from how deps works today?

return

with downloads_directory():
resolved_deps = resolve_packages(packages, self.project, self.cli_vars)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on where those packages lives, this might actually pull down a git repo right? Just to understand what's happening, we will pull it down, and remove it in the function above that has comment on it, and pull it down again when install? Did we always pull down git repos twice before?

fire_event(
DepsLockUpdating(lock_filepath=f"{self.project.project_root}/package-lock.yml")
)
LockTask(self.args, self.project).run()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure I understand correctly, both deps install(when no lock file added before) and deps add will run LockTask and update lock yaml. Have you considered if it is not dry run deps add would just also install the package? I am wondering do you think it would be more natural to have deps add just install the newly added package. CC @dbeatty10

@ChenyuLInx
Copy link
Contributor

In terms of the yaml write I found this answer works.

@ChenyuLInx
Copy link
Contributor

@justbldwn are you still interested in working on this feature? Otherwise we can help bring it across the finish line.

@justbldwn
Copy link
Contributor Author

justbldwn commented Aug 1, 2023

@justbldwn are you still interested in working on this feature? Otherwise we can help bring it across the finish line.

@ChenyuLInx so sorry for the delay in responding! i may be able to help in the coming weeks, if you have someone that takes over and you need my assistance, too, i'm happy to help where possible.

@ChenyuLInx
Copy link
Contributor

@justbldwn no worries at all! Thank you so much for the contribution! Deps has been something that didn't got much improvement over the years and we are excited to see changes to it.

I am just back to pick this up and working on getting it into dbt-core. I just merged the main branch into your branch and pushed it to branch deps_lock_by_justbldwn. I will likely working on getting everything ready to go on that branch over the next few days. Depending on how you would like to collaborate and your schedule I am sure we can figure something out. Either merge that branch to your's or just open another PR with that branch. Let me know if there's a way you would prefer to do this.

@dbeatty10
Copy link
Contributor

This PR was continued in #8408 and ultimately merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CT-1845] [Feature] Write a file of installed versions during dbt deps
3 participants