Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse information returned by list_relations_without_caching macro to speed up catalog generation #160

Merged
merged 10 commits into from
Apr 17, 2021

Conversation

franloza
Copy link

@franloza franloza commented Apr 13, 2021

Resolves #93

Description

Use regular expressions to parse the information retrieved by show table extended in [schema] like '*'. In this way, it's no longer necessary to run show tblproperties and describe extended on every single relation.

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

@cla-bot cla-bot bot added the cla:yes label Apr 13, 2021
@franloza franloza changed the title WIP: Parse information returned by show table extended to speed up catalog generation Parse information returned by list_relations_without_caching macro to speed up catalog generation Apr 14, 2021
@franloza
Copy link
Author

@jtcohen6 I think this is ready for review 👍

Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@franloza This is incredible! Thank you so much for the contribution.

The changes worked wonderfully for me locally, and docs generate was so much quicker than all those pesky describe extended. I have one ask... if you wanted to take this a step further, and include stats as well... if not no worries! You've done a lot as it is :)

Comment on lines +64 to +66
INFORMATION_COLUMNS_REGEX = re.compile(
r"\|-- (.*): (.*) \(nullable = (.*)\b", re.MULTILINE)
INFORMATION_OWNER_REGEX = re.compile(r"^Owner: (.*)$", re.MULTILINE)
Copy link
Contributor

@jtcohen6 jtcohen6 Apr 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we're here... any chance I could convince you to regex for stats as well? :)

INFORMATION_STATISTICS_REGEX = re.compile('^Statistics: (.*)$", re.MULTILINE)

From some (anecdotal) testing, delta tables include a line like Statistics: 1109049927 bytes, whereas (e.g.) parquet tables include a line like Statistics: 1109049927 bytes, 14093476 rows.

The SparkColumn object takes a table_stats argument; we'd just need to adjust convert_table_stats to handle the delta table case, which is missing the rows bit after the comma split.

Alternatively, if you wanted to do it all in regex, I guess you could:

INFORMATION_STATISTICS_REGEX = re.compile('^Statistics: ([0-9]+) bytes(, ([0-9]+) rows)?', re.MULTILINE)
bytes, _, rows = re.findall(self.INFORMATION_OWNER_REGEX, relation.information)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtcohen6 This is a great idea! This is useful information and I think it's worth trying to parse it. I'll give it a try and add an extra test for parquet tables. I'll let you know when I have this implemented. (Switching back to WIP)

@franloza franloza changed the title Parse information returned by list_relations_without_caching macro to speed up catalog generation WIP: Parse information returned by list_relations_without_caching macro to speed up catalog generation Apr 15, 2021
@franloza
Copy link
Author

@jtcohen6 I'm getting an authentication error in CircleCI:

Build-agent version 1.0.55761-e64d5704 (2021-04-13T14:06:42+0000)
System information:
 Server Version: 19.03.13
 Storage Driver: overlay2
  Backing Filesystem: xfs
 Cgroup Driver: cgroupfs
 Kernel Version: 4.15.0-1092-aws
 Operating System: Ubuntu 18.04.5 LTS
 OSType: linux
 Architecture: x86_64

Starting container fishtownanalytics/test-container:10
Warning: No authentication provided, using CircleCI credentials for pulls from Docker Hub.
  image cache not found on this host, downloading fishtownanalytics/test-container:10

Error response from daemon: unauthorized: authentication required

Would you mind to rerun the job after the problem has been solved? I don't have permission for that.

@jtcohen6
Copy link
Contributor

@franloza Just re-ran! I think DockerHub was down momentarily, causing the error above. Now looks like a linting error.

@franloza franloza changed the title WIP: Parse information returned by list_relations_without_caching macro to speed up catalog generation Parse information returned by list_relations_without_caching macro to speed up catalog generation Apr 15, 2021
@franloza
Copy link
Author

@jtcohen6 Thanks for re-running!

we'd just need to adjust convert_table_stats to handle the delta table case, which is missing the rows bit after the comma split.

I didn't have to modify this function to use it in the new parsing method. I don't know if I missed something but I included a couple of unit tests to verify that this function was working as expected.

This PR is ready for a re-review 😃

@franloza franloza requested a review from jtcohen6 April 15, 2021 21:34
Copy link
Contributor

@jtcohen6 jtcohen6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for this contribution @franloza! LGTM :)

@jtcohen6 jtcohen6 merged commit 27409c4 into dbt-labs:master Apr 17, 2021
@franloza franloza deleted the feature/93-catalog-generation branch April 19, 2021 06:39
@mmeasic
Copy link

mmeasic commented May 4, 2021

This is great work @franloza, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

More efficient catalog generation
3 participants