Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More efficient catalog generation #93

Closed
jtcohen6 opened this issue May 29, 2020 · 5 comments · Fixed by #160
Closed

More efficient catalog generation #93

jtcohen6 opened this issue May 29, 2020 · 5 comments · Fixed by #160
Labels
enhancement New feature or request good_first_issue Good for newcomers

Comments

@jtcohen6
Copy link
Contributor

Picks up from #49

Background

Currently, _get_one_catalog runs show tblproperties and describe extended on every single relation, one by one. As a result, dbt docs generate can take several minutes to run.

All the same information is available at a schema level by running show table extended in [schema] like '*', which dbt-spark knows as the LIST_RELATIONS_MACRO_NAME.

Challenge

The tricky part is the formatting: the result of show table extended... is structured, but strange:

Database: my_db
Table: view_model
Owner: root
Created Time: Wed Jan 29 01:58:46 UTC 2020
Last Access: Thu Jan 01 00:00:00 UTC 1970
Created By: Spark 2.4.4
Type: VIEW
View Text: select * from my_db.seed
View Default Database: default
View Query Output Columns: [id, first_name, last_name, email, gender, ip_address]
Table Properties: [transient_lastDdlTime=1580263126, view.query.out.col.3=email, view.query.out.col.0=id, view.query.out.numCols=6, view.query.out.col.4=gender, view.default.database=default, view.query.out.col.1=first_name, view.query.out.col.5=ip_address, view.query.out.col.2=last_name]
Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Storage Properties: [serialization.format=1]
Schema: root
 |-- id: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- email: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- ip_address: string (nullable = true)

As @aaronsteers helpfully found, we could use this regex string to parse out all the info we need from the result above:

\|-- (.*): (.*) \(nullable = (.*)\b

This will require changes to _get_columns_for_catalog, and possibly to parse_describe_extended as well.

Alternatives

  • Open a PR against Apache Spark to add an information schema :)

Who will this benefit?

Faster docs generation for everybody!

@jtcohen6 jtcohen6 added enhancement New feature or request good_first_issue Good for newcomers labels May 29, 2020
@jtcohen6
Copy link
Contributor Author

This would also resolve the fact that Statistics are currently missing in docs for Delta tables. They're returned by describe detail instead of describe extended (docs, but regardless, they populate in the result of show table extended like '*'.

@felippecaso
Copy link

felippecaso commented Mar 4, 2021

Adding to the discussion here, show table extended can be very costly depending on the account's Spark architecture.

The current instance I'm working with has a single schema for data_lake tables which has 1000+ tables; even with my dbt project using just 4 sources from this schema, the command pulls metadata from all tables.

A simple dbt docs takes 1 hour to run 😰

From what I understood, the challenge here would be to change dbt-spark's behavior to query describe extended {{ table }} instead of show table extended, is that right?

@jtcohen6
Copy link
Contributor Author

From what I understood, the challenge here would be to change dbt-spark's behavior to query describe extended {{ table }} instead of show table extended, is that right?

@felippecaso Ah no, just the reverse: We want to run just one show table extended in [schema] like '*' and use that to populate information about all catalog objects, rather than needing to run one describe extended for every object.

I don't know if there's anything we can do in the case where show table extended in [schema] like '*' takes an hour to run! It's pretty standard database behavior to ask for that kind of metadata. Is there anything obvious we're missing? Any chance this is faster with a bigger cluster?

@felippecaso
Copy link

Hey @jtcohen6,
There was indeed a leap of performance with my previous lack of understanding of how threading worked in dbt executions Haha.

Still, my point relates to the fact that show table extended in <schema> can process a lot more tables than it needed to. For instance, our database currently has those 1000+ tables in a single data_lake schema, and no more than 50 are currently being consumed in our dbt project.

Those 50 would be the number of tables that dbt needs to understand metadata from, but show table extended in <schema> pulls from all 1000+, spending unnecessary time to finish execution. Does it make sense?

As always, I may also be missing something =)

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Apr 14, 2021

@felippecaso I don't think you're missing anything, and you've got a really good point. It's slow for dbt to run a separate describe table query for every single object it needs to know about (50 times in your case). It's also slow to run show table extended in <schema> like '*' against schemas with lots of objects in them (1000 objects, of which you only need 50). It's sort of a lose-lose.

I can imagine a future state of dbt, where it's able to marry the relation cache built at the start of the run with the catalog built at the end of the run, and actually update the former over the course of each model execution to produce the latter. In that future state, running describe table once after each model is created makes a ton of sense. In the current state of the world, however, dbt builds the relation cache + catalog separately, and each in one fell swoop.

For my part, on other databases / warehouses / query engines, dbt expects to be able to run metadata queries by limiting its scope to the database/schema of interest. It's pretty reasonable to expect that returning metadata queries on a schema with 1000 objects should actually be fairly quick... so I'm inclined to see this as a limitation of Apache Spark, more than anything.

Is this behavior we need to make configurable? Could we offer an optional/configurable wildcard, in case (e.g.) all the tables in your source schema share a common prefix/suffix? Are there other secret SparkSQL approaches I don't know about (yet)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good_first_issue Good for newcomers
Projects
None yet
2 participants