More efficient catalog generation #93

jtcohen6 · 2020-05-29T21:48:28Z

Picks up from #49

Background

Currently, _get_one_catalog runs show tblproperties and describe extended on every single relation, one by one. As a result, dbt docs generate can take several minutes to run.

All the same information is available at a schema level by running show table extended in [schema] like '*', which dbt-spark knows as the LIST_RELATIONS_MACRO_NAME.

Challenge

The tricky part is the formatting: the result of show table extended... is structured, but strange:

Database: my_db
Table: view_model
Owner: root
Created Time: Wed Jan 29 01:58:46 UTC 2020
Last Access: Thu Jan 01 00:00:00 UTC 1970
Created By: Spark 2.4.4
Type: VIEW
View Text: select * from my_db.seed
View Default Database: default
View Query Output Columns: [id, first_name, last_name, email, gender, ip_address]
Table Properties: [transient_lastDdlTime=1580263126, view.query.out.col.3=email, view.query.out.col.0=id, view.query.out.numCols=6, view.query.out.col.4=gender, view.default.database=default, view.query.out.col.1=first_name, view.query.out.col.5=ip_address, view.query.out.col.2=last_name]
Serde Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Storage Properties: [serialization.format=1]
Schema: root
 |-- id: long (nullable = true)
 |-- first_name: string (nullable = true)
 |-- last_name: string (nullable = true)
 |-- email: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- ip_address: string (nullable = true)

As @aaronsteers helpfully found, we could use this regex string to parse out all the info we need from the result above:

\|-- (.*): (.*) \(nullable = (.*)\b

This will require changes to _get_columns_for_catalog, and possibly to parse_describe_extended as well.

Alternatives

Open a PR against Apache Spark to add an information schema :)

Who will this benefit?

Faster docs generation for everybody!

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2020-11-10T00:01:07Z

This would also resolve the fact that Statistics are currently missing in docs for Delta tables. They're returned by describe detail instead of describe extended (docs, but regardless, they populate in the result of show table extended like '*'.

felippecaso · 2021-03-04T18:56:59Z

Adding to the discussion here, show table extended can be very costly depending on the account's Spark architecture.

The current instance I'm working with has a single schema for data_lake tables which has 1000+ tables; even with my dbt project using just 4 sources from this schema, the command pulls metadata from all tables.

A simple dbt docs takes 1 hour to run 😰

From what I understood, the challenge here would be to change dbt-spark's behavior to query describe extended {{ table }} instead of show table extended, is that right?

jtcohen6 · 2021-04-14T21:36:33Z

From what I understood, the challenge here would be to change dbt-spark's behavior to query describe extended {{ table }} instead of show table extended, is that right?

@felippecaso Ah no, just the reverse: We want to run just one show table extended in [schema] like '*' and use that to populate information about all catalog objects, rather than needing to run one describe extended for every object.

I don't know if there's anything we can do in the case where show table extended in [schema] like '*' takes an hour to run! It's pretty standard database behavior to ask for that kind of metadata. Is there anything obvious we're missing? Any chance this is faster with a bigger cluster?

felippecaso · 2021-04-14T22:28:03Z

Hey @jtcohen6,
There was indeed a leap of performance with my previous lack of understanding of how threading worked in dbt executions Haha.

Still, my point relates to the fact that show table extended in <schema> can process a lot more tables than it needed to. For instance, our database currently has those 1000+ tables in a single data_lake schema, and no more than 50 are currently being consumed in our dbt project.

Those 50 would be the number of tables that dbt needs to understand metadata from, but show table extended in <schema> pulls from all 1000+, spending unnecessary time to finish execution. Does it make sense?

As always, I may also be missing something =)

jtcohen6 · 2021-04-14T22:54:17Z

@felippecaso I don't think you're missing anything, and you've got a really good point. It's slow for dbt to run a separate describe table query for every single object it needs to know about (50 times in your case). It's also slow to run show table extended in <schema> like '*' against schemas with lots of objects in them (1000 objects, of which you only need 50). It's sort of a lose-lose.

I can imagine a future state of dbt, where it's able to marry the relation cache built at the start of the run with the catalog built at the end of the run, and actually update the former over the course of each model execution to produce the latter. In that future state, running describe table once after each model is created makes a ton of sense. In the current state of the world, however, dbt builds the relation cache + catalog separately, and each in one fell swoop.

For my part, on other databases / warehouses / query engines, dbt expects to be able to run metadata queries by limiting its scope to the database/schema of interest. It's pretty reasonable to expect that returning metadata queries on a schema with 1000 objects should actually be fairly quick... so I'm inclined to see this as a limitation of Apache Spark, more than anything.

Is this behavior we need to make configurable? Could we offer an optional/configurable wildcard, in case (e.g.) all the tables in your source schema share a common prefix/suffix? Are there other secret SparkSQL approaches I don't know about (yet)?

jtcohen6 added enhancement New feature or request good_first_issue Good for newcomers labels May 29, 2020

franloza mentioned this issue Apr 13, 2021

Parse information returned by list_relations_without_caching macro to speed up catalog generation #160

Merged

4 tasks

jtcohen6 closed this as completed in #160 Apr 17, 2021

danfran mentioned this issue Sep 20, 2021

When a table gets dropped elsewhere, analysis fails causing triggers of multiple repetitions of create table #215

Closed

5 tasks

danfran mentioned this issue Oct 13, 2021

[CT-202] Workaround for some limitations due to list_relations_without_caching method #228

Open

jtcohen6 mentioned this issue Feb 24, 2022

[CT-279] Delta tables missing column schema in show table extended #295

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More efficient catalog generation #93

More efficient catalog generation #93

jtcohen6 commented May 29, 2020

jtcohen6 commented Nov 10, 2020

felippecaso commented Mar 4, 2021 •

edited

Loading

jtcohen6 commented Apr 14, 2021

felippecaso commented Apr 14, 2021

jtcohen6 commented Apr 14, 2021 •

edited

Loading

More efficient catalog generation #93

More efficient catalog generation #93

Comments

jtcohen6 commented May 29, 2020

Background

Challenge

Alternatives

Who will this benefit?

jtcohen6 commented Nov 10, 2020

felippecaso commented Mar 4, 2021 • edited Loading

jtcohen6 commented Apr 14, 2021

felippecaso commented Apr 14, 2021

jtcohen6 commented Apr 14, 2021 • edited Loading

felippecaso commented Mar 4, 2021 •

edited

Loading

jtcohen6 commented Apr 14, 2021 •

edited

Loading