Evaluate open source data catalog options for integration into this platform #35

MattTriano · 2023-01-03T03:35:30Z

A data catalog should have:

Document core metadata (table name, table description, table grain, source, etc),
Document table schema (column names, descriptions, data types, etc),
Lineage information,
Usage information,
Access control information,
Search functionality, and
the ability for users to enrich data with tags and further information.

dbt's built-in doc server does include most of that functionality (even access control, apparently https://www.getdbt.com/blog/teaching-dbt-about-grants/), but it doesn't allow users to edit things through the portal, and I think it's intended more as a dev tool than a production option.

There are two options I want to evaluate:

DataHub
- ~7k stars, project started in 2016, main open source option.
- features
OpenMetadata
- ~1.8k starts, project started in Aug 2021, growing faster than DataHub and has a slightly more active community.
- features

I've looked at Amundsen, but its community is about 5% as active as OpenMetadata's community, and I don't think it will keep up.

MattTriano · 2023-01-16T03:38:36Z

Per the feature-set comparisons on awesome-data-catalogs, it looks like my assessment about this product space was accurate; DataHub and OpenMetadata are the most feature-rich and developed options, but there's one more comparably feature-rich project: OpenDataDiscovery. That project only has 680 stars at the moment, is a month older than OpenMetadata, and it is growing much less rapidly than OpenMetadata or DataHub.

MattTriano · 2023-01-16T04:26:48Z

Looks like I'll have to upgrade docker-compose to at least v2.0.0 to use OpenMetadata (which will involve updating makefile recipes to use docker compose instead of docker-compose, and per the compatibility docs, it looks like some commands I don't use have been. removed

MattTriano · 2023-01-17T01:52:53Z

Misc notes

DataHub Metadata Enrichment
After ingesting metadata into DataHub, you can enrich metadata through the UI:

Describe a data set, or even add description of columns,
Set the owner(s) of the dataset,
Add tags for the data set,
Add a glossary of terms
add a domain for the data set

Shift Left enrichment

Enrich at source (e.g., via comments in SQL table definitions, or in meta blocks in dbt schema.yml files, in description fields in LookML dimension/metric definitions, etc)

Transform Enrichment

Useful when there are patterns in the source data (e.g. common terms, field names, or concepts),
example: any time there's a column whose name matches some regex, apply a given tag.

CSV: Bulk Enrichment Emport

If you have a google doc or something defining ownership and definitions, you can ingest that
(unclear how much config you have to do to parse the sheet)

API Enrichment

For programmatic metadata (e.g., outputs from CI/CD processes)

DataHub UI

The initial one shown where you add info through the UI

bbrewington · 2023-01-17T02:09:45Z

did you consider Amundsen? https://github.com/amundsen-io/amundsen - not entirely sure if it's considered a data catalog, but just a callout it has 3,700 stars

MattTriano · 2023-01-17T05:30:42Z

@bbrewington I gave it a brief look but the relatively modest amount of activity on the amundsen repo put it below DataHub and OpenMetadata on my list of things to check. I will confess, I couldn't get a great sense of the feature-sets of either of those tools from their websites and decided I'd just spin up test deployments for both and scan through the features. Here's my test setup of OpenMetadata and I'll probably spin up a DataHub test run tomorrow.

Have you used it? If so, what did you think of it? I checked through your repos and commits to see if you were a contributor but I didn't check too far. By the way, it looks like we've been looking at a lot of the docs and projects over the past few months, and I like the commit msgs on your dbt-BQ-info_schema repo.

bbrewington · 2023-01-18T14:44:25Z

@MattTriano haha sounds like the clickbait hooked you in (having some fun with that one) - here's link for reference: https://github.com/bbrewington/dbt-bigquery-information-schema

TBH I'm still pretty new to Metadata tools...actually the above linked repo might be a good use case to try some of these out. I assumed Amundsen was best in class, but now will consider the 3 against each other

MattTriano mentioned this issue Jan 19, 2023

Integrate OpenMetadata with the platform #49

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate open source data catalog options for integration into this platform #35

Evaluate open source data catalog options for integration into this platform #35

MattTriano commented Jan 3, 2023

MattTriano commented Jan 16, 2023 •

edited

Loading

MattTriano commented Jan 16, 2023

MattTriano commented Jan 17, 2023

bbrewington commented Jan 17, 2023

MattTriano commented Jan 17, 2023 •

edited

Loading

bbrewington commented Jan 18, 2023

Evaluate open source data catalog options for integration into this platform #35

Evaluate open source data catalog options for integration into this platform #35

Comments

MattTriano commented Jan 3, 2023

MattTriano commented Jan 16, 2023 • edited Loading

MattTriano commented Jan 16, 2023

MattTriano commented Jan 17, 2023

Shift Left enrichment

Transform Enrichment

CSV: Bulk Enrichment Emport

API Enrichment

DataHub UI

bbrewington commented Jan 17, 2023

MattTriano commented Jan 17, 2023 • edited Loading

bbrewington commented Jan 18, 2023

MattTriano commented Jan 16, 2023 •

edited

Loading

MattTriano commented Jan 17, 2023 •

edited

Loading