Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate open source data catalog options for integration into this platform #35

Open
MattTriano opened this issue Jan 3, 2023 · 6 comments

Comments

@MattTriano
Copy link
Owner

A data catalog should have:

  • Document core metadata (table name, table description, table grain, source, etc),
  • Document table schema (column names, descriptions, data types, etc),
  • Lineage information,
  • Usage information,
  • Access control information,
  • Search functionality, and
  • the ability for users to enrich data with tags and further information.

dbt's built-in doc server does include most of that functionality (even access control, apparently https://www.getdbt.com/blog/teaching-dbt-about-grants/), but it doesn't allow users to edit things through the portal, and I think it's intended more as a dev tool than a production option.

There are two options I want to evaluate:

  1. DataHub
    • ~7k stars, project started in 2016, main open source option.
    • features
  2. OpenMetadata
    • ~1.8k starts, project started in Aug 2021, growing faster than DataHub and has a slightly more active community.
    • features

I've looked at Amundsen, but its community is about 5% as active as OpenMetadata's community, and I don't think it will keep up.

@MattTriano
Copy link
Owner Author

MattTriano commented Jan 16, 2023

Per the feature-set comparisons on awesome-data-catalogs, it looks like my assessment about this product space was accurate; DataHub and OpenMetadata are the most feature-rich and developed options, but there's one more comparably feature-rich project: OpenDataDiscovery. That project only has 680 stars at the moment, is a month older than OpenMetadata, and it is growing much less rapidly than OpenMetadata or DataHub.

@MattTriano
Copy link
Owner Author

Looks like I'll have to upgrade docker-compose to at least v2.0.0 to use OpenMetadata (which will involve updating makefile recipes to use docker compose instead of docker-compose, and per the compatibility docs, it looks like some commands I don't use have been. removed

@MattTriano
Copy link
Owner Author

Misc notes

DataHub Metadata Enrichment
After ingesting metadata into DataHub, you can enrich metadata through the UI:

  • Describe a data set, or even add description of columns,
  • Set the owner(s) of the dataset,
  • Add tags for the data set,
  • Add a glossary of terms
  • add a domain for the data set

Shift Left enrichment

Enrich at source (e.g., via comments in SQL table definitions, or in meta blocks in dbt schema.yml files, in description fields in LookML dimension/metric definitions, etc)

Transform Enrichment

Useful when there are patterns in the source data (e.g. common terms, field names, or concepts),
example: any time there's a column whose name matches some regex, apply a given tag.

CSV: Bulk Enrichment Emport

If you have a google doc or something defining ownership and definitions, you can ingest that
(unclear how much config you have to do to parse the sheet)

API Enrichment

For programmatic metadata (e.g., outputs from CI/CD processes)

DataHub UI

The initial one shown where you add info through the UI

@bbrewington
Copy link

did you consider Amundsen? https://github.com/amundsen-io/amundsen - not entirely sure if it's considered a data catalog, but just a callout it has 3,700 stars

@MattTriano
Copy link
Owner Author

MattTriano commented Jan 17, 2023

@bbrewington I gave it a brief look but the relatively modest amount of activity on the amundsen repo put it below DataHub and OpenMetadata on my list of things to check. I will confess, I couldn't get a great sense of the feature-sets of either of those tools from their websites and decided I'd just spin up test deployments for both and scan through the features. Here's my test setup of OpenMetadata and I'll probably spin up a DataHub test run tomorrow.

Have you used it? If so, what did you think of it? I checked through your repos and commits to see if you were a contributor but I didn't check too far. By the way, it looks like we've been looking at a lot of the docs and projects over the past few months, and I like the commit msgs on your dbt-BQ-info_schema repo.

@bbrewington
Copy link

@MattTriano haha sounds like the clickbait hooked you in (having some fun with that one) - here's link for reference: https://github.com/bbrewington/dbt-bigquery-information-schema

TBH I'm still pretty new to Metadata tools...actually the above linked repo might be a good use case to try some of these out. I assumed Amundsen was best in class, but now will consider the 3 against each other

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants