Skip to content

MetadataHandler

Anthony Virtuoso edited this page Nov 16, 2019 · 2 revisions

As part of forming a query execution plan that includes a federated data source, Athena needs a way to obtain key metadata from your source. More precisely, Athena needs a way to obtain:

  1. list of schemas (aka databases).
  2. list of tables in a given schema.
  3. Table definitions (e.g. column names, column types).
  4. Partitions that should be queried for a given Schema, Table, and Predicate.
  5. How to split-up/parallelize reads of a partitions.

The Athena Query Federation SDK provides an MetadataHandler as an abstract class that you can extend in order to implement the above functionality via the below functions:

  1. doListSchemas(...) - lists available schemas.
  2. doListTables(...) - lists available tables in a schema.
  3. doGetTable(...) - get the definition of a Table.
  4. doGetTableLayout(...) - provides partition information and optionally performs partition pruning.
  5. doGetSplits(...) - tells Athena how it can split up and parallelize reads of a Partition.

Also provided is a partial implementation of these methods which uses the AWS Glue DataCatalog for metadata. The GlueMetadataHandler can jump start your MetadataHandler if your source lacks its own metadata source. The athena-redis is an example of a connector that uses AWS Glue DataCatalog since Redis lacks a traditional metastore for helping Athena understand how to interpret your Redis keys/prefixes/zsets as Tables and Columns.

Advanced Usage

In most cases you will deploy a MetadataHandler and RecordHandler together in the same Lambda function by using a CompositeHandler. There are however some unique cases where you may want to deploy them independently. This is supported by Athena and most often done for one of the below reasons:

  1. You have a centralized source of meta-data for all your data sources (e.g. a Single Source of Truth) which is in its own VPC.
  2. Your data sources themselves are in separate VPC which do not contain the meta-data source.
  3. Your meta data operations and data reads require different scale or languages in their lambda function.