Skip to content

Commit

Permalink
Feature/05 configure fileloader and cdc props (#14)
Browse files Browse the repository at this point in the history
* fixed a painful bloated dependencies issue

* wired up table creation -- next is simple table property management

* STUPID - I didn't have .github in the project root

* specified python version better for gh action

* made pipenv just a little more flexible

* made gh actions less strict

* remember to pipenv install dev deps

* added envs to gh action workflow

* fixed bad var ref for ci

* fixed state issues in unit testing

* refactor to s3_monitoring_uri syntax, split table creation into loader tables and cdc mirror tables

* small tweak to config file

* added s3 uri indicator
  • Loading branch information
randypitcherii committed Dec 11, 2023
1 parent b4c1966 commit 8dce317
Show file tree
Hide file tree
Showing 9 changed files with 303 additions and 178 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,20 @@ on:
- main

jobs:
test:
build_and_test:
name: Build and test cdc bootstrapper locally (no sls deploy)
runs-on: ubuntu-latest


env:
S3_BUCKET_NAME: ${{ vars.S3_BUCKET_NAME }}
S3_BUCKET_PATH: ${{ vars.S3_BUCKET_PATH }}

TABULAR_CREDENTIAL: ${{ secrets.TABULAR_CREDENTIAL }}
TABULAR_TARGET_WAREHOUSE: ${{ vars.TABULAR_TARGET_WAREHOUSE }}
TABULAR_CATALOG_URI: ${{ vars.TABULAR_CATALOG_URI }}


steps:
- uses: actions/checkout@v2

Expand All @@ -21,7 +32,7 @@ jobs:
run: |
pip install pipenv
cd tabular-cdc-bootstrapper
pipenv install
pipenv install --dev
- name: Run tests
run: |
Expand Down
8 changes: 6 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,12 @@ pipenv install
- [configure serverless for your AWS account.](https://www.serverless.com/framework/docs/providers/aws/guide/credentials)
- update serverless.yml with your specific configs, including tabular credentials. You may also provide a `.env` file if you prefer. Place the file alongside the `serverless.yml` file in the same directory 💪
```.env
S3_BUCKET_NAME=randy-pitcher-workspace--aws
S3_BUCKET_PATH=cdc-bootstrap
S3_BUCKET_TO_MONITOR=randy-pitcher-workspace--aws
S3_PATH_TO_MONITOR=cdc-bootstrap
TABULAR_TARGET_WAREHOUSE=enterprise_data_warehouse
TABULAR_CREDENTIAL=t-1234:123123123 # needs permission to create database in a warehouse and list all existing objects in a warehouse
TABULAR_CATALOG_URI=https://api.tabular.io/ws
```
- activate the python virtual environment with `pipenv shell`
- deploy with `npx sls deploy`. NOTE: if you want to just run `sls deploy`, install serverless globally with npm (`npm install -g serverless`)
Expand Down
5 changes: 4 additions & 1 deletion tabular-cdc-bootstrapper/.gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -23,4 +23,7 @@ node_modules
.requirements.zip

# shhh, secrets. Or maybe just configs, who knows 🤷
.env
.env

# ignore sample files
*.parquet
5 changes: 1 addition & 4 deletions tabular-cdc-bootstrapper/Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -4,14 +4,11 @@ verify_ssl = true
name = "pypi"

[packages]
requests = "*"
pytest = "*"
pyiceberg = {extras = ["s3fs"], version = "*"}
pyarrow = "*"
boto3 = "*"

[dev-packages]
pytest = "*"

[requires]
python_version = "3.9"
python_full_version = "3.9.6"
126 changes: 54 additions & 72 deletions tabular-cdc-bootstrapper/Pipfile.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

47 changes: 28 additions & 19 deletions tabular-cdc-bootstrapper/cdc_bootstrap_handler.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,18 +10,17 @@
except ImportError:
pass


import tabular


# Tabular ENVs
# Tabular connectivity
TABULAR_CREDENTIAL = os.environ['TABULAR_CREDENTIAL']
TABULAR_CATALOG_URI = os.environ['TABULAR_CATALOG_URI']
TABULAR_TARGET_WAREHOUSE = os.environ['TABULAR_TARGET_WAREHOUSE']

# S3 Monitoring ENVs
S3_BUCKET_NAME = os.environ['S3_BUCKET_NAME']
S3_BUCKET_PATH = os.environ.get('S3_BUCKET_PATH', '') # monitor the whole bucket if no path provided
# S3 Monitoring
S3_BUCKET_TO_MONITOR = os.environ['S3_BUCKET_TO_MONITOR']
S3_PATH_TO_MONITOR = os.environ['S3_PATH_TO_MONITOR']
S3_MONITORING_URI = f's3://{S3_BUCKET_TO_MONITOR}/{S3_PATH_TO_MONITOR}'


# Set up logging
Expand All @@ -37,8 +36,7 @@ def handle_new_file(event, context):
logger.info(f"""Processing new bootstrap event...
TABULAR_CATALOG_URI: {TABULAR_CATALOG_URI}
TABULAR_TARGET_WAREHOUSE: {TABULAR_TARGET_WAREHOUSE}
S3_BUCKET_NAME: {S3_BUCKET_NAME}
S3_BUCKET_PATH: {S3_BUCKET_PATH}
S3_MONITORING_URI: {S3_MONITORING_URI}
Object Key: {object_key}
Expand All @@ -55,24 +53,35 @@ def handle_new_file(event, context):
'warehouse': TABULAR_TARGET_WAREHOUSE
}

tabular.bootstrap_from_file(object_key, S3_BUCKET_PATH, catalog_properties)
if tabular.bootstrap_from_file(object_key, S3_MONITORING_URI, catalog_properties):
msg = 'Table successfully bootstrapped ✅'
logger.info(msg=msg)
return {
'statusCode': 200,
'body': json.dumps(msg)
}

else:
msg = 'Nothing to do 🌞'
logger.info(msg=msg)
return {
'statusCode': 200,
'body': json.dumps(msg)
}


except Exception as e:
error_message = str(e)
error_type = type(e).__name__
stack_info = traceback.format_exc()

return {
'statusCode': 500,
'errorType': error_type,
'errorMessage': error_message,
'stackTrace': stack_info,
resp = {
'statusCode': 500,
'errorType': error_type,
'errorMessage': error_message,
'stackTrace': stack_info,
}

logger.error(f'\nFailed to bootstrap 🔴\n{resp}')

else:
return {
'statusCode': 200,
'body': json.dumps('Looks good to me 🌞')
}
return resp
15 changes: 13 additions & 2 deletions tabular-cdc-bootstrapper/serverless.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,15 @@ service: tabular-cdc-bootstrapper
frameworkVersion: "3"

custom:
s3_bucket_name: ${env:S3_BUCKET_NAME, 'you_can_hardcode_here'}
s3_bucket_path: ${env:S3_BUCKET_PATH, 'you_can_hardcode_here'}
s3_bucket_name: ${env:S3_BUCKET_TO_MONITOR, 'you_can_hardcode_here'}
s3_bucket_path: ${env:S3_PATH_TO_MONITOR, 'you_can_hardcode_here'}

pythonRequirements:
dockerizePip: true
zip: true
noDeploy:
- pytest
- boto3

provider:
name: aws
Expand All @@ -37,9 +38,19 @@ functions:
event: s3:ObjectCreated:*
rules:
- prefix: ${self:custom.s3_bucket_path}
- suffix: .parquet
existing: true
forceDeploy: true

package:
patterns:
- '!test*.py'
- '!*.parquet'
- '!.requirements.zip'
- '!.pytest_cache'
- '!.github'
- '!node_modules'

plugins:
- serverless-python-requirements
- serverless-dotenv-plugin
Loading

0 comments on commit 8dce317

Please sign in to comment.