Operational Analytics with Atlas and Redshift

Introduction

The modern business world demands expedited decision-making, improved customer experience, and increased productivity. Gone are those days when business intelligence relied only on past data through batch processing. The order of the day is Operational analytics, which relies on measuring the existing or real-time operations of the business along with its past datatrends.

Why Operational Analytics?

First and foremost there is an exponential increase in data volumes and their varieties. Traditional DW needs to evolve constantly to meet this demand of changing needs. Most recent data are no more tabular in nature. Databases evolved into JSONs, Social logs, Images, Videos, and Time Series data.

Of late the Legacy relational database models are becoming showstoppers for programming and advanced analytics. With the legacy ways of Datastore, the performance becomes a big bottleneck as the data grows into Terabytes and petabytes.

So the need of the hour is to have a cohesive data model, which takes care of both the day-to-day operational data and its past.

Thus the coexistence of Atlas and Redshift evolves as the perfect fit for the business need.

Integration Framework

The data from/to MongoDB Atlas can be migrated in two step approach

Step 1: One-Time Load

MongoDB Atlas has direct connectors with Apache Spark. Using the spark connectors the data is migrated from MongoDB Atlas to Redshift at one time

Step2: Real-Time Data Sync

With the help of the MongoDB Atlas triggers, the data changes in the database can be continuously written to S3 bucket. From the S3 bucket data can be loaded into the Redshift either through schedule AWS Glue jobs or can be accessed as an external tables.

In this demonstration we attempt to provided step by step approach for each of these scenarios.

Pre-requisite:

a) Good understanding of AWS Redshift , Glue and S3 services b) Good understanding of MongoDB Atlas and Application services c) VPC and Network settings are already setup as per the secuirty standards. c) Redshift Database d) S3 bucket to store the json files e) MongoDB Atlas cluster for free cluster creation

One-Time Load

Architecture diagram

Step by Step Instruction

Use the Pyspark job template to move the data from Atlas to Redshift.

Real-Time Data Sync

The Change Data Capture feature of MongoDB Atlas is utilized to capture the real-time data. Utilizing the Atlas Application Services and Data Federation's $out feature to write to S3 bucket, we capture the change data to S3 continuously.

Architecture diagram

With Glue:

with Redshift Spectrum (External Table)

Step by Step Instruction for setting up Glue Job

The data from MongoDB Atlas can be continuously written to S3 bucket using the Data Federation and MongoDB Atlas triggers. Please refer the link for the step by step instructions to capture the data to S3.

For any further reference , please follow the MongoDB documentation link

create a AWS Glue job to move the data from S3 bucket to AWS Redshift

a. Create the Glue Connections Redshift Database.

Navigate to the AWS Glue console and to the "Data Catalog" menu on the left panel. Select "Connections" and Click on "Add Connection". Enter the parameters taking guidance of the screenshots attached.

b.Create the Glue Connection for S3 bucket.

Test these connections are working fine.

C.Create the Crawler to populate Database and Tables in AWS Glue Catalog from S3. Navigate to "Crawlers" menu on the left side panel and click "Add Crawler". Add all the required information for the crawler taking guidance from the attached screenshots.

Once the crawlers are created successfully , run the crawler and ensure its successful completion.

Click the "Tables" menu from the left side and ensure the required tables are created.

c.Now we are all set to create the Job to map and populate the redshift tables. Click " Jobs (legacy)" from the left side menu and click "Add job". Fill all the information required for creating the job, by taking the reference from screenshots attached.

Select the option of creating a new table in the destination. Select the Redshift connection and the Database name you already created in the Redshift (as part of the pre-requisite).

Validate the mapping and alter as required.

The scripts for the conversions are created automatically and allowed for edit. Change the Redshift database table name to customer_activity.

Run the job and ensure it's successful completion. Use the logs and Error logs generated for debugging. (if required)

Verify the data is populated successfully in the Redshift table.

Step by Step Instruction for setting up Redshift Spectrum - External Table

Redshift Spectrum host the S3 bucket data as an external table. Provided the reference and steps to create the extenal table in the following link

Redshift Specturm - external table

Reference link for querying JSON structure data in Redshift.

Analytical Services using Redshift ML.

The data thus populated from MongoDB Atlas either through AWS Glue or as a external tables in Redshift can be utilized to train the models . Redshift ML services enables to directly use the Sagemaker Models to train and infer results.

Please refer the link for training a model and infer result from the model.

Business Use Cases:

Enterprises like Retail, Banking & Finance and Manufacturing are in great demand for the operational analytics for it's various real-time analytics.

A few are captured below

Summary:

With the synergy it creates by having Atlas for its operational efficiency and Redshift for its DWH excellence, all the “Operational Analytics” use cases can be delivered in no time. The solution can be extended to integrate the AI/ML needs using the AWS SageMaker.

Hope you are able to setup the integration successfully. For any further reference pls reach out to partners@mongodb.com

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
Scripts		Scripts
images		images
LICENSE		LICENSE
README.md		README.md
pyspark_atlastoredshift.py		pyspark_atlastoredshift.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Operational Analytics with Atlas and Redshift

Introduction

Why Operational Analytics?

Integration Framework

Step 1: One-Time Load

Step2: Real-Time Data Sync

Pre-requisite:

One-Time Load

Architecture diagram

Step by Step Instruction

Real-Time Data Sync

Architecture diagram

With Glue:

with Redshift Spectrum (External Table)

Step by Step Instruction for setting up Glue Job

Step by Step Instruction for setting up Redshift Spectrum - External Table

Analytical Services using Redshift ML.

Business Use Cases:

Summary:

About

Releases

Packages

Languages

License

Babusrinivasan76/atlastoredshift

Folders and files

Latest commit

History

Repository files navigation

Operational Analytics with Atlas and Redshift

Introduction

Why Operational Analytics?

Integration Framework

Step 1: One-Time Load

Step2: Real-Time Data Sync

Pre-requisite:

One-Time Load

Architecture diagram

Step by Step Instruction

Real-Time Data Sync

Architecture diagram

With Glue:

with Redshift Spectrum (External Table)

Step by Step Instruction for setting up Glue Job

Step by Step Instruction for setting up Redshift Spectrum - External Table

Analytical Services using Redshift ML.

Business Use Cases:

Summary:

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages