Skip to content

aws-samples/amazon-sagemaker-knn-search

Using Amazon Elasticsearch K-Nearest Neighbor (KNN) and Amazon SageMaker to power search and recommendation

Purpose of this repository

The purpose of this code repository is to provide technical guidelines on building a search and recommendation engine, for data scientists working on similar challenges.

We’ll start by training a custom language model using Amazon SageMaker and it's Object2Vec built-in algorithm in order to generate textual embeddings for each catalog product. Then we will showcase how to visualize the embeddings using Tensorflow Embedding Projector.

In the second part of the blog, we will set up an Elasticsearch index using Amazon Elasticsearch Service (AES) and populate it with these embeddings in addition to the catalog data (product id, category, title, etc).

Finally, we will conclude with a mechanism to combine key word and KNN search functionality to perform search and recommendation queries.

Folder structure

.
├── src
├──── search_utils #helper functions
├──── preprocessing_main.py #Entry point for the initial processing job
├──── glove_embeddings_main.py #Entry point for the second processing job (Glove related)
├──── requirements.txt #Python libraries  
├── notebooks #Step by step notebooks
├── build_and_push.sh #script used to build and push the docker Amazon Elastic Container Registry
├── Dockerfile
├── elasticsearch.yml #A cfn template used to create the Amazon Elasticsearch Service cluster
├── THIRD-PARTY-LICENSES.txt
├── LICENSE
└── README.md

How to

  1. Create an Amazon S3 bucket. This will be used throughout the notebooks to store files generated by the examples. By default we will use the default SageMaker bucket in your AWS account.

  2. [Optional] If you only want to run the first three notebooks (out of four) you can skip this step. 2.1 Deploy the cloudformation stack using the "elasticsearch.yml" template at the root level of this repository. This template will deploy an Elasticsearch cluster using Amazon Elasticsearch Service. (Please note that this cluster set-up is to be used only for demo purposes, for running actual production workloads refer to documentation and follow security best practices) 2.2 If you want a more customised set up, please refer to the following documentation as well: https://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-createupdatedomains.html

  3. Create a SageMaker notebook instance. Please observe the following:

    3.1. In addition to the managed policy "arn:aws:iam::aws:policy/AmazonSageMakerFullAccess", the execution role must be given an additional permission to read/write from the S3 bucket created in step 1

    3.2. If you put the notebook instance inside a Virtual Private Cloud (VPC), make sure that the VPC allows access to the public Pypi repository and aws-samples/ repositories.

    3.3. Attach the current Git repository to the Amazon SageMaker instance

    3.4. Ensure you set up the "Volume Size" to at least 20GB and the instance type to "ml.t2.xlarge"

  4. Once the notebook is running, you can get started from the "notebooks" folder:

    4.1. "01_Create_dataset.ipynb": In this notebook we will download the necessary dataset and upload the clean data back to the Amazon S3 bucket 4.2. "02_Launch_training.ipynb": In this notebook we will process the clean data and train the Object2vec algorithm using Amazon SageMaker 4.3. "03_Inference.ipynb": In this notebook we will deploy an Amazon SageMaker endpoint and perform predictions using the trained model in the previous step 4.4. "04_Load_and_query_elasticsearch.ipynb": In this notebook, we will populate the Elasticsearch index, and perform real-time queries.

  5. Once you're done make sure you remove any resources you don't need (Amazon SageMaker endpoints, Amazon Elasticsearch service clusters, etc.)

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

About

No description, website, or topics provided.

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published