Skip to content

Commit

Permalink
Update headings and punctuation in sycamore page (opensearch-project#…
Browse files Browse the repository at this point in the history
…8301)

* Update sycamore.md

Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

* Update _tools/sycamore.md

Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>

---------

Signed-off-by: kolchfa-aws <105444904+kolchfa-aws@users.noreply.github.com>
Signed-off-by: Noah Staveley <noah.staveley@intel.com>
  • Loading branch information
kolchfa-aws authored and noahstaveley committed Sep 23, 2024
1 parent 18b4233 commit 66354a4
Showing 1 changed file with 4 additions and 4 deletions.
8 changes: 4 additions & 4 deletions _tools/sycamore.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,23 +11,23 @@ has_children: false

To get started, visit the [Sycamore documentation](https://sycamore.readthedocs.io/en/stable/sycamore/get_started.html).

# Sycamore ETL pipeline structure
## Sycamore ETL pipeline structure

A Sycamore extract, transform, load (ETL) pipeline applies a series of transformations to a [DocSet](https://sycamore.readthedocs.io/en/stable/sycamore/get_started/concepts.html#docsets), which is a collection of documents and their constituent elements (for example, tables, blocks of text, or headers). At the end of the pipeline, the DocSet is loaded into OpenSearch vector and keyword indexes.

A typical pipeline for preparing unstructured data for vector or hybrid search in OpenSearch consists of the following steps:

* Read documents into a [DocSet](https://sycamore.readthedocs.io/en/stable/sycamore/get_started/concepts.html#docsets).
* [Partition documents](https://sycamore.readthedocs.io/en/stable/sycamore/transforms/partition.html) into structured JSON elements.
* Extract metadata, filter, and clean data using [transforms](https://sycamore.readthedocs.io/en/stable/sycamore/APIs/docset.html).
* Extract metadata and filter and clean data using [transforms](https://sycamore.readthedocs.io/en/stable/sycamore/APIs/docset.html).
* Create [chunks](https://sycamore.readthedocs.io/en/stable/sycamore/transforms/merge.html) from groups of elements.
* Embed the chunks using the model of your choice.
* [Load](https://sycamore.readthedocs.io/en/stable/sycamore/connectors/opensearch.html) the embeddings, metadata, and text into OpenSearch vector and keyword indexes.

For an example pipeline that uses this workflow, see [this notebook](https://github.com/aryn-ai/sycamore/blob/main/notebooks/opensearch_docs_etl.ipynb).


# Install Sycamore
## Install Sycamore

We recommend installing the Sycamore library using `pip`. The connector for OpenSearch can be specified and installed using extras. For example:

Expand All @@ -45,4 +45,4 @@ pip install sycamore-ai[opensearch,local-inference]

## Next steps

For more information, visit the [Sycamore documentation](https://sycamore.readthedocs.io/en/stable/sycamore/get_started.html).
For more information, visit the [Sycamore documentation](https://sycamore.readthedocs.io/en/stable/sycamore/get_started.html).

0 comments on commit 66354a4

Please sign in to comment.