Merge branch 'main' into maint/processor-auto-generation-campaign

opensearch-project · Jul 17, 2024 · ae6ec35 · ae6ec35
2 parents 29fc181 + e3ee238
commit ae6ec35
Show file tree

Hide file tree

Showing 10 changed files with 64 additions and 38 deletions.
diff --git a/.github/workflows/pr_checklist.yml b/.github/workflows/pr_checklist.yml
@@ -1,9 +1,12 @@
 name: PR Checklist
 
 on:
-  pull_request:
+  pull_request_target:
     types: [opened]
 
+permissions:
+  pull-requests: write
+
 jobs:
   add-checklist:
     runs-on: ubuntu-latest

diff --git a/.gitignore b/.gitignore
@@ -4,4 +4,5 @@ _site
 .DS_Store
 Gemfile.lock
 .idea
+*.iml
 .jekyll-cache
diff --git a/_about/version-history.md b/_about/version-history.md
@@ -30,6 +30,7 @@ OpenSearch version | Release highlights | Release date
 [2.0.1](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-2.0.1.md) | Includes bug fixes and maintenance updates for Alerting and Anomaly Detection. | 16 June 2022
 [2.0.0](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-2.0.0.md) | Includes document-level monitors for alerting, OpenSearch Notifications plugins, and Geo Map Tiles in OpenSearch Dashboards. Also adds support for Lucene 9 and bug fixes for all OpenSearch plugins. For a full list of release highlights, see the Release Notes. | 26 May 2022
 [2.0.0-rc1](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-2.0.0-rc1.md) | The Release Candidate for 2.0.0. This version allows you to preview the upcoming 2.0.0 release before the GA release. The preview release adds document-level alerting, support for Lucene 9, and the ability to use term lookup queries in document level security. | 03 May 2022
+[1.3.18](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-1.3.18.md) | Includes maintenance updates for OpenSearch security. | 16 July 2024
 [1.3.17](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-1.3.17.md) | Includes maintenance updates for OpenSearch security and OpenSearch Dashboards security. | 06 June 2024
 [1.3.16](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-1.3.16.md) | Includes bug fixes and maintenance updates for OpenSearch security, index management, performance analyzer, and reporting. | 23 April 2024
 [1.3.15](https://github.com/opensearch-project/opensearch-build/blob/main/release-notes/opensearch-release-notes-1.3.15.md) | Includes bug fixes and maintenance updates for cross-cluster replication, SQL, OpenSearch Dashboards reporting, and alerting.  | 05 March 2024

diff --git a/_api-reference/nodes-apis/nodes-stats.md b/_api-reference/nodes-apis/nodes-stats.md
@@ -44,7 +44,7 @@ thread_pool | Statistics about each thread pool for the node.
 fs | File system statistics, such as read/write statistics, data path, and free disk space.
 transport | Transport layer statistics about send/receive in cluster communication.
 http | Statistics about the HTTP layer.
-breaker | Statistics about the field data circuit breakers.
+breakers | Statistics about the field data circuit breakers.
 script | Statistics about scripts, such as compilations and cache evictions. 
 discovery | Statistics about cluster states.
 ingest | Statistics about ingest pipelines.

diff --git a/_automating-configurations/api/create-workflow.md b/_automating-configurations/api/create-workflow.md
@@ -20,9 +20,9 @@ You can include placeholder expressions in the value of workflow step fields. Fo
 
 Once a workflow is created, provide its `workflow_id` to other APIs.
 
-The `POST` method creates a new workflow. The `PUT` method updates an existing workflow. 
+The `POST` method creates a new workflow. The `PUT` method updates an existing workflow. You can specify the `update_fields` parameter to update specific fields.
 
-You can only update a workflow if it has not yet been provisioned.
+You can only update a complete workflow if it has not yet been provisioned.
 {: .note}
 
 ## Path and HTTP methods
@@ -58,11 +58,26 @@ POST /_plugins/_flow_framework/workflow?validation=none
 ```
 {% include copy-curl.html %}
 
+You cannot update a full workflow once it has been provisioned, but you can update fields other than the `workflows` field, such as `name` and `description`:
+
+```json
+PUT /_plugins/_flow_framework/workflow/<workflow_id>?update_fields=true
+{
+  "name": "new-template-name",
+  "description": "A new description for the existing template"
+}
+```
+{% include copy-curl.html %}
+
+You cannot specify both the `provision` and `update_fields` parameters at the same time.
+{: .note}
+
 The following table lists the available query parameters. All query parameters are optional. User-provided parameters are only allowed if the `provision` parameter is set to `true`.
 
 | Parameter | Data type | Description |
 | :--- | :--- | :--- |
 | `provision` | Boolean | Whether to provision the workflow as part of the request. Default is `false`. |
+| `update_fields` | Boolean | Whether to update only the fields included in the request body. Default is `false`. |
 | `validation` | String | Whether to validate the workflow. Valid values are `all` (validate the template) and `none` (do not validate the template). Default is `all`. |
 | User-provided substitution expressions | String | Parameters matching substitution expressions in the template. Only allowed if `provision` is set to `true`. Optional. If `provision` is set to `false`, you can pass these parameters in the [Provision Workflow API query parameters]({{site.url}}{{site.baseurl}}/automating-configurations/api/provision-workflow/#query-parameters). |
 

diff --git a/_ingest-pipelines/processors/split.md b/_ingest-pipelines/processors/split.md
@@ -26,19 +26,18 @@ The following is the syntax for the `split` processor:
 
 The following table lists the required and optional parameters for the `split` processor.
 
-Parameter | Required/Optional | Description |
-|-----------|-----------|-----------|
-`field` | Required | The field containing the string to be split.
-`separator` | Required | The delimiter used to split the string. This can be a regular expression pattern.
-`preserve_field` | Optional | If set to `true`, preserves empty trailing fields (for example, `''`) in the resulting array. If set to `false`, empty trailing fields are removed from the resulting array. Default is `false`.
-`target_field` | Optional | The field where the array of substrings is stored. If not specified, then the field is updated in-place.
-`ignore_missing` | Optional	| Specifies whether the processor should ignore documents that do not contain the specified 
-field. If set to `true`, then the processor ignores missing values in the field and leaves the `target_field` unchanged. Default is `false`. 
-`description` | Optional | A brief description of the processor.
-`if` | Optional | A condition for running the processor.
-`ignore_failure` | Optional | Specifies whether the processor continues execution even if it encounters an error. If set to `true`, then failures are ignored. Default is `false`.
-`on_failure` | Optional | A list of processors to run if the processor fails.
-`tag` | Optional | An identifier tag for the processor. Useful for debugging in order to distinguish between processors of the same type.
+Parameter  | Required/Optional  | Description 
+:--- | :--- | :--- 
+`field` | Required | The field containing the string to be split. 
+`separator` | Required | The delimiter used to split the string. This can be a regular expression pattern. 
+`preserve_field` | Optional | If set to `true`, preserves empty trailing fields (for example, `''`) in the resulting array. If set to `false`, empty trailing fields are removed from the resulting array. Default is `false`. 
+`target_field` | Optional | The field where the array of substrings is stored. If not specified, then the field is updated in-place. 
+`ignore_missing` | Optional	| Specifies whether the processor should ignore documents that do not contain the specified field. If set to `true`, then the processor ignores missing values in the field and leaves the `target_field` unchanged. Default is `false`.  
+`description` | Optional | A brief description of the processor. 
+`if` | Optional | A condition for running the processor. 
+`ignore_failure` | Optional | Specifies whether the processor continues execution even if it encounters an error. If set to `true`, then failures are ignored. Default is `false`. 
+`on_failure` | Optional | A list of processors to run if the processor fails. 
+`tag` | Optional | An identifier tag for the processor. Useful for debugging in order to distinguish between processors of the same type. 
 
 ## Using the processor
 

diff --git a/_query-dsl/compound/hybrid.md b/_query-dsl/compound/hybrid.md
@@ -12,11 +12,7 @@ You can use a hybrid query to combine relevance scores from multiple queries int
 
 ## Example
 
-Before using a `hybrid` query, you must set up a machine learning (ML) model, ingest documents, and configure a search pipeline with a [`normalization-processor`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/normalization-processor/).
-
-To learn how to set up an ML model, see [Choosing a model]({{site.url}}{{site.baseurl}}/ml-commons-plugin/integrating-ml-models/#choosing-a-model).
-
-Once you set up an ML model, learn how to use the `hybrid` query by following the steps in [Using hybrid search]({{site.url}}{{site.baseurl}}/search-plugins/hybrid-search/#using-hybrid-search).
+Learn how to use the `hybrid` query by following the steps in [Using hybrid search]({{site.url}}{{site.baseurl}}/search-plugins/hybrid-search/#using-hybrid-search).
 
 For a comprehensive example, follow the [Neural search tutorial]({{site.url}}{{site.baseurl}}/ml-commons-plugin/semantic-search#tutorial).
 

diff --git a/_search-plugins/hybrid-search.md b/_search-plugins/hybrid-search.md
@@ -12,7 +12,7 @@ Introduced 2.11
 Hybrid search combines keyword and neural search to improve search relevance. To implement hybrid search, you need to set up a [search pipeline]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/index/) that runs at search time. The search pipeline you'll configure intercepts search results at an intermediate stage and applies the [`normalization_processor`]({{site.url}}{{site.baseurl}}/search-plugins/search-pipelines/normalization-processor/) to them. The `normalization_processor` normalizes and combines the document scores from multiple query clauses, rescoring the documents according to the chosen normalization and combination techniques. 
 
 **PREREQUISITE**<br>
-Before using hybrid search, you must set up a text embedding model. For more information, see [Choosing a model]({{site.url}}{{site.baseurl}}/ml-commons-plugin/integrating-ml-models/#choosing-a-model).
+To follow this example, you must set up a text embedding model. For more information, see [Choosing a model]({{site.url}}{{site.baseurl}}/ml-commons-plugin/integrating-ml-models/#choosing-a-model). If you have already generated text embeddings, ingest the embeddings into an index and skip to [Step 4](#step-4-configure-a-search-pipeline).
 {: .note}
 
 ## Using hybrid search

diff --git a/_search-plugins/knn/settings.md b/_search-plugins/knn/settings.md
@@ -12,17 +12,28 @@ The k-NN plugin adds several new cluster settings. To learn more about static an
 
 ## Cluster settings
 
+The following table lists all available cluster-level k-NN settings. For more information about cluster settings, see [Configuring OpenSearch]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/index/#updating-cluster-settings-using-the-api) and [Updating cluster settings using the API]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/index/#updating-cluster-settings-using-the-api).
+
+Setting | Static/Dynamic | Default | Description
+:--- | :--- | :--- | :---
+`knn.plugin.enabled`| Dynamic | `true` | Enables or disables the k-NN plugin.
+`knn.algo_param.index_thread_qty` | Dynamic | `1` | The number of threads used for native library index creation. Keeping this value low reduces the CPU impact of the k-NN plugin but also reduces indexing performance.
+`knn.cache.item.expiry.enabled` | Dynamic | `false` | Whether to remove native library indexes that have not been accessed for a certain duration from memory.
+`knn.cache.item.expiry.minutes` | Dynamic | `3h` | If enabled, the amount of idle time before a native library index is removed from memory.
+`knn.circuit_breaker.unset.percentage` | Dynamic | `75` | The native memory usage threshold for the circuit breaker. Memory usage must be lower than this percentage of `knn.memory.circuit_breaker.limit` in order for `knn.circuit_breaker.triggered` to remain `false`.
+`knn.circuit_breaker.triggered` | Dynamic | `false` | True when memory usage exceeds the `knn.circuit_breaker.unset.percentage` value.
+`knn.memory.circuit_breaker.limit` | Dynamic | `50%` | The native memory limit for native library indexes. At the default value, if a machine has 100 GB of memory and the JVM uses 32 GB, then the k-NN plugin uses 50% of the remaining 68 GB (34 GB). If memory usage exceeds this value, then the plugin removes the native library indexes used least recently.
+`knn.memory.circuit_breaker.enabled` | Dynamic | `true` | Whether to enable the k-NN memory circuit breaker.
+`knn.model.index.number_of_shards`| Dynamic | `1` | The number of shards to use for the model system index, which is the OpenSearch index that stores the models used for approximate nearest neighbor (ANN) search.
+`knn.model.index.number_of_replicas`| Dynamic | `1` | The number of replica shards to use for the model system index. Generally, in a multi-node cluster, this value should be at least 1 in order to increase stability.
+`knn.model.cache.size.limit` | Dynamic | `10%` |  The model cache limit cannot exceed 25% of the JVM heap.
+`knn.faiss.avx2.disabled` | Static | `false` | A static setting that specifies whether to disable the SIMD-based `libopensearchknn_faiss_avx2.so` library and load the non-optimized `libopensearchknn_faiss.so` library for the Faiss engine on machines with x64 architecture. For more information, see [SIMD optimization for the Faiss engine]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index/#simd-optimization-for-the-faiss-engine).
+
+## Index settings
+
+The following table lists all available index-level k-NN settings. All settings are static. For information about updating static index-level settings, see [Updating a static index setting]({{site.url}}{{site.baseurl}}/install-and-configure/configuring-opensearch/index-settings/#updating-a-static-index-setting).
+
 Setting | Default | Description
-:--- | :--- | :---
-`knn.algo_param.index_thread_qty` | 1 | The number of threads used for native library index creation. Keeping this value low reduces the CPU impact of the k-NN plugin, but also reduces indexing performance.
-`knn.cache.item.expiry.enabled` | false | Whether to remove native library indexes that have not been accessed for a certain duration from memory.
-`knn.cache.item.expiry.minutes` | 3h | If enabled, the idle time before removing a native library index from memory.
-`knn.circuit_breaker.unset.percentage` | 75% | The native memory usage threshold for the circuit breaker. Memory usage must be below this percentage of `knn.memory.circuit_breaker.limit` for `knn.circuit_breaker.triggered` to remain false.
-`knn.circuit_breaker.triggered` | false | True when memory usage exceeds the `knn.circuit_breaker.unset.percentage` value.
-`knn.memory.circuit_breaker.limit` | 50% | The native memory limit for native library indexes. At the default value, if a machine has 100 GB of memory and the JVM uses 32 GB, the k-NN plugin uses 50% of the remaining 68 GB (34 GB). If memory usage exceeds this value, k-NN removes the least recently used native library indexes.
-`knn.memory.circuit_breaker.enabled` | true | Whether to enable the k-NN memory circuit breaker.
-`knn.plugin.enabled`| true | Enables or disables the k-NN plugin.
-`knn.model.index.number_of_shards`| 1 | The number of shards to use for the model system index, the OpenSearch index that stores the models used for Approximate Nearest Neighbor (ANN) search.
-`knn.model.index.number_of_replicas`| 1 | The number of replica shards to use for the model system index. Generally, in a multi-node cluster, this should be at least 1 to increase stability.
-`knn.advanced.filtered_exact_search_threshold`| null | The threshold value for the filtered IDs that is used to switch to exact search during filtered ANN search. If the number of filtered IDs in a segment is less than this setting's value, exact search will be performed on the filtered IDs.  
-`knn.faiss.avx2.disabled` | False | A static setting that specifies whether to disable the SIMD-based `libopensearchknn_faiss_avx2.so` library and load the non-optimized `libopensearchknn_faiss.so` library for the Faiss engine on machines with x64 architecture. For more information, see [SIMD optimization for the Faiss engine]({{site.url}}{{site.baseurl}}/search-plugins/knn/knn-index/#simd-optimization-for-the-faiss-engine).
+:--- | :--- | :--- 
+`index.knn.advanced.filtered_exact_search_threshold`| `null` | The filtered ID threshold value used to switch to exact search during filtered ANN search. If the number of filtered IDs in a segment is lower than this setting's value, then exact search will be performed on the filtered IDs. 
+`index.knn.algo_param.ef_search` | `100` | `ef` (or `efSearch`) represents the size of the dynamic list for the nearest neighbors used during a search. Higher `ef` values lead to a more accurate but slower search. `ef` cannot be set to a value lower than the number of queried nearest neighbors, `k`. `ef` can take any value between `k` and the size of the dataset. 
diff --git a/_search-plugins/neural-sparse-search.md b/_search-plugins/neural-sparse-search.md
@@ -16,8 +16,8 @@ Introduced 2.11
 
 When selecting a model, choose one of the following options:
 
-- Use a sparse encoding model at both ingestion time and search time (high performance, relatively high latency).
-- Use a sparse encoding model at ingestion time and a tokenizer at search time for relatively low performance and low latency. The tokenism doesn't conduct model inference, so you can deploy and invoke a tokenizer using the ML Commons Model API for a more consistent experience.
+- Use a sparse encoding model at both ingestion time and search time for better search relevance at the expense of relatively high latency.
+- Use a sparse encoding model at ingestion time and a tokenizer at search time for lower search latency at the expense of relatively lower search relevance. Tokenization doesn't involve model inference, so you can deploy and invoke a tokenizer using the ML Commons Model API for a more streamlined experience.
 
 **PREREQUISITE**<br>
 Before using neural sparse search, make sure to set up a [pretrained sparse embedding model]({{site.url}}{{site.baseurl}}/ml-commons-plugin/pretrained-models/#sparse-encoding-models) or your own sparse embedding model. For more information, see [Choosing a model]({{site.url}}{{site.baseurl}}/ml-commons-plugin/integrating-ml-models/#choosing-a-model).