Adds adaptive allocations feature description to conceptual docs (#2763…

…) (#2764) Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
elastic · Aug 1, 2024 · 8d40df4 · 8d40df4
1 parent 53ff4d6
commit 8d40df4
Show file tree

Hide file tree

Showing 4 changed files with 46 additions and 2 deletions.
diff --git a/docs/en/stack/ml/nlp/ml-nlp-deploy-models.asciidoc b/docs/en/stack/ml/nlp/ml-nlp-deploy-models.asciidoc
@@ -186,7 +186,10 @@ NOTE: Since eland uses APIs to deploy the models, you cannot see the models in
 When you deploy the model, its allocations are distributed across available {ml} 
 nodes. Model allocations are independent units of work for NLP tasks. To 
 influence model performance, you can configure the number of allocations and the 
-number of threads used by each allocation of your deployment.
+number of threads used by each allocation of your deployment. Alternatively, you
+can enable <<nlp-model-adaptive-allocations>> to automatically create and remove
+model allocations based on the current workload of the model (you still need to 
+manually set the number of threads).
 
 IMPORTANT: If your deployed trained model has only one allocation, it's likely 
 that you will experience downtime in the service your trained model performs. 
@@ -211,7 +214,16 @@ You can view the allocation status in {kib} or by using the
 {ref}/get-trained-models-stats.html[get trained model stats API]. If you want to
 change the number of allocations, you can use the
 {ref}/update-trained-model-deployment.html[update trained model stats API] after
-the allocation status is `started`.
+the allocation status is `started`. You can also enable
+<<nlp-model-adaptive-allocations>> to automatically create and remove model
+allocations based on the current workload of the model.
+
+[discrete]
+[[nlp-model-adaptive-allocations]]
+=== Adaptive allocations
+
+include::ml-nlp-shared.asciidoc[tag=ml-nlp-adaptive-allocations]
+
 
 [discrete]
 [[infer-request-queues]]

diff --git a/docs/en/stack/ml/nlp/ml-nlp-e5.asciidoc b/docs/en/stack/ml/nlp/ml-nlp-e5.asciidoc
@@ -269,6 +269,12 @@ Once it's uploaded to {es}, the model will have the ID specified by
 underscores `__`.
 --
 
+[discrete]
+[[e5-adaptive-allocations]]
+== Adaptive allocations
+
+include::ml-nlp-shared.asciidoc[tag=ml-nlp-adaptive-allocations]
+
 
 [discrete]
 [[terms-of-use-e5]]

diff --git a/docs/en/stack/ml/nlp/ml-nlp-elser.asciidoc b/docs/en/stack/ml/nlp/ml-nlp-elser.asciidoc
@@ -433,6 +433,13 @@ per document to ingest.
 To learn more about ELSER performance, refer to the <<elser-benchmarks>>.
 
 
+[discrete]
+[[elser-adaptive-allocations]]
+== Adaptive allocations
+
+include::ml-nlp-shared.asciidoc[tag=ml-nlp-adaptive-allocations]
+
+
 [discrete]
 [[further-readings]]
 == Further reading

diff --git a/docs/en/stack/ml/nlp/ml-nlp-shared.asciidoc b/docs/en/stack/ml/nlp/ml-nlp-shared.asciidoc
@@ -1,3 +1,22 @@
+tag::ml-nlp-adaptive-allocations[]
+The numbers of threads and allocations you can set manually for a model remain constant even when not all the available resources are fully used or when the load on the model requires more resources.
+Instead of setting the number of allocations manually, you can enable adaptive allocations to set the number of allocations based on the load on the process. This can help you to manage performance and cost more easily.
+When adaptive allocations are enabled, the number of allocations of the model is set automatically based on the current load.
+When the load is high, a new model allocation is automatically created.
+When the load is low, a model allocation is automatically removed.
+
+You can enable adaptive allocations by using:
+
+* the Create inference endpoint API for {ref}/infer-service-elser.html[ELSER], {ref}/infer-service-elasticsearch.html[E5 and models uploaded through Eland] that are used as {infer} services.
+* the {ref}/start-trained-model-deployment.html[start trained model deployment] or {ref}/update-trained-model-deployment.html[update trained model deployment] APIs for trained models that are deployed on {ml} nodes.
+
+If the new allocations fit on the current {ml} nodes, they are immediately started.
+If more resource capacity is needed for creating new model allocations, then your {ml} node will be scaled up if {ml} autoscaling is enabled to provide enough resources for the new allocation.
+The number of model allocations cannot be scaled down to less than 1.
+And they cannot be scaled up to more than 32 allocations, unless you explicitly set the maximum number of allocations to more.
+Adaptive allocations must be set up independently for each deployment and {infer} endpoint.
+end::ml-nlp-adaptive-allocations[]
+
 tag::nlp-eland-clone-docker-build[]
 You can use the {eland-docs}[Eland client] to install the {nlp} model. Use the prebuilt  
 Docker image to run the Eland install model commands. Pull the latest image with: