Skip to content

Commit

Permalink
Adds adaptive allocations feature description to conceptual docs (#2763
Browse files Browse the repository at this point in the history
…) (#2764)

Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
  • Loading branch information
mergify[bot] and szabosteve authored Aug 1, 2024
1 parent 53ff4d6 commit 8d40df4
Show file tree
Hide file tree
Showing 4 changed files with 46 additions and 2 deletions.
16 changes: 14 additions & 2 deletions docs/en/stack/ml/nlp/ml-nlp-deploy-models.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -186,7 +186,10 @@ NOTE: Since eland uses APIs to deploy the models, you cannot see the models in
When you deploy the model, its allocations are distributed across available {ml}
nodes. Model allocations are independent units of work for NLP tasks. To
influence model performance, you can configure the number of allocations and the
number of threads used by each allocation of your deployment.
number of threads used by each allocation of your deployment. Alternatively, you
can enable <<nlp-model-adaptive-allocations>> to automatically create and remove
model allocations based on the current workload of the model (you still need to
manually set the number of threads).

IMPORTANT: If your deployed trained model has only one allocation, it's likely
that you will experience downtime in the service your trained model performs.
Expand All @@ -211,7 +214,16 @@ You can view the allocation status in {kib} or by using the
{ref}/get-trained-models-stats.html[get trained model stats API]. If you want to
change the number of allocations, you can use the
{ref}/update-trained-model-deployment.html[update trained model stats API] after
the allocation status is `started`.
the allocation status is `started`. You can also enable
<<nlp-model-adaptive-allocations>> to automatically create and remove model
allocations based on the current workload of the model.

[discrete]
[[nlp-model-adaptive-allocations]]
=== Adaptive allocations

include::ml-nlp-shared.asciidoc[tag=ml-nlp-adaptive-allocations]


[discrete]
[[infer-request-queues]]
Expand Down
6 changes: 6 additions & 0 deletions docs/en/stack/ml/nlp/ml-nlp-e5.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -269,6 +269,12 @@ Once it's uploaded to {es}, the model will have the ID specified by
underscores `__`.
--

[discrete]
[[e5-adaptive-allocations]]
== Adaptive allocations

include::ml-nlp-shared.asciidoc[tag=ml-nlp-adaptive-allocations]


[discrete]
[[terms-of-use-e5]]
Expand Down
7 changes: 7 additions & 0 deletions docs/en/stack/ml/nlp/ml-nlp-elser.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -433,6 +433,13 @@ per document to ingest.
To learn more about ELSER performance, refer to the <<elser-benchmarks>>.


[discrete]
[[elser-adaptive-allocations]]
== Adaptive allocations

include::ml-nlp-shared.asciidoc[tag=ml-nlp-adaptive-allocations]


[discrete]
[[further-readings]]
== Further reading
Expand Down
19 changes: 19 additions & 0 deletions docs/en/stack/ml/nlp/ml-nlp-shared.asciidoc
Original file line number Diff line number Diff line change
@@ -1,3 +1,22 @@
tag::ml-nlp-adaptive-allocations[]
The numbers of threads and allocations you can set manually for a model remain constant even when not all the available resources are fully used or when the load on the model requires more resources.
Instead of setting the number of allocations manually, you can enable adaptive allocations to set the number of allocations based on the load on the process. This can help you to manage performance and cost more easily.
When adaptive allocations are enabled, the number of allocations of the model is set automatically based on the current load.
When the load is high, a new model allocation is automatically created.
When the load is low, a model allocation is automatically removed.

You can enable adaptive allocations by using:

* the Create inference endpoint API for {ref}/infer-service-elser.html[ELSER], {ref}/infer-service-elasticsearch.html[E5 and models uploaded through Eland] that are used as {infer} services.
* the {ref}/start-trained-model-deployment.html[start trained model deployment] or {ref}/update-trained-model-deployment.html[update trained model deployment] APIs for trained models that are deployed on {ml} nodes.
If the new allocations fit on the current {ml} nodes, they are immediately started.
If more resource capacity is needed for creating new model allocations, then your {ml} node will be scaled up if {ml} autoscaling is enabled to provide enough resources for the new allocation.
The number of model allocations cannot be scaled down to less than 1.
And they cannot be scaled up to more than 32 allocations, unless you explicitly set the maximum number of allocations to more.
Adaptive allocations must be set up independently for each deployment and {infer} endpoint.
end::ml-nlp-adaptive-allocations[]

tag::nlp-eland-clone-docker-build[]
You can use the {eland-docs}[Eland client] to install the {nlp} model. Use the prebuilt
Docker image to run the Eland install model commands. Pull the latest image with:
Expand Down

0 comments on commit 8d40df4

Please sign in to comment.