Adds anomaly detection FAQ items to the Troubleshooting page (#2714) (#…

…2735) Co-authored-by: István Zoltán Szabó <istvan.szabo@elastic.co>
elastic · Jun 17, 2024 · 649dc48 · 649dc48
1 parent 3e461d8
commit 649dc48
Showing 1 changed file with 172 additions and 4 deletions.
diff --git a/docs/en/stack/ml/anomaly-detection/ml-ad-troubleshooting.asciidoc b/docs/en/stack/ml/anomaly-detection/ml-ad-troubleshooting.asciidoc
@@ -1,15 +1,183 @@
 [role="xpack"]
 [[ml-ad-troubleshooting]]
-= Troubleshooting {ml} {anomaly-detect}
+= Troubleshooting {ml} {anomaly-detect} and frequently asked questions
 ++++
-<titleabbrev>Troubleshooting</titleabbrev>
+<titleabbrev>Troubleshooting and FAQ</titleabbrev>
 ++++
 
 Use the information in this section to troubleshoot common problems and find
 answers for frequently asked questions.
 
+
 [discrete]
 [[ml-ad-restart-failed-jobs]]
-== Restart failed {anomaly-jobs}
+== How to restart failed {anomaly-jobs}
+
+include::ml-restart-failed-jobs.asciidoc[]
+
+[discrete]
+[[faq-methods]]
+== What {ml} methods are used for {anomaly-detect}?
+
+For detailed information, refer to the paper https://www.ijmlc.org/papers/398-LC018.pdf[Anomaly Detection in Application Performance Monitoring Data] by Thomas Veasey and Stephen Dodson, as well as our webinars on https://www.elastic.co/elasticon/conf/2018/sf/the-math-behind-elastic-machine-learning[The Math behind Elastic Machine Learning] and 
+https://www.elastic.co/elasticon/conf/2017/sf/machine-learning-and-statistical-methods-for-time-series-analysis[Machine Learning and Statistical Methods for Time Series Analysis].
+
+Further papers cited in the C++ code:
+
+* http://arxiv.org/pdf/1109.2378.pdf[Modern hierarchical, agglomerative clustering algorithms]
+* https://www.cs.umd.edu/~mount/Projects/KMeans/pami02.pdf[An Efficient k-Means Clustering Algorithm: Analysis and Implementation]
+* http://www.stat.columbia.edu/~madigan/PAPERS/techno.pdf[Large-Scale Bayesian Logistic Regression for Text Categorization]
+* https://www.cs.cmu.edu/~dpelleg/download/xmeans.pdf[X-means: Extending K-means with Efficient Estimation of the Number of Clusters]
+
+
+[discrete]
+[[faq-features]]
+== What are the input features used by the model?
+
+All input features are specified by the user, for example, using 
+https://www.elastic.co/guide/en/machine-learning/current/ml-functions.html[diverse statistical functions]
+like count or mean over the data of interest.
+
+
+[discrete]
+[[faq-data]]
+== Does the data used by the model only include customers' data?
+
+Yes. Only the data specified in the {anomaly-job} configuration are used for
+detection.
+
+
+[discrete]
+[[faq-output-score]]
+== What does the model output score represent? How is it generated and calibrated?
+
+The ensemble model generates a probability value, which is then mapped to an
+anomaly severity score between 0 and 100. The lower the probability of observed
+data, the higher the severity score. Refer to this
+<<ml-ad-explain,advanced concept doc>> for details. Calibration (also called as
+normalization) happens on two levels:
+
+. Within the same metric/partition, the scores are re-normalized “back in time”
+within the window specified by the `renormalization_window_days` parameter.
+This is the reason, for example, that both `record_score` and
+`initial_record_score` exist.
+. Over multiple partitions, scores are renormalized as described in
+https://www.elastic.co/blog/changes-to-elastic-machine-learning-anomaly-scoring-in-6-5[this blog post].
+
+
+[discrete]
+[[faq-model-update]]
+== Is the model static or updated periodically?
+
+It's an online model and updated continuously. Old parts of the model are pruned
+out based on the parameter `model_prune_window` (usually 30 days).
+
+
+[discrete]
+[[faq-model-performance]]
+== Is the performance of the model monitored?
+
+There is a set of benchmarks to monitor the performance of the {anomaly-detect}
+algorithms and to ensure no regression occurs as the methods are continuously
+developed and refined. They are called "data scenarios" and consist of 3 things:
+
+* a dataset (stored as an {es} snapshot),
+* a {ml} config ({anomaly-detect}, {dfanalysis}, {transform}, or {infer}),
+* an arbitrary set of static assertions (bucket counts, anomaly scores, accuracy
+value, and so on).
+
+Performance metrics are collected from each and every scenario run and they are
+persisted in an Elastic Cloud cluster. This information is then used to track
+the performance over time, across the different builds, mainly to detect any
+regressions in the performance (both result quality and compute time).
+
+On the customer side, the situation is different. There is no conventional way
+to monitor the model performance as it's unsupervised. Usually,
+operationalization of the model output include one or several of the following
+steps:
+
+* Creating alerts for influencers, buckets, or records based on a certain
+anomaly score.
+* Use the forecasting feature to predict the development of the metric of
+interest in the future.
+* Use one or a combination of multiple {anomaly-jobs} to identify the
+significant anomaly influencers.
+
+
+[discrete]
+[[faq-model-accuracy]]
+== How to measure the accuracy of the unsupervised {ml} model?
+
+For each record in a given time series, anomaly detection models provide an
+anomaly severity score, 95% confidence intervals, and an actual value. This data
+is stored in an index and can be retrieved using the Get Records API. With this
+information, you can use standard measures to assess prediction accuracy,
+interval calibration, and so on. Elasticsearch aggregations can be used to
+compute these statistics.
+
+The purpose of {anomaly-detect} is to achieve the best ranking of periods where
+an anomaly happened. A practical way to evaluate this is to keep track of real
+incidents and see how well they correlate with the predictions of 
+{anomaly-detect}.
+
+
+[discrete]
+[[faq-model-drift]]
+== Can the {anomaly-detect} model experience model drift?
+
+Elasticsearch's {anomaly-detect} model continuously learns and adapts to changes
+in the time series. These changes can take the form of slow drifts as well as
+sudden jumps. Therefore, we take great care to manage the adaptation to changing
+data characteristics. There is always a fine trade-off between fitting anomalous
+periods (over-fitting) and not learning new normal behavior. The following are
+the main approaches Elastic uses to manage this trade-off:
+
+* Learning the optimal decay rate based on measuring the bias in the forecast
+and the moments of the error distribution and error distribution moments.
+* Allowing continuous small drifts in periodic patterns. This is achieved by
+continuously minimizing the mean prediction error over the last iteration with
+respect to a small bounded time shift.
+* If the predictions are significantly wrong over a long period of time, the
+algorithm tests whether the time series has undergone a sudden change.
+Hypothesis Testing is used to test for different types of changes, such as
+scaling of values, shifting of values, and large time shifts in periodic
+patterns such as daylight saving time.
+* Running continuous hypothesis tests on time windows of various lengths to test
+for significant evidence of new or changed periodic patterns, and update the
+model if the null hypothesis of unchanged features is rejected. 
+* Accumulating error statistics on calendar days and continuously test whether
+predictive calendar features need to be added or removed from the model.
+
+
+[discrete]
+[[faq-minimum-data]]
+== What is the minimum amount of data for an {anomaly-job}?
+
+Elastic {ml} needs a minimum amount of data to be able to build an effective
+model for {anomaly-detect}.
+
+* For sampled metrics such as `mean`, `min`, `max`,
+and `median`, the minimum data amount is either eight non-empty bucket spans or
+two hours, whichever is greater.
+* For all other non-zero/null metrics and count-based quantities, it's four
+non-empty bucket spans or two hours, whichever is greater.
+* For the `count` and `sum` functions, empty buckets matter and therefore it is
+the same as sampled metrics - eight buckets or two hours.
+* For the `rare` function, it's typically around 20 bucket spans. It can be faster
+for population models, but it depends on the number of people that interact per
+bucket.
+
+Rules of thumb:
+
+* more than three weeks for periodic data or a few hundred buckets for
+non-periodic data
+* at least as much data as you want to forecast
+
+
+[discrete]
+[[faq-data-integrity]]
+== Are there any checks or processes to ensure data integrity?
 
-include::ml-restart-failed-jobs.asciidoc[]
+The Elastic {ml} algorithms are programmed to work with missing and noisy data 
+and use denoising and data reputation techniques based on the learned
+statistical properties.