Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Implement new rules design #31110

Conversation

dimitris-athanasiou
Copy link
Contributor

@dimitris-athanasiou dimitris-athanasiou commented Jun 5, 2018

Rules allow users to supply a detector with domain
knowledge that can improve the quality of the results.
The model detects statistically anomalous results but it
has no knowledge of the meaning of the values being modelled.

For example, a detector that performs a population analysis
over IP addresses could benefit from a list of IP addresses
that the user knows to be safe. Then anomalous results for
those IP addresses will not be created and will not affect
the quantiles either.

Another example would be a detector looking for anomalies
in the median value of CPU utilization. A user might want
to inform the detector that any results where the actual
value is less than 5 is not interesting.

This commit introduces a custom_rules field to the Detector.
A detector may have multiple rules which are combined with or.

A rule has 3 fields: actions, scope and conditions.

Actions is a list of what should happen when the rule applies.
The current options include skip_result and skip_model_update.
The default value for actions is the skip_result action.

Scope is optional and allows for applying filters on any of the
partition/over/by field. When not defined the rule applies to
all series. The filter_id needs to be specified to match the id
of the filter to be used. Optionally, the filter_type can be specified
as either include (default) or exclude. When set to include
the rule applies to entities that are in the filter. When set to
exclude the rule only applies to entities not in the filter.

There may be zero or more conditions. A condition requires applies_to,
operator and value to be specified. The applies_to value can be
either actual, typical or diff_from_typical and it specifies
the numerical value to which the condition applies. The operator
(lt, lte, gt, gte) and value complete the definition.
Conditions are combined with and and allow to specify numerical
conditions for when a rule applies.

A rule must either have a scope or one or more conditions. Finally,
a rule with scope and conditions applies when all of them apply.

@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@@ -700,6 +704,18 @@ void setMaxMachineMemoryPercent(int maxMachineMemoryPercent) {
}
}

// Visible for testing
static void checkJobWithRulesRequiresMinVersionOnAllNodes(String jobId, ClusterState clusterState) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droberts195 Actually, I think this is not enough. In the case of a job relocating, validate will not be called, right? So if at that point there's a node running on a version earlier than 6.4 it may be assigned to it. I think I need to add a check in the node selection code as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we go down the road of banning use of functionality until the entire cluster is on a particular version then we need to do it when the job is created, not when it's opened.

If we can't do this for some reason and need to do a check when opening a job then it needs to be in the allocation decision, so we might as well then use our existing functionality that takes job version into account when allocating to nodes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with checking when the job is created is we then also need to ban earlier nodes from joining the cluster. It feels too excessive to do that for this. I'll switch to rejecting incompatible nodes from taking on jobs with rules.

public static final ConstructingObjectParser<FilterRef, Void> CONFIG_PARSER =
new ConstructingObjectParser<>(FILTER_REF_FIELD.getPreferredName(), false,
a -> new FilterRef((String) a[0], (FilterType) a[1]));
public static final Map<MlParserType, ConstructingObjectParser<FilterRef, Void>> PARSERS = new EnumMap<>(MlParserType.class);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you need this map if the individual parsers are public static?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This follows the same pattern we've been using in other classes to ensure unknown fields are ignored while parsing from cluster state but still be strict when parsing from a user request.

return Objects.hash(filterId, filterType);
}

public String getFilterId() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and public FilterType getFilterType()...

parser.declareField(Builder::setConditionsConnective, p -> {
if (p.currentToken() == XContentParser.Token.VALUE_STRING) {
return Connective.fromString(p.text());
parser.declareObject(Builder::setScope, (p, c) -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be neater to create a parser for RuleScope and use that here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't figure how to do that given that it has no named fields. Any ideas?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have at least extracted the code into a parser method in RuleScope. Looks cleaner now.

for (RuleAction action : actions) {
action.writeTo(out);
}
if (out.getVersion().onOrAfter(Version.V_6_4_0)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lack of symmetry here. The stream input constructor doesn't check the version so either this shouldn't either or both should. Not checking would be reasonable if the protection of old versions is done at a higher level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is preventing a rule to be written out to a node that doesn't know how to read it. When it comes to reading there is no issue as it would only read from nodes that are >= 6.4. Am I missing something?

Copy link
Contributor

@droberts195 droberts195 Jun 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, on the input side we're relying on the rules list in 6.3 and below being an empty list. We discussed that and said it was fine because nobody should have used a rule before 6.4, and if they did it was unsupported and hence acceptable for the transport to break.

On the output side I think a change is necessary though. The way it is at the moment if you have a list of 2 rules in 6.4 then that will be serialised as a list of 2 objects each with no fields. Instead I think it needs to be serialised as an empty list, because 6.3 won't understand the objects with no fields. (Let me know if that's wrong.)

If my thinking is correct then this version check can be removed but instead we need one in Detector.java around this line:

out.writeList(rules);

If the remote node is below version 6.4 it needs to write an empty list instead of rules.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're absolutely right. Will push a fix for this shortly.

case DIFF_FROM_TYPICAL:
return true;
case TIME:
default:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should default throw an exception rather than return false? If a future enum value is added we don't know what would be most appropriate to return here.

@@ -700,6 +704,18 @@ void setMaxMachineMemoryPercent(int maxMachineMemoryPercent) {
}
}

// Visible for testing
static void checkJobWithRulesRequiresMinVersionOnAllNodes(String jobId, ClusterState clusterState) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we go down the road of banning use of functionality until the entire cluster is on a particular version then we need to do it when the job is created, not when it's opened.

If we can't do this for some reason and need to do a check when opening a job then it needs to be in the allocation decision, so we might as well then use our existing functionality that takes job version into account when allocating to nodes.

case TYPICAL:
case DIFF_FROM_TYPICAL:
return true;
case TIME:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now default is throwing an exception, does TIME need a separate return false?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Also realised I'm missing tests here. I'll add them in together with the fix.

Rules allow users to supply a detector with domain
knowledge that can improve the quality of the results.
The model detects statistically anomalous results but it
has no knowledge of the meaning of the values being modelled.

For example, a detector that performs a population analysis
over IP addresses could benefit from a list of IP addresses
that the user knows to be safe. Then anomalous results for
those IP addresses will not be created and will not affect
the quantiles either.

Another example would be a detector looking for anomalies
in the median value of CPU utilization. A user might want
to inform the detector that any results where the actual
value is less than 5 is not interesting.

This commit introduces a `custom_rules` field to the `Detector`.
A detector may have multiple rules which are combined with `or`.

A rule has 3 fields: `actions`, `scope` and `conditions`.

Actions is a list of what should happen when the rule applies.
The current options include `skip_result` and `skip_model_update`.
The default value for `actions` is the `skip_result` action.

Scope is optional and allows for applying filters on any of the
partition/over/by field. When not defined the rule applies to
all series. The `filter_id` needs to be specified to match the id
of the filter to be used. Optionally, the `filter_type` can be specified
as either `include` (default) or `exclude`. When set to `include`
the rule applies to entities that are in the filter. When set to
`exclude` the rule only applies to entities not in the filter.

Conditions can be zero or more. A condition requires `applies_to`
and `condition` to be specified. The `applied_to` value can be
either `actual`, `typical` or `diff_from_typical` and it specifies
the numerical value to which the `condition` applies. The `condition`
has an `operator` (`lt`, `lte`, `gt`, `gte`) and a numerical `value`.
Conditions are combined with `and` and allow to specify numerical
conditions for then a rule applies.

A rule must either have a scope or one or more conditions. Finally,
a rule with scope and conditions applies when all of them apply.
@dimitris-athanasiou
Copy link
Contributor Author

retest this please

@dimitris-athanasiou dimitris-athanasiou merged commit 5c77ebe into elastic:master Jun 13, 2018
@dimitris-athanasiou dimitris-athanasiou deleted the implement-new-rules-design branch June 13, 2018 10:20
dnhatn added a commit that referenced this pull request Jun 14, 2018
* master:
  Remove RestGetAllAliasesAction (#31308)
  Temporary fix for broken build
  Reenable Checkstyle's unused import rule (#31270)
  Remove remaining unused imports before merging #31270
  Fix non-REST doc snippet
  [DOC] Extend SQL docs
  Immediately flush channel after writing to buffer (#31301)
  [DOCS] Shortens ML API intros
  Use quotes in the call invocation (#31249)
  move security ingest processors to a sub ingest directory (#31306)
  Add 5.6.11 version constant.
  Fix version detection.
  SQL: Whitelist SQL utility class for better scripting (#30681)
  [Docs] All Rollup docs experimental, agg limitations, clarify DeleteJob (#31299)
  CCS: don't proxy requests for already connected node (#31273)
  Mute ScriptedMetricAggregatorTests testSelfReferencingAggStateAfterMap
  [test] opensuse packaging turn up debug logging
  Add unreleased version 6.3.1
  Removes experimental tag from scripted_metric aggregation (#31298)
  [Rollup] Metric config parser must use builder so validation runs (#31159)
  [ML] Check licence when datafeeds use cross cluster search  (#31247)
  Add notion of internal index settings (#31286)
  Test: Remove broken yml test feature (#31255)
  REST hl client: cluster health to default to cluster level (#31268)
  [ML] Update test thresholds to account for changes to memory control (#31289)
  Log warnings when cluster state publication failed to some nodes (#31233)
  Fix AntFixture waiting condition (#31272)
  Ignore numeric shard count if waiting for ALL (#31265)
  [ML] Implement new rules design (#31110)
  index_prefixes back-compat should test 6.3 (#30951)
  Core: Remove plain execute method on TransportAction (#30998)
  Update checkstyle to 8.10.1 (#31269)
  Set analyzer version in PreBuiltAnalyzerProviderFactory (#31202)
  Modify pipelining handlers to require full requests (#31280)
  Revert upgrade to Netty 4.1.25.Final (#31282)
  Use armored input stream for reading public key (#31229)
  Fix Netty 4 Server Transport tests. Again.
  REST hl client: adjust wait_for_active_shards param in cluster health (#31266)
  REST high-level Client: remove deprecated API methods (#31200)
  [DOCS] Mark SQL feature as experimental
  [DOCS] Updates machine learning custom URL screenshots (#31222)
  Fix naming conventions check for XPackTestCase
  Fix security Netty 4 transport tests
  Fix race in clear scroll (#31259)
  [DOCS] Clarify audit index settings when remote indexing (#30923)
  Delete typos in SAML docs (#31199)
  REST high-level client: add Cluster Health API (#29331)
  [ML][TEST] Mute tests using rules (#31204)
  Support RequestedAuthnContext (#31238)
  SyncedFlushResponse to implement ToXContentObject (#31155)
  Add Get Aliases API to the high-level REST client (#28799)
  Remove some line length supressions (#31209)
  Validate xContentType in PutWatchRequest. (#31088)
  [INGEST] Interrupt the current thread if evaluation grok expressions take too long (#31024)
  Suppress extras FS on caching directory tests
  Revert "[DOCS] Added 6.3 info & updated the upgrade table. (#30940)"
  Revert "Fix snippets in upgrade docs"
  Fix snippets in upgrade docs
  [DOCS] Added 6.3 info & updated the upgrade table. (#30940)
  LLClient: Support host selection (#30523)
  Upgrade to Netty 4.1.25.Final (#31232)
  Enable custom credentials for core REST tests (#31235)
  Move ESIndexLevelReplicationTestCase to test framework (#31243)
  Encapsulate Translog in Engine (#31220)
  HLRest: Add get index templates API (#31161)
  Remove all unused imports and fix CRLF (#31207)
  [Tests] Fix self-referencing tests
  [TEST] Fix testRecoveryAfterPrimaryPromotion
  [Docs] Remove mention pattern files in Grok processor (#31170)
  Use stronger write-once semantics for Azure repository (#30437)
  Don't swallow exceptions on replication (#31179)
  Limit the number of concurrent requests per node (#31206)
  Call ensureNoSelfReferences() on _agg state variable after scripted metric agg script executions (#31044)
  Move java version checker back to its own jar (#30708)
  [test] add fix for rare virtualbox error (#31212)
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this pull request Jun 15, 2018
Rules allow users to supply a detector with domain
knowledge that can improve the quality of the results.
The model detects statistically anomalous results but it
has no knowledge of the meaning of the values being modelled.

For example, a detector that performs a population analysis
over IP addresses could benefit from a list of IP addresses
that the user knows to be safe. Then anomalous results for
those IP addresses will not be created and will not affect
the quantiles either.

Another example would be a detector looking for anomalies
in the median value of CPU utilization. A user might want
to inform the detector that any results where the actual
value is less than 5 is not interesting.

This commit introduces a `custom_rules` field to the `Detector`.
A detector may have multiple rules which are combined with `or`.

A rule has 3 fields: `actions`, `scope` and `conditions`.

Actions is a list of what should happen when the rule applies.
The current options include `skip_result` and `skip_model_update`.
The default value for `actions` is the `skip_result` action.

Scope is optional and allows for applying filters on any of the
partition/over/by field. When not defined the rule applies to
all series. The `filter_id` needs to be specified to match the id
of the filter to be used. Optionally, the `filter_type` can be specified
as either `include` (default) or `exclude`. When set to `include`
the rule applies to entities that are in the filter. When set to
`exclude` the rule only applies to entities not in the filter.

There may be zero or more conditions. A condition requires `applies_to`,
`operator` and `value` to be specified. The `applies_to` value can be
either `actual`, `typical` or `diff_from_typical` and it specifies
the numerical value to which the condition applies. The `operator`
(`lt`, `lte`, `gt`, `gte`) and `value` complete the definition.
Conditions are combined with `and` and allow to specify numerical
conditions for when a rule applies.

A rule must either have a scope or one or more conditions. Finally,
a rule with scope and conditions applies when all of them apply.
dimitris-athanasiou added a commit that referenced this pull request Jun 15, 2018
Rules allow users to supply a detector with domain
knowledge that can improve the quality of the results.
The model detects statistically anomalous results but it
has no knowledge of the meaning of the values being modelled.

For example, a detector that performs a population analysis
over IP addresses could benefit from a list of IP addresses
that the user knows to be safe. Then anomalous results for
those IP addresses will not be created and will not affect
the quantiles either.

Another example would be a detector looking for anomalies
in the median value of CPU utilization. A user might want
to inform the detector that any results where the actual
value is less than 5 is not interesting.

This commit introduces a `custom_rules` field to the `Detector`.
A detector may have multiple rules which are combined with `or`.

A rule has 3 fields: `actions`, `scope` and `conditions`.

Actions is a list of what should happen when the rule applies.
The current options include `skip_result` and `skip_model_update`.
The default value for `actions` is the `skip_result` action.

Scope is optional and allows for applying filters on any of the
partition/over/by field. When not defined the rule applies to
all series. The `filter_id` needs to be specified to match the id
of the filter to be used. Optionally, the `filter_type` can be specified
as either `include` (default) or `exclude`. When set to `include`
the rule applies to entities that are in the filter. When set to
`exclude` the rule only applies to entities not in the filter.

There may be zero or more conditions. A condition requires `applies_to`,
`operator` and `value` to be specified. The `applies_to` value can be
either `actual`, `typical` or `diff_from_typical` and it specifies
the numerical value to which the condition applies. The `operator`
(`lt`, `lte`, `gt`, `gte`) and `value` complete the definition.
Conditions are combined with `and` and allow to specify numerical
conditions for when a rule applies.

A rule must either have a scope or one or more conditions. Finally,
a rule with scope and conditions applies when all of them apply.
dnhatn added a commit that referenced this pull request Jun 15, 2018
* 6.x:
  Upgrade to Lucene-7.4.0-snapshot-518d303506 (#31360)
  [ML] Implement new rules design (#31110) (#31294)
  Remove RestGetAllAliasesAction (#31308)
  CCS: don't proxy requests for already connected node (#31273)
  Rankeval: Fold template test project into main module (#31203)
  [Docs] Remove reference to repository-s3 plugin creating an S3 bucket (#31359)
  More detailed tracing when writing metadata (#31319)
  Add details section for dcg ranking metric (#31177)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants