[ML] Consider unifying datafeed and job configuration #34231

davidkyle · 2018-10-02T15:00:20Z

The original design of the datafeed envisioned it as a general purpose tool that could be used with different types of jobs rather than just anomaly detection. As the code has evolved the datafeed looks more like a single purpose tool dedicated to feeding data to anomaly detector jobs (query delay, aggregations, write to autodetect) and not easily adaptable to future use cases or job types. Also it was imagined that a single datafeed could feed multiple jobs but aggregations efficiently reduce the data volume enough that we have not required this and because the ideal aggregation interval is a function of bucket span it is not always appropriate to feed the same data to multiple jobs at different bucket spans. To some extent multi-bucket anomalies have mitigated this requirement.

The change to move configuration out of the cluster state (#32905) has shown the current arrangement is vulnerable to inconsistencies as the datafeed and job are defined in separate documents that can change independently. Given that a datafeed is tightly coupled to its job the configuration could be defined inside the job itself - this is how the UI presents the datafeed as part of the job - simplifying the code as only one document needs to be read and ensuring consistency. This needn't break the REST API as the datafeeds can be extracted from the jobs without the client having any knowledge of where they came from.

I'm not advocating making the change today but if the burden of maintaining separate configs for datafeeds and anomaly detector jobs grows the refactor should be made.

elasticmachine · 2018-10-02T15:00:24Z

Pinging @elastic/ml-core

droberts195 · 2018-10-02T15:08:51Z

Another thing is that even if datafeeds were made generic enough that they could be reused for some future type of job, that wouldn't preclude storing the datafeeds for anomaly detector jobs inside the anomaly detector job config.

droberts195 · 2019-01-28T14:36:16Z

Since #37349 jobs and datafeeds are even more tightly coupled together. The fact that we have to do these checks to make jobs and datafeeds work together highlights that it was probably the wrong decision in the first place to separate them. However, the work to combine them now would be huge - it would be a similar project to the one that moved ML configs from cluster state to an index, so would have to be done around the 7.last -> 8.0 timeframe and would take 3-4 person months of effort to do in a way that was backwards compatible for end users. These changes would also have more impact on the UI than the ML config migration project, so total time taken would be even greater. I'm not sure that having separate jobs and datafeeds causes enough pain to justify all this complex rework.

droberts195 · 2021-08-03T09:20:24Z

We have started working on this. #74265 was the first step.

davidkyle added :ml Machine learning team-discuss labels Oct 2, 2018

droberts195 mentioned this issue Feb 6, 2019

ML: update set_upgrade_mode, add logging #38372

Merged

droberts195 removed the team-discuss label Aug 3, 2021

droberts195 self-assigned this Aug 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Consider unifying datafeed and job configuration #34231

[ML] Consider unifying datafeed and job configuration #34231

davidkyle commented Oct 2, 2018

elasticmachine commented Oct 2, 2018

droberts195 commented Oct 2, 2018

droberts195 commented Jan 28, 2019

droberts195 commented Aug 3, 2021

[ML] Consider unifying datafeed and job configuration #34231

[ML] Consider unifying datafeed and job configuration #34231

Comments

davidkyle commented Oct 2, 2018

elasticmachine commented Oct 2, 2018

droberts195 commented Oct 2, 2018

droberts195 commented Jan 28, 2019

droberts195 commented Aug 3, 2021