Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Consider unifying datafeed and job configuration #34231

Open
davidkyle opened this issue Oct 2, 2018 · 4 comments
Open

[ML] Consider unifying datafeed and job configuration #34231

davidkyle opened this issue Oct 2, 2018 · 4 comments
Assignees
Labels
:ml Machine learning

Comments

@davidkyle
Copy link
Member

The original design of the datafeed envisioned it as a general purpose tool that could be used with different types of jobs rather than just anomaly detection. As the code has evolved the datafeed looks more like a single purpose tool dedicated to feeding data to anomaly detector jobs (query delay, aggregations, write to autodetect) and not easily adaptable to future use cases or job types. Also it was imagined that a single datafeed could feed multiple jobs but aggregations efficiently reduce the data volume enough that we have not required this and because the ideal aggregation interval is a function of bucket span it is not always appropriate to feed the same data to multiple jobs at different bucket spans. To some extent multi-bucket anomalies have mitigated this requirement.

The change to move configuration out of the cluster state (#32905) has shown the current arrangement is vulnerable to inconsistencies as the datafeed and job are defined in separate documents that can change independently. Given that a datafeed is tightly coupled to its job the configuration could be defined inside the job itself - this is how the UI presents the datafeed as part of the job - simplifying the code as only one document needs to be read and ensuring consistency. This needn't break the REST API as the datafeeds can be extracted from the jobs without the client having any knowledge of where they came from.

I'm not advocating making the change today but if the burden of maintaining separate configs for datafeeds and anomaly detector jobs grows the refactor should be made.

@davidkyle davidkyle added :ml Machine learning team-discuss labels Oct 2, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@droberts195
Copy link
Contributor

Another thing is that even if datafeeds were made generic enough that they could be reused for some future type of job, that wouldn't preclude storing the datafeeds for anomaly detector jobs inside the anomaly detector job config.

@droberts195
Copy link
Contributor

Since #37349 jobs and datafeeds are even more tightly coupled together. The fact that we have to do these checks to make jobs and datafeeds work together highlights that it was probably the wrong decision in the first place to separate them. However, the work to combine them now would be huge - it would be a similar project to the one that moved ML configs from cluster state to an index, so would have to be done around the 7.last -> 8.0 timeframe and would take 3-4 person months of effort to do in a way that was backwards compatible for end users. These changes would also have more impact on the UI than the ML config migration project, so total time taken would be even greater. I'm not sure that having separate jobs and datafeeds causes enough pain to justify all this complex rework.

@droberts195
Copy link
Contributor

We have started working on this. #74265 was the first step.

@droberts195 droberts195 self-assigned this Aug 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning
Projects
None yet
Development

No branches or pull requests

3 participants