Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] bucket_count is inaccurate when there are gaps in the data #30080

Closed
elasticmachine opened this issue Aug 25, 2017 · 0 comments · Fixed by #30294
Closed

[ML] bucket_count is inaccurate when there are gaps in the data #30080

elasticmachine opened this issue Aug 25, 2017 · 0 comments · Fixed by #30294
Assignees
Labels
>bug :ml Machine learning

Comments

@elasticmachine
Copy link
Collaborator

Original comment by @davidkyle:

Open a job send some data and close the job then reopen the job and send some data timestamped a week later than the previous batch. Autodetect will create empty bucket results for the intervening period but DataCounts::bucket_count will not reflect that.

The testMlBasicMultiNodeIT::testMiniFarequoteReopen does exactly this but the test was asserting that bucket_count == 2 rather than bucket_count = 7 days of buckets. bucket_count should equal to the number of buckets written by autodetect, with the caveat that old results are sometimes pruned.

@elasticmachine elasticmachine added :ml Machine learning >bug labels Apr 25, 2018
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue Apr 30, 2018
This commit refactors the DataStreamDiagnostics class
achieving the following advantages:

- simpler code; by encapsulating the moving bucket histogram
into its own class
- better performance; by using an array to store the buckets
instead of a map
- explicit handling of gap buckets; in preparation of fixing elastic#30080
dimitris-athanasiou added a commit that referenced this issue May 1, 2018
This commit refactors the DataStreamDiagnostics class
achieving the following advantages:

- simpler code; by encapsulating the moving bucket histogram
into its own class
- better performance; by using an array to store the buckets
instead of a map
- explicit handling of gap buckets; in preparation of fixing #30080
dimitris-athanasiou added a commit to dimitris-athanasiou/elasticsearch that referenced this issue May 3, 2018
This commit fixes an issue with the data diagnostics were
empty buckets are not reported even though they should. Once
a job is reopened, the diagnostics do not get initialized from
the current data counts (especially the latest record timestamp).
The result is that if the data that is sent have a time gap compared
to the previous ones, that gap is not accounted for in the empty bucket
count.

This commit fixes that by initializing the diagnostics with the current
data counts.

Closes elastic#30080
dimitris-athanasiou added a commit that referenced this issue May 3, 2018
This commit fixes an issue with the data diagnostics were
empty buckets are not reported even though they should. Once
a job is reopened, the diagnostics do not get initialized from
the current data counts (especially the latest record timestamp).
The result is that if the data that is sent have a time gap compared
to the previous ones, that gap is not accounted for in the empty bucket
count.

This commit fixes that by initializing the diagnostics with the current
data counts.

Closes #30080
dimitris-athanasiou added a commit that referenced this issue May 3, 2018
This commit refactors the DataStreamDiagnostics class
achieving the following advantages:

- simpler code; by encapsulating the moving bucket histogram
into its own class
- better performance; by using an array to store the buckets
instead of a map
- explicit handling of gap buckets; in preparation of fixing #30080
dimitris-athanasiou added a commit that referenced this issue May 3, 2018
This commit fixes an issue with the data diagnostics were
empty buckets are not reported even though they should. Once
a job is reopened, the diagnostics do not get initialized from
the current data counts (especially the latest record timestamp).
The result is that if the data that is sent have a time gap compared
to the previous ones, that gap is not accounted for in the empty bucket
count.

This commit fixes that by initializing the diagnostics with the current
data counts.

Closes #30080
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants