Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggregations return different counts when invoked twice in a row #5021

Closed
thanodnl opened this issue Feb 5, 2014 · 3 comments
Closed

Aggregations return different counts when invoked twice in a row #5021

thanodnl opened this issue Feb 5, 2014 · 3 comments

Comments

@thanodnl
Copy link
Contributor

thanodnl commented Feb 5, 2014

Hi,

A couple of days ago I started a thread on the mailing list (https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/c_xLCPOpvjc) about this issue, and the responses on it are slim.

The problem exists in the aggregations api since version 1.0.0.RC1 and is confirmed by me to also occur in 1.0.0.RC2.

The problem is that when you do a terms aggregation on an index sharded in multiple shards (10 in my case) it start to return inconsistent numbers. With this I mean that the numbers are different the second time compared to the first time. You cannot show these numbers to users as when they reload the analytics it shows totally different numbers than before without anything changing to the data.

I created a test suit as a gist for you to recreate the problem your self. It is hosted at: https://gist.github.com/thanodnl/8803745.

But since it contains datafiles it is kind of bugged in the web interface of github. Best you can clone this gist by running: $ git clone https://gist.github.com/8803745.git

cd into the newly created directory and run: $ ./aggsbug.load.sh to load the test set into your local database. This can take a couple of minutes since it is loading ~1M documents. I tried to recreate it with a smaller set, but then the issue is not appearing.

Once the data is loaded you can run a contained test with: $ ./aggsbug.test.sh. This will call the same aggregation twice, store the output, and later print the diff of the output.

If you recreated the bug the output of the test should be something like:

$ ./aggsbug.test.sh
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1088  100   950  100   138    192     27  0:00:05  0:00:04  0:00:01   206
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1086  100   948  100   138   2867    417 --:--:-- --:--:-- --:--:--  2872
diff in 2 aggs calls:
2c2
<   "took" : 4918,

---
>   "took" : 325,
18c18
<         "doc_count" : 3599

---
>         "doc_count" : 3228
21c21
<         "doc_count" : 2517

---
>         "doc_count" : 2254
24c24
<         "doc_count" : 2207

---
>         "doc_count" : 2007
27c27
<         "doc_count" : 2207

---
>         "doc_count" : 1971
30c30
<         "doc_count" : 1660

---
>         "doc_count" : 1478
33c33
<         "doc_count" : 1534

---
>         "doc_count" : 1401
36c36
<         "doc_count" : 1468

---
>         "doc_count" : 1330
39c39
<         "doc_count" : 1079

---
>         "doc_count" : 952

When ran against 1.0.0.Beta2 the output is what is to be expected:

$ ./aggsbug.test.sh
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1087  100   949  100   138    208     30  0:00:04  0:00:04 --:--:--   208
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1086  100   948  100   138   1525    222 --:--:-- --:--:-- --:--:--  1526
diff in 2 aggs calls:
2c2
<   "took" : 4525,

---
>   "took" : 611,

You see the output of the aggs is not occurring in the diff during the test, and the only diff between the two runs is the time it took to calculate the result.

@jpountz jpountz self-assigned this Feb 5, 2014
@jpountz
Copy link
Contributor

jpountz commented Feb 5, 2014

Thanks for reporting this issue, this looks like a bad bug indeed. I'll look into it.

@jprante
Copy link
Contributor

jprante commented Feb 5, 2014

More info from what I found.

ES 1.0.0.RC2
Mac OS X 10.8.5
Darwin Jorg-Prantes-MacBook-Pro.local 12.5.0 Darwin Kernel Version 12.5.0: Sun Sep 29 13:33:47 PDT 2013; root:xnu-2050.48.12~1/RELEASE_X86_64 x86_64
java version "1.8.0"
Java(TM) SE Runtime Environment (build 1.8.0-b128)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b69, mixed mode)
G1GC enabled

ES 1.0.0.RC2
RHEL 6.3
Linux zephyros 2.6.32-279.el6.x86_64 #1 SMP Wed Jun 13 18:24:36 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
java version "1.8.0"
Java(TM) SE Runtime Environment (build 1.8.0-b128)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b69, mixed mode)
G1GC enabled

On Mac, counts may change between first and subsequent runs. On the first run, the counts are lower than on the subsequent runs.

On Linux, the effect is more subtle. Counts do not change between runs. But, it seems different shard count lead to deviating entries, on the lower buckets.

Here are two Linux examples, using Nils' data set. First is 10 shards, second is 5 shards, the lower three buckets differ.

shards=10

{
  "took" : 143,
  "timed_out" : false,
  "_shards" : {
    "total" : 10,
    "successful" : 10,
    "failed" : 0
  },
  "hits" : {
    "total" : 1060387,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "a" : {
      "buckets" : [ {
        "key" : "totaltrafficbos",
        "doc_count" : 3599
      }, {
        "key" : "mai93thm",
        "doc_count" : 2517
      }, {
        "key" : "mai90thm",
        "doc_count" : 2207
      }, {
        "key" : "mai95thm",
        "doc_count" : 2207
      }, {
        "key" : "totaltrafficnyc",
        "doc_count" : 1660
      }, {
        "key" : "confessions",
        "doc_count" : 1534
      }, {
        "key" : "incidentreports",
        "doc_count" : 1468
      }, {
        "key" : "nji80thm",
        "doc_count" : 1071
      }, {
        "key" : "pai76thm",
        "doc_count" : 1039
      }, {
        "key" : "txi35thm",
        "doc_count" : 357
      } ]
    }
  }
}

shards=5

{
  "took" : 302,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1060387,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "a" : {
      "buckets" : [ {
        "key" : "totaltrafficbos",
        "doc_count" : 3599
      }, {
        "key" : "mai93thm",
        "doc_count" : 2517
      }, {
        "key" : "mai90thm",
        "doc_count" : 2207
      }, {
        "key" : "mai95thm",
        "doc_count" : 2207
      }, {
        "key" : "totaltrafficnyc",
        "doc_count" : 1660
      }, {
        "key" : "confessions",
        "doc_count" : 1534
      }, {
        "key" : "incidentreports",
        "doc_count" : 1468
      }, {
        "key" : "nji80thm",
        "doc_count" : 1180
      }, {
        "key" : "pai76thm",
        "doc_count" : 936
      }, {
        "key" : "nji78thm",
        "doc_count" : 422
      } ]
    }
  }
}

@jprante
Copy link
Contributor

jprante commented Feb 5, 2014

I just learned it is already known that the bucket counts differ over shard numbers, also for facets #1305

jpountz added a commit to jpountz/elasticsearch that referenced this issue Feb 6, 2014
The byte[] array that was used to store the term was owned by the BytesRefHash
which is used to compute counts. However, the BytesRefHash is released at some
point and its content may be recycled.

MockPageCacheRecycler has been improved to expose this issue (putting random
content into the arrays upon release).

Number of documents/terms have been increased in RandomTests to make sure page
recycling occurs.

Close elastic#5021
@jpountz jpountz closed this as completed in e1c1120 Feb 7, 2014
jpountz added a commit that referenced this issue Feb 7, 2014
The byte[] array that was used to store the term was owned by the BytesRefHash
which is used to compute counts. However, the BytesRefHash is released at some
point and its content may be recycled.

MockPageCacheRecycler has been improved to expose this issue (putting random
content into the arrays upon release).

Number of documents/terms have been increased in RandomTests to make sure page
recycling occurs.

Close #5021
jpountz added a commit that referenced this issue Feb 7, 2014
The byte[] array that was used to store the term was owned by the BytesRefHash
which is used to compute counts. However, the BytesRefHash is released at some
point and its content may be recycled.

MockPageCacheRecycler has been improved to expose this issue (putting random
content into the arrays upon release).

Number of documents/terms have been increased in RandomTests to make sure page
recycling occurs.

Close #5021
mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015
The byte[] array that was used to store the term was owned by the BytesRefHash
which is used to compute counts. However, the BytesRefHash is released at some
point and its content may be recycled.

MockPageCacheRecycler has been improved to expose this issue (putting random
content into the arrays upon release).

Number of documents/terms have been increased in RandomTests to make sure page
recycling occurs.

Close elastic#5021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants