Aggregations return different counts when invoked twice in a row #5021

thanodnl · 2014-02-05T13:31:09Z

Hi,

A couple of days ago I started a thread on the mailing list (https://groups.google.com/forum/?fromgroups=#!topic/elasticsearch/c_xLCPOpvjc) about this issue, and the responses on it are slim.

The problem exists in the aggregations api since version 1.0.0.RC1 and is confirmed by me to also occur in 1.0.0.RC2.

The problem is that when you do a terms aggregation on an index sharded in multiple shards (10 in my case) it start to return inconsistent numbers. With this I mean that the numbers are different the second time compared to the first time. You cannot show these numbers to users as when they reload the analytics it shows totally different numbers than before without anything changing to the data.

I created a test suit as a gist for you to recreate the problem your self. It is hosted at: https://gist.github.com/thanodnl/8803745.

But since it contains datafiles it is kind of bugged in the web interface of github. Best you can clone this gist by running: $ git clone https://gist.github.com/8803745.git

cd into the newly created directory and run: $ ./aggsbug.load.sh to load the test set into your local database. This can take a couple of minutes since it is loading ~1M documents. I tried to recreate it with a smaller set, but then the issue is not appearing.

Once the data is loaded you can run a contained test with: $ ./aggsbug.test.sh. This will call the same aggregation twice, store the output, and later print the diff of the output.

If you recreated the bug the output of the test should be something like:

$ ./aggsbug.test.sh
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1088  100   950  100   138    192     27  0:00:05  0:00:04  0:00:01   206
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1086  100   948  100   138   2867    417 --:--:-- --:--:-- --:--:--  2872
diff in 2 aggs calls:
2c2
<   "took" : 4918,

---
>   "took" : 325,
18c18
<         "doc_count" : 3599

---
>         "doc_count" : 3228
21c21
<         "doc_count" : 2517

---
>         "doc_count" : 2254
24c24
<         "doc_count" : 2207

---
>         "doc_count" : 2007
27c27
<         "doc_count" : 2207

---
>         "doc_count" : 1971
30c30
<         "doc_count" : 1660

---
>         "doc_count" : 1478
33c33
<         "doc_count" : 1534

---
>         "doc_count" : 1401
36c36
<         "doc_count" : 1468

---
>         "doc_count" : 1330
39c39
<         "doc_count" : 1079

---
>         "doc_count" : 952

When ran against 1.0.0.Beta2 the output is what is to be expected:

$ ./aggsbug.test.sh
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1087  100   949  100   138    208     30  0:00:04  0:00:04 --:--:--   208
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1086  100   948  100   138   1525    222 --:--:-- --:--:-- --:--:--  1526
diff in 2 aggs calls:
2c2
<   "took" : 4525,

---
>   "took" : 611,

You see the output of the aggs is not occurring in the diff during the test, and the only diff between the two runs is the time it took to calculate the result.

The text was updated successfully, but these errors were encountered:

jpountz · 2014-02-05T14:54:59Z

Thanks for reporting this issue, this looks like a bad bug indeed. I'll look into it.

jprante · 2014-02-05T19:30:19Z

More info from what I found.

ES 1.0.0.RC2
Mac OS X 10.8.5
Darwin Jorg-Prantes-MacBook-Pro.local 12.5.0 Darwin Kernel Version 12.5.0: Sun Sep 29 13:33:47 PDT 2013; root:xnu-2050.48.12~1/RELEASE_X86_64 x86_64
java version "1.8.0"
Java(TM) SE Runtime Environment (build 1.8.0-b128)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b69, mixed mode)
G1GC enabled

ES 1.0.0.RC2
RHEL 6.3
Linux zephyros 2.6.32-279.el6.x86_64 #1 SMP Wed Jun 13 18:24:36 EDT 2012 x86_64 x86_64 x86_64 GNU/Linux
java version "1.8.0"
Java(TM) SE Runtime Environment (build 1.8.0-b128)
Java HotSpot(TM) 64-Bit Server VM (build 25.0-b69, mixed mode)
G1GC enabled

On Mac, counts may change between first and subsequent runs. On the first run, the counts are lower than on the subsequent runs.

On Linux, the effect is more subtle. Counts do not change between runs. But, it seems different shard count lead to deviating entries, on the lower buckets.

Here are two Linux examples, using Nils' data set. First is 10 shards, second is 5 shards, the lower three buckets differ.

shards=10

{
  "took" : 143,
  "timed_out" : false,
  "_shards" : {
    "total" : 10,
    "successful" : 10,
    "failed" : 0
  },
  "hits" : {
    "total" : 1060387,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "a" : {
      "buckets" : [ {
        "key" : "totaltrafficbos",
        "doc_count" : 3599
      }, {
        "key" : "mai93thm",
        "doc_count" : 2517
      }, {
        "key" : "mai90thm",
        "doc_count" : 2207
      }, {
        "key" : "mai95thm",
        "doc_count" : 2207
      }, {
        "key" : "totaltrafficnyc",
        "doc_count" : 1660
      }, {
        "key" : "confessions",
        "doc_count" : 1534
      }, {
        "key" : "incidentreports",
        "doc_count" : 1468
      }, {
        "key" : "nji80thm",
        "doc_count" : 1071
      }, {
        "key" : "pai76thm",
        "doc_count" : 1039
      }, {
        "key" : "txi35thm",
        "doc_count" : 357
      } ]
    }
  }
}

shards=5

{
  "took" : 302,
  "timed_out" : false,
  "_shards" : {
    "total" : 5,
    "successful" : 5,
    "failed" : 0
  },
  "hits" : {
    "total" : 1060387,
    "max_score" : 0.0,
    "hits" : [ ]
  },
  "aggregations" : {
    "a" : {
      "buckets" : [ {
        "key" : "totaltrafficbos",
        "doc_count" : 3599
      }, {
        "key" : "mai93thm",
        "doc_count" : 2517
      }, {
        "key" : "mai90thm",
        "doc_count" : 2207
      }, {
        "key" : "mai95thm",
        "doc_count" : 2207
      }, {
        "key" : "totaltrafficnyc",
        "doc_count" : 1660
      }, {
        "key" : "confessions",
        "doc_count" : 1534
      }, {
        "key" : "incidentreports",
        "doc_count" : 1468
      }, {
        "key" : "nji80thm",
        "doc_count" : 1180
      }, {
        "key" : "pai76thm",
        "doc_count" : 936
      }, {
        "key" : "nji78thm",
        "doc_count" : 422
      } ]
    }
  }
}

jprante · 2014-02-05T19:54:23Z

I just learned it is already known that the bucket counts differ over shard numbers, also for facets #1305

The byte[] array that was used to store the term was owned by the BytesRefHash which is used to compute counts. However, the BytesRefHash is released at some point and its content may be recycled. MockPageCacheRecycler has been improved to expose this issue (putting random content into the arrays upon release). Number of documents/terms have been increased in RandomTests to make sure page recycling occurs. Close elastic#5021

The byte[] array that was used to store the term was owned by the BytesRefHash which is used to compute counts. However, the BytesRefHash is released at some point and its content may be recycled. MockPageCacheRecycler has been improved to expose this issue (putting random content into the arrays upon release). Number of documents/terms have been increased in RandomTests to make sure page recycling occurs. Close #5021

The byte[] array that was used to store the term was owned by the BytesRefHash which is used to compute counts. However, the BytesRefHash is released at some point and its content may be recycled. MockPageCacheRecycler has been improved to expose this issue (putting random content into the arrays upon release). Number of documents/terms have been increased in RandomTests to make sure page recycling occurs. Close elastic#5021

jpountz self-assigned this Feb 5, 2014

jpountz mentioned this issue Feb 6, 2014

Fix BytesRef owning issue in string terms aggregations. #5039

Closed

jpountz added bug labels Feb 6, 2014

jpountz closed this as completed in e1c1120 Feb 7, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Aggregations return different counts when invoked twice in a row #5021

Aggregations return different counts when invoked twice in a row #5021

thanodnl commented Feb 5, 2014

jpountz commented Feb 5, 2014

jprante commented Feb 5, 2014

jprante commented Feb 5, 2014

Aggregations return different counts when invoked twice in a row #5021

Aggregations return different counts when invoked twice in a row #5021

Comments

thanodnl commented Feb 5, 2014

jpountz commented Feb 5, 2014

jprante commented Feb 5, 2014

jprante commented Feb 5, 2014