Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent facet counts #1832

Closed
stuartloxton opened this issue Apr 2, 2012 · 7 comments
Closed

Inconsistent facet counts #1832

stuartloxton opened this issue Apr 2, 2012 · 7 comments

Comments

@stuartloxton
Copy link

I've got an issue with facet counts that I've managed to simplify into a re-creatable example (attached).

When searching an index I get a facet count for 'fabric' as 12 however when I then filter on that attribute it increases to 13, not sure how this is possible as by adding a must query I can surely only decrease the facet counts.

Attached is the script which should recreate the issue in a test_bug index, it will insert 118 documents and run a query where the "fabric" facet comes out at 12.

Query looks like:

{
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "default_operator": "AND",
            "fields": [
              "_all"
            ],
            "query": "double"
          }
        }
      ]
    }
  },
  "facets": {
    "styles": {
      "terms": {
        "field": "product_styles"
      }
    }
  },
  "size": 1
}

Facet comes out as:

"facets" : {
    "styles" : {
      "_type" : "terms",
      "missing" : 11,
      "total" : 243,
      "other" : 141,
      "terms" : [ {
        "term" : "beds",
        "count" : 19
      }, {
        "term" : "fabric",
        "count" : 12
      }, {
        "term" : "products",
        "count" : 11
      }, {
        "term" : "new",
        "count" : 11
      }, {
        "term" : "luxury",
        "count" : 10
      }, {
        "term" : "furniture",
        "count" : 10
      }, {
        "term" : "upholstered",
        "count" : 8
      }, {
        "term" : "kingsize",
        "count" : 7
      }, {
        "term" : "double",
        "count" : 7
      }, {
        "term" : "bedroom",
        "count" : 7
      } ]
    }

When adding a filter on the fabric like this:

{
  "query": {
    "bool": {
      "must": [
        {
          "query_string": {
            "default_operator": "AND",
            "fields": [
              "_all"
            ],
            "query": "double"
          }
        },
        {
          "term": {
            "product_styles": "fabric"
          }
        }
      ]
    }
  },
  "facets": {
    "styles": {
      "terms": {
        "field": "product_styles"
      }
    }
  },
  "size": 1
}

the fabric facet then increases to 13:

"facets" : {
    "styles" : {
      "_type" : "terms",
      "missing" : 0,
      "total" : 186,
      "other" : 97,
      "terms" : [ {
        "term" : "fabric",
        "count" : 13
      }, {
        "term" : "beds",
        "count" : 12
      }, {
        "term" : "upholstered",
        "count" : 9
      }, {
        "term" : "luxury",
        "count" : 9
      }, {
        "term" : "products",
        "count" : 8
      }, {
        "term" : "new",
        "count" : 8
      }, {
        "term" : "kingsize",
        "count" : 8
      }, {
        "term" : "double",
        "count" : 8
      }, {
        "term" : "furniture",
        "count" : 7
      }, {
        "term" : "bedroom",
        "count" : 7
      } ]
    }

Hopefully I've explained everything and you can also recreate this. I'm using elasticsearch-0.19.1 on Mac OS X Lion.

Script for recreating data is at https://gist.github.com/2283964

@lukas-vlcek
Copy link
Contributor

Hi, did you try to run your example on a single shard? Assuming your example run on a 3 shards which are there by default. May be you are hitting similar issue to this: #667

@stuartloxton
Copy link
Author

I've just run it against an index with 1 shard and the numbers are correct however is there a way to make this work with multiple shards? Is this the expected behaviour or is this still a bug?

@lukas-vlcek
Copy link
Contributor

It is not a bug per se. It is a performance trade-off for the distributed calculation. You can try to increate the size to minimize the effect of it. AFAIK Shay have a plan to implement some improvements to allow more accurate results.

@stuartloxton
Copy link
Author

I've just tried changing the size on the query to various numbers between 1 and the size of the dataset and it always returns 12, for an index with roughly 150 documents and 10 properties on each one is there a reason NOT to use just 1 shard?

@lukas-vlcek
Copy link
Contributor

I was referring to size parameter of terms facet, not the size of the query.
http://www.elasticsearch.org/guide/reference/api/search/facets/terms-facet.html

In fact what is happening is the following: when ES wants to calculate top term facets then it calculates top 'size' terms per shard [or may be (top 'size' x number_of_shards), do not remember from the top of my head, but it is not important much] and then all these 'top' sets are collected by a single node (the one that started the query) and are aggregated into final top 'size' result. This strategy does not always lead to correct global results, it also depends on the nature of your data and their distribution among shards.

If you have only 150 documents then using one shard will help. Other option would be using terms facet 'size' = 150.

@apatrida
Copy link
Contributor

Solr provides accuracy here with a 2nd call to shards to calculate counts found on share A but so far missing from B,C,...Z. And vis versa.

@spinscale
Copy link
Contributor

looks like a duplicate of #1305

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants