Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] Language analyzer docs failure #30557

Closed
romseygeek opened this issue May 14, 2018 · 5 comments · Fixed by #30722
Closed

[CI] Language analyzer docs failure #30557

romseygeek opened this issue May 14, 2018 · 5 comments · Fixed by #30722
Assignees
Labels
>docs General docs changes :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch >test-failure Triaged test failures from CI v6.3.0

Comments

@romseygeek
Copy link
Contributor

These both reproduce:

REPRODUCE WITH: ./gradlew :docs:integTestRunner \
  -Dtests.seed=33404A59123B3635 \
  -Dtests.class=org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT \
  -Dtests.method="test {yaml=reference/analysis/analyzers/lang-analyzer/line_1146}" \
  -Dtests.security.manager=true \
  -Dtests.locale=pt \
  -Dtests.timezone=Australia/Sydney

REPRODUCE WITH: ./gradlew :docs:integTestRunner \
  -Dtests.seed=33404A59123B3635 \
  -Dtests.class=org.elasticsearch.smoketest.DocsClientYamlTestSuiteIT \
  -Dtests.method="test {yaml=reference/analysis/analyzers/lang-analyzer/line_373}" \
  -Dtests.security.manager=true \
  -Dtests.locale=pt \
  -Dtests.timezone=Australia/Sydney
@romseygeek romseygeek added >docs General docs changes :Search Relevance/Analysis How text is split into tokens v6.3.0 labels May 14, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@romseygeek romseygeek added the >test-failure Triaged test failures from CI label May 14, 2018
@cbuescher
Copy link
Member

This looks very much related to the changes made in #29535. Other seeds seem to run fine. I'm digging into the details but if @nik9000 has some ideas how to debug this efficiently, I'd appreciate a hint.

@cbuescher cbuescher self-assigned this May 18, 2018
@cbuescher
Copy link
Member

The first failure is related to the italian analyzers:

Failure at [reference/analysis/analyzers/lang-analyzer:1144]: text differs: italian was [𐒁𐒌𐒥𐒔] but rebuilt_italian was [d'e]. In utf8 those are
   > [f0 90 92 81 f0 90 92 8c f0 90 92 a5 f0 90 92 94] and
   > [64 27 65]

The second is the same input token, but relates to the catalan analyzer:

Failure at [reference/analysis/analyzers/lang-analyzer:352]: text differs: catalan was [𐒁𐒌𐒥𐒔] but rebuilt_catalan was [d'e]. In utf8 those are0
   > [f0 90 92 81 f0 90 92 8c f0 90 92 a5 f0 90 92 94] and
   > [64 27 65]

@cbuescher
Copy link
Member

Its a bit tricky to debug this since the context if missing from the error, but I think I managed to isolate the part where the two analyzer outputs begin to differ. I can reproduce in Kibana, not sure if this copy/paste action preserves all "hidden" characters that the test string contains, but anyway:

PUT /italian_example
{
  "settings": {
    "analysis": {
      "filter": {
        "italian_elision": {
          "type": "elision",
          "articles": [
                "c", "l", "all", "dall", "dell",
                "nell", "sull", "coll", "pell",
                "gl", "agl", "dagl", "degl", "negl",
                "sugl", "un", "m", "t", "s", "v", "d"
          ]
        },
        "italian_stop": {
          "type":       "stop",
          "stopwords":  "_italian_" 
        },
        "italian_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["esempio"] 
        },
        "italian_stemmer": {
          "type":       "stemmer",
          "language":   "light_italian"
        }
      },
      "analyzer": {
        "rebuilt_italian": {
          "tokenizer":  "standard",
          "filter": [
            "italian_elision",
            "lowercase",
            "italian_stop",
            "italian_keywords",
            "italian_stemmer"
          ]
        }
      }
    }
  }
}

POST /test/_analyze
  {
  "analyzer" : "italian",
  "text" : "𐅣 D'e* ᧴᧱᧡, ﹢﹪ 𐒁𐒌𐒥𐒔 ջ՘՘զԲ԰ԽՙԻՕԴՇֈ 𐱍𐰕𐰬𐰪𐰯𐰕𐰉𐰜𐰁𐰫𐱉𐰅𐰾 ꩪꩤ꩹ꩤꩼ ᜦᜦᜫᜦ᜹, 𐡏𐡚𐡗𐡖𐡂𐡞𐡒𐡑      𐡞𐡌𐡗𐡄𐡁𐡓, ᇷᄒᇐᇽ취ᄸ"
}

POST /italian_example/_analyze
  {
  "analyzer" : "rebuilt_italian",
  "text" : "𐅣 D'e* ᧴᧱᧡, ﹢﹪ 𐒁𐒌𐒥𐒔 ջ՘՘զԲ԰ԽՙԻՕԴՇֈ 𐱍𐰕𐰬𐰪𐰯𐰕𐰉𐰜𐰁𐰫𐱉𐰅𐰾 ꩪꩤ꩹ꩤꩼ ᜦᜦᜫᜦ᜹, 𐡏𐡚𐡗𐡖𐡂𐡞𐡒𐡑      𐡞𐡌𐡗𐡄𐡁𐡓, ᇷᄒᇐᇽ취ᄸ"
}

The first analyzes to:

{
  "tokens": [
    {
      "token": "𐅣",
      "start_offset": 0,
      "end_offset": 2,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "𐒁𐒌𐒥𐒔",
      "start_offset": 16,
      "end_offset": 24,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "ջ",
      "start_offset": 25,
      "end_offset": 26,
      "type": "<ALPHANUM>",
      "position": 3
    }, [...]

The second to

{
  "tokens": [
    {
      "token": "𐅣",
      "start_offset": 0,
      "end_offset": 2,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "d'e",
      "start_offset": 3,
      "end_offset": 6,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "𐒁𐒌𐒥𐒔",
      "start_offset": 16,
      "end_offset": 24,
      "type": "<ALPHANUM>",
      "position": 2
    }, [...]

So the original italian analyzer seems to swallow one more token. This part is surrounded by many chracters that seem to get dropped during analysis, which makes this also hard to debug.

@jimczi
Copy link
Contributor

jimczi commented May 18, 2018

I think it's caused by the elision filter that is case insensitive in the built in analyzer and not in the rebuilt one. Adding "articles_case": true in all the elision filter of the rebuilt analyzer seems to solve the issue (this is already done for the french_rebuilt).

jimczi added a commit to jimczi/elasticsearch that referenced this issue May 18, 2018
This commit fixes docs failure on language analyzers when compared to the built in analyzers.
The `elision` filters used by the rebuilt language analyzers should be case insensitive to match
the definition of the prebuilt analyzers.

Closes elastic#30557
jimczi added a commit that referenced this issue May 22, 2018
This commit fixes docs failure on language analyzers when compared to the built in analyzers.
The `elision` filters used by the rebuilt language analyzers should be case insensitive to match
the definition of the prebuilt analyzers.

Closes #30557
jimczi added a commit that referenced this issue May 22, 2018
This commit fixes docs failure on language analyzers when compared to the built in analyzers.
The `elision` filters used by the rebuilt language analyzers should be case insensitive to match
the definition of the prebuilt analyzers.

Closes #30557
jimczi added a commit that referenced this issue May 22, 2018
This commit fixes docs failure on language analyzers when compared to the built in analyzers.
The `elision` filters used by the rebuilt language analyzers should be case insensitive to match
the definition of the prebuilt analyzers.

Closes #30557
@javanna javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>docs General docs changes :Search Relevance/Analysis How text is split into tokens Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch >test-failure Triaged test failures from CI v6.3.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants