Two-stage tokenization #4276

nnmm · 2022-08-24T11:41:00Z

nnmm
Aug 24, 2022

We are experiencing not-so-great search results. I believe the root cause is that compound words like "CamelCase" are not in the search index, since they are split by the tokenizer separator, in this example into "Camel" and "Case".

I want to propose keeping both the subdivided word and the original word in the index.

This could be achieved by tokenizing first into "words", i.e. splitting on whitespace + punctuation, but not case changes, and adding those words to the token list. Afterwards, these words can be "subdivided" with another regex, which would do case changes, and if any subdividing happened, the parts are added to the token list too.

On the query side, we would apply the "word tokenization", but not necessarily the "subdivision tokenization".

I'm using case change splitting as an example here, but in our case, it's also important to move the splitting on "::" to the "subdivision tokenization" stage to make C++ API docs much more searchable.

Please let me know what you think! Unfortunately I'm not a TypeScript dev or I'd open a PR myself.

squidfunk · 2022-08-25T07:25:09Z

squidfunk
Aug 25, 2022
Maintainer

Thanks for reporting. As noted in my email in our previous conversation, I'm going to need a minimal reproduction that showcases the issue. You can attach it here as a zip file. It's likely fixable with a slightly different configuration, but I need to know exactly what you're doing in order to help you.

11 replies

squidfunk Sep 1, 2022
Maintainer

BTW, as it's not apparent from this conversation – it's related to the Insiders search, as the search in the community edition doesn't allow for advanced tokenization. Thus, a potential PR should be submitted on the Insiders repository.

squidfunk Sep 1, 2022
Maintainer

where extra_search_term_separator is applied in the tokenizer() function on top of the separator. I'm happy to measure runtime difference of that. Queries are only split by separator, like now. That part would get faster, if anything, since the regex would be smaller.

The runtime differences for tokenizing the query are neglible – queries are usually shorter than a few dozen characters. In any case, measuring is a good idea. Note that you'll need to make sure not to mess up search highlighting and ranking, as you're now adding many terms to the index twice.

flynneva Jan 10, 2023

@squidfunk just a follow-up question here but does the current search implementation add the raw search string to the token list? To me this would be a super simple work around that would get you almost to the approach that @nnmm mentions above. Essentially the search assumes the raw input string is exactly what the user is searching for (which IMO should be done most of the time).

EDIT: I guess this could easily be done today just by adding to the separator regex, no? Where the first regex group just matches everything...without splitting the raw string at all. EDIT2: nope this does not work

squidfunk Jan 10, 2023
Maintainer

@flynneva sorry, but I'm not sure I understand what you mean. Could you maybe provide an example? What do you mean by token list?

flynneva Jan 12, 2023

@squidfunk you can ignore my question above - after digging into how the search is implemented I realized it didn't make sense 😅 sorry!

nnmm · 2023-04-21T15:36:14Z

nnmm
Apr 21, 2023
Author

For the record, here is a JS implementation for an arbitrary number of stages using recursion.
I tried it on our docs and it gives great results.

function fine_split(inputs, separators) {
  if (separators.length === 0) {
    return inputs
  }
  let [separator, ...remaining_separators] = separators
  return inputs.flatMap(tok => {
    let tokens = tok.split(separator)
    const add_back_original_token = tokens.length > 1
    tokens = fine_split(tokens, remaining_separators)
    if (add_back_original_token) {
      tokens.push(tok)
    }
    return tokens.filter(tok => tok.length > 0)
  })
}

function tokenize(input) {
  const inputs = Array.isArray(input) ? input : [input]
  let tokens = inputs.flatMap(inp => extract(inp));

  // hierarchical_separators is a list of regexes
  let [primary_separator, ...fine_separators] = hierarchical_separators
  tokens = tokens.flatMap(tok => tok.split(primary_separator))
  tokens = fine_split(tokens, fine_separators)
  return tokens
}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Two-stage tokenization #4276

{{title}}

Replies: 2 comments 11 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Two-stage tokenization #4276

nnmm Aug 24, 2022

Replies: 2 comments · 11 replies

squidfunk Aug 25, 2022 Maintainer

squidfunk Sep 1, 2022 Maintainer

squidfunk Sep 1, 2022 Maintainer

flynneva Jan 10, 2023

squidfunk Jan 10, 2023 Maintainer

flynneva Jan 12, 2023

nnmm Apr 21, 2023 Author

nnmm
Aug 24, 2022

Replies: 2 comments 11 replies

squidfunk
Aug 25, 2022
Maintainer

squidfunk Sep 1, 2022
Maintainer

squidfunk Sep 1, 2022
Maintainer

squidfunk Jan 10, 2023
Maintainer

nnmm
Apr 21, 2023
Author