Replies: 2 comments 11 replies
-
Thanks for reporting. As noted in my email in our previous conversation, I'm going to need a minimal reproduction that showcases the issue. You can attach it here as a zip file. It's likely fixable with a slightly different configuration, but I need to know exactly what you're doing in order to help you. |
Beta Was this translation helpful? Give feedback.
11 replies
-
For the record, here is a JS implementation for an arbitrary number of stages using recursion. function fine_split(inputs, separators) {
if (separators.length === 0) {
return inputs
}
let [separator, ...remaining_separators] = separators
return inputs.flatMap(tok => {
let tokens = tok.split(separator)
const add_back_original_token = tokens.length > 1
tokens = fine_split(tokens, remaining_separators)
if (add_back_original_token) {
tokens.push(tok)
}
return tokens.filter(tok => tok.length > 0)
})
}
function tokenize(input) {
const inputs = Array.isArray(input) ? input : [input]
let tokens = inputs.flatMap(inp => extract(inp));
// hierarchical_separators is a list of regexes
let [primary_separator, ...fine_separators] = hierarchical_separators
tokens = tokens.flatMap(tok => tok.split(primary_separator))
tokens = fine_split(tokens, fine_separators)
return tokens
} |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
We are experiencing not-so-great search results. I believe the root cause is that compound words like "CamelCase" are not in the search index, since they are split by the tokenizer separator, in this example into "Camel" and "Case".
I want to propose keeping both the subdivided word and the original word in the index.
This could be achieved by tokenizing first into "words", i.e. splitting on whitespace + punctuation, but not case changes, and adding those words to the token list. Afterwards, these words can be "subdivided" with another regex, which would do case changes, and if any subdividing happened, the parts are added to the token list too.
On the query side, we would apply the "word tokenization", but not necessarily the "subdivision tokenization".
I'm using case change splitting as an example here, but in our case, it's also important to move the splitting on "::" to the "subdivision tokenization" stage to make C++ API docs much more searchable.
Please let me know what you think! Unfortunately I'm not a TypeScript dev or I'd open a PR myself.
Beta Was this translation helpful? Give feedback.
All reactions