-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add keywords to docs for better search results #8712
Comments
3 of the 4 documents that rank above it don't contain adapter-static a single time. It must be tokenizing it into "adapter" and "static". Perhaps we can either remove |
I think there's another bug as well:
The problem is that sub-sections rank lower than main pages. https://kit.svelte.dev/docs/adapter-static#usage - because it has a It should probably be grouping followed by ranking. I.e. we group by the page and then rank based on the highest ranking sub-section or something like that. |
Might it be best to implement 3rd party search? Algolia is free for open-source, and does a great job of indexing and ranking... https://www.algolia.com/for-open-source/ Edit: Oh, not quite free: 200,000 search requests per month - still, maybe worth budgeting for. |
I've been meaning to write a blog post about this, but there's a variety of reasons we don't want to use third party search tools:
|
The only bullet point I'd comment on before you write this blog post is:
Flexsearch is incredibly hard to customize relative to Algolia, Elastic, or just about any index I've used in the past. I've spent the morning trying and simply can't understand how Flexsearch's scoring works. I've filed a few issues in the Flexsearch repo asking for more details and hope to come back to this after getting some more details about how to tweak Flexsearch. In the meantime, I've sent a PR which just does some housekeeping on our side: #8727 |
Right, but we could swap out flexsearch for something else if we needed to. Hell, we could write our own! |
A very well reasoned response. Personally I'd put results relevance above all of those points. I've had some success in the past with Typesense, IIRC it has a rational approach to ranking and relevance. Might be worth a peek: https://typesense.org/docs/guide/ranking-and-relevance.html Flexsearch has a list of other libraries, benchmarked: |
Typesence looks really cool @oodavid. Could we try implementing it? I'd like to participate. |
|
There are a lot of options, we should put our focus on the problem we wanna resolve and look at which one is the best for doing it. The current problem seems to be the priorities. |
I'm open to alternatives as I don't particularly like flexsearch, but it'd be nice to find one that allows us to keep the functionality that we have today. In particular today you can see results as you type and many of the tools mentioned above don't appear to support that. The search we use today also does not require any extra infrastructure. I'm not sure if any of the tools mentioned are great fits, but would love if someone can find one that fits the bill.
|
I am not biased towards lunrjs but I have been working with it currently and I think it checks off all your requirements.
@benmccann I didn't get the "search based off prefix" part can you please explain?(If possible with a small example) |
What I mean by search off a prefix is this... Imagine that you're typing "adapter". When you start and you type "a" it will show all words beginning with "a", when you get to "ad" it will show all words starting with "ad", and so on. You can see how the search auto-completes in realtime on kit.svelte.dev as you do this. |
Correct me if I am wrong but are you perhaps talking about auto-complete? Would this be something we are looking for? |
It's a bit different than autocomplete. It's not completing your queries. Rather it's doing searches based on partial query strings. E.g. to take the "adapter" example from earlier, the way it works is by indexing "a", "ad", "ada", "adap", "adapt", "adapte", "adapter". This takes a lot more memory, but provides the experience you see today on kit.svelte.dev. |
I see then perhaps is this what we are looking for, I think this can reproduce the same functionality you are talking about. |
Ah, yes! Thanks for the pointer. Lunrjs may indeed work then! I'd be happy to review any attempt to switch out flexsearch for lunrjs if anyone wants to take a stab at it. |
I can try to make a prototype. Can anyone guide me through some of the steps to setup the code for docs locally? |
You'll need to have pnpm installed, then...
...and you should be off to the races! |
One thing I'll note is that the web worker that powers our current search — which includes all of By contrast, lunr by itself weighs 99kb. Probably not a dealbreaker but something to be conscious of. |
I suspect you want to keep the search locally on the client, but if you're looking for an alternative to algolia there's meilisearch: https://docs.meilisearch.com - though 11kb minified+zipped |
lunr is only 29k minified, so it's not too bad. The thing that I just noticed that gives me more hesitation is that it appears to basically be abandoned. It hasn't been updated since 2020, it still uses Travis CI, there's a number of unreviewed PRs, etc. It'd be nice if we could find something that's a bit better maintained |
https://github.com/lucaong/minisearch looks like a promising option. It'd probably be better to try it than lunr |
Thanks for you suggestion, I will try to use this. |
I was trying out minisearch and elasticlunr yesterday, @Hetarth02 you can continue from those branches if it saves you some setup time. You'll need to be using Chrome for this btw since Firefox doesn't yet support module workers. |
@gtm-nayan Thanks for your help, by the way any noticeable results you got from using minisearch. Also, if you want we can co-ordinate with each other and work on this topic together. |
Minisearch gives out a lot more results than our current setup but I think that's due to the |
There's an up-and-coming in-memory search engine fully build from the ground up to be performant for full-text search, called Lyra. The project seems quite intuitive and the people behind it are constantly improving it. It might be worth giving it a shot for the docs 🤔 |
Here's a playground of sorts for lyra, now called orama, https://stackblitz.com/edit/stackblitz-starters-eraanr?file=index.mjs run would be great if folks could help with the evaluation, i.e. compare the results it gives for something you searched recently against the current setup on kit.svelte.dev and share the findings here |
I just tried it out: ❯ node index.mjs "ssr"
ssr
[
'/docs/single-page-apps#prerendering-individual-pages',
'/docs/types#public-types-server',
'/docs/page-options#prerender-prerender-and-ssr',
'/docs/routing#layout-layout-server-js',
'/docs/page-options#csr',
'/docs/routing#page-page-svelte',
'/docs/types#public-types-ssrmanifest',
'/docs/state-management#using-stores-with-context',
'/docs/routing#layout-layout-js',
'/docs/load#universal-vs-server-when-does-which-load-function-run'
] In the current docs the first result is |
Amazing! Thank you @gtm-nayan!
Hmm. That's funny. I just tried the command you shared and that page was the second result. Perhaps @gtm-nayan made some improvements I also tested against the string "adapter-static", which was the original one filed here and it returned the "Static Site Generation" section first as expected. One that could be better is "assets". I was hoping to see the asset handling page returned higher. Turning on stemming helped quite a bit and boosting the breadcrumbs helped some as well as shown below. I think we may be able to do still better by splitting the breadcrumb into fields like
Orama is vastly better than fastsearch from an API perspective. I love how easy it is to boost a field, which I couldn't figure out with flexsearch if I recall. Another thing I noticed is that we may want to do something to put the migration guide and possibly the types towards the end of the search results. I know we at least discussed that with the current doc search, but can't remember if we implemented it or not. A small thing I noticed independent of which library we use is that we divide most of the articles into very small chunks, but then leave the config page as a single chunk. That could be worth tweaking. Finally, something I just noticed in their docs is that they have a grouping functionality. I know we do some grouping on the results after they're returned, so it might be interesting to see if this feature would be useful to us: https://docs.oramasearch.com/usage/search/grouping Overall, I'd love to switch to Orama. It seems way easier to use, so if we need to make any tweaks I'm a lot more confident we'll be able to do that. Also, my questions in the flexsearch repo have gone unanswered and flexsearch has no commits this year whereas orama seems much more actively developed. |
I'm a bit skeptical about Orama based on this askorama/orama#76 (comment) Unless this has been resolved somehow? |
That doesn't seems like a deal breaker to me. While it'd nice if it would take into account whether words are found consecutively, I'm not sure how often that would result in different search rankings and there are other ways in which Orama's search is better. I would expect that being able to have the scoring take into account whether a term is found in the first heading, second heading, or content would have a larger effect on search quality and Orama beats flexsearch there. I actually think the most interesting part of that post is that it lists several other options that I'm not sure we've investigated yet. The ones it lists as potentially performing better than Orama in that one particular benchmark (which is not terribly representative of actual usage) are bulksearch, jsii, wade, and js-search, so it might be worth checking those out as well. |
Size would be a factor as well, the current search implementation on the kit site, including flexsearch and the components is about 24kB minified, just the playground I linked above is about 55.7 kB after minification. |
jsii says it's not maintained I still think Orama is going to give us the best results. Especially since the test in askorama/orama#76 (comment) is so unlike our usecase. It was searching over all of HarryPotter so there's going to be tons more matches for any query and it doesn't have headings that can be used for per-field boosting Orama is larger. Though it also uses a lot less memory. I don't think we're likely to find a single library that wins across all metrics. |
I filed an issue with a suggestion for making Orama a bit smaller: askorama/orama#418 I think we could also mitigate it on our end by loading the search functionality in |
Hi @benmccann, Orama author here. Thanks a lot for considering Orama! Other frameworks and libraries are migrating to Orama for their documentation (think of Fastify, Platformatic, and more). The benchmark posted above is not representative of the current status of Orama performance-wise, and it's based on an older version (when it was pre-1.0.0 and it was called Lyra). We're now over v1.0.0 with stable APIs and significant performance improvements. It can work 100% client-side, so you own and manage your data. I hear your concerns about the bundle size, and I'd love to take this as an opportunity to optimize it, starting from your use case. You have all of my and my company's support for this. I'll continue the conversation in the Orama repo on your issue askorama/orama#418 🙂 |
a bit late to the party, but i'll drop this here: https://github.com/leeoniya/uFuzzy#a-biased-appraisal-of-similar-work i'd be interested in benchmarking Orama vs Lyra. lyra didn't come out very fast in my tests, though they're biased towards partial substring matches. Flexsearch is for sure the performance king if you can spare the ram for a giant index. |
Sharing an update here from askorama/orama#418:
|
@gtm-nayan your stackblitz isn't working for me anymore. Is there a way to go back to a working version of it? |
Whoops, forgot to revert the changes after checking the bundle size. Fixed now. |
Describe the problem
To use @tcc-sejohnson's example: if you search for
adapter-static
in the docs, the page you're probably looking for — this one — is the fifth result:Describe the proposed solution
I think the easiest and most reliable solution would be to add
keywords
frontmatter to the relevant markdown files, so that if you match one of them (or a keyword starts with your search term) that document is treated as higher priority than all others.We could indicate the keyword in the UI somehow but I don't think it's necessary.
Alternatives considered
No response
Importance
nice to have
Additional Information
No response
The text was updated successfully, but these errors were encountered: