add keywords to docs for better search results #8712

Rich-Harris · 2023-01-25T01:09:22Z

Describe the problem

To use @tcc-sejohnson's example: if you search for adapter-static in the docs, the page you're probably looking for — this one — is the fifth result:

Describe the proposed solution

I think the easiest and most reliable solution would be to add keywords frontmatter to the relevant markdown files, so that if you match one of them (or a keyword starts with your search term) that document is treated as higher priority than all others.

We could indicate the keyword in the UI somehow but I don't think it's necessary.

Alternatives considered

No response

Importance

nice to have

Additional Information

No response

The text was updated successfully, but these errors were encountered:

benmccann · 2023-01-25T03:54:28Z

3 of the 4 documents that rank above it don't contain adapter-static a single time. It must be tokenizing it into "adapter" and "static". Perhaps we can either remove - as a delimiter character or special case the adapter names to be treated as a single word

benmccann · 2023-01-25T06:08:14Z

I think there's another bug as well:

kit/sites/kit.svelte.dev/src/lib/search/search.js

Line 64 in f953c9d

    
           return a.block.breadcrumbs.length - b.block.breadcrumbs.length || a.rank - b.rank;

The problem is that sub-sections rank lower than main pages.

https://kit.svelte.dev/docs/adapter-static#usage - because it has a # is automatically pushed to the bottom
https://kit.svelte.dev/docs/configuration - despite not even having the text adapter-static jumps to the top because there is no#

It should probably be grouping followed by ranking. I.e. we group by the page and then rank based on the highest ranking sub-section or something like that.

oodavid · 2023-01-25T14:03:56Z

Might it be best to implement 3rd party search? Algolia is free for open-source, and does a great job of indexing and ranking...

https://www.algolia.com/for-open-source/
https://www.algolia.com/doc/tools/crawler/getting-started/overview/

Edit: Oh, not quite free: 200,000 search requests per month - still, maybe worth budgeting for.

Rich-Harris · 2023-01-25T17:53:38Z

I've been meaning to write a blog post about this, but there's a variety of reasons we don't want to use third party search tools:

We value your privacy. While we don't have any particular animus or distrust towards Algolia, we don't want to be in the position of trusting any third party to handle your data responsibly, and that includes your search history
We don't want to cede control over the UI or the search results. While it's arguably true that Algolia will have better out-of-the-box results than our homegrown setup (which uses flexsearch), we have the ability to improve it and tailor it as we see fit, which we'd lose if we had something generic
Search should work without JavaScript, especially when the framework in question preaches progressive enhancement. Ours does (https://kit.svelte.dev/search?q=hello), every other framework's doesn't, because they use Algolia
If you use Algolia, every keystroke results in a network request. Depending on where you are relative to Algolia's servers, that will result in latency; regardless of where you are it will result in unnecessary data usage
By extension, search ceases to work if you lose connectivity. The SvelteKit docs don't currently work fully offline, but it's a medium term goal
It takes time to index a site. With our approach, search is 100% up to date for every deploy, even preview deploys. That's not true for any site that uses Algolia

benmccann · 2023-01-25T18:29:16Z

The only bullet point I'd comment on before you write this blog post is:

We don't want to cede control over the UI or the search results. While it's arguably true that Algolia will have better out-of-the-box results than our homegrown setup (which uses flexsearch), we have the ability to improve it and tailor it as we see fit, which we'd lose if we had something generic

Flexsearch is incredibly hard to customize relative to Algolia, Elastic, or just about any index I've used in the past. I've spent the morning trying and simply can't understand how Flexsearch's scoring works. I've filed a few issues in the Flexsearch repo asking for more details and hope to come back to this after getting some more details about how to tweak Flexsearch.

In the meantime, I've sent a PR which just does some housekeeping on our side: #8727

Rich-Harris · 2023-01-25T19:07:14Z

Right, but we could swap out flexsearch for something else if we needed to. Hell, we could write our own!

oodavid · 2023-01-25T21:08:17Z

A very well reasoned response. Personally I'd put results relevance above all of those points.

I've had some success in the past with Typesense, IIRC it has a rational approach to ranking and relevance. Might be worth a peek:

https://typesense.org/docs/guide/ranking-and-relevance.html

Flexsearch has a list of other libraries, benchmarked:

https://nextapps-de.github.io/flexsearch/bench/

enBonnet · 2023-02-04T17:33:03Z

Typesence looks really cool @oodavid.

Could we try implementing it? I'd like to participate.

Hetarth02 · 2023-02-04T19:45:06Z

Right, but we could swap out flexsearch for something else if we needed to. Hell, we could write our own!

LunrJs is also good and is flexible enough with a good documentation. Other alternatives might be stork.js and fuse.js.

enBonnet · 2023-02-05T20:50:47Z

LunrJs is also good and is flexible enough with a good documentation. Other alternatives might be stork.js and fuse.js.

There are a lot of options, we should put our focus on the problem we wanna resolve and look at which one is the best for doing it.

The current problem seems to be the priorities.

benmccann · 2023-02-06T19:37:27Z

I'm open to alternatives as I don't particularly like flexsearch, but it'd be nice to find one that allows us to keep the functionality that we have today. In particular today you can see results as you type and many of the tools mentioned above don't appear to support that. The search we use today also does not require any extra infrastructure. I'm not sure if any of the tools mentioned are great fits, but would love if someone can find one that fits the bill.

typesense - can do prefix-based search. appears to require running search server. is it going to require extra infrastructure or can we run as a serverless funtion on vercel? I see a next.js example but it uses typesense's cloud. It won't work offline in any case
lunr - uses bm25. unclear if you can search based off prefix
stork search - has markdown and frontmatter support. appears to index substrings, but not sure you can do prefixes. can boost titles, but otherwise not sure you can control ranking. wasm. isn't on npm. wants to manage the DOM by default, so need to use advanced search method to build your own interface
fuse.js - has weighted search. unclear if you can search based off prefix

Hetarth02 · 2023-02-07T01:08:56Z

I am not biased towards lunrjs but I have been working with it currently and I think it checks off all your requirements.

See search results as you type

You can build the indexes once on the initial search on client side(not a good idea) or you can load the pre-built indexes file during search. For example, see Julia docs they build the indexes on client side for the initial search.

Does not require any extra infrastructure

One can use github's CI/CD to pre-build the index file on every push.(I am actually work on this issue in Documenter.jl)

@benmccann I didn't get the "search based off prefix" part can you please explain?(If possible with a small example)

benmccann · 2023-02-07T02:41:28Z

What I mean by search off a prefix is this... Imagine that you're typing "adapter". When you start and you type "a" it will show all words beginning with "a", when you get to "ad" it will show all words starting with "ad", and so on. You can see how the search auto-completes in realtime on kit.svelte.dev as you do this.

Hetarth02 · 2023-02-07T03:09:18Z

What I mean by search off a prefix is this... Imagine that you're typing "adapter". When you start and you type "a" it will show all words beginning with "a", when you get to "ad" it will show all words starting with "ad", and so on. You can see how the search auto-completes in realtime on kit.svelte.dev as you do this.

Correct me if I am wrong but are you perhaps talking about auto-complete?

Would this be something we are looking for?

Autocomplete library by Algolia

benmccann · 2023-02-07T03:12:45Z

It's a bit different than autocomplete. It's not completing your queries. Rather it's doing searches based on partial query strings. E.g. to take the "adapter" example from earlier, the way it works is by indexing "a", "ad", "ada", "adap", "adapt", "adapte", "adapter". This takes a lot more memory, but provides the experience you see today on kit.svelte.dev.

Hetarth02 · 2023-02-07T03:14:54Z

It's a bit different than autocomplete. It's not completing your queries. Rather it's doing searches based on partial query strings. E.g. to take the "adapter" example from earlier, the way it works is by indexing "a", "ad", "ada", "adap", "adapt", "adapte", "adapter". This takes a lot more memory, but provides the experience you see today on kit.svelte.dev.

I see then perhaps is this what we are looking for,

Wildcards Lunrjs

I think this can reproduce the same functionality you are talking about.

benmccann · 2023-02-07T03:27:04Z

Ah, yes! Thanks for the pointer. Lunrjs may indeed work then!

I'd be happy to review any attempt to switch out flexsearch for lunrjs if anyone wants to take a stab at it.

Hetarth02 · 2023-02-07T03:36:27Z

Ah, yes! Thanks for the pointer. Lunrjs may indeed work then!

I'd be happy to review any attempt to switch out flexsearch for lunrjs if anyone wants to take a stab at it.

I can try to make a prototype. Can anyone guide me through some of the steps to setup the code for docs locally?

@benmccann @enBonnet

Rich-Harris · 2023-02-11T17:28:53Z

Can anyone guide me through some of the steps to setup the code for docs locally?

You'll need to have pnpm installed, then...

git clone git@github.com:sveltejs/kit
cd kit
pnpm install
cd sites/kit.svelte.dev
pnpm dev

...and you should be off to the races!

Rich-Harris · 2023-02-11T17:37:55Z

One thing I'll note is that the web worker that powers our current search — which includes all of flexsearch plus our logic that sits around it — is 18kb of unminified code (though it probably should be minified, not sure why it isn't).

By contrast, lunr by itself weighs 99kb. Probably not a dealbreaker but something to be conscious of.

kevmodrome · 2023-02-13T12:16:42Z

I suspect you want to keep the search locally on the client, but if you're looking for an alternative to algolia there's meilisearch: https://docs.meilisearch.com - though 11kb minified+zipped

benmccann · 2023-02-14T00:11:45Z

lunr is only 29k minified, so it's not too bad. The thing that I just noticed that gives me more hesitation is that it appears to basically be abandoned. It hasn't been updated since 2020, it still uses Travis CI, there's a number of unreviewed PRs, etc. It'd be nice if we could find something that's a bit better maintained

benmccann · 2023-02-14T01:00:23Z

https://github.com/lucaong/minisearch looks like a promising option. It'd probably be better to try it than lunr

Hetarth02 · 2023-02-17T02:55:39Z

https://github.com/lucaong/minisearch looks like a promising option. It'd probably be better to try it than lunr

Thanks for you suggestion, I will try to use this.

gtm-nayan · 2023-02-17T03:24:41Z

I was trying out minisearch and elasticlunr yesterday, @Hetarth02 you can continue from those branches if it saves you some setup time.

You'll need to be using Chrome for this btw since Firefox doesn't yet support module workers.

Hetarth02 · 2023-02-17T03:27:20Z

@gtm-nayan Thanks for your help, by the way any noticeable results you got from using minisearch. Also, if you want we can co-ordinate with each other and work on this topic together.

gtm-nayan · 2023-02-17T04:07:20Z

Minisearch gives out a lot more results than our current setup but I think that's due to the combineWith setting, changing it to "AND" reduces the number of results but there's no way to do that on a per-field basis. Minisearch did improve the query originally mentioned in this issue ie. searching for adapter-static leads to the static site generation page, and I didn't see any glaring problems yet but still have to test for other common queries.

boian-ivanov · 2023-03-01T09:25:43Z

There's an up-and-coming in-memory search engine fully build from the ground up to be performant for full-text search, called Lyra. The project seems quite intuitive and the people behind it are constantly improving it. It might be worth giving it a shot for the docs 🤔

gtm-nayan · 2023-06-07T12:16:57Z

Here's a playground of sorts for lyra, now called orama, https://stackblitz.com/edit/stackblitz-starters-eraanr?file=index.mjs

run node index.mjs "query goes here"

would be great if folks could help with the evaluation, i.e. compare the results it gives for something you searched recently against the current setup on kit.svelte.dev and share the findings here

karimfromjordan · 2023-06-07T17:53:37Z

I just tried it out:

❯ node index.mjs "ssr"
ssr
[
  '/docs/single-page-apps#prerendering-individual-pages',
  '/docs/types#public-types-server',
  '/docs/page-options#prerender-prerender-and-ssr',
  '/docs/routing#layout-layout-server-js',
  '/docs/page-options#csr',
  '/docs/routing#page-page-svelte',
  '/docs/types#public-types-ssrmanifest',
  '/docs/state-management#using-stores-with-context',
  '/docs/routing#layout-layout-js',
  '/docs/load#universal-vs-server-when-does-which-load-function-run'
]

In the current docs the first result is /docs/page-options#ssr which doesn't seem to be included here in the search results.

benmccann · 2023-06-09T20:08:17Z

Here's a playground of sorts for lyra, now called orama, https://stackblitz.com/edit/stackblitz-starters-eraanr?file=index.mjs

Amazing! Thank you @gtm-nayan!

I just tried it out:
node index.mjs "ssr"
In the current docs the first result is /docs/page-options#ssr which doesn't seem to be included here in the search results.

Hmm. That's funny. I just tried the command you shared and that page was the second result. Perhaps @gtm-nayan made some improvements

I also tested against the string "adapter-static", which was the original one filed here and it returned the "Static Site Generation" section first as expected.

One that could be better is "assets". I was hoping to see the asset handling page returned higher. Turning on stemming helped quite a bit and boosting the breadcrumbs helped some as well as shown below. I think we may be able to do still better by splitting the breadcrumb into fields like h1, h2, h3, so that we can give a higher boost to larger headings. Right now we can't do that which makes it really hard to get the asset handling page back first since other chunks have the term "assets" in their lower headings.

  const index = await create({
    schema: {
      breadcrumbs: 'string[]',
      content: 'string',
    },
    components: {
      tokenizer: { language: 'english', stemming: true },
    },
  });

  await insertMultiple(index, blocks);

  const results = await search(index, {
    term: query,
    boost: {
      breadcrumbs: 2,
    },
  });

Orama is vastly better than fastsearch from an API perspective. I love how easy it is to boost a field, which I couldn't figure out with flexsearch if I recall.

Another thing I noticed is that we may want to do something to put the migration guide and possibly the types towards the end of the search results. I know we at least discussed that with the current doc search, but can't remember if we implemented it or not.

A small thing I noticed independent of which library we use is that we divide most of the articles into very small chunks, but then leave the config page as a single chunk. That could be worth tweaking.

Finally, something I just noticed in their docs is that they have a grouping functionality. I know we do some grouping on the results after they're returned, so it might be interesting to see if this feature would be useful to us: https://docs.oramasearch.com/usage/search/grouping

Overall, I'd love to switch to Orama. It seems way easier to use, so if we need to make any tweaks I'm a lot more confident we'll be able to do that. Also, my questions in the flexsearch repo have gone unanswered and flexsearch has no commits this year whereas orama seems much more actively developed.

PuruVJ · 2023-06-12T14:37:47Z

I'm a bit skeptical about Orama based on this askorama/orama#76 (comment)

Unless this has been resolved somehow?

benmccann · 2023-06-12T16:14:05Z

That doesn't seems like a deal breaker to me. While it'd nice if it would take into account whether words are found consecutively, I'm not sure how often that would result in different search rankings and there are other ways in which Orama's search is better. I would expect that being able to have the scoring take into account whether a term is found in the first heading, second heading, or content would have a larger effect on search quality and Orama beats flexsearch there.

I actually think the most interesting part of that post is that it lists several other options that I'm not sure we've investigated yet. The ones it lists as potentially performing better than Orama in that one particular benchmark (which is not terribly representative of actual usage) are bulksearch, jsii, wade, and js-search, so it might be worth checking those out as well.

gtm-nayan · 2023-06-12T16:22:55Z

Size would be a factor as well, the current search implementation on the kit site, including flexsearch and the components is about 24kB minified, just the playground I linked above is about 55.7 kB after minification.

benmccann · 2023-06-12T16:27:29Z

jsii says it's not maintained
js-search can't do per-field boosting
wade doesn't appear to do stemming, substring matching, or per-field boosting
bulksearch doesn't appear to do stemming or per-field boosting

I still think Orama is going to give us the best results. Especially since the test in askorama/orama#76 (comment) is so unlike our usecase. It was searching over all of HarryPotter so there's going to be tons more matches for any query and it doesn't have headings that can be used for per-field boosting

Orama is larger. Though it also uses a lot less memory. I don't think we're likely to find a single library that wins across all metrics.

benmccann · 2023-06-12T16:47:48Z

I filed an issue with a suggestion for making Orama a bit smaller: askorama/orama#418

I think we could also mitigate it on our end by loading the search functionality in onMount so that we don't block the page load

micheleriva · 2023-06-12T17:01:01Z

Hi @benmccann, Orama author here. Thanks a lot for considering Orama! Other frameworks and libraries are migrating to Orama for their documentation (think of Fastify, Platformatic, and more).

The benchmark posted above is not representative of the current status of Orama performance-wise, and it's based on an older version (when it was pre-1.0.0 and it was called Lyra). We're now over v1.0.0 with stable APIs and significant performance improvements.

It can work 100% client-side, so you own and manage your data.

I hear your concerns about the bundle size, and I'd love to take this as an opportunity to optimize it, starting from your use case.

You have all of my and my company's support for this. I'll continue the conversation in the Orama repo on your issue askorama/orama#418 🙂

leeoniya · 2023-06-12T17:04:17Z

a bit late to the party, but i'll drop this here:

https://github.com/leeoniya/uFuzzy#a-biased-appraisal-of-similar-work

i'd be interested in benchmarking Orama vs Lyra. lyra didn't come out very fast in my tests, though they're biased towards partial substring matches. Flexsearch is for sure the performance king if you can spare the ram for a giant index.

benmccann · 2023-06-13T16:27:25Z

Sharing an update here from askorama/orama#418:

We just published Orama v1.0.7 and went from 20kb to 13kb gziped. If you import the search function only, it will cost around 4.88kb gziped (it was around 11kb yesterday)

benmccann · 2023-06-13T16:47:09Z

@gtm-nayan your stackblitz isn't working for me anymore. Is there a way to go back to a working version of it?

gtm-nayan · 2023-06-13T17:08:24Z

Whoops, forgot to revert the changes after checking the bundle size. Fixed now.

PuruVJ · 2023-06-29T08:05:49Z

sveltejs/site-kit#162

Rich-Harris added the documentation Improvements or additions to documentation label Jan 25, 2023

Rich-Harris added this to the soon milestone Jan 27, 2023

benmccann mentioned this issue Jun 12, 2023

Investigate options for reducing some kB askorama/orama#418

Closed

benmccann mentioned this issue Jun 14, 2023

Difficulty doing primary + secondary search with sortBy askorama/orama#422

Closed

ghostdevv mentioned this issue Aug 25, 2023

Consider DocSearch for Svelte and SvelteKit sites #10623

Closed

add keywords to docs for better search results #8712

add keywords to docs for better search results #8712

Comments

Rich-Harris commented Jan 25, 2023

Describe the problem

Describe the proposed solution

Alternatives considered

Importance

Additional Information

benmccann commented Jan 25, 2023

benmccann commented Jan 25, 2023

oodavid commented Jan 25, 2023 • edited Loading

Rich-Harris commented Jan 25, 2023

benmccann commented Jan 25, 2023

Rich-Harris commented Jan 25, 2023

oodavid commented Jan 25, 2023

enBonnet commented Feb 4, 2023

Hetarth02 commented Feb 4, 2023

enBonnet commented Feb 5, 2023

benmccann commented Feb 6, 2023 • edited Loading

Hetarth02 commented Feb 7, 2023 • edited Loading

benmccann commented Feb 7, 2023

Hetarth02 commented Feb 7, 2023 • edited Loading

benmccann commented Feb 7, 2023

Hetarth02 commented Feb 7, 2023 • edited Loading

benmccann commented Feb 7, 2023

Hetarth02 commented Feb 7, 2023 • edited Loading

Rich-Harris commented Feb 11, 2023

Rich-Harris commented Feb 11, 2023

kevmodrome commented Feb 13, 2023

benmccann commented Feb 14, 2023

benmccann commented Feb 14, 2023

Hetarth02 commented Feb 17, 2023

gtm-nayan commented Feb 17, 2023

Hetarth02 commented Feb 17, 2023 • edited Loading

gtm-nayan commented Feb 17, 2023 • edited Loading

boian-ivanov commented Mar 1, 2023

gtm-nayan commented Jun 7, 2023 • edited Loading

karimfromjordan commented Jun 7, 2023

benmccann commented Jun 9, 2023

PuruVJ commented Jun 12, 2023

benmccann commented Jun 12, 2023

gtm-nayan commented Jun 12, 2023

benmccann commented Jun 12, 2023

benmccann commented Jun 12, 2023

micheleriva commented Jun 12, 2023 • edited Loading

leeoniya commented Jun 12, 2023

benmccann commented Jun 13, 2023

benmccann commented Jun 13, 2023

gtm-nayan commented Jun 13, 2023

PuruVJ commented Jun 29, 2023

oodavid commented Jan 25, 2023 •

edited

Loading

benmccann commented Feb 6, 2023 •

edited

Loading

Hetarth02 commented Feb 7, 2023 •

edited

Loading

Hetarth02 commented Feb 7, 2023 •

edited

Loading

Hetarth02 commented Feb 7, 2023 •

edited

Loading

Hetarth02 commented Feb 7, 2023 •

edited

Loading

Hetarth02 commented Feb 17, 2023 •

edited

Loading

gtm-nayan commented Feb 17, 2023 •

edited

Loading

gtm-nayan commented Jun 7, 2023 •

edited

Loading

micheleriva commented Jun 12, 2023 •

edited

Loading