Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add custom weighting functionality #13

Closed
bglw opened this issue May 31, 2022 · 10 comments
Closed

Add custom weighting functionality #13

bglw opened this issue May 31, 2022 · 10 comments
Labels
Available on prerelease improvement Not a bug Pagefind CLI The CLI responsible for indexing content Pagefind Search The web-facing search js/wasm API
Milestone

Comments

@bglw
Copy link
Contributor

bglw commented May 31, 2022

This could involve:

  • Weighting certain HTML elements above others (i.e. h1 > p)
  • Providing custom "boosts" to certain pages for certain terms
@bglw bglw added Pagefind CLI The CLI responsible for indexing content Pagefind Search The web-facing search js/wasm API improvement Not a bug labels Jul 26, 2022
@bglw
Copy link
Contributor Author

bglw commented Aug 23, 2022

Some evolved thoughts: This could be handled at the tagging level, to achieve something like:

<h1 data-pagefind-weight="2">More important text</h1>
<p>Normal text</p>

@simonwiles
Copy link

Even just the ability to weight the title higher than the body text would be a very welcome addition!

@bglw bglw added this to the v1.0.0 milestone Mar 1, 2023
@bglw
Copy link
Contributor Author

bglw commented May 17, 2023

(Self note: related: #304)

@bglw
Copy link
Contributor Author

bglw commented May 26, 2023

Hello to any watching 👋

I'm working toward a larger 1.0 release, but to anyone itching for this feature right now, you can use the latest 1.0.0-alpha.5 release to get this 🎉. The data-pagefind-weight has been implemented, and the h1..6 headings are automatically given a higher weight.

Documentation for this feature here: https://unreleased.pagefind.app/docs/indexing/#ranking-content-higher-with-weights

npx pagefind@alpha, or download the relevant binary from this release page.

I'll update this thread again on a 1.0 release 🙂

@fekete-robert
Copy link

Thanks, this seems to be a useful addition, I think it'll solve some of my problems :D

@bglw
Copy link
Contributor Author

bglw commented Sep 13, 2023

Hey @simonwiles & @fekete-robert ! 👋

Good news — this has landed in Pagefind v1.0.0!

See the full release notes here: https://github.com/CloudCannon/pagefind/releases/tag/v1.0.0 💙

And the specific docs on weighting: https://pagefind.app/docs/weighting/

@bglw bglw closed this as completed Sep 13, 2023
@fekete-robert
Copy link

Thanks, that's great news! Now I'll just have to find the time to update our site to use it :)

@tenmozes
Copy link

Could you please explain how it works
i paste the exact h3 title in the searchbox

i see no it in result

version 1.03
using pagefind-ui

you can try in here https://new.docs.victoriametrics.com/
search for example - How to export time series (without quotes, with quotes everything is fine)

@bglw
Copy link
Contributor Author

bglw commented Sep 27, 2023

Hey @tenmozes 👋

There are a a few signals that are all combined to rank the results, I can give some insight into the ranking here. Large single-page docs are a weak point right now for ranking.

You can see some of the underlying ranking data if we run a search with the JS API on your site, using the term export time series to trim the data down a little:

const pagefind = await import("/pagefind/pagefind.js");
const search = await pagefind.search("export time series"); 
const results = await Promise.all([...search.results.map(r => r.data())]); 

There's a lot of data so I won't paste it here, but you can dig into results[0] which is /guides/understand-your-setup-size/, and results[5] which is /, within each you can see all of the weighted_locations used to rank the result

The biggest thing hurting you here will be that Pagefind ranks pages well if they have a higher frequency of matching words. The Understand Your Setup Size page has quite a lot of hits for Time Series, including an h3 as well, and importantly has a higher rate of that word in the page. That page is only 1100 words with a lot of hits, while the homepage is 20x the size with 10x the words, so proportionally less hits.

All of that is combined with some other signals to rank the results, which ultimately causes that homepage to be further down the list.

There's a separate issue where Pagefind is less smart ranking the sub results within a page, which is why you see the homepage as the sixth result without the relevant heading shown. Ranking the sub results within a page needs to be smarter, as it doesn't currently take the weightings into account. I'll open an issue to get that work underway, it just didn't make it into the 1.0 cut

@bglw
Copy link
Contributor Author

bglw commented Sep 27, 2023

Issue #437 discusses related topics, so that would also be a good place to continue any discussion! 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Available on prerelease improvement Not a bug Pagefind CLI The CLI responsible for indexing content Pagefind Search The web-facing search js/wasm API
Projects
None yet
Development

No branches or pull requests

4 participants