Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make author name matching case insensitive #9390

Open
scottbarnes opened this issue Jun 5, 2024 · 2 comments
Open

Make author name matching case insensitive #9390

scottbarnes opened this issue Jun 5, 2024 · 2 comments
Labels
Lead: @scottbarnes Issues overseen by Scott (Community Imports) Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Needs: Investigation This issue/PR needs a root-cause analysis to determine a solution. [managed] Priority: 3 Issues that we can consider at our leisure. [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]

Comments

@scottbarnes
Copy link
Collaborator

scottbarnes commented Jun 5, 2024

Related: #9003, internetarchive/infogami#221

Problem

A clear and concise description of what you want to happen

On import, author name matching should be case insensitive.

Additional Context

internetarchive/infogami#217 changed ~ to use ILIKE rather than LIKE, and the Open Library code in #9003 relied upon this to perform case insensitive author name matching on import.

However, the Infogami ILIKE change caused performance issues and is slated to be reverted in internetarchive/infogami#221, with ~ doing a LIKE operation and ~i doing an ILIKE operation.

Once internetarchive/infogami#221 is merged, author name resolution will be case sensitive again. However, we can't simply update the Open Library code in openlibrary/catalog/add_book/load_book.py to use ~i, because of the performance issues associated with the ILIKE query, so we'll need to investigate further (perhaps using EXPLAIN can help us see more about the query.

Proposal & Constraints

What is the proposed solution / implementation?

None yet -- this will take more investigation to figure out why ILIKE was such significant performance issues.

Leads

Related files

Stakeholders

Note: Before making a new branch or updating an existing one, please ensure your branch is up to date.

@scottbarnes scottbarnes added Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed] Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] Needs: Lead Needs: Investigation This issue/PR needs a root-cause analysis to determine a solution. [managed] labels Jun 5, 2024
@tfmorris
Copy link
Contributor

tfmorris commented Jun 5, 2024

Doesn't SOLR already do this? Is there more context available about why this needs to be done in PostgreSQL in this particular use case?

A few general comments:

  • name matching should be done on normalized names which are not only case folded, but also diacritic folded, and Unicode composition normalized
  • some of these operations are, ideally, locale specific
  • if you have to do it in PostgreSQL, a trigram index may help performance https://stackoverflow.com/questions/20336665/lower-like-vs-ilike
  • but pre-computing a separate column with a normalized version of the name might be better

@cdrini
Copy link
Collaborator

cdrini commented Jun 5, 2024

Solr might be what we have to do considering the performance issues with ILIKE. Note solr has a caveat of being 1 minute behind live edits. In the past when solr has been used to dedupe imports, it caused edge cases where it caused dupes with related books being imported in quick succession, so we'd always need a postgres backup check of some sort. The postgres ILIKE was hence a mandatory and simple change that would result in a large improvement in new authors being created. The plan was to add the solr checking as an improvement at some point in the future. But we might have to re-evaluate that strategy as mentioned above.

Oh sweet thanks for that trigram index find! When we investigate we'll see what it's currently using.

@mekarpeles mekarpeles added Lead: @scottbarnes Issues overseen by Scott (Community Imports) and removed Needs: Lead labels Jun 10, 2024
@scottbarnes scottbarnes added Priority: 3 Issues that we can consider at our leisure. [managed] Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] and removed Needs: Triage This issue needs triage. The team needs to decide who should own it, what to do, by when. [managed] labels Jun 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Lead: @scottbarnes Issues overseen by Scott (Community Imports) Module: Import Issues related to the configuration or use of importbot and other bulk import systems. [managed] Needs: Investigation This issue/PR needs a root-cause analysis to determine a solution. [managed] Priority: 3 Issues that we can consider at our leisure. [managed] Type: Feature Request Issue describes a feature or enhancement we'd like to implement. [managed]
Projects
None yet
Development

No branches or pull requests

4 participants