Make author name matching case insensitive #9390

scottbarnes · 2024-06-05T15:29:19Z

Related: #9003, internetarchive/infogami#221

Problem

A clear and concise description of what you want to happen

On import, author name matching should be case insensitive.

Additional Context

internetarchive/infogami#217 changed ~ to use ILIKE rather than LIKE, and the Open Library code in #9003 relied upon this to perform case insensitive author name matching on import.

However, the Infogami ILIKE change caused performance issues and is slated to be reverted in internetarchive/infogami#221, with ~ doing a LIKE operation and ~i doing an ILIKE operation.

Once internetarchive/infogami#221 is merged, author name resolution will be case sensitive again. However, we can't simply update the Open Library code in openlibrary/catalog/add_book/load_book.py to use ~i, because of the performance issues associated with the ILIKE query, so we'll need to investigate further (perhaps using EXPLAIN can help us see more about the query.

Proposal & Constraints

What is the proposed solution / implementation?

None yet -- this will take more investigation to figure out why ILIKE was such significant performance issues.

Leads

Related files

Stakeholders

Note: Before making a new branch or updating an existing one, please ensure your branch is up to date.

The text was updated successfully, but these errors were encountered:

tfmorris · 2024-06-05T17:04:21Z

Doesn't SOLR already do this? Is there more context available about why this needs to be done in PostgreSQL in this particular use case?

A few general comments:

name matching should be done on normalized names which are not only case folded, but also diacritic folded, and Unicode composition normalized
some of these operations are, ideally, locale specific
if you have to do it in PostgreSQL, a trigram index may help performance https://stackoverflow.com/questions/20336665/lower-like-vs-ilike
but pre-computing a separate column with a normalized version of the name might be better

cdrini · 2024-06-05T17:14:42Z

Solr might be what we have to do considering the performance issues with ILIKE. Note solr has a caveat of being 1 minute behind live edits. In the past when solr has been used to dedupe imports, it caused edge cases where it caused dupes with related books being imported in quick succession, so we'd always need a postgres backup check of some sort. The postgres ILIKE was hence a mandatory and simple change that would result in a large improvement in new authors being created. The plan was to add the solr checking as an improvement at some point in the future. But we might have to re-evaluate that strategy as mentioned above.

Oh sweet thanks for that trigram index find! When we investigate we'll see what it's currently using.

scottbarnes mentioned this issue Jun 5, 2024

Author name resolution unit tests should be case sensitive #9391

Open

tfmorris mentioned this issue Jun 5, 2024

Weird issue template text #9394

Closed

mekarpeles added Lead: @scottbarnes Issues overseen by Scott (Community Imports) and removed Needs: Lead labels Jun 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make author name matching case insensitive #9390

Make author name matching case insensitive #9390

scottbarnes commented Jun 5, 2024 •

edited

Loading

tfmorris commented Jun 5, 2024

cdrini commented Jun 5, 2024 •

edited

Loading

Make author name matching case insensitive #9390

Make author name matching case insensitive #9390

Comments

scottbarnes commented Jun 5, 2024 • edited Loading

Problem

A clear and concise description of what you want to happen

Additional Context

Proposal & Constraints

What is the proposed solution / implementation?

Leads

Related files

Stakeholders

tfmorris commented Jun 5, 2024

cdrini commented Jun 5, 2024 • edited Loading

scottbarnes commented Jun 5, 2024 •

edited

Loading

cdrini commented Jun 5, 2024 •

edited

Loading