Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Search is case-sensitive in non-English languages #3116

Open
rkfg opened this issue Apr 17, 2018 · 10 comments
Open

Search is case-sensitive in non-English languages #3116

rkfg opened this issue Apr 17, 2018 · 10 comments
Labels
A-I18n A-Message-Search Searching messages O-Frequent Affects or can be seen by most users regularly or impacts most users' first experience S-Minor Blocks non-critical functionality, workarounds exist. T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. Z-Help-Wanted We know exactly how to fix this issue, and would be grateful for any contribution

Comments

@rkfg
Copy link
Contributor

rkfg commented Apr 17, 2018

Description

It's hard to find the text you need in languages other than English as case becomes important.

Steps to reproduce

  • in a non-English room try to search for a word in lower and upper case
  • the results would be quite different
  • searching for an English word in upper and lower case shows the same result

Also, here the FTS language is hardcoded as English so no stemming is supported for other languages. I propose lowercasing the event text before inserting it and do the same on querying.

Version information

  • Homeserver: own homeserver
  • Version: 0.27.4
  • Install method: package manager
  • Platform: Debian 9.4 on a VDS
@neilisfragile neilisfragile added z-p2 (Deprecated Label) z-minor (Deprecated Label) Z-Help-Wanted We know exactly how to fix this issue, and would be grateful for any contribution labels Apr 20, 2018
@areisp
Copy link

areisp commented May 25, 2019

There are quite a few articles suggesting to create the postgresql database with LC_COLLATE and LC_CTYPE set to C. Like here.
I don't know if there is a reason for that, (performance-wise maybe?), but recreating the db (and restoring the dump into it) with region specific locale in LC_COLLATE and LC_CTYPE, which is ru_RU.UTF-8 in my case, fixes the problem. The new text is getting indexed just fine.
As for previous conversations, I used some questionable workaround:

UPDATE event_search SET vector = array_to_tsvector(lower(tsvector_to_array(vector)::text)::text[]) ;

Not sure if it was proper and/or safe way.

@MurzNN
Copy link

MurzNN commented Jan 15, 2020

@areisp thanks for workaround description, can it also fix issue element-hq/element-web#7247 with user search? In which table we must execute proposed UPDATE for it?

@ptman
Copy link
Contributor

ptman commented Jun 26, 2020

#6696 (comment)

@ilmari
Copy link
Contributor

ilmari commented Jun 26, 2020

The issue is that the C database locale doesn't know how to case-fold non-ASCII letters. Using C.UTF-8 would avoid the index corruption issues when upgrading libc, while still allowing case-insensitive searching for non-ASCII letters.

@532910
Copy link

532910 commented Jun 27, 2020

C.UTF-8 fixes this issue, but adds another:

C.UTF-8:

synapse=# select event_id from event_search where vector @@ to_tsquery('можно & забрать');
              event_id              
------------------------------------
 $1566.................:matrix.org
 $1546..................:matrix.org
(2 rows)

synapse=# select event_id from event_search where vector @@ to_tsquery('МОЖНО & забрать');
              event_id              
------------------------------------
 $1566.................:matrix.org
 $1546..................:matrix.org
(2 rows)

C:

synapse=# select event_id from event_search where vector @@ to_tsquery('можно & забрать');
              event_id
------------------------------------
 $1566.................:matrix.org
 $1546..................:matrix.org
(2 rows)

synapse=# select event_id from event_search where vector @@ to_tsquery('МОЖНО & забрать');
                   event_id
----------------------------------------------
 $1574..............:homeserver.tld
 $1574..............:homeserver.tld
 $1574..............:homeserver.tld
 $1574..............:homeserver.tld
 $1576...............:homeserver.tld
 $1576...............:homeserver.tld
 $1591..............:homeserver.tld
 $Zgt9.......................................
(8 rows)

@MurzNN
Copy link

MurzNN commented Jun 27, 2020

Maybe you must rebuild indexes after changing locale?

@532910
Copy link

532910 commented Jun 27, 2020

Sure, I've issued REINDEX database synapse; and restarted psql client

@532910
Copy link

532910 commented Jun 27, 2020

Full log:
https://clbin.com/k2J64

@MurzNN
Copy link

MurzNN commented Dec 16, 2020

I have found a PR here https://github.com/matrix-org/synapse/pull/6268/files that forcing lowercase of strings for solving similar problem with case-sensitive search.
So, if we can't use LC_COLLATE and LC_CTYPE other than C, maybe we can add similar behavior, and convert to lowercase all strings before adding to search index, and do same trick with search phrase too?

@MadLittleMods MadLittleMods added the A-Message-Search Searching messages label Nov 10, 2021
@DMRobertson DMRobertson added the T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. label Nov 18, 2021
@grinapo
Copy link

grinapo commented Apr 24, 2022

Sidenote: as far as I see synapse will reject the workarounds which try to use a non-C db locale, so this will come up as a problem again.

The current state of searching is a hack, it should be possible to use a proper full-text search backend. I'm not sure it would be very hard to develop one independently (using the db and redirecting the api calls), but right now I'm busy with many other things. :-(

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-I18n A-Message-Search Searching messages O-Frequent Affects or can be seen by most users regularly or impacts most users' first experience S-Minor Blocks non-critical functionality, workarounds exist. T-Enhancement New features, changes in functionality, improvements in performance, or user-facing enhancements. Z-Help-Wanted We know exactly how to fix this issue, and would be grateful for any contribution
Projects
None yet
Development

No branches or pull requests

12 participants