Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Search for URLs in messages doesn't return any results #3024

Open
WGH- opened this issue Mar 25, 2018 · 16 comments
Open

Search for URLs in messages doesn't return any results #3024

WGH- opened this issue Mar 25, 2018 · 16 comments
Labels
A-Message-Search Searching messages O-Occasional Affects or can be seen by some users regularly or most users rarely S-Tolerable Minor significance, cosmetic issues, low or no impact to users. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. Z-Help-Wanted We know exactly how to fix this issue, and would be grateful for any contribution

Comments

@WGH-
Copy link
Contributor

WGH- commented Mar 25, 2018

Description

Message text search doesn't work with URL. Even if you search for URL exactly as it was in a message, search will not return any results.

Steps to reproduce

The search results will not include the message with the link, which is unexpected.

Version information

  • Homeserver: self-hosted
  • Version: matrix-synapse==0.26.0
  • Install method: pip
  • Platform: Linux, PostgreSQL 9.6
@WGH-
Copy link
Contributor Author

WGH- commented Mar 25, 2018

Plain SQL query returns the message like a charm:

synapse=> select event_id, content from events where content like '%https://www.youtube.com/watch?v=oHg5SJYRHA0%';
            event_id            |                                  content                                  
--------------------------------+---------------------------------------------------------------------------
 $15219863901020QNrdM:torlan.ru | {"body":"https://www.youtube.com/watch?v=oHg5SJYRHA0","msgtype":"m.text"}
(1 row)

@neilisfragile neilisfragile added z-bug (Deprecated Label) Z-Help-Wanted We know exactly how to fix this issue, and would be grateful for any contribution z-p2 (Deprecated Label) labels Mar 27, 2018
@skylord123
Copy link

Just ran into this same issue. Tried searching a room for a specific link and it wasn't finding anything but I found it when searching manually.

@werner291
Copy link
Contributor

@werner291
Copy link
Contributor

werner291 commented Sep 27, 2019

Managed to reproduce in Riot on my homeserver. Will try upgrading server to see if the issue persists. (Though shouldn't be Riot-specific since the request itself seems to be failing.)

@werner291
Copy link
Contributor

Also occurs on latest develop version when running live.

@werner291
Copy link
Contributor

This bit looks suspicious: https://github.com/matrix-org/synapse/blob/master/synapse/storage/search.py#L398

ts_search appears to be doing some kind of stopword removal and such. I can imagine URI's might get mangled by that https://www.postgresql.org/docs/9.5/textsearch-controls.html

@werner291
Copy link
Contributor

And the tests were passing since they are run against SQLite, which uses a somewhat more straightforward approach.

@werner291
Copy link
Contributor

werner291 commented Sep 27, 2019

SELECT vector FROM event_search WHERE vector @@ to_tsquery('english', 'www.youtube.com');

Curiously, this does seem to produce a couple results, suggesting to_tsquery isn't the roadbloack I thought it might be.

@werner291
Copy link
Contributor

werner291 commented Sep 28, 2019

Yep: PostgreSQL is rather iffy when it comes to handling URI's.

You end up with queries like WHERE vector @@ to_tsquery('english', 'youtube.com:*'); returning 0 results, whereas WHERE vector @@ to_tsquery('english', 'www.youtube.com:*') suddenly finds the event you're looking for, but then WHERE vector @@ to_tsquery('english', 'youtube:*') suddenly doesn't work again.

This is made worse by the fact that Synapse mangles a query string like "www.youtube.com" into something like www:* & youtube:* & com:* which prevents us from hitting that one case where we get a useful result.

@werner291
Copy link
Contributor

@werner291
Copy link
Contributor

How about we pre-process the search index like this?

SELECT translate('https://www.youtube.com/watch?v=dQw4w9WgXcQ', './:?=', ' ');

This results in a breakdown like this: https www youtube com watch vdQw4w9WgXcQ which should be easier to parse for the search engine.

SELECT to_tsvector('english', translate('https://www.youtube.com/watch?v=dQw4w9WgXcQ', './:?=', '  '));
 >  'com':4 'https':1 'vdqw4w9wgxcq':6 'watch':5 'www':2 'youtub':3

Much nicer parsing result.

@werner291
Copy link
Contributor

Removing special characters would happen around here then, probably: https://github.com/matrix-org/synapse/blob/master/synapse/storage/search.py#L326

Note that we're essentially already doing that here: https://github.com/matrix-org/synapse/blob/master/synapse/storage/search.py#L695

werner291 added a commit to werner291/synapse that referenced this issue Sep 28, 2019
…3024.

Signed-off-by: Werner Kroneman <werner@wernerkroneman.nl>
werner291 added a commit to werner291/synapse that referenced this issue Sep 28, 2019
… DB, should improve search performance with regards to matrix-org#3024.

Added regression test as well, which passes.

Signed-off-by: Werner Kroneman <werner@wernerkroneman.nl>
@werner291
Copy link
Contributor

How odd... When entering the URI in Riot search it doesn't seem to always work, yet it seems mostly reliable when running the tests. Yes, I'm running with SYNAPSE_POSTGRES=1

@werner291
Copy link
Contributor

werner291 commented Sep 28, 2019

🤦‍♂️ I forgot how federation worked, was running the search against a non-upgraded server.

On a more positive note, URI search seems to work very well!

@werner291
Copy link
Contributor

werner291 commented Sep 28, 2019

Note that the improved search will only work on messages newly inserted into the search index. (Old messages will behave the same as before.)

It will also only work on PostgreSQL. SQlite seemed to already be sorta working before so i didn't touch it.

@babolivier
Copy link
Contributor

I've just had a quick look and it's still an issue. For what it's worth by looking at the search vector and testing locally a bit it looks like dropping the https:// bit from the search query allows the URL to be searched:

     key      |                                        vector                                        
--------------+--------------------------------------------------------------------------------------
 content.body | '/watch?v=ohg5sjyrha0':3 'www.youtube.com':2 'www.youtube.com/watch?v=ohg5sjyrha0':1

@babolivier babolivier added S-Tolerable Minor significance, cosmetic issues, low or no impact to users. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. labels Jul 21, 2022
@squahtx squahtx added O-Occasional Affects or can be seen by some users regularly or most users rarely and removed z-bug (Deprecated Label) z-p2 (Deprecated Label) labels Oct 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A-Message-Search Searching messages O-Occasional Affects or can be seen by some users regularly or most users rarely S-Tolerable Minor significance, cosmetic issues, low or no impact to users. T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues. Z-Help-Wanted We know exactly how to fix this issue, and would be grateful for any contribution
Projects
None yet
Development

No branches or pull requests

7 participants