Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use of GIN index on YugabyteDB has worse performance than that of PostgreSQL #13554

Open
yugabyte-ci opened this issue Aug 9, 2022 · 2 comments
Assignees
Labels
area/ysql Yugabyte SQL (YSQL) jira-originated kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue

Comments

@yugabyte-ci
Copy link
Contributor

Jira Link: DB-3134

@yugabyte-ci yugabyte-ci added jira-originated kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue status/awaiting-triage Issue awaiting triage area/ysql Yugabyte SQL (YSQL) labels Aug 9, 2022
@yugabyte-ci yugabyte-ci removed the status/awaiting-triage Issue awaiting triage label Aug 10, 2022
@mrajcevic01
Copy link
Contributor

To recreate:

-- table

CREATE TABLE account (
id UUID PRIMARY KEY,
create_time TIMESTAMP ,
last_update TIMESTAMP,
data JSONB NOT NULL,
version INTEGER NOT NULL DEFAULT 0
);

-- function

CREATE OR REPLACE function account_fun(data jsonb) RETURNS text IMMUTABLE AS $$
DECLARE
    _return text;
BEGIN
    _return := concat(
	    data #>> array['id'] ,
	    data #>> array['openId'] ,
	    data #>> array['shortRef'] ,
	    data #>> array['virtualAccountNumber'] ,
	    data #>> array['accountDetails', 'businessDetails', 'businessName'] ,
	    data #>> array['accountDetails', 'businessDetails', 'businessNameEnglish'] ,
	    data #>> array['accountDetails', 'businessDetails', 'businessRegistrationNumber'] ,
	    data #>> array['accountDetails', 'businessDetails', 'contactName'] ,
	    data #>> array['clientInternalInfo', 'accountManager'] ,
	    data #>> array['clientInternalInfo', 'relationshipManager'] ,
	    data #>> array['externalInfo', 'nickName'] ,
	    data #>> array['primaryContact', 'firstName'] ,
	    data #>> array['primaryContact', 'lastName'] ,
	    data #>> array['primaryContact', 'phoneNumber'] ,
	    ''
    );
    RETURN _return ;
END;
$$ LANGUAGE plpgsql;

--GIN index
CREATE INDEX CONCURRENTLY idx_gin_account_fun ON account USING gin (account_fun(data) gin_trgm_ops);

--query

explain (verbose, analyse)
SELECT * FROM account WHERE 
(account_fun(data) ILIKE '%marko.rajcevic+testapp%') limit 1000;

--output PG w LIMIT 1000

Limit (cost=442.91..567.47 rows=92 width=54) (actual time=8.773..12.693 rows=8 loops=1)
Output: id, create_time, last_update, data, version
-> Bitmap Heap Scan on public.account (cost=442.91..567.47 rows=92 width=54) (actual time=8.772..12.689 rows=8 loops=1)
Output: id, create_time, last_update, data, version
Recheck Cond: (account_fun(account.data) ~~* '%marko.rajcevic+testapp%'::text)
Heap Blocks: exact=8
-> Bitmap Index Scan on idx_gin_account_fun (cost=0.00..442.89 rows=92 width=0) (actual time=8.125..8.125 rows=8 loops=1)
Index Cond: (account_fun(account.data) ~~* '%marko.rajcevic+testapp%'::text)
Planning Time: 1.598 ms
Execution Time: 12.736 ms

--output YB w LIMIT 1000

Limit (cost=96.00..104.01 rows=1000 width=68) (actual time=2666.896..28444.558 rows=8 loops=1)
Output: id, create_time, last_update, data, version
-> Index Scan using idx_gin_account_fun on public.account (cost=96.00..104.01 rows=1000 width=68) (actual time=2666.894..28444.545 rows=8 loops=1)
Output: id, create_time, last_update, data, version
Index Cond: (account_fun(account.data) ~~* '%marko.rajcevic+testapp%'::text)
Rows Removed by Index Recheck: 217580
Planning Time: 0.222 ms
Execution Time: 28445.058 ms
Peak Memory Usage: 18014398509476482 kB

@jasonyb
Copy link
Contributor

jasonyb commented Aug 16, 2022

This is currently a common issue with ybgin pg_trgm particularly because feat: support all scans of issue #7850 is not yet supported. For pg_trgm like '%foo%' queries using gin, we want gin keys that match trigram 1, trigram 2, ..., and trigram x. Since it is possible to only fetch data using one of the trigrams then filter on client-side using the other trigrams, there is one "required" scan entry and the rest are "additional". The way postgres chooses the "required" scan entry appears to be deterministic, but I haven't looked into it too much. I think it is influenced by some stats that YB unfortunately does not have, so that may be why a bad required entry is chosen. Then, until the code is improved, the user-side workaround is to force usage of a better required entry.

Running with set yb_debug_log_docdb_requests = on; on empty table, postgres logs show

I0815 16:43:15.771070 3735963 pg_session.cc:213] Applying operation: { READ active: 1 read_time: { read: <invalid> local_limit: <invalid> global_limit: <invalid> in_txn_limit: <invalid> serial_no: 0 } request: client: YQL_CLIENT_PGSQL stmt_id: 35889248 schema_version: 0 targets { column_id: 11 } column_refs { ids: 11 } is_forward_scan: 1 is_aggregate: 0 limit: 1024 return_paging_state: 1 range_column_values { value { int32_value: 2105458 } } table_id: "000033e8000030008000000000004062" upper_bound { key: "48802020727E7E21" is_inclusive: 1 } col_refs { column_id: 11 attno: -101 } partition_key: "48802020720021" }

The filter key is 2105458 = 0x202072. pg_trgm uses trigrams. For ascii, this would mean three bytes for each char of a tragram. So 0x20, 0x20, 0x72 are the three chars. They correspond to space, space, r. This is very unselective, so performance is bound to be bad.

We can test how to modify the query so that a more selective filter is pushed down to the gin index:

explain (verbose, analyse)
SELECT * FROM account WHERE
(account_fun(data) ILIKE '%ajc%') limit 1000;

This gives

I0815 16:44:29.654284 3735963 pg_session.cc:213] Applying operation: { READ active: 1 read_time: { read: <invalid> local_limit: <invalid> global_limit: <invalid> in_txn_limit: <invalid> serial_no: 0 } request: client: YQL_CLIENT_PGSQL stmt_id: 35889248 schema_version: 0 targets { column_id: 11 } column_refs { ids: 11 } is_forward_scan: 1 is_aggregate: 0 limit: 1024 return_paging_state: 1 range_column_values { value { int32_value: 6384227 } } table_id: "000033e8000030008000000000004062" upper_bound { key: "4880616A637E7E21" is_inclusive: 1 } col_refs { column_id: 11 attno: -101 } partition_key: "4880616A630021" }

with filter 6384227 corresponding to ajc, a lot more selective.

This query is incomplete since it allows texts like "fooajcfoo", so we need to add the whole filter:

explain (verbose, analyse)
WITH w AS (
SELECT * FROM account WHERE 
(account_fun(data) ILIKE '%ajc%')
) SELECT * FROM w WHERE (account_fun(data) ILIKE '%marko.rajcevic+testapp%') limit 1000;

Please try this and see the results. I don't have the dataset to test myself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ysql Yugabyte SQL (YSQL) jira-originated kind/enhancement This is an enhancement of an existing feature priority/medium Medium priority issue
Projects
None yet
Development

No branches or pull requests

3 participants