optimize queries with IN list to execute in parallel when reasonable #375

kmuthukk · 2018-07-10T01:02:46Z

Allow multiple keys to be looked up in parallel.

For many simple queries, where the client wants to lookup multiple keys in one call to YugaByte, we should allow those sub-queries/key lookups to happen in parallel.

Consider an example like:

CREATE TABLE T (key text PRIMARY KEY, value text);

INSERT INTO T (key, value) VALUES ('k1', 'v1');
INSERT INTO T (key, value) VALUES ('k2', 'v2');
INSERT INTO T (key, value) VALUES ('k3', 'v3');
INSERT INTO T (key, value) VALUES ('k4', 'v3');
INSERT INTO T (key, value) VALUES ('k5', 'v3');
INSERT INTO T (key, value) VALUES ('k6', 'v3');

SELECT *
FROM   T
WHERE key IN ('k1', 'k5');

Currently, for each key (k1 and k5) the lookup is done sequentially. But we should allow for these lookups across potentially different tablets (and potentially across different servers) to be done in parallel [Note: in some cases, such as if the IN list has many keys, and there is a LIMIT clause that's much smaller, we may not want to fire off all the lookups in parallel.]

The text was updated successfully, but these errors were encountered:

ddorian · 2018-07-10T11:03:19Z

Isn't it better to do this on the client (the multi server part)?
Pretty sure many libraries have helper methods (python one does) to group keys by hash.
While querying parallell across tables still makes sense.
Or keeping both versions for the client one being the most efficient.

kmuthukk · 2018-07-10T15:10:56Z

hi @ddorian - yes, if the grouping is done at client layer that's most optimal. Since it avoids extra hop for all keys in the common case. However, both are nice to have because if a client were to send a request having keys on multiple servers, the server should still do the right thing.

…onable Summary: For multi-partition selects, if all primary key columns are fixed with equality or IN conditions we can estimate the maximum number of returned rows based on the number of options for each IN condition. Then, if the max number of returned rows is smaller than both the page size and limit, we optimize to run all internal queries (to each partition) in parallel -- since we know we won't be doing extra work and need to truncate the result after. Test Plan: TestSelect.java, existing IN tests Reviewers: pritam.damania, robert Reviewed By: pritam.damania, robert Subscribers: kannan, yql Differential Revision: https://phabricator.dev.yugabyte.com/D5137

m-iancu · 2018-07-23T09:19:55Z

Fixed in 00a585d.

kmuthukk added the kind/enhancement This is an enhancement of an existing feature label Jul 10, 2018

kmuthukk assigned m-iancu Jul 10, 2018

m-iancu closed this as completed Jul 23, 2018

ryan-ally mentioned this issue Dec 21, 2023

[Snyk] Fix for 14 vulnerabilities ryan-ally/yugabyte-db#230

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

optimize queries with IN list to execute in parallel when reasonable #375

optimize queries with IN list to execute in parallel when reasonable #375

kmuthukk commented Jul 10, 2018

ddorian commented Jul 10, 2018

kmuthukk commented Jul 10, 2018

m-iancu commented Jul 23, 2018

optimize queries with IN list to execute in parallel when reasonable #375

optimize queries with IN list to execute in parallel when reasonable #375

Comments

kmuthukk commented Jul 10, 2018

ddorian commented Jul 10, 2018

kmuthukk commented Jul 10, 2018

m-iancu commented Jul 23, 2018