Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize queries with IN list to execute in parallel when reasonable #375

Closed
kmuthukk opened this issue Jul 10, 2018 · 3 comments
Closed
Assignees
Labels
kind/enhancement This is an enhancement of an existing feature

Comments

@kmuthukk
Copy link
Collaborator

Allow multiple keys to be looked up in parallel.

For many simple queries, where the client wants to lookup multiple keys in one call to YugaByte, we should allow those sub-queries/key lookups to happen in parallel.

Consider an example like:

CREATE TABLE T (key text PRIMARY KEY, value text);

INSERT INTO T (key, value) VALUES ('k1', 'v1');
INSERT INTO T (key, value) VALUES ('k2', 'v2');
INSERT INTO T (key, value) VALUES ('k3', 'v3');
INSERT INTO T (key, value) VALUES ('k4', 'v3');
INSERT INTO T (key, value) VALUES ('k5', 'v3');
INSERT INTO T (key, value) VALUES ('k6', 'v3');

SELECT *
FROM   T
WHERE key IN ('k1', 'k5');

Currently, for each key (k1 and k5) the lookup is done sequentially. But we should allow for these lookups across potentially different tablets (and potentially across different servers) to be done in parallel [Note: in some cases, such as if the IN list has many keys, and there is a LIMIT clause that's much smaller, we may not want to fire off all the lookups in parallel.]

@kmuthukk kmuthukk added the kind/enhancement This is an enhancement of an existing feature label Jul 10, 2018
@ddorian
Copy link
Contributor

ddorian commented Jul 10, 2018

Isn't it better to do this on the client (the multi server part)?
Pretty sure many libraries have helper methods (python one does) to group keys by hash.
While querying parallell across tables still makes sense.
Or keeping both versions for the client one being the most efficient.

@kmuthukk
Copy link
Collaborator Author

hi @ddorian - yes, if the grouping is done at client layer that's most optimal. Since it avoids extra hop for all keys in the common case. However, both are nice to have because if a client were to send a request having keys on multiple servers, the server should still do the right thing.

yugabyte-ci pushed a commit that referenced this issue Jul 20, 2018
…onable

Summary:
For multi-partition selects, if all primary key columns are fixed with equality or IN conditions
we can estimate the maximum number of returned rows based on the number of options for each IN
condition.
Then, if the max number of returned rows is smaller than both the page size and limit, we optimize
to run all internal queries (to each partition) in parallel -- since we know we won't be doing extra
work and need to truncate the result after.

Test Plan: TestSelect.java, existing IN tests

Reviewers: pritam.damania, robert

Reviewed By: pritam.damania, robert

Subscribers: kannan, yql

Differential Revision: https://phabricator.dev.yugabyte.com/D5137
@m-iancu
Copy link
Contributor

m-iancu commented Jul 23, 2018

Fixed in 00a585d.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement This is an enhancement of an existing feature
Projects
None yet
Development

No branches or pull requests

3 participants