Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

semi join performance issue #10226

Open
mxlxm opened this issue Apr 22, 2019 · 5 comments
Open

semi join performance issue #10226

mxlxm opened this issue Apr 22, 2019 · 5 comments
Assignees

Comments

@mxlxm
Copy link
Contributor

mxlxm commented Apr 22, 2019

Bug Report

Please answer these questions before submitting your issue. Thanks!

  1. What did you do?
    create table t1 (id bigint, ctime timestamp);
    generate 10w rows data
    select * from t1 where id not in (select id from t1 where ctime < '2019-04-22 18:09:27');

  2. What did you expect to see?
    response quickly as tidb-v2.1.4

  3. What did you see instead?
    very slow since plan/executor: make semi joins null and empty aware (#9051) #9449 merged

  4. What version of TiDB are you using (tidb-server -V or run select tidb_version(); on TiDB)?
    tidb-v2.1.8

@zz-jason
Copy link
Member

To guarantee correctness, we moved the equal condition eq(est.t1.id, test.t1.id) to be evaluated after the data is joined in PR #9449. So the execution plan for this query at present is:

TiDB(root@127.0.0.1:test) > desc select * from t1 where id not in (select id from t1 where ctime < '2019-04-22 18:09:27');
+------------------------+----------+------+-----------------------------------------------------------------------------+
| id                     | count    | task | operator info                                                               |
+------------------------+----------+------+-----------------------------------------------------------------------------+
| HashLeftJoin_9         | 8000.00  | root | anti semi join, inner:TableReader_14, other cond:eq(test.t1.id, test.t1.id) |
| ├─TableReader_11       | 10000.00 | root | data:TableScan_10                                                           |
| │ └─TableScan_10       | 10000.00 | cop  | table:t1, range:[-inf,+inf], keep order:false, stats:pseudo                 |
| └─TableReader_14       | 3323.33  | root | data:Selection_13                                                           |
|   └─Selection_13       | 3323.33  | cop  | lt(test.t1.ctime, 2019-04-22 18:09:27.000000)                               |
|     └─TableScan_12     | 10000.00 | cop  | table:t1, range:[-inf,+inf], keep order:false, stats:pseudo                 |
+------------------------+----------+------+-----------------------------------------------------------------------------+
6 rows in set (0.01 sec)

As you can see, it performs a cartesian product first and then performs the filter eq(test.t1.id, test.t1.id).

To avoid this transformation, you can add a not null property for the column id. That is to say, you can create the table in the following way:

create table t1 (id bigint not null, ctime timestamp);

Then the query execution plan is:

TiDB(root@127.0.0.1:test) > desc select * from t1 where id not in (select id from t1 where ctime < '2019-04-22 18:09:27');
+------------------------+----------+------+--------------------------------------------------------------------------+
| id                     | count    | task | operator info                                                            |
+------------------------+----------+------+--------------------------------------------------------------------------+
| HashLeftJoin_9         | 8000.00  | root | anti semi join, inner:TableReader_14, equal:[eq(test.t1.id, test.t1.id)] |
| ├─TableReader_11       | 10000.00 | root | data:TableScan_10                                                        |
| │ └─TableScan_10       | 10000.00 | cop  | table:t1, range:[-inf,+inf], keep order:false, stats:pseudo              |
| └─TableReader_14       | 3323.33  | root | data:Selection_13                                                        |
|   └─Selection_13       | 3323.33  | cop  | lt(test.t1.ctime, 2019-04-22 18:09:27.000000)                            |
|     └─TableScan_12     | 10000.00 | cop  | table:t1, range:[-inf,+inf], keep order:false, stats:pseudo              |
+------------------------+----------+------+--------------------------------------------------------------------------+
6 rows in set (0.00 sec)

Because id is guaranteed to be not null. This time the equal condition is moved to be the hash key of hash join. Thus the cartesian product is avoided.

In the future, we are planning to derive not null property from filters and scalar functions, thus you can simply add a is not null filter to avoid the cartesian product in the future. But for now, this optimization is not performed, so the following query still can not avoid the cartesian product:

TiDB(root@127.0.0.1:test) > desc select * from t1 where id not in (select id from t1 where ctime < '2019-04-22 18:09:27' and id is not null);
+------------------------+----------+------+-----------------------------------------------------------------------------+
| id                     | count    | task | operator info                                                               |
+------------------------+----------+------+-----------------------------------------------------------------------------+
| HashLeftJoin_9         | 8000.00  | root | anti semi join, inner:TableReader_14, other cond:eq(test.t1.id, test.t1.id) |
| ├─TableReader_11       | 10000.00 | root | data:TableScan_10                                                           |
| │ └─TableScan_10       | 10000.00 | cop  | table:t1, range:[-inf,+inf], keep order:false, stats:pseudo                 |
| └─TableReader_14       | 3320.01  | root | data:Selection_13                                                           |
|   └─Selection_13       | 3320.01  | cop  | lt(test.t1.ctime, 2019-04-22 18:09:27.000000), not(isnull(test.t1.id))      |
|     └─TableScan_12     | 10000.00 | cop  | table:t1, range:[-inf,+inf], keep order:false, stats:pseudo                 |
+------------------------+----------+------+-----------------------------------------------------------------------------+
6 rows in set (0.00 sec)

@mxlxm
Copy link
Contributor Author

mxlxm commented Apr 23, 2019

To guarantee correctness.

this behavior is different from MySQL, NOT IN (subquery) requires additional checks for NULL values to guarantee correctness even with MySQL.

due to historical reasons, NULL represents a business status, so we can't just modify this column.

now we use NOT EXISTS instead of NOT IN(with null values checks) to avoid the cartesian product.

in production, we have 1000w+ records, with v2.1.4- similar query cost 20s, after v2.1.5+ same query cost 4h+.

@eurekaka
Copy link
Contributor

this behavior is different from MySQL, NOT IN (subquery) requires additional checks for NULL values to guarantee correctness even with MySQL.

Could you please illustrate more about this comment?

now we use NOT EXISTS instead of NOT IN(with null values checks) to avoid the cartesian product.

NOT EXISTS should be different with NOT IN semantically regarding NULL values of nested table, they may return different results sometimes.

@mxlxm
Copy link
Contributor Author

mxlxm commented Apr 24, 2019

@eurekaka
Copy link
Contributor

That's interesting.

this behavior is different from MySQL

To make it clear, actually the SQL "behavior" of TiDB is compatible with MySQL now, but the implementation is different. In MySQL,

It takes a single index scan to find out if there are NULL values in t_inner and return if they are

Another interesting thing is that we can learn from the blog to optimize LEFT JOIN / IS NULL to NOT EXISTS in TiDB to improve join execution performance.

@eurekaka eurekaka self-assigned this Jun 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants