semi join performance issue #10226

mxlxm · 2019-04-22T10:22:50Z

Bug Report

Please answer these questions before submitting your issue. Thanks!

What did you do?
create table t1 (id bigint, ctime timestamp);
generate 10w rows data
select * from t1 where id not in (select id from t1 where ctime < '2019-04-22 18:09:27');
What did you expect to see?
response quickly as tidb-v2.1.4
What did you see instead?
very slow since plan/executor: make semi joins null and empty aware (#9051) #9449 merged
What version of TiDB are you using (tidb-server -V or run select tidb_version(); on TiDB)?
tidb-v2.1.8

The text was updated successfully, but these errors were encountered:

zz-jason · 2019-04-23T11:34:04Z

To guarantee correctness, we moved the equal condition eq(est.t1.id, test.t1.id) to be evaluated after the data is joined in PR #9449. So the execution plan for this query at present is:

TiDB(root@127.0.0.1:test) > desc select * from t1 where id not in (select id from t1 where ctime < '2019-04-22 18:09:27');
+------------------------+----------+------+-----------------------------------------------------------------------------+
| id                     | count    | task | operator info                                                               |
+------------------------+----------+------+-----------------------------------------------------------------------------+
| HashLeftJoin_9         | 8000.00  | root | anti semi join, inner:TableReader_14, other cond:eq(test.t1.id, test.t1.id) |
| ├─TableReader_11       | 10000.00 | root | data:TableScan_10                                                           |
| │ └─TableScan_10       | 10000.00 | cop  | table:t1, range:[-inf,+inf], keep order:false, stats:pseudo                 |
| └─TableReader_14       | 3323.33  | root | data:Selection_13                                                           |
|   └─Selection_13       | 3323.33  | cop  | lt(test.t1.ctime, 2019-04-22 18:09:27.000000)                               |
|     └─TableScan_12     | 10000.00 | cop  | table:t1, range:[-inf,+inf], keep order:false, stats:pseudo                 |
+------------------------+----------+------+-----------------------------------------------------------------------------+
6 rows in set (0.01 sec)

As you can see, it performs a cartesian product first and then performs the filter eq(test.t1.id, test.t1.id).

To avoid this transformation, you can add a not null property for the column id. That is to say, you can create the table in the following way:

create table t1 (id bigint not null, ctime timestamp);

Then the query execution plan is:

TiDB(root@127.0.0.1:test) > desc select * from t1 where id not in (select id from t1 where ctime < '2019-04-22 18:09:27');
+------------------------+----------+------+--------------------------------------------------------------------------+
| id                     | count    | task | operator info                                                            |
+------------------------+----------+------+--------------------------------------------------------------------------+
| HashLeftJoin_9         | 8000.00  | root | anti semi join, inner:TableReader_14, equal:[eq(test.t1.id, test.t1.id)] |
| ├─TableReader_11       | 10000.00 | root | data:TableScan_10                                                        |
| │ └─TableScan_10       | 10000.00 | cop  | table:t1, range:[-inf,+inf], keep order:false, stats:pseudo              |
| └─TableReader_14       | 3323.33  | root | data:Selection_13                                                        |
|   └─Selection_13       | 3323.33  | cop  | lt(test.t1.ctime, 2019-04-22 18:09:27.000000)                            |
|     └─TableScan_12     | 10000.00 | cop  | table:t1, range:[-inf,+inf], keep order:false, stats:pseudo              |
+------------------------+----------+------+--------------------------------------------------------------------------+
6 rows in set (0.00 sec)

Because id is guaranteed to be not null. This time the equal condition is moved to be the hash key of hash join. Thus the cartesian product is avoided.

In the future, we are planning to derive not null property from filters and scalar functions, thus you can simply add a is not null filter to avoid the cartesian product in the future. But for now, this optimization is not performed, so the following query still can not avoid the cartesian product:

TiDB(root@127.0.0.1:test) > desc select * from t1 where id not in (select id from t1 where ctime < '2019-04-22 18:09:27' and id is not null);
+------------------------+----------+------+-----------------------------------------------------------------------------+
| id                     | count    | task | operator info                                                               |
+------------------------+----------+------+-----------------------------------------------------------------------------+
| HashLeftJoin_9         | 8000.00  | root | anti semi join, inner:TableReader_14, other cond:eq(test.t1.id, test.t1.id) |
| ├─TableReader_11       | 10000.00 | root | data:TableScan_10                                                           |
| │ └─TableScan_10       | 10000.00 | cop  | table:t1, range:[-inf,+inf], keep order:false, stats:pseudo                 |
| └─TableReader_14       | 3320.01  | root | data:Selection_13                                                           |
|   └─Selection_13       | 3320.01  | cop  | lt(test.t1.ctime, 2019-04-22 18:09:27.000000), not(isnull(test.t1.id))      |
|     └─TableScan_12     | 10000.00 | cop  | table:t1, range:[-inf,+inf], keep order:false, stats:pseudo                 |
+------------------------+----------+------+-----------------------------------------------------------------------------+
6 rows in set (0.00 sec)

mxlxm · 2019-04-23T14:24:59Z

To guarantee correctness.

this behavior is different from MySQL, NOT IN (subquery) requires additional checks for NULL values to guarantee correctness even with MySQL.

due to historical reasons, NULL represents a business status, so we can't just modify this column.

now we use NOT EXISTS instead of NOT IN(with null values checks) to avoid the cartesian product.

in production, we have 1000w+ records, with v2.1.4- similar query cost 20s, after v2.1.5+ same query cost 4h+.

eurekaka · 2019-04-24T06:57:58Z

this behavior is different from MySQL, NOT IN (subquery) requires additional checks for NULL values to guarantee correctness even with MySQL.

Could you please illustrate more about this comment?

now we use NOT EXISTS instead of NOT IN(with null values checks) to avoid the cartesian product.

NOT EXISTS should be different with NOT IN semantically regarding NULL values of nested table, they may return different results sometimes.

mxlxm · 2019-04-24T07:46:00Z

take a look at this:
https://explainextended.com/2010/05/27/left-join-is-null-vs-not-in-vs-not-exists-nullable-columns/

eurekaka · 2019-04-24T08:56:00Z

That's interesting.

this behavior is different from MySQL

To make it clear, actually the SQL "behavior" of TiDB is compatible with MySQL now, but the implementation is different. In MySQL,

It takes a single index scan to find out if there are NULL values in t_inner and return if they are

Another interesting thing is that we can learn from the blog to optimize LEFT JOIN / IS NULL to NOT EXISTS in TiDB to improve join execution performance.

eurekaka self-assigned this Jun 17, 2019

zz-jason added sig/planner SIG: Planner type/performance labels Mar 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

semi join performance issue #10226

semi join performance issue #10226

mxlxm commented Apr 22, 2019

zz-jason commented Apr 23, 2019

mxlxm commented Apr 23, 2019 •

edited

Loading

eurekaka commented Apr 24, 2019

mxlxm commented Apr 24, 2019

eurekaka commented Apr 24, 2019

semi join performance issue #10226

semi join performance issue #10226

Comments

mxlxm commented Apr 22, 2019

Bug Report

zz-jason commented Apr 23, 2019

mxlxm commented Apr 23, 2019 • edited Loading

eurekaka commented Apr 24, 2019

mxlxm commented Apr 24, 2019

eurekaka commented Apr 24, 2019

mxlxm commented Apr 23, 2019 •

edited

Loading