Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP testing enum as part of PK #277

Merged
merged 6 commits into from
Oct 26, 2016
Merged

WIP testing enum as part of PK #277

merged 6 commits into from
Oct 26, 2016

Conversation

shlomi-noach
Copy link
Contributor

reference: #273

enum values are intentionally not alphabetically monotonic

cc @ggunson

reference: #273

enum values are intentionally not alphabetically monotonic
@shlomi-noach
Copy link
Contributor Author

At this time the test fails, suggesting we suffer from the enum problem:

$ ./localtests/test.sh enum-pk
Building
Testing: enum-pk......ERROR enum-pk: checksum mismatch
---
1       11      red
1       11      green
2       13      green
3       17      orange
4       11      red
4       11      green
5       13      green
6       17      orange
7       11      red
7       11      green
8       13      green
9       17      orange
10      11      red
10      11      green
11      13      green
12      17      orange
13      11      red
13      11      green
14      13      green
15      17      orange
16      11      red
16      11      green
17      13      green
18      17      orange
19      11      red
19      11      green
20      13      green
21      17      orange
---
1       11      red
2       13      green
3       17      orange
4       11      red
4       11      green
5       13      green
6       17      orange
7       11      red
7       11      green
8       13      green
9       17      orange
10      11      red
10      11      green
11      13      green
12      17      orange
13      11      red
13      11      green
14      13      green
15      17      orange
16      11      red
16      11      green
17      13      green
18      17      orange
19      11      red
19      11      green
20      13      green
21      17      orange
+ FAIL

Note the very first two rows.

@shlomi-noach
Copy link
Contributor Author

Fixed, via casting enum value to string as in concat(e).
This abuses the datatype and reduces chance of using the index. That is, consider e enum(...) that is part of primary key:

  • if e is first column in PK, we completely negate use of PK in row iteration. Bad migration performance. Even terrible.
  • if e is 2nd or 3rd column in PK, we're reducing use of PK in row iteration. Likely bad migration performance.

@shlomi-noach
Copy link
Contributor Author

DO NOT MERGE at this time

@shlomi-noach shlomi-noach changed the title testing enum as part of PK WIP testing enum as part of PK Oct 20, 2016
@shlomi-noach
Copy link
Contributor Author

all tests pass at this time

@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented Oct 24, 2016

sent to testing in prod

@druud62
Copy link

druud62 commented Oct 19, 2017

Or use 0+enum_col.

@shlomi-noach
Copy link
Contributor Author

Or use 0+enum_col.

Sorry, can you please explain?

@druud62
Copy link

druud62 commented Oct 19, 2017

I meant it like this:

CREATE TABLE test (
  tmain enum ('X','A','0','Test','42') NOT NULL
, tsub enum ('9','2','M','S') NOT NULL
, tname varchar(24)
, PRIMARY KEY (tmain,tsub)
, UNIQUE (tname)
);
Query OK, 0 rows affected (0.02 sec)

INSERT INTO test (tmain,tsub,tname) 
VALUES
  ('0','M','0-M')
, ('X','M','X-M')
, ('Test','9','Test-9');
Query OK, 3 rows affected (0.01 sec)
Records: 3  Duplicates: 0  Warnings: 0

SELECT * FROM test;
+-------+------+--------+
| tmain | tsub | tname  |
+-------+------+--------+
| 0     | M    | 0-M    |
| Test  | 9    | Test-9 |
| X     | M    | X-M    |
+-------+------+--------+
3 rows in set (0.00 sec)

SELECT * FROM test FORCE INDEX (PRIMARY);
+-------+------+--------+
| tmain | tsub | tname  |
+-------+------+--------+
| X     | M    | X-M    |
| 0     | M    | 0-M    |
| Test  | 9    | Test-9 |
+-------+------+--------+
3 rows in set (0.00 sec)

SELECT 0+tmain, 0+tsub, tname FROM test;
+---------+--------+--------+
| 0+tmain | 0+tsub | tname  |
+---------+--------+--------+
|       3 |      3 | 0-M    |
|       4 |      1 | Test-9 |
|       1 |      3 | X-M    |
+---------+--------+--------+
3 rows in set (0.00 sec)

SELECT 0+tmain, 0+tsub, tname FROM test FORCE INDEX (PRIMARY);
+---------+--------+--------+
| 0+tmain | 0+tsub | tname  |
+---------+--------+--------+
|       1 |      3 | X-M    |
|       3 |      3 | 0-M    |
|       4 |      1 | Test-9 |
+---------+--------+--------+
3 rows in set (0.00 sec)

@shlomi-noach
Copy link
Contributor Author

Sorry, I still don't follow: is this a suggestion to the user on how to use an enum in a numeric way? If so, wonderful, but this isn't the right place. This place is for commenting on the changes made to the code to support enum as part of PRIMARY KEY. gh-ost should (hopefully) support arbitrary use cases.
I'd suggest turning your comment into a blog post.

@druud62
Copy link

druud62 commented Oct 24, 2017

I presumed that the numeric enum-casting is needed for writing the new table in perfect PK-order.

@shlomi-noach
Copy link
Contributor Author

@druud62 got it. The problem this actually reverse. We need to intentionally convert the enum to string, not to int. The numeric value is implicit -- no need to 0+val, but the correct handling is using the textual value. This is unfortunate because it leads to degraded performance.

See more on https://bugs.launchpad.net/percona-toolkit/+bug/1613915/comments/1

@druud62
Copy link

druud62 commented Oct 25, 2017

In the nibbler of our internal table-clone tool, we do something like this:

CREATE TABLE `ruud_t1` (
  `e_grp` enum('E','B','U','G') NOT NULL,
  `e_id` int(10) unsigned NOT NULL DEFAULT '1',
  PRIMARY KEY (`e_grp`,`e_id`),
  KEY `e_id` (`e_id`)
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

INSERT INTO ruud_t1 (e_grp) VALUES ('E'),('B'),('U'),('G');

# then run this several times, until you have a zillion rows
INSERT INTO ruud_t1 SELECT e_grp, e_id+(SELECT MAX(e_id) FROM ruud_t1) FROM ruud_t1;

SELECT * FROM ruud_t1 FORCE INDEX (PRIMARY) 
WHERE (e_grp,e_id) >= (2,(SELECT MAX(e_id)-2 FROM ruud_t1)) 
LIMIT 0,6
;
+-------+--------+
| e_grp | e_id   |
+-------+--------+
| B     | 524286 |
| B     | 524287 |
| B     | 524288 |
| U     |      1 |
| U     |      2 |
| U     |      3 |
+-------+--------+
6 rows in set (0.47 sec)

@shlomi-noach
Copy link
Contributor Author

shlomi-noach commented Oct 25, 2017

I see. One difference between your implementation and gh-ost's is that you are able to do one time full table scan, hence use FORCE INDEX(PRIMARY).

gh-ost takes it upon itself to iterate N rows at a time. To do that, it needs to compute the first tuple and last tuple according to the migration key (which is typically the PRIMARY KEY but not necessarily), and than take many steps, each time computing the range-end of the next N rows.

Or use 0+enum_col.

At the end of the day, whether we continue doing this as implemented by this PR (concat(enum_col)) or by your suggestion (0+enum_col), we lose index optimization since we run a function on the column value.

There may be a lead in your previous comment that's worth investigating. I'd need a clear time to switch context into this to give it more thought.

@druud62
Copy link

druud62 commented Oct 25, 2017

The "FORCE INDEX (PRIMARY)" is mainly there to retrieve the right rows in the right order.
An "ORDER BY 1,2" would achieve the same.

Yes, with enums in the PK this approach needs a table scan. We store the starting-PK of each chunk, and use that list to process the chunks. No need to store the last tuple too.

We also tried this https://dev.mysql.com/doc/refman/5.7/en/handler.html
for nibbling, but the profit was zero.

The real PK of a table like this, would in many cases be an id++, with either a uniq (e_id, e_grp), or with a uniq (e_grp,e_id) and a key (e_id).
(tables with a secondary key, generally shouldn't have a multi-column PK)
So the example is quite academic.

@ggunson
Copy link
Contributor

ggunson commented Oct 26, 2017

The "FORCE INDEX (PRIMARY)" is mainly there to retrieve the right rows in the right order.
An "ORDER BY 1,2" would achieve the same.

gh-ost doesn't require a primary key (it can use a unique key), and FORCE INDEX() in that case wouldn't guarantee an ordering on that index ascending, because it's still possible for the mysql optimizer to ignore that directive. It might decide to ignore the FORCE INDEX(PRIMARY) too, though I'm not sure. And MySQL docs say to specify the ORDER BY because it can't promise any particular/default ordering of results otherwise (even with GROUP BY after MySQL 8).

@druud62
Copy link

druud62 commented Oct 27, 2017

I don't remember ever having seen it ignore a FORCE INDEX(). Idem IGNORE INDEX().

It does ignore 'USE INDEX()' hints, which generally works out good, providing that the table statistics aren't too old.

(Makes me remember that I have to re-instate the ANALYZE LOCAL TABLE cronjob, that ran locally on all our mysql/maria/etc. boxes, as we recently found more tables with old table stats. The job touches each table once or twice per week, which really helps.)

@ggunson
Copy link
Contributor

ggunson commented Oct 27, 2017

From the docs:

The FORCE INDEX hint acts like USE INDEX (index_list), with the addition that a table scan is assumed to be very expensive. In other words, a table scan is used only if there is no way to use one of the named indexes to find rows in the table.

Less of an issue here, but I would still be explicit re: ordering in SQL statements, including for the human readability aspect.

I presumed that the numeric enum-casting is needed for writing the new table in perfect PK-order.

I'm not sure how concerned we are about that, @druud62 -- though I don't recall talking to @shlomi-noach about it. We're copying rows while also parsing and playing real time writes to the table, so it wouldn't be in perfect PK order. And if there's a desire for parallel copy that wouldn't be in perfect order either.

Unless there's a problem with the current implementation, I think we should close this and work on fixes that have more impact or performance improvement.

@zmoazeni
Copy link
Contributor

writing the new table in perfect PK-order

I'm a bit naive on this, but what does this mean? What are the pros/cons in terms of performance or data consistency/reliablity where that applies?

As far as I understand, it doesn't matter what order data is written, MySQL will still cluster the data on disk based on the choice and value of the Primary Key (something I've had to change on large tables in order to cluster data more effectively for reads). Also known as a clustered index.

@ggunson
Copy link
Contributor

ggunson commented Oct 27, 2017

@zmoazeni https://blog.jcole.us/2014/10/02/visualizing-the-impact-of-ordered-vs-random-index-insertion-in-innodb/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants