Build dataset using COPY instead of multi-row inserts for TPROC-C #301

JelteF · 2021-12-17T10:47:31Z

NOTE: this PR includes the #292, to make it more readable and make merging them
both in the future easier. Adding comments to the diff is probably easier to do
on this equivalent PR with a different base branch on the Citus repo: citusdata#3

The fastest way to bulk insert data in Postgres is by using COPY. This
changes the dataset building to use that for TPROC-C. I tried building a dataset for
1000 warehouses using 100 vusers. Without copy this took 100 minutes, with
copy it only took 42 minutes. So it reduced the time it takes to build the
dataset by ~58%.

Docs on the usage of COPY in the tcl Postgres library can be found here:
http://pgtclng.sourceforge.net/pgtcldocs/pgtcl-example-copy.html

NOTE: A similar improvement could probably be done for postgres on TPROC-H

The tcl files in this repo did not seem to adhere to any common indentation style. This made it quite hard for me to understand what the code was doing. To resolve this I used the reformat.tcl script that is provided on the tcl wiki: https://wiki.tcl-lang.org/page/Reformatting+Tcl+code+indentation I replaced the original reformat proc that was provided in the first message, with the reformat2 proc that was contributed by aplsimple in the last message on the thread. I did make one final change to this reformat2 script, such that does not exclude comments from indentation. I think this makes the code much more readable. I understand that this is a big PR, but it only changes whitespace. If you look at the diff by excluding whitespace, there's only one change. And this change is a change from file encoding from latin1 to utf-8.

The fastest way to bulk insert data in Postgres is by using COPY. This changes the dataset building to use that. I tried building a dataset for 1000 warehouses using 100 vusers. Without copy this took 100 minutes, with copy it only took 42 minutes. So it reduced the time it takes to build the dataset by ~58%. Docs on the usage of COPY in the tcl Postgres library can be found here: http://pgtclng.sourceforge.net/pgtcldocs/pgtcl-example-copy.html

sm-shaw · 2022-01-11T10:26:47Z

I have tested this change on the same system with the changes for #295 already in place.
The schema build is for TPROC-C 800WH with 64VUs
for #295 the results were:
postgres orig = 12 min 40 secs / postgres new = 8 min 54 secs = 30% improvement
With #301 and #295 together, the time to build the schema was 5 min 57 secs = 53% improvement on original build time.
PR is recommended for approval.

Resolve Merge Conflicts for #301

sm-shaw · 2022-01-12T13:32:37Z

Merging Pull Request as voted on by TPC-OSS subcommittee on 11th Jan 2022.

JelteF force-pushed the copy-support branch from 44cb5e7 to 5791ba1 Compare January 7, 2022 09:23

JelteF changed the title ~~Build dataset using COPY instead of multi-row inserts~~ Build dataset using COPY instead of multi-row inserts for TPROC-C Jan 7, 2022

JelteF force-pushed the copy-support branch from 5791ba1 to 4bf90a0 Compare January 7, 2022 15:26

sm-shaw added a commit that referenced this pull request Jan 12, 2022

Merge branch 'citusdata-copy-support'

48736d9

Resolve Merge Conflicts for #301

sm-shaw merged commit e6e4f3e into TPC-Council:master Jan 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build dataset using COPY instead of multi-row inserts for TPROC-C #301

Build dataset using COPY instead of multi-row inserts for TPROC-C #301

JelteF commented Dec 17, 2021 •

edited

Loading

sm-shaw commented Jan 11, 2022

sm-shaw commented Jan 12, 2022

Build dataset using COPY instead of multi-row inserts for TPROC-C #301

Build dataset using COPY instead of multi-row inserts for TPROC-C #301

Conversation

JelteF commented Dec 17, 2021 • edited Loading

sm-shaw commented Jan 11, 2022

sm-shaw commented Jan 12, 2022

JelteF commented Dec 17, 2021 •

edited

Loading