Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build dataset using COPY instead of multi-row inserts for TPROC-C #301

Merged
merged 2 commits into from
Jan 12, 2022

Conversation

JelteF
Copy link
Contributor

@JelteF JelteF commented Dec 17, 2021

NOTE: this PR includes the #292, to make it more readable and make merging them
both in the future easier. Adding comments to the diff is probably easier to do
on this equivalent PR with a different base branch on the Citus repo: citusdata#3

The fastest way to bulk insert data in Postgres is by using COPY. This
changes the dataset building to use that for TPROC-C. I tried building a dataset for
1000 warehouses using 100 vusers. Without copy this took 100 minutes, with
copy it only took 42 minutes. So it reduced the time it takes to build the
dataset by ~58%.

Docs on the usage of COPY in the tcl Postgres library can be found here:
http://pgtclng.sourceforge.net/pgtcldocs/pgtcl-example-copy.html

NOTE: A similar improvement could probably be done for postgres on TPROC-H

The tcl files in this repo did not seem to adhere to any common
indentation style. This made it quite hard for me to understand what the
code was doing. To resolve this I used the reformat.tcl script that is
provided on the tcl wiki:
https://wiki.tcl-lang.org/page/Reformatting+Tcl+code+indentation

I replaced the original reformat proc that was provided in the first
message, with the reformat2 proc that was contributed by aplsimple in
the last message on the thread. I did make one final change to this
reformat2 script, such that does not exclude comments from indentation.

I think this makes the code much more readable.

I understand that this is a big PR, but it only changes whitespace. If
you look at the diff by excluding whitespace, there's only one change.
And this change is a change from file encoding from latin1 to utf-8.
@JelteF JelteF changed the title Build dataset using COPY instead of multi-row inserts Build dataset using COPY instead of multi-row inserts for TPROC-C Jan 7, 2022
The fastest way to bulk insert data in Postgres is by using COPY. This
changes the dataset building to use that. I tried building a dataset for
1000 warehouses using 100 vusers. Without copy this took 100 minutes,
with copy it only took 42 minutes. So it reduced the time it takes to
build the dataset by ~58%.

Docs on the usage of COPY in the tcl Postgres library can be found here:
http://pgtclng.sourceforge.net/pgtcldocs/pgtcl-example-copy.html
@sm-shaw
Copy link
Contributor

sm-shaw commented Jan 11, 2022

I have tested this change on the same system with the changes for #295 already in place.
The schema build is for TPROC-C 800WH with 64VUs
for #295 the results were:
postgres orig = 12 min 40 secs / postgres new = 8 min 54 secs = 30% improvement
With #301 and #295 together, the time to build the schema was 5 min 57 secs = 53% improvement on original build time.
PR is recommended for approval.

sm-shaw added a commit that referenced this pull request Jan 12, 2022
Resolve Merge Conflicts for #301
@sm-shaw sm-shaw merged commit e6e4f3e into TPC-Council:master Jan 12, 2022
@sm-shaw
Copy link
Contributor

sm-shaw commented Jan 12, 2022

Merging Pull Request as voted on by TPC-OSS subcommittee on 11th Jan 2022.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants