Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

embulk-output-redshift copy should be single command #224

Open
mjalkio opened this issue Jan 4, 2018 · 4 comments
Open

embulk-output-redshift copy should be single command #224

mjalkio opened this issue Jan 4, 2018 · 4 comments

Comments

@mjalkio
Copy link
Contributor

mjalkio commented Jan 4, 2018

We are trying to utilize #207 (thank you for that feature!), but are hitting some issues with how Embulk copies S3 files into Redshift. Right now I'm using this feature on a table that uses mode: replace.

The RedshiftCopyBatchInsert copies one file at a time. This causes Redshift to only sort the contents of the first file. All other files are left unsorted, and that means a VACUUM command needs to be run after the table is created.

Instead, all the files in S3 should be copied as a single command. This will allow Redshift to keep the entire table sorted while using mode: replace, and will also allow the Redshift cluster to process the files in parallel. See: https://docs.aws.amazon.com/redshift/latest/dg/t_Loading-data-from-S3.html

@hito4t
Copy link
Contributor

hito4t commented Jan 10, 2018

@mjalkio
Thank you for the information!
Copying all files at once may be slower than copying files one after another.
So we should add the new property to choose.

@mjalkio
Copy link
Contributor Author

mjalkio commented Jan 10, 2018

The COPY command from within Redshift shouldn't be slower than copying one after another, it's what the docs recommend. We still want to create multiple files in S3, but they should be ingested through a single COPY command as described here.

@hito4t
Copy link
Contributor

hito4t commented Jan 10, 2018

embulk-output-redsfhit will copy files as soon as ready.
Namely, embulk-output-redshift will start copying files before preparing all files.
embulk-output-redshift might have copied n-1 files before preparing the last file.
Copying only the last file will be faster than copying all files at once.

@mjalkio
Copy link
Contributor Author

mjalkio commented Jan 10, 2018

Ah, okay. I guess I'm not familiar with other Embulk user workloads. It definitely isn't the case for us that our embulk-output-redshift threads finish at significantly different times. If that's the case for others an option would help us a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants