-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use parallel ES indexing #185
Conversation
Results for SNAFU CI Test
|
@rsevilla87 good idea but does it scale? If you were doing this with 30 nodes instead of 1, is py_es_bulk able to back off if ES is overloaded? I think so (@acalhounRH what do you think?) but has anyone tried it? |
@portante can you please review this ? I distinctly remember you finding a boatload of problems with parallel indexing and suggested to stick with serial indexing. |
Client side indexing is problematic to make scale. Unless you control all the clients, controlling the right level of parallelism for each client can cause an Elasticsearch instance to be swamped. If |
Results for SNAFU CI Test
|
I have already added parallel_bulk indexing with PR #173 |
This one enables parallelism optionally, by default false. Do you want to me wait for 173 or move forward with this one? |
I would prefer to wait for it #173, if you don't mind. |
@acalhounRH can you please update your PR in that case to support enabling parallelism optionally, but defaulting to serial given @portante 's comments above. |
This will require a change in both RIPSAW and snafu, ripsaw to set the env, and snafu to check the variable to switch between parallel or stream indexing. |
In a simple 10 minutes test I got:
and with parallel