-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SoS performance dealing with large number of files #874
Comments
I can now my numbers here but what are your numbers for
without
with
|
Okey here are my stats:
I wonder what your take is on this. I did not know |
This basically shows high overhead on the dispatching of sos tasks. The tasks were dispatched in parallel by a few threads but it caused a nasty bug because I suppose some optimization can be done in this area. |
I see ... I guess it might be particularly frustrating to people who work on methods development that often sends lots of jobs out to test for different models on the same dataset. I'm in a collaboration on an RNA project that tries to use CNN-based classifiers that requires lots of runs to tune and we want to keep track of all the outcome ... Anyways, a perhaps easier optimization is to automatically decide trunk size when not specified? Because it seems to save lots of time when properly configured, but this difference in behavior that relies on a not so obvious approach is not as elegant as automatically optimizing it. |
The pause between the completion of tasks and starting of the next batch of tasks is for checking the status of tasks. Basically, after the tasks are completed (print |
Ah that's consistent with my feeling about the 2 sec pause. Thanks for clarifying! I guess it is completely acceptable to have remote tasks pause every 2 sec. From my observation people who work on methods development tend to use local hosts more, because they often try lots of stuff with very light input data (vs. only a few heavy jobs with large data on cluster for real data analysis). For local task it would be great if we can optimize! |
I tried to adjust the code (e.g reduce waiting time, shortcut status checking) but do not see any obvious improvement. Basically, the separation of task execution and monitoring come at a cost that is hard to reduce. Unless we use some more efficient communication method (some message queue such as redis for RQ and redis and others for Celery) to exchange information, I doubt we can do much about it. If you have to use
In this method, the tasks would be executed inside of SoS. They would not have task id and all the status stuff but will be executed in parallel. |
I see ... so what's the drawback for having to use the old task code on |
Basically tasks are executed and monitored outside of SoS with their own signatures. That is to say,
An |
Technically speaking, internally-executed tasks could also have external task files and could also maintain signatures so 1,3,4,6 could hold. so the only differences would be 2 and 5, because
|
Ha I was about to post a similar comment that states that in the context of It'd be more practical (and less puzzling + frustrating) to beginners or to people who solely works on desktops to be able to run the old task model, if maintaining 2 mechanisms is not too bad. |
Not quite sure about the if part. The old internal behavior can co-exist with the task model well because it does not have anything external (no task file, no signature) so what you gain is only a way to run part of the step in parallel. If we implement task file and signature, things can get complicated for cases such as if the same task has been running externally, if the task is killed externally, basically we will have to check the status of tasks, which can again be inefficient. |
Sorry I think this is where my major confusion is. Since it seems to be a most obvious practical reason against not using task file and signature I'd like to clarify this: suppose we have a outcome-oriented style workflow that properly builds a full DAG, do I still just get part of the steps running in parallel on A practical user case would be prototyping on localhost with |
Another possibility, as I pointed earlier, is to use auto chunk_size. I understand it will make the workflow less portable, but at least on a single machine the automatic task chunk_size should be consistent so we'll be set on one (or to some, the only) computational environment. |
We are talking about two levels of parallelisms: DAG level and step level. At the step level, currently only tasks can be run in parallel as external tasks. At the DAG level, independent paths will be executed in parallel, and the steps will wait for the completion of tasks, regardless of external or internal. However, one of the motivations for the current implementation is that sos currently uses workers for both step and tasks, so |
Automatic trunk size sounds like a good idea. It basically assumes tasks (from the same step) will take roughly the same time to complete, group tasks and send groups of tasks to a small number of workers. The main benefit is that it checks status of a much smaller number of tasks and tasks within each group can be executed sequentially without checking status of tasks. The only problems are error handling #772, and checking status of individual tasks. |
Ahh I see, it is #772 looks like a very practical issue that can be quite inconvenient in research mode ... so the idea of auto trunking is now less appealing to me unless there is a way out of that. |
A way out is actually not that hard. Since the main problem was frequent checking of task status to submit a new task, we could potentially send groups of tasks to the executor and let it run it one by one. In this way we know the number of concurrent jobs (groups) and the executor can execute the tasks more efficiently. |
Oh then it is auto chunking but happens at a different level? It sounds like an efficient solution! |
The patch makes use of My numbers (-J4, without
to
with the patch. |
Great! I have just checked. But improvement on my end seems less significant than on yours. It reduced from above: (
to
So there is certainly improvement, though still not as fast as with
Does it mean the checks are still |
Right now, we
|
Thanks for the clarification. I see that it is the best we can do with the new task model without trunks. My concern is, don't we want to use an upper limit? I believe command line can't be arbitrarily long. I worry what happen if there are say 5000 tasks at one step |
In my case
then
because my system cannot actually afford 20 concurrent tasks so there are a lot of competitions. |
|
It is good thing that we have a reasonable default for So this fix is in fact an improvement to the current task model whether it be |
Do we still need the inline option? It is only useful for large amount of tiny tasks ... |
Agreed. It may also depend on how people organize their projects but I see a value in that. This will be true for methods projects when people would want to try a lot of settings. For me, even in real data analysis related to RNA-expression mapping / feature selection type of studies where I break data first into > 30K genes then do all types of analysis applying various methods, creating a situation of large amount of tiny tasks; and is often restricted to my 40-thread desktop so that I can get quick feedback. |
However this will add an additional parameter to the |
It can be an option to
then the entire input groups will be executed in parallel. This will not change the definition of task (as external) and will help the general execution of input groups. |
I see, then for such computing environment we'll just drop
Well at least this is what I expect for |
Yes, because |
Well, it sounds like corner case that could bother us if not properly handled (do you foresee this will happen a lot?) ... but other than this I'd say I'm all in for |
|
With
|
Great! but I'm having this issue:
it hangs ... I think has something to do with pre-mature stop I created ealier with |
removing |
It worked, I actually tried to remove something from
only the last command worked! Now performance:
This is very satisfying. So other than the hang issue, and without reading the patch in detail: what if I then put in |
Tasks are not allowed in concurrent input groups. |
Yes I understand. I see now an exception will be raised. It is perhaps too much to ask, but when switching between local to remote executions users will have to significantly modify their script. I wonder if it makes sense that we sweep this under the rug for them. |
Do you mean ignoring |
What I meant was that
will submit tasks sequentially which will then be executed in parallel, and
would make no difference because it only speed up task submission (on paper). |
I understand and agree with your point. I was thinking of making switching back/forth a bit easier.
I'd say maybe we only raise exception in this case? ie
I would also not complain if we keep the current behavior. After all I was the one who insisted on making them separated cases from the start :) |
It is not only about
This is currently not allowed because of complications in logic and implementation (the concurrent input groups, now in separate processes, need to be clever enough to handle tasks). Since |
I think this is reasonable. BTW this new |
…s) for steps with tasks and multiple statements. #874
Sorry I just want to double-check the defaults here: we have |
-j is for the execution inside SoS (DAG and input groups), -J is for the execution outside of SoS (task queue). Currently -J has a default of N=nCPU/2 while -j has a default of 4, which was set more or less arbitrarily. I suppose it makes sense also set -j=N. -j and -J and more or less independent and I have not had a case that has both internal and external parallelism on the same machine (I always use cluster). |
Indeed, I think this justifies setting |
This bothers me when I do some very simple simulations:
If you run this script, you'll see it halts for a second or 2 at the end of every batch of completed jobs. I can understand that things like signature checks etc are on going. Therefore a simple simulation that takes < 10 sec as a for loop can take as much as > 700 secs with SoS -- the overhead takes way longer time than the actual computation. I remember it used to be 10 sec vs > 100 secs before last summer. Now I guess as signature check becomes more strict and careful about racing conditions the whole process is a lot slower. Is there still room for optimization?
The text was updated successfully, but these errors were encountered: