-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiprocessing for thumbnails. #19
base: master
Are you sure you want to change the base?
Conversation
To speed up thumbnail generation on multicore machines, this modifys the thumb_pdf.py script to use python's multiprocessing library.
I had someone submit a similar commit before and it broke everything because intermediate temp files overwrote each other chaotically. I see in your code that you are creating multiple intermediate directories |
I just finished comparing the generated images using the two methods. All but three files were binary identical, and those three were visually identical. I'm guessing there's some weird non-determinism (or bug) in one of the graphic programs. I'll look into that more later. Also, during my run, I canceled the runs midway, and was able to successfully resume later. CTRL-C behavior is not quite as nice as I would like (spits up a fairly long stack trace) but it does work. I would have liked to have made it simply
But apparently there's been a long outstanding bug on this behavior that prevents this from working as expected. Python's multiprocessing story is not as clean as I would like. http://bugs.python.org/issue8296 So we're stuck with polling. :( Note, you might want to pair down the timeout (or I can do that in this request); I increased it because I was testing high job levels initially, and those would cause the job to take longer than the timeout. At the moment, it's at 20 minutes; which probably means that I never hit a file that caused an infinite loop. Do you have one pdfs that causes the inifinite looping lying around? |
Oh, and yes, I had earlier verified that the thumbnails were loading correctly running the server locally. Sorry, I didn't think to compare the two methods directly until later. |
Hi Andrej; is there anything that I can do to modify this commit to make it more acceptable and trustworthy for you? I feel that this change is valuable to people bootstrapping their initial arxivs viewers. |
Hi @cwgreene I appreciate the PR but I just recently went through a scarring experience of merging someone else's PR and it broke arxiv-sanity and required me to revert breaking changes for an hour. I would categorize this PR as a luxury/exotic feature. I understand it may speed up thumbnail generation for people who are initializing their libraries by a decent constant factor, but computing these thumbnails is a single-time compute that I don't think bottlenecks anyone too seriously in practice. I'll opt to keep the simplicity of the code in this case, but I'm happy to keep this PR around for anyone who wants the feature. |
To speed up thumbnail generation on multicore machines, this modifies
the thumb_pdf.py script to use python's multiprocessing library.