Multiprocessing for thumbnails. #19

cwgreene · 2016-02-28T04:54:15Z

To speed up thumbnail generation on multicore machines, this modifies
the thumb_pdf.py script to use python's multiprocessing library.

To speed up thumbnail generation on multicore machines, this modifys the thumb_pdf.py script to use python's multiprocessing library.

karpathy · 2016-02-28T08:35:20Z

I had someone submit a similar commit before and it broke everything because intermediate temp files overwrote each other chaotically. I see in your code that you are creating multiple intermediate directories tmp-%d to address this issue? I am being overly cautious here - I assume you tried this and it works?

cwgreene · 2016-02-28T10:08:03Z

I just finished comparing the generated images using the two methods. All but three files were binary identical, and those three were visually identical. I'm guessing there's some weird non-determinism (or bug) in one of the graphic programs. I'll look into that more later.

Also, during my run, I canceled the runs midway, and was able to successfully resume later. CTRL-C behavior is not quite as nice as I would like (spits up a fairly long stack trace) but it does work. I would have liked to have made it simply

pool.close()
pool.join()

But apparently there's been a long outstanding bug on this behavior that prevents this from working as expected. Python's multiprocessing story is not as clean as I would like.

http://bugs.python.org/issue8296

So we're stuck with polling. :(

Note, you might want to pair down the timeout (or I can do that in this request); I increased it because I was testing high job levels initially, and those would cause the job to take longer than the timeout. At the moment, it's at 20 minutes; which probably means that I never hit a file that caused an infinite loop.

Do you have one pdfs that causes the inifinite looping lying around?

cwgreene · 2016-02-28T10:11:27Z

Oh, and yes, I had earlier verified that the thumbnails were loading correctly running the server locally. Sorry, I didn't think to compare the two methods directly until later.

cwgreene · 2016-03-26T17:50:23Z

Hi Andrej; is there anything that I can do to modify this commit to make it more acceptable and trustworthy for you?

I feel that this change is valuable to people bootstrapping their initial arxivs viewers.

karpathy · 2016-03-26T22:05:50Z

Hi @cwgreene I appreciate the PR but I just recently went through a scarring experience of merging someone else's PR and it broke arxiv-sanity and required me to revert breaking changes for an hour. I would categorize this PR as a luxury/exotic feature. I understand it may speed up thumbnail generation for people who are initializing their libraries by a decent constant factor, but computing these thumbnails is a single-time compute that I don't think bottlenecks anyone too seriously in practice. I'll opt to keep the simplicity of the code in this case, but I'm happy to keep this PR around for anyone who wants the feature.

Multiprocessing for thumbnails.

c93ce99

To speed up thumbnail generation on multicore machines, this modifys the thumb_pdf.py script to use python's multiprocessing library.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multiprocessing for thumbnails. #19

Multiprocessing for thumbnails. #19

cwgreene commented Feb 28, 2016

karpathy commented Feb 28, 2016

cwgreene commented Feb 28, 2016

cwgreene commented Feb 28, 2016

cwgreene commented Mar 26, 2016

karpathy commented Mar 26, 2016

Multiprocessing for thumbnails. #19

Are you sure you want to change the base?

Multiprocessing for thumbnails. #19

Conversation

cwgreene commented Feb 28, 2016

karpathy commented Feb 28, 2016

cwgreene commented Feb 28, 2016

cwgreene commented Feb 28, 2016

cwgreene commented Mar 26, 2016

karpathy commented Mar 26, 2016