Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiprocessing for thumbnails. #19

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

cwgreene
Copy link
Contributor

To speed up thumbnail generation on multicore machines, this modifies
the thumb_pdf.py script to use python's multiprocessing library.

To speed up thumbnail generation on multicore machines, this modifys
the thumb_pdf.py script to use python's multiprocessing library.
@karpathy
Copy link
Owner

I had someone submit a similar commit before and it broke everything because intermediate temp files overwrote each other chaotically. I see in your code that you are creating multiple intermediate directories tmp-%d to address this issue? I am being overly cautious here - I assume you tried this and it works?

@cwgreene
Copy link
Contributor Author

I just finished comparing the generated images using the two methods. All but three files were binary identical, and those three were visually identical. I'm guessing there's some weird non-determinism (or bug) in one of the graphic programs. I'll look into that more later.

Also, during my run, I canceled the runs midway, and was able to successfully resume later. CTRL-C behavior is not quite as nice as I would like (spits up a fairly long stack trace) but it does work. I would have liked to have made it simply

pool.close()
pool.join()

But apparently there's been a long outstanding bug on this behavior that prevents this from working as expected. Python's multiprocessing story is not as clean as I would like.

http://bugs.python.org/issue8296

So we're stuck with polling. :(

Note, you might want to pair down the timeout (or I can do that in this request); I increased it because I was testing high job levels initially, and those would cause the job to take longer than the timeout. At the moment, it's at 20 minutes; which probably means that I never hit a file that caused an infinite loop.

Do you have one pdfs that causes the inifinite looping lying around?

@cwgreene
Copy link
Contributor Author

Oh, and yes, I had earlier verified that the thumbnails were loading correctly running the server locally. Sorry, I didn't think to compare the two methods directly until later.

@cwgreene
Copy link
Contributor Author

Hi Andrej; is there anything that I can do to modify this commit to make it more acceptable and trustworthy for you?

I feel that this change is valuable to people bootstrapping their initial arxivs viewers.

@karpathy
Copy link
Owner

Hi @cwgreene I appreciate the PR but I just recently went through a scarring experience of merging someone else's PR and it broke arxiv-sanity and required me to revert breaking changes for an hour. I would categorize this PR as a luxury/exotic feature. I understand it may speed up thumbnail generation for people who are initializing their libraries by a decent constant factor, but computing these thumbnails is a single-time compute that I don't think bottlenecks anyone too seriously in practice. I'll opt to keep the simplicity of the code in this case, but I'm happy to keep this PR around for anyone who wants the feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants