-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
12x times faster next_request()/spider_idle() logic #74
Comments
Hi ,thanks for this code, but after changing next_requests to this, my spiders still gets idle for and gets no request at all from redis starturls list. It keeps on getting (0 pages/min) for hours and eventually spider crashes. Please help me on this how to solve thanks. |
@77cc33 Excellent idea! Thanks. |
@sandeepsingh you need to be sure that you also add global sleep function def sleep(secs):
d = Deferred()
reactor.callLater(secs, d.callback, None)
return d |
@77cc33 I'm curious about how you were able to get to 10588 pages/min. Any points or hints on how to configure scrapy / scrapy-redis in order to maximize number of pages per min and also keep politeness? |
It was tests with localhost, so you are right it's little bit synthetic. In your case, when you have tons of different domains - just split your domains list to chunks, and start new process of scrapy spider for each chunk, usually in scrapy most CPU consuming task is page parsing, so to get better speed - you need spread this task between different CPU cores. With scrapy-redis you need just start more same spiders in parallel and you'll get N * 1800 pages/min speed - where N is number of started processes. Usually it's good when N = Num of CPU cores in your server or less. |
Got it, thanks for the answer and tips! |
NameError: name 'Deferred' is not defined where is Deferred()? |
Hello,
Here is much faster way to fetch URL's from Redis as is doesn't wait for IDLE after each batch.
Here are some benchmarks first, let's run crawl links directly from file with this simple spider:
I got these results on my machine:
Now let's run simple Redis spider with the same URL's what were imported into Redis, with REDIS_START_URLS_BATCH_SIZE = 16
Here are bench results:
Now let's test my updated next request function:
I also want to add, that we probably may skip '+ self.redis_batch_size' and use just
this way, we will not use any resources in Scrapy scheduler, as all urls will go into inprogress queue from the start, but I didn't check code inside scrapy enough to be sure about that, also we may also code shorter here:
converts to
As this operation is actually some time consuming, as it's just not simple var len returning.
Hope this code or it's idea will go into master
The text was updated successfully, but these errors were encountered: