Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider replacing handle_http_error with spider middleware #533

Open
jpmckinney opened this issue Oct 27, 2020 · 3 comments
Open

Consider replacing handle_http_error with spider middleware #533

jpmckinney opened this issue Oct 27, 2020 · 3 comments
Labels
framework-spiders Relating to common spider functionality
Milestone

Comments

@jpmckinney
Copy link
Member

jpmckinney commented Oct 27, 2020

We set HTTPERROR_ALLOW_ALL = True. If we had left it to False, HttpErrorMiddleware would have raised an HttpError exception, which subclasses IgnoreRequest – a special exception class that gets ignored by Scrapy. That middleware also implements process_spider_exception to handle that exception and log and count the HTTP errors.

Assuming we can write a new spider middleware to handle the HttpError exception first, we can have it return FileError items instead. That way, we can remove all the @handle_http_error decorators.

Some spiders handle HTTP errors in special ways. For those spiders, the handle_httpstatus_list spider attribute can be set, as documented by HttpErrorMiddleware. They include spiders using:

  • is_http_success
  • response.status
@jpmckinney jpmckinney added the framework-spiders Relating to common spider functionality label Oct 27, 2020
@jpmckinney
Copy link
Member Author

jpmckinney commented Jan 31, 2021

Since we want to use handle_http_error on some request callbacks but not all request callbacks, I think it's simplest to leave it as a decorator. For example, the Paraguay spiders use handle_http_error for data requests, but manually handles errors for access token requests.

@jpmckinney
Copy link
Member Author

Actually, nevermind - we can just use a request meta attribute to enable/disable the proposed middleware for cases like the Paraguay spiders.

@jpmckinney jpmckinney reopened this Jan 31, 2021
@yolile yolile added this to the Priority milestone Mar 3, 2021
@jpmckinney
Copy link
Member Author

Another – maybe more appropriate – option is to use request errbacks with HTTPERROR_ALLOW_ALL = False (the default): https://docs.scrapy.org/en/latest/topics/request-response.html?highlight=exceptions#using-errbacks-to-catch-exceptions-in-request-processing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
framework-spiders Relating to common spider functionality
Projects
None yet
Development

No branches or pull requests

2 participants