Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revisit error handling for getting batch job logs #546

Open
bossie opened this issue Oct 23, 2023 · 2 comments
Open

revisit error handling for getting batch job logs #546

bossie opened this issue Oct 23, 2023 · 2 comments
Labels

Comments

@bossie
Copy link
Collaborator

bossie commented Oct 23, 2023

This ERROR log has some peculiarities/bugs:

log_collection_failed_error

  1. the (ref: ...) part is how the openeo-python-client typically renders the id (= request ID) in an error response (where e.g. OpenEO editor shows it in a dialog box IIRC): here it is already rendered in the back-end as part of the message:
  2. the request ID is empty, for some reason (= defaults to "no-request");
  3. the underlying error (a connection timeout towards ES) is returned in-band: it becomes a log entry with the exception rendered in the message, an empty (log) id in a 200 OK response.

3 is a consequence of choosing to stream the log entries from ES to the client: a 200 OK is already sent to the client and then log entries are appended to the response body. Maybe this is a case of premature optimization on my part and simply collecting the logs in memory before returning them to the client is better. The underlying error then becomes an actual 504 HTTP response and the client can retry the request, should he want to.

@bossie bossie added the bug label Oct 23, 2023
@soxofaan
Copy link
Member

Found some answers/input while doing some digging for https://github.com/eu-cdse/openeo-cdse-infra/issues/29:

the ref: ... is explicitly added in the exception thrown from the ES query loop:

message = (
"Temporary failure while retrieving logs: ConnectionTimeout. "
+ f"Please try again and report this error if it persists. (ref: {req_id})"
)
raise OpenEOApiException(status_code=504, message=message) from exc

the id=... is indeed part of the default rendering of OpenEOApiException but that is defined in python-driver, not python client:
https://github.com/Open-EO/openeo-python-driver/blob/3f49f2777959a4845afbb891ef697d68e1bcfdeb/openeo_driver/errors.py#L83-L84

the request ID is indeed unset (rendered as "no-request") because of this:
the get_job_logs view handler builds the response in a streaming fashion (response body is yielded in chunks):
https://github.com/Open-EO/openeo-python-driver/blob/3f49f2777959a4845afbb891ef697d68e1bcfdeb/openeo_driver/views.py#L1432-L1453
because of WSGI implementation details the flask request context is closed before the first response chunk is produced.
This practically means that the code that tries to get the request id

req_id = FlaskRequestCorrelationIdLogging.get_request_id()

is executed when the request context is already gone, resulting in an unset request id. Conceptually, the code looks like it is inside the request context, but in practice that is not the case.

soxofaan added a commit to Open-EO/openeo-python-driver that referenced this issue Oct 26, 2023
@soxofaan
Copy link
Member

Did a first attempt in Open-EO/openeo-python-driver@e546e3d to get the actual request id in both the error logs in ES and the "Log collection for job ... failed" message that the end users gets

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants