Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

phase out ZkJobRegistry #632

Closed
bossie opened this issue Jan 4, 2024 · 15 comments · Fixed by Open-EO/openeo-python-driver#244, #635, #638, #639 or #660
Closed

phase out ZkJobRegistry #632

bossie opened this issue Jan 4, 2024 · 15 comments · Fixed by Open-EO/openeo-python-driver#244, #635, #638, #639 or #660

Comments

@bossie
Copy link
Collaborator

bossie commented Jan 4, 2024

Context: ZK on CDSE crashes due to OOM.

@bossie bossie self-assigned this Jan 4, 2024
@soxofaan
Copy link
Member

soxofaan commented Jan 4, 2024

FYI: related to (and maybe even duplicate of) #498

@bossie
Copy link
Collaborator Author

bossie commented Jan 4, 2024

async_task is only aware of ZkJobRegistry.

In the case of SHub batch processes that's just fine because those weren't/aren't available on CDSE SHub endpoint https://sh.dataspace.copernicus.eu and won't be considered:

supports_batch_processes = (endpoint.startswith("https://services.sentinel-hub.com") or
endpoint.startswith("https://services-uswest2.sentinel-hub.com"))

load_stac of partial job results only works on Terrascope anyway so this changes nothing.

@soxofaan
Copy link
Member

soxofaan commented Jan 4, 2024

FYI: I've also been tagging some implementation parts with todo notes referencing #498

$ grep '#498' openeogeotrellis/ -r -n
openeogeotrellis/job_registry.py:797:        # TODO #236/#498 For now: compare job metadata between Zk and EJR
openeogeotrellis/job_registry.py:910:        # TODO #236/#498: error if both sources failed?
openeogeotrellis/job_registry.py:923:        # TODO #236/#498 Need to have EJR implementation for this? This is only necessary for ZK cleaner script anyway.
openeogeotrellis/job_tracker_v2.py:395:            # TODO: #236/#498 also/instead get jobs_to_track from EJR?
openeogeotrellis/backend.py:334:            # TODO #236/#498 avoid this fallback and just make sure it is always set when necessary
openeogeotrellis/backend.py:1640:            zk_job_registry_factory=ZkJobRegistry,  # TODO #236/#498 allow to disable this with config?
openeogeotrellis/backend.py:1984:            # TODO #498 eliminate ZK code path, or at least encapsulate this logic better

@bossie
Copy link
Collaborator Author

bossie commented Jan 5, 2024

Heh. Persisting usage in ES (#488) doesn't work. From the YARN job_tracker logs:

In context "job_tracker job_metadata.status='finished' from YarnStatusGetter": caught EjrHttpError('EJR API error: 400 'Bad Request' on PATCH \'https://jobregistry.openeo.vito.be/jobs/j-240105907a9640a79375dbcdabe66505\': {"statusCode":400,"message":["property costs should not exist"],"error":"Bad Request"}')

Traceback (most recent call last):
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/util/logging.py", line 371, in just_log_exceptions
    yield
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/job_tracker_v2.py", line 512, in _sync_job_status
    self._elastic_job_registry.set_usage(job_id, job_costs, dict(usage))
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/jobregistry.py", line 465, in set_usage
    return self._update(
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/jobregistry.py", line 442, in _update
    return self._do_request("PATCH", f"/jobs/{job_id}", json=data)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/jobregistry.py", line 288, in _do_request
    raise EjrHttpError.from_response(response=response)
openeo_driver.jobregistry.EjrHttpError: EJR API error: 400 'Bad Request' on `PATCH 'https://jobregistry.openeo.vito.be/jobs/j-240105907a9640a79375dbcdabe66505'`: {"statusCode":400,"message":["property costs should not exist"],"error":"Bad Request"}

Was a bit hard to find because there's no job_id in the log entry.

@soxofaan
Copy link
Member

soxofaan commented Jan 5, 2024

so the "costs" field must be supported in EJR (https://github.com/Open-EO/openeo-job-registry-elastic-api)
this is something @JanssenBrm or hist team has to add

@bossie
Copy link
Collaborator Author

bossie commented Jan 5, 2024

@soxofaan was the property result_metadata (actually: results_metadata) mentioned in Open-EO/openeo-job-registry-elastic-api#2 (comment) ever used?

Because this seems to assume that all properties are top level:

def ejr_job_info_to_metadata(job_info: JobDict) -> BatchJobMetadata:
"""Convert job info dict (from JobRegistryInterface) to BatchJobMetadata"""
# TODO: eliminate zk_job_info_to_metadata/ejr_job_info_to_metadata duplication?
def map_safe(prop: str, f):
value = job_info.get(prop)
return f(value) if value else None
return BatchJobMetadata(
id=job_info["job_id"],
status=job_info["status"],
created=map_safe("created", rfc3339.parse_datetime),
process=job_info.get("process"),
job_options=job_info.get("job_options"),
title=job_info.get("title"),
description=job_info.get("description"),
updated=map_safe("updated", rfc3339.parse_datetime),
started=map_safe("started", rfc3339.parse_datetime),
finished=map_safe("finished", rfc3339.parse_datetime),
memory_time_megabyte=map_safe("memory_time_megabyte_seconds", lambda seconds: timedelta(seconds=seconds)),
cpu_time=map_safe("cpu_time_seconds", lambda seconds: timedelta(seconds=seconds)),
geometry=job_info.get("geometry"),
bbox=job_info.get("bbox"),
start_datetime=map_safe("start_datetime", rfc3339.parse_datetime),
end_datetime=map_safe("end_datetime", rfc3339.parse_datetime),
instruments=job_info.get("instruments"),
epsg=job_info.get("epsg"),
links=job_info.get("links"),
usage=job_info.get("usage"),
costs=job_info.get("costs"),
proj_shape=job_info.get("proj:shape"),
proj_bbox=job_info.get("proj:bbox"),
)

@soxofaan
Copy link
Member

soxofaan commented Jan 5, 2024

no indeed, I don't think results_metadata is already being used (writing nor reading)

@JanssenBrm
Copy link

Is it possible to create an issue on https://github.com/Open-EO/openeo-job-registry-elastic-api with clear instructions on what should be done?

@bossie bossie linked a pull request Jan 9, 2024 that will close this issue
@bossie bossie changed the title phase out ZKJobRegistry phase out ZkJobRegistry Jan 9, 2024
bossie added a commit to Open-EO/openeo-python-driver that referenced this issue Jan 9, 2024
bossie added a commit that referenced this issue Jan 9, 2024
bossie added a commit to Open-EO/openeo-python-driver that referenced this issue Jan 9, 2024
bossie added a commit that referenced this issue Jan 9, 2024
bossie added a commit that referenced this issue Jan 9, 2024
ERROR    openeogeotrellis.job_tracker_v2:job_tracker_v2.py:443 Failed status sync for job_id=job-3: unexpected AssertionError:
Traceback (most recent call last):
  File "/home/bossie/PycharmProjects/openeo/openeo-geopyspark-driver/openeogeotrellis/job_tracker_v2.py", line 434, in update_statuses
    self._sync_job_status(
  File "/home/bossie/PycharmProjects/openeo/openeo-geopyspark-driver/openeogeotrellis/job_tracker_v2.py", line 562, in _sync_job_status
    double_job_registry.set_status(
  File "/home/bossie/PycharmProjects/openeo/openeo-geopyspark-driver/openeogeotrellis/job_registry.py", line 821, in set_status
    self.elastic_job_registry.set_status(job_id=job_id, status=status, started=started, finished=finished)
  File "/home/bossie/PycharmProjects/openeo/openeo-geopyspark-driver/openeogeotrellis/job_registry.py", line 632, in set_status
    self._update(
  File "/home/bossie/PycharmProjects/openeo/openeo-geopyspark-driver/openeogeotrellis/job_registry.py", line 619, in _update
    assert job_id in self.db
AssertionError
bossie added a commit that referenced this issue Jan 10, 2024
bossie added a commit that referenced this issue Jan 17, 2024
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
openeogeotrellis/backend.py:338: in __init__
    elastic_job_registry = get_elastic_job_registry(requests_session)
openeogeotrellis/backend.py:1636: in get_elastic_job_registry
    job_registry = ElasticJobRegistry(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <openeo_driver.jobregistry.ElasticJobRegistry object at 0x7f3ea225f460>
api_url = None, backend_id = 'unknown'

    def __init__(
        self,
        api_url: str,
        backend_id: Optional[str] = None,
        *,
        session: Optional[requests.Session] = None,
        _debug_show_curl: bool = False,
    ):
        if api_url is None:
>           raise ValueError(api_url)
E           ValueError: None
bossie added a commit that referenced this issue Jan 17, 2024
openeogeotrellis/backend.py:338: in __init__
    elastic_job_registry = get_elastic_job_registry(requests_session)
openeogeotrellis/backend.py:1636: in get_elastic_job_registry
    job_registry = ElasticJobRegistry(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <openeo_driver.jobregistry.ElasticJobRegistry object at 0x7f61ea3d6fd0>
api_url = None, backend_id = 'unknown'

    def __init__(
        self,
        api_url: str,
        backend_id: Optional[str] = None,
        *,
        session: Optional[requests.Session] = None,
        _debug_show_curl: bool = False,
    ):
        if api_url is None:
>           raise ValueError(api_url)
E           ValueError: None
bossie added a commit that referenced this issue Jan 18, 2024
This reverts commit 4b4fe78.

Batch job result metadata is broken, getting job results still seems slow.

"extent": {
  "spatial": {
    "bbox": [
      null
    ]
  },
  "temporal": {
    "interval": [
      [
        null,
        null
      ]
    ]
  }
}
bossie added a commit to Open-EO/openeo-python-driver that referenced this issue Jan 18, 2024
bossie added a commit to Open-EO/openeo-python-driver that referenced this issue Jan 18, 2024
Open-EO/openeo-geopyspark-driver#632

Co-authored-by: Stefaan Lippens <soxofaan@users.noreply.github.com>
bossie added a commit to Open-EO/openeo-python-driver that referenced this issue Jan 18, 2024
bossie added a commit to Open-EO/openeo-python-driver that referenced this issue Jan 18, 2024
bossie added a commit that referenced this issue Jan 18, 2024
bossie added a commit that referenced this issue Jan 19, 2024
bossie added a commit that referenced this issue Jan 19, 2024
EJR update job_id='j-24011972cfb04efcb15e28d0b230a23b' data={'costs': 5.0, 'usage': {'cpu': {'value': 4033, 'unit': 'cpu-seconds'}, 'memory': {'value': 9359879, 'unit': 'mb-seconds'}, 'input_pixel': {'value': 0.0625, 'unit': 'mega-pixel'}}, 'results_metadata': {'geometry': {'type': 'Polygon', 'coordinates': [[[4.825919, 51.259766], [4.825919, 51.307638], [4.859629, 51.307638], [4.859629, 51.259766], [4.825919, 51.259766]]]}, 'bbox': [4.825919, 51.259766, 4.859629, 51.307638], 'area': {'value': 12526090.219329834, 'unit': 'square meter'}, 'start_datetime': '2017-11-01T00:00:00Z', 'end_datetime': '2017-11-01T00:00:00Z', 'links': [{'href': 'urn:eop:VITO:PROBAV_S10_TOC_333M_COG_V2:PROBAV_S10_TOC_X18Y02_20171101_333M_V201', 'rel': 'derived_from', 'title': 'Derived from urn:eop:VITO:PROBAV_S10_TOC_333M_COG_V2:PROBAV_S10_TOC_X18Y02_20171101_333M_V201', 'type': 'application/json'}, {'href': 'urn:eop:VITO:PROBAV_S10_TOC_333M_COG_V2:PROBAV_S10_TOC_X18Y02_20171101_333M_V201', 'rel': 'derived_from', 'title': 'Derived from urn:eop:VITO:PROBAV_S10_TOC_333M_COG_V2:PROBAV_S10_TOC_X18Y02_20171101_333M_V201', 'type': 'application/json'}, {'href': 'urn:eop:VITO:PROBAV_S10_TOC_333M_COG_V2:PROBAV_S10_TOC_X18Y02_20171101_333M_V201', 'rel': 'derived_from', 'title': 'Derived from urn:eop:VITO:PROBAV_S10_TOC_333M_COG_V2:PROBAV_S10_TOC_X18Y02_20171101_333M_V201', 'type': 'application/json'}, {'href': 'urn:eop:VITO:PROBAV_S10_TOC_333M_COG_V2:PROBAV_S10_TOC_X18Y02_20171101_333M_V201', 'rel': 'derived_from', 'title': 'Derived from urn:eop:VITO:PROBAV_S10_TOC_333M_COG_V2:PROBAV_S10_TOC_X18Y02_20171101_333M_V201', 'type': 'application/json'}, {'href': 'urn:eop:VITO:PROBAV_S10_TOC_333M_COG_V2:PROBAV_S10_TOC_X18Y02_20171101_333M_V201', 'rel': 'derived_from', 'title': 'Derived from urn:eop:VITO:PROBAV_S10_TOC_333M_COG_V2:PROBAV_S10_TOC_X18Y02_20171101_333M_V201', 'type': 'application/json'}, {'href': 'urn:eop:VITO:PROBAV_S10_TOC_333M_COG_V2:PROBAV_S10_TOC_X18Y02_20171101_333M_V201', 'rel': 'derived_from', 'title': 'Derived from urn:eop:VITO:PROBAV_S10_TOC_333M_COG_V2:PROBAV_S10_TOC_X18Y02_20171101_333M_V201', 'type': 'application/json'}, {'href': 'urn:eop:VITO:PROBAV_S10_TOC_333M_COG_V2:PROBAV_S10_TOC_X18Y02_20171101_333M_V201', 'rel': 'derived_from', 'title': 'Derived from urn:eop:VITO:PROBAV_S10_TOC_333M_COG_V2:PROBAV_S10_TOC_X18Y02_20171101_333M_V201', 'type': 'application/json'}, {'href': 'urn:eop:VITO:PROBAV_S10_TOC_333M_COG_V2:PROBAV_S10_TOC_X18Y02_20171101_333M_V201', 'rel': 'derived_from', 'title': 'Derived from urn:eop:VITO:PROBAV_S10_TOC_333M_COG_V2:PROBAV_S10_TOC_X18Y02_20171101_333M_V201', 'type': 'application/json'}, {'href': 'urn:eop:VITO:PROBAV_S10_TOC_333M_COG_V2:PROBAV_S10_TOC_X18Y02_20171101_333M_V201', 'rel': 'derived_from', 'title': 'Derived from urn:eop:VITO:PROBAV_S10_TOC_333M_COG_V2:PROBAV_S10_TOC_X18Y02_20171101_333M_V201', 'type': 'application/json'}, {'href': 'urn:eop:VITO:PROBAV_S10_TOC_333M_COG_V2:PROBAV_S10_TOC_X18Y02_20171101_333M_V201', 'rel': 'derived_from', 'title': 'Derived from urn:eop:VITO:PROBAV_S10_TOC_333M_COG_V2:PROBAV_S10_TOC_X18Y02_20171101_333M_V201', 'type': 'application/json'}], 'proj:bbox': [4.8257395, 51.2594548, 4.8614451, 51.3088932], 'proj:shape': [13, 36], 'assets': {'openEO.tif': {'href': '/data/projects/OpenEO/j-24011972cfb04efcb15e28d0b230a23b/openEO.tif', 'type': 'image/tiff; application=geotiff', 'roles': ['data'], 'bands': [], 'nodata': nan, 'datetime': None, 'raster:bands': [{'name': '1', 'statistics': {'minimum': 3.0, 'maximum': 5.0, 'mean': 4.2350427350427, 'stddev': 0.97198503728382, 'valid_percent': 100.0}}]}}, 'epsg': 4326, 'instruments': ['PROBA-V'], 'processing:facility': 'VITO - SPARK', 'processing:software': 'openeo-geotrellis-0.23.0a1', 'unique_process_ids': ['reduce_dimension', 'load_ml_model', 'save_result', 'load_collection', 'predict_random_forest', 'mean'], 'providers': [{'name': 'VITO', 'description': 'This data was processed on an openEO backend maintained by VITO.', 'roles': ['processor'], 'processing:facility': 'openEO Geotrellis backend', 'processing:software': {'Geotrellis backend': '0.23.0a1'}, 'processing:expression': [{'format': 'openeo', 'expression': {'loadmlmodel1': {'process_id': 'load_ml_model', 'arguments': {'id': 'j-240119d8e2234977beeb86f307291756'}}, 'loadcollection1': {'process_id': 'load_collection', 'arguments': {'bands': ['NDVI'], 'id': 'PROBAV_L3_S10_TOC_333M', 'spatial_extent': {'west': 4.825919, 'east': 4.859629, 'south': 51.259766, 'north': 51.307638}, 'temporal_extent': ['2017-11-01', '2017-11-01']}}, 'reducedimension1': {'process_id': 'reduce_dimension', 'arguments': {'data': {'from_node': 'loadcollection1'}, 'dimension': 't', 'reducer': {'process_graph': {'mean1': {'process_id': 'mean', 'arguments': {'data': {'from_parameter': 'data'}}, 'result': True}}}}}, 'reducedimension2': {'process_id': 'reduce_dimension', 'arguments': {'context': {'from_node': 'loadmlmodel1'}, 'data': {'from_node': 'reducedimension1'}, 'dimension': 'bands', 'reducer': {'process_graph': {'predictrandomforest1': {'process_id': 'predict_random_forest', 'arguments': {'data': {'from_parameter': 'data'}, 'model': {'from_parameter': 'context'}}, 'result': True}}}}}, 'saveresult1': {'process_id': 'save_result', 'arguments': {'data': {'from_node': 'reducedimension2'}, 'format': 'GTiff', 'options': {}}, 'result': True}}}]}], 'usage': {'input_pixel': {'value': 0.0625, 'unit': 'mega-pixel'}}}}

Traceback (most recent call last):
  File "/opt/venv/lib64/python3.8/site-packages/requests/models.py", line 511, in prepare_body
    body = complexjson.dumps(json, allow_nan=False)
  File "/usr/lib64/python3.8/json/__init__.py", line 234, in dumps
    return cls(
  File "/usr/lib64/python3.8/json/encoder.py", line 199, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/usr/lib64/python3.8/json/encoder.py", line 257, in iterencode
    return _iterencode(o, 0)
ValueError: Out of range float values are not JSON compliant

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/job_tracker_v2.py", line 430, in update_statuses
    self._sync_job_status(
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/job_tracker_v2.py", line 553, in _sync_job_status
    double_job_registry.set_results_metadata(job_id, user_id, costs=job_costs, usage=dict(total_usage),
  File "/opt/venv/lib64/python3.8/site-packages/openeogeotrellis/job_registry.py", line 964, in set_results_metadata
    self.elastic_job_registry.set_results_metadata(job_id=job_id, costs=costs, usage=usage,
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/jobregistry.py", line 553, in set_results_metadata
    return self._update(job_id=job_id, data={
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/jobregistry.py", line 460, in _update
    return self._do_request("PATCH", f"/jobs/{job_id}", json=data)
  File "/opt/venv/lib64/python3.8/site-packages/openeo_driver/jobregistry.py", line 297, in _do_request
    response = self._session.request(
  File "/opt/venv/lib64/python3.8/site-packages/requests/sessions.py", line 575, in request
    prep = self.prepare_request(req)
  File "/opt/venv/lib64/python3.8/site-packages/requests/sessions.py", line 486, in prepare_request
    p.prepare(
  File "/opt/venv/lib64/python3.8/site-packages/requests/models.py", line 371, in prepare
    self.prepare_body(data, files, json)
  File "/opt/venv/lib64/python3.8/site-packages/requests/models.py", line 513, in prepare_body
    raise InvalidJSONError(ve, request=self)
requests.exceptions.InvalidJSONError: Out of range float values are not JSON compliant
@bossie
Copy link
Collaborator Author

bossie commented Jan 22, 2024

Disabled ZkJobRegistry on CDSE dev/staging; seems to work after a force_redeploy.

bossie pushed a commit to Open-EO/openeo-geopyspark-integrationtests that referenced this issue Jan 22, 2024
@bossie
Copy link
Collaborator Author

bossie commented Jan 22, 2024

seems to work

*sometimes.

Not production ready because of Open-EO/openeo-job-registry-elastic-api#32.

@bossie
Copy link
Collaborator Author

bossie commented Jan 23, 2024

ES mapping of dependencies and results_metadata should be changed (to flattened?) to avoid a mapping explosion.

This should also fix integration tests test_advanced_cloud_masking_diy and test_load_collection_references_correct_batch_process_id.

@bossie
Copy link
Collaborator Author

bossie commented Jan 29, 2024

For reference: Open-EO/openeo-job-registry-elastic-api#33

bossie added a commit that referenced this issue Jan 29, 2024
bossie added a commit that referenced this issue Jan 29, 2024
@bossie bossie linked a pull request Jan 29, 2024 that will close this issue
bossie added a commit that referenced this issue Jan 29, 2024
bossie added a commit that referenced this issue Jan 29, 2024
@bossie
Copy link
Collaborator Author

bossie commented Jan 30, 2024

Rolled back on CDSE dev/staging because of Open-EO/openeo-job-registry-elastic-api#33 (#660).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment