Hard time limit (300s) exceeded for task.steve_jobs.process_message #2512

majamassarini · 2024-08-30T08:13:31Z

I realized that starting from 2nd of June 2024 we have this exception raised quite often in the same day (5-10 occurrences).

It makes no sense for the process_message function.

Unless it is somehow related to the communication with the pushgateway

If we can't find the reason for the slowliness we should at least increase again the hard_time_limit for this task.

Hard time limit (900s)

"2024-09-26T10:51:51.546708534+00:00 stderr F [2024-09-26 10:51:51,546: INFO/MainProcess] task.run_copr_build_end_handler[d47aa906-d996-4745-9ac1-378a6b0f51f1] Cleaning up the mess.","2024-09-26T10:51:51.546+0000"
"2024-09-26T10:36:52.820075409+00:00 stderr F [2024-09-26 10:36:52,820: DEBUG/MainProcess] task.run_copr_build_end_handler[d47aa906-d996-4745-9ac1-378a6b0f51f1] Setting Github status check 'success' for check 'rpm-build:fedora-40-aarch64:fedora-rawhide': RPMs were built successfully.","2024-09-26T10:36:52.820+0000"

I wouldn't be surprised if it got stuck on the GitHub status check, but it's still weird that we see the clean up and then timeout exception. OTOH the time difference between these two messages is ~15 minutes 👀

packit-service/packit_service/worker/handlers/abstract.py

Lines 285 to 287 in b58b3b7

    
           def clean(self): 
        
               """clean up the mess once we're done""" 
        
               logger.info("Cleaning up the mess.")

into

packit-service/packit_service/worker/handlers/abstract.py

Lines 225 to 228 in b58b3b7

    
           if not getenv("KUBERNETES_SERVICE_HOST"): 
        
               logger.debug("This is not a kubernetes pod, won't clean.") 
        
               return 
        
           logger.debug("Removing contents of the PV.")

Hard time limit (300s)

Redis connection problems are already been resolved after switching from redict to valkey

just to clarify, I've adjusted the timeout, issue itself is not related to neither Redict or Valkey, the short-running pods leak memory from the concurrent threads :/

Ideally it should be only temporary solution, cause the workers (caused by either celery, or gevent, or celery × gevent) have the issue anyways, the only difference is that Valkey currently cleans up “dead” (idle for a long time) connections, so we don't run out.

By this point, from what I see in your investigation, there are 3 occurrences where I see GitHub-related action before a gap and then timeout right away. There is also one more that is related to TF, which could point to network issue.

All in all, timeout with requests sounds good, but we should be able to retry (not sure if we could, by any chance, spam the external services like that somehow) :/

SLO1 related problems

From the last log 👀 how many times do we need to parse the comment? it is repeated a lot IMO

majamassarini · 2024-10-02T08:07:56Z

I wouldn't be surprised if it got stuck on the GitHub status check, but it's still weird that we see the clean up and then timeout exception. OTOH the time difference between these two messages is ~15 minutes 👀

Good point, I didn't saw the time difference between the status check action and the cleaning. I think an exception is thrown there and we are not logging it. I will add some code to log it.

Hard time limit (300s)

Redis connection problems are already been resolved after switching from redict to valkey

just to clarify, I've adjusted the timeout, issue itself is not related to neither Redict or Valkey, the short-running pods leak memory from the concurrent threads :/

Ideally it should be only temporary solution, cause the workers (caused by either celery, or gevent, or celery × gevent) have the issue anyways, the only difference is that Valkey currently cleans up “dead” (idle for a long time) connections, so we don't run out.

I just noticed that after the switch to Valkey the hard time limit 300s... exceptions are 10 times less frequent on average. So if we can not find the root cause, I will keep it ;)

By this point, from what I see in your investigation, there are 3 occurrences where I see GitHub-related action before a gap and then timeout right away. There is also one more that is related to TF, which could point to network issue.

All in all, timeout with requests sounds good, but we should be able to retry (not sure if we could, by any chance, spam the external services like that somehow) :/

There are already 5 max_retries set for the HTTPAdapter class, so we should be good to go.

SLO1 related problems

From the last log 👀 how many times do we need to parse the comment? it is repeated a lot IMO

I agree, there have two different kind of problems.

We have really too much work to match the labels and it tooks us a lot of time
We don't give feedback in the meantime so we are definitely too slow in giving feedback to the user

Fixes related with packit service slowliness Should partially fix #2512 Reviewed-by: Matej Focko Reviewed-by: Laura Barcziová

majamassarini · 2024-10-03T11:04:16Z

We agreed on keeping this open for a while and see later what is fixed and what is not. To open new follow-up issues if needed.

majamassarini added complexity/single-task Regular task, should be done within days. area/general Related to whole service, not a specific part/integration. kind/internal Doesn't affect users directly, may be e.g. infrastructure, DB related. labels Aug 30, 2024

majamassarini mentioned this issue Sep 6, 2024

Improvments and fixes to our SLO rules #2519

Open

4 tasks

majamassarini self-assigned this Sep 25, 2024

This was referenced Oct 1, 2024

Fixes related with packit service slowliness #2555

Merged

Fixes related with packit service slowliness packit/ogr#862

Merged

softwarefactory-project-zuul bot closed this as completed in packit/ogr#862 Oct 3, 2024

softwarefactory-project-zuul bot closed this as completed in packit/ogr@3d0800f Oct 3, 2024

softwarefactory-project-zuul bot added a commit that referenced this issue Oct 3, 2024

Fixes related with packit service slowliness (#2555)

be730d0

Fixes related with packit service slowliness Should partially fix #2512 Reviewed-by: Matej Focko Reviewed-by: Laura Barcziová

majamassarini reopened this Oct 3, 2024

majamassarini added the blocked We are blocked! label Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hard time limit (300s) exceeded for task.steve_jobs.process_message #2512

Hard time limit (300s) exceeded for task.steve_jobs.process_message #2512

majamassarini commented Aug 30, 2024 •

edited

Loading

majamassarini commented Oct 1, 2024

mfocko commented Oct 1, 2024

majamassarini commented Oct 2, 2024 •

edited

Loading

Hard time limit (300s)

SLO1 related problems

majamassarini commented Oct 3, 2024

Hard time limit (300s) exceeded for task.steve_jobs.process_message #2512

Hard time limit (300s) exceeded for task.steve_jobs.process_message #2512

Comments

majamassarini commented Aug 30, 2024 • edited Loading

majamassarini commented Oct 1, 2024

mfocko commented Oct 1, 2024

Hard time limit (900s)

Hard time limit (300s)

SLO1 related problems

majamassarini commented Oct 2, 2024 • edited Loading

Hard time limit (300s)

SLO1 related problems

majamassarini commented Oct 3, 2024

majamassarini commented Aug 30, 2024 •

edited

Loading

majamassarini commented Oct 2, 2024 •

edited

Loading