Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix stats #17

Merged
merged 1 commit into from
May 8, 2024
Merged

fix stats #17

merged 1 commit into from
May 8, 2024

Conversation

dtrifiro
Copy link

@dtrifiro dtrifiro commented May 8, 2024

Due to changes introduced in vllm-project#2764, the engine crashes with:

ERROR 05-08 08:45:19 async_llm_engine.py:43] Engine background task failed
ERROR 05-08 08:45:19 async_llm_engine.py:43] Traceback (most recent call last):
ERROR 05-08 08:45:19 async_llm_engine.py:43]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
ERROR 05-08 08:45:19 async_llm_engine.py:43]     task.result()
ERROR 05-08 08:45:19 async_llm_engine.py:43]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop
ERROR 05-08 08:45:19 async_llm_engine.py:43]     has_requests_in_progress = await asyncio.wait_for(
ERROR 05-08 08:45:19 async_llm_engine.py:43]                                ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-08 08:45:19 async_llm_engine.py:43]   File "/usr/lib64/python3.11/asyncio/tasks.py", line 489, in wait_for
ERROR 05-08 08:45:19 async_llm_engine.py:43]     return fut.result()
ERROR 05-08 08:45:19 async_llm_engine.py:43]            ^^^^^^^^^^^^
ERROR 05-08 08:45:19 async_llm_engine.py:43]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 475, in engine_step
ERROR 05-08 08:45:19 async_llm_engine.py:43]     request_outputs = await self.engine.step_async()
ERROR 05-08 08:45:19 async_llm_engine.py:43]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-08 08:45:19 async_llm_engine.py:43]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 231, in step_async
ERROR 05-08 08:45:19 async_llm_engine.py:43]     self.do_log_stats(scheduler_outputs, output)
ERROR 05-08 08:45:19 async_llm_engine.py:43]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/llm_engine.py", line 615, in do_log_stats
ERROR 05-08 08:45:19 async_llm_engine.py:43]     self.stat_logger.log(
ERROR 05-08 08:45:19 async_llm_engine.py:43]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/tgis_utils/metrics.py", line 119, in log
ERROR 05-08 08:45:19 async_llm_engine.py:43]     self.tgi_queue_size.set(stats.num_waiting + stats.num_swapped)
ERROR 05-08 08:45:19 async_llm_engine.py:43]                             ^^^^^^^^^^^^^^^^^
ERROR 05-08 08:45:19 async_llm_engine.py:43] AttributeError: 'Stats' object has no attribute 'num_waiting'Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f978f163f60>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f978382cb50>>)handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f978f163f60>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f978382cb50>>)>Traceback (most recent call last):  File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish    task.result()  File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop    has_requests_in_progress = await asyncio.wait_for(                               ^^^^^^^^^^^^^^^^^^^^^^^  File "/usr/lib64/python3.11/asyncio/tasks.py", line 489, in wait_for    return fut.result()           ^^^^^^^^^^^^  File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 475, in engine_step    request_outputs = await self.engine.step_async()                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 231, in step_async    self.do_log_stats(scheduler_outputs, output)  File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/llm_engine.py", line 615, in do_log_stats    self.stat_logger.log(  File "/opt/vllm/lib64/python3.11/site-packages/vllm/tgis_utils/metrics.py", line 119, in log    self.tgi_queue_size.set(stats.num_waiting + stats.num_swapped)                            ^^^^^^^^^^^^^^^^^AttributeError: 'Stats' object has no attribute 'num_waiting'The above exception was the direct cause of the following exception:Traceback (most recent call last):  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run  File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 45, in _raise_exception_on_finish    raise AsyncEngineDeadError(vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
ERROR 05-08 08:45:19 grpc_server.py:65] Generate failed
ERROR 05-08 08:45:19 grpc_server.py:65] Traceback (most recent call last):
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/entrypoints/grpc/grpc_server.py", line 82, in func_with_log
ERROR 05-08 08:45:19 grpc_server.py:65]     return await func(*args, **kwargs)
ERROR 05-08 08:45:19 grpc_server.py:65]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/entrypoints/grpc/grpc_server.py", line 155, in Generate
ERROR 05-08 08:45:19 grpc_server.py:65]     async for i, res in result_generator:
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/utils.py", line 239, in consumer
ERROR 05-08 08:45:19 grpc_server.py:65]     raise e
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/utils.py", line 232, in consumer
ERROR 05-08 08:45:19 grpc_server.py:65]     raise item
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/utils.py", line 216, in producer
ERROR 05-08 08:45:19 grpc_server.py:65]     async for item in iterator:
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 666, in generate
ERROR 05-08 08:45:19 grpc_server.py:65]     raise e
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 660, in generate
ERROR 05-08 08:45:19 grpc_server.py:65]     async for request_output in stream:
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 77, in __anext__
ERROR 05-08 08:45:19 grpc_server.py:65]     raise result
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
ERROR 05-08 08:45:19 grpc_server.py:65]     task.result()
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop
ERROR 05-08 08:45:19 grpc_server.py:65]     has_requests_in_progress = await asyncio.wait_for(
ERROR 05-08 08:45:19 grpc_server.py:65]                                ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/usr/lib64/python3.11/asyncio/tasks.py", line 489, in wait_for
ERROR 05-08 08:45:19 grpc_server.py:65]     return fut.result()
ERROR 05-08 08:45:19 grpc_server.py:65]            ^^^^^^^^^^^^
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 475, in engine_step
ERROR 05-08 08:45:19 grpc_server.py:65]     request_outputs = await self.engine.step_async()
ERROR 05-08 08:45:19 grpc_server.py:65]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 231, in step_async
ERROR 05-08 08:45:19 grpc_server.py:65]     self.do_log_stats(scheduler_outputs, output)
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/llm_engine.py", line 615, in do_log_stats
ERROR 05-08 08:45:19 grpc_server.py:65]     self.stat_logger.log(
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/tgis_utils/metrics.py", line 119, in log
ERROR 05-08 08:45:19 grpc_server.py:65]     self.tgi_queue_size.set(stats.num_waiting + stats.num_swapped)
ERROR 05-08 08:45:19 grpc_server.py:65]                             ^^^^^^^^^^^^^^^^^
ERROR 05-08 08:45:19 grpc_server.py:65] AttributeError: 'Stats' object has no attribute 'num_waiting'

ERROR 05-08 08:45:19 async_llm_engine.py:43] Engine background task failed
ERROR 05-08 08:45:19 async_llm_engine.py:43] Traceback (most recent call last):
ERROR 05-08 08:45:19 async_llm_engine.py:43]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
ERROR 05-08 08:45:19 async_llm_engine.py:43]     task.result()
ERROR 05-08 08:45:19 async_llm_engine.py:43]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop
ERROR 05-08 08:45:19 async_llm_engine.py:43]     has_requests_in_progress = await asyncio.wait_for(
ERROR 05-08 08:45:19 async_llm_engine.py:43]                                ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-08 08:45:19 async_llm_engine.py:43]   File "/usr/lib64/python3.11/asyncio/tasks.py", line 489, in wait_for
ERROR 05-08 08:45:19 async_llm_engine.py:43]     return fut.result()
ERROR 05-08 08:45:19 async_llm_engine.py:43]            ^^^^^^^^^^^^
ERROR 05-08 08:45:19 async_llm_engine.py:43]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 475, in engine_step
ERROR 05-08 08:45:19 async_llm_engine.py:43]     request_outputs = await self.engine.step_async()
ERROR 05-08 08:45:19 async_llm_engine.py:43]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-08 08:45:19 async_llm_engine.py:43]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 231, in step_async
ERROR 05-08 08:45:19 async_llm_engine.py:43]     self.do_log_stats(scheduler_outputs, output)
ERROR 05-08 08:45:19 async_llm_engine.py:43]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/llm_engine.py", line 615, in do_log_stats
ERROR 05-08 08:45:19 async_llm_engine.py:43]     self.stat_logger.log(
ERROR 05-08 08:45:19 async_llm_engine.py:43]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/tgis_utils/metrics.py", line 119, in log
ERROR 05-08 08:45:19 async_llm_engine.py:43]     self.tgi_queue_size.set(stats.num_waiting + stats.num_swapped)
ERROR 05-08 08:45:19 async_llm_engine.py:43]                             ^^^^^^^^^^^^^^^^^
ERROR 05-08 08:45:19 async_llm_engine.py:43] AttributeError: 'Stats' object has no attribute 'num_waiting'Exception in callback functools.partial(<function _raise_exception_on_finish at 0x7f978f163f60>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f978382cb50>>)handle: <Handle functools.partial(<function _raise_exception_on_finish at 0x7f978f163f60>, error_callback=<bound method AsyncLLMEngine._error_callback of <vllm.engine.async_llm_engine.AsyncLLMEngine object at 0x7f978382cb50>>)>Traceback (most recent call last):  File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish    task.result()  File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop    has_requests_in_progress = await asyncio.wait_for(                               ^^^^^^^^^^^^^^^^^^^^^^^  File "/usr/lib64/python3.11/asyncio/tasks.py", line 489, in wait_for    return fut.result()           ^^^^^^^^^^^^  File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 475, in engine_step    request_outputs = await self.engine.step_async()                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^  File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 231, in step_async    self.do_log_stats(scheduler_outputs, output)  File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/llm_engine.py", line 615, in do_log_stats    self.stat_logger.log(  File "/opt/vllm/lib64/python3.11/site-packages/vllm/tgis_utils/metrics.py", line 119, in log    self.tgi_queue_size.set(stats.num_waiting + stats.num_swapped)                            ^^^^^^^^^^^^^^^^^AttributeError: 'Stats' object has no attribute 'num_waiting'The above exception was the direct cause of the following exception:Traceback (most recent call last):  File "uvloop/cbhandles.pyx", line 63, in uvloop.loop.Handle._run  File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 45, in _raise_exception_on_finish    raise AsyncEngineDeadError(vllm.engine.async_llm_engine.AsyncEngineDeadError: Task finished unexpectedly. This should never happen! Please open an issue on Github. See stack trace above for the actual cause.
ERROR 05-08 08:45:19 grpc_server.py:65] Generate failed
ERROR 05-08 08:45:19 grpc_server.py:65] Traceback (most recent call last):
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/entrypoints/grpc/grpc_server.py", line 82, in func_with_log
ERROR 05-08 08:45:19 grpc_server.py:65]     return await func(*args, **kwargs)
ERROR 05-08 08:45:19 grpc_server.py:65]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/entrypoints/grpc/grpc_server.py", line 155, in Generate
ERROR 05-08 08:45:19 grpc_server.py:65]     async for i, res in result_generator:
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/utils.py", line 239, in consumer
ERROR 05-08 08:45:19 grpc_server.py:65]     raise e
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/utils.py", line 232, in consumer
ERROR 05-08 08:45:19 grpc_server.py:65]     raise item
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/utils.py", line 216, in producer
ERROR 05-08 08:45:19 grpc_server.py:65]     async for item in iterator:
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 666, in generate
ERROR 05-08 08:45:19 grpc_server.py:65]     raise e
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 660, in generate
ERROR 05-08 08:45:19 grpc_server.py:65]     async for request_output in stream:
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 77, in __anext__
ERROR 05-08 08:45:19 grpc_server.py:65]     raise result
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 38, in _raise_exception_on_finish
ERROR 05-08 08:45:19 grpc_server.py:65]     task.result()
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 501, in run_engine_loop
ERROR 05-08 08:45:19 grpc_server.py:65]     has_requests_in_progress = await asyncio.wait_for(
ERROR 05-08 08:45:19 grpc_server.py:65]                                ^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/usr/lib64/python3.11/asyncio/tasks.py", line 489, in wait_for
ERROR 05-08 08:45:19 grpc_server.py:65]     return fut.result()
ERROR 05-08 08:45:19 grpc_server.py:65]            ^^^^^^^^^^^^
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 475, in engine_step
ERROR 05-08 08:45:19 grpc_server.py:65]     request_outputs = await self.engine.step_async()
ERROR 05-08 08:45:19 grpc_server.py:65]                       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/async_llm_engine.py", line 231, in step_async
ERROR 05-08 08:45:19 grpc_server.py:65]     self.do_log_stats(scheduler_outputs, output)
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/engine/llm_engine.py", line 615, in do_log_stats
ERROR 05-08 08:45:19 grpc_server.py:65]     self.stat_logger.log(
ERROR 05-08 08:45:19 grpc_server.py:65]   File "/opt/vllm/lib64/python3.11/site-packages/vllm/tgis_utils/metrics.py", line 119, in log
ERROR 05-08 08:45:19 grpc_server.py:65]     self.tgi_queue_size.set(stats.num_waiting + stats.num_swapped)
ERROR 05-08 08:45:19 grpc_server.py:65]                             ^^^^^^^^^^^^^^^^^
ERROR 05-08 08:45:19 grpc_server.py:65] AttributeError: 'Stats' object has no attribute 'num_waiting'

Copy link

openshift-ci bot commented May 8, 2024

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@dtrifiro

This comment was marked as outdated.

Copy link

@z103cb z103cb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/ok-to-test

@openshift-ci openshift-ci bot added lgtm and removed lgtm labels May 8, 2024
@z103cb
Copy link

z103cb commented May 8, 2024

/lgtm

@openshift-ci openshift-ci bot added lgtm and removed lgtm labels May 8, 2024
@z103cb
Copy link

z103cb commented May 8, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm label May 8, 2024
@z103cb
Copy link

z103cb commented May 8, 2024

/ok-to-test

Copy link

openshift-ci bot commented May 8, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dtrifiro, z103cb

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@dtrifiro dtrifiro marked this pull request as ready for review May 8, 2024 13:20
@openshift-ci openshift-ci bot requested review from rpancham and z103cb May 8, 2024 13:20
@z103cb
Copy link

z103cb commented May 8, 2024

/lgtm

@openshift-ci openshift-ci bot added the lgtm label May 8, 2024
@dtrifiro dtrifiro enabled auto-merge (rebase) May 8, 2024 13:24
@dtrifiro dtrifiro merged commit 6100f4b into opendatahub-io:ibm_main May 8, 2024
2 of 3 checks passed
njhill pushed a commit to IBM/vllm that referenced this pull request May 8, 2024
Cherry-pick of fix commit 6100f4b from ODH:
opendatahub-io/vllm#17

---------

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>
Co-authored-by: Daniele Trifirò <dtrifiro@redhat.com>
@dtrifiro dtrifiro mentioned this pull request May 15, 2024
dtrifiro pushed a commit that referenced this pull request Jul 26, 2024
These Dockerfile changes:
- Update the release stage to work with the recently refactored
`requirements-common.txt` / `requirements-cuda.txt` split
- Fixup the kernel compilation in the `build` stage to correctly pick up
cuda
- Install the kernels from this docker build rather than pulling a
precompiled wheel. We can swap that back once a new wheel is available
with the correct pytorch version + updated interfaces

---------

Signed-off-by: Nick Hill <nickhill@us.ibm.com>
Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>
Co-authored-by: Joe Runde <Joseph.Runde@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants