Add more Prometheus metrics #2764

ronensc · 2024-02-05T15:25:23Z

This PR adds the Prometheus metrics defined in #2650

vllm/engine/metrics.py

vllm:request_max_tokens -> vllm:request_params_max_tokens vllm:request_n -> vllm:request_params_n

ronensc · 2024-02-12T16:46:51Z

@rib-2, I highly value your opinion. Would you please review my pull request?

simon-mo · 2024-02-28T00:24:58Z

@robertgshaw2-neuralmagic

ronensc · 2024-03-18T08:15:55Z

@simon-mo Could you please review this PR?

hmellor · 2024-03-26T09:37:40Z

vllm/engine/metrics.py

+        self.histogram_request_prompt_tokens = Histogram(
+            name="vllm:request_prompt_tokens",
+            documentation="Number of prefill tokens processed.",
+            labelnames=labelnames,
+            buckets=build_1_2_5_buckets(max_model_len),
+        )
+        self.histogram_request_generation_tokens = Histogram(
+            name="vllm:request_generation_tokens",
+            documentation="Number of generation tokens processed.",
+            labelnames=labelnames,
+            buckets=build_1_2_5_buckets(max_model_len),
+        )


These two could be constructed using vllm:prompt_tokens_total and vllm:generation_tokens_total using a Binary operation transform in Grafana.

It wouldn't be exactly the same, but it would prevent additional overhead in the server. i.e. if you calculate it on grafana (and your scrape interval is 1 minute) then it'd be a histogram of how many tokens get processed/generated per minute rather than how many tokens get processed/generated per request.

Thanks for your feedback!
You are right. But, wouldn't it be beneficial to have in addition histograms depicting the distribution of prompt length and generation length?

Since this metric doesn't actually introduce any overhead (because the data from vllm:x_tokens_total is reused, these two are probably fine. It would be interesting to know how big the prompts the users were providing are.

Exactly! I suggest to deprecate the 2 vllm:x_tokens_total metrics as they will be included as part of the Histogram metrics this PR adds.

I think we should keep these metrics, because a developer may not want to have to aggregate histogram data in order to get the same effect of vllm:x_tokens_total

Prometheus histograms have this nice feature where in addition to the bucket counters, they include 2 additional counters
suffixed with _sum and _count.

_count is incremented by 1 on every observe, and _sum is incremented by the value of the observation.

Therfore, vllm:prompt_tokens_total is equivalent to vllm:request_prompt_tokens_sum,
and vllm:generation_tokens_total is equivalent to vllm:request_generation_tokens_sum

Source:
https://www.robustperception.io/how-does-a-prometheus-histogram-work/

Oh I see, thanks for explaining. In that case you could move the vllm:x_tokens_total metrics into the # Legacy metrics section.

Although I think there might be some objection to changing metrics that people are already using in dashboards.

cc @simon-mo @Yard1 @robertgshaw2-neuralmagic (not sure who to ping for metrics related things, so please tell me if I should stop)

from my point of view, it's fine to duplicate metrics for backward compatibility reason.

Sure, I'll relocate these metrics to the legacy section. Perhaps in the future, when we're able to make breaking changes, we can consider removing them.

hmellor · 2024-03-26T09:40:03Z

vllm/engine/metrics.py

+        self.counter_request_success = Counter(
+            name="vllm:request_success",
+            documentation="Count of successfully processed requests.",
+            labelnames=labelnames)


This isn't just counting successful responses, it's counting all finish reasons. If you could find an elegant way to implement the counters we lost when switching from aioprometheus to prometheus_client, that would be great!

A quick option to add http related metrics would be to use prometheus-fastapi-instrumentator.

This involves installing the package:
pip install prometheus-fastapi-instrumentator

Then adding the following 2 lines after the app creation:

vllm/vllm/entrypoints/openai/api_server.py

Lines 47 to 48 in 8af890a

app = fastapi.FastAPI(lifespan=lifespan)

from prometheus_fastapi_instrumentator import Instrumentator Instrumentator().instrument(app).expose(app)

This will add the following metrics:

Metric Name Type Description

http_requests_total counter Total number of requests by method, status, and handler.

http_request_size_bytes summary Content length of incoming requests by handler. Only value of header is respected. Otherwise ignored.

http_response_size_bytes summary Content length of outgoing responses by handler. Only value of header is respected. Otherwise ignored.

http_request_duration_highr_seconds histogram Latency with many buckets but no API specific labels. Made for more accurate percentile calculations.

http_request_duration_seconds histogram Latency with only a few buckets by handler. Made to be only used if aggregation by handler is important.

Should I add it to the PR?

I like this solution, it saves us from reinventing the wheel in vLLM

Nice, if we're going to be using prometheus-fastapi-instrumentator then this implementation should be removed.

Sorry, I'm not sure I'm following. Which implementation are you referring to that should be removed?
IIUC, prometheus-fastapi-instrumentator simply adds a middleware into FastAPI to collect the metrics specified in the table above. It uses prometheus_client under the hood and adding other vLLM related metrics should be done with prometheus_client.

The code highlighted in the original comment

self.counter_request_success = Counter( name="vllm:request_success", documentation="Count of successfully processed requests.", labelnames=labelnames)

can be removed if we are getting these metrics from prometheus-fastapi-instrumentator instead.

Thank you for clarifying. I believe that vllm:request_success remains valuable. It includes a finished_reason label, which allows for counting requests based on their finished reason — either stop if the sequence ends with an EOS token, or length if the sequence length reaches either scheduler_config.max_model_len or sampling_params.max_tokens. I'm open to adjusting its name and description to make it more indicative.

What do you think of the idea of renaming this to something like vllm:request_info and including n and best_of as labels too? This way we log a single metric from which the user can construct many different visualisations on Grafana by utilising the label filters?

While the idea of combining these metrics may seem appealing at first glance, I believe they should be kept separate for the following reasons:

The metrics have different types: vllm:request_success is a Counter, while vllm:request_params_best_of and vllm:request_params_n are Histograms.

Aggregating different labels lacks semantic meaning.

Although merging n and best_of into the same histograms might make sense in this case, as they would share the same buckets, we may encounter scenarios where we need to introduce another metric with different bucket requirements.

This situation differs from the Info metric type, where data is encoded in the label values.

hmellor · 2024-03-26T09:49:55Z

vllm/engine/metrics.py

+        self.histogram_max_tokens = Histogram(
+            name="vllm:request_params_max_tokens",
+            documentation="Histogram of the max_tokens request parameter.",
+            labelnames=labelnames,
+            buckets=build_1_2_5_buckets(max_model_len),
+        )
+        self.histogram_request_n = Histogram(
+            name="vllm:request_params_n",
+            documentation="Histogram of the n request parameter.",
+            labelnames=labelnames,
+            buckets=[1, 2, 5, 10, 20],
+        )


Maybe these should go in an Info, like the cache_config?

IIUC, Info is intended to collect constant metrics such as configuration values. vllm:request_params_max_tokens and vllm:request_params_n are intended to collect the values of the max_tokens and n arguments in each request, which may vary with each request.
For instance, in the following request, max_tokens=7 and n=3 will be collected.

curl http://localhost:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "facebook/opt-125m", "prompt": "San Francisco is a", "max_tokens": 7, "best_of": 5, "n": 3, "use_beam_search": "true", "temperature": 0 }'

Good point.

Although, I'm not convinced that it's a useful datapoint to collect (I am open to being convinced though!). i.e. what do we, as the vLLM provider, learn from the token limit the user sets?

Also, in the vast majority of use cases a user will statically set their sampling parameters in their application and then never touch them again.

I see your point. Having a histogram of the token limit set by the user might indeed be redundant, particularly considering we have a histogram of the actual number of generated tokens per request (vllm:request_generation_tokens_sum). I'll remove it.

Regarding n and best_of, I realized I overlooked adding a metric for collecting best_of. Just to clarify, best_of determines the width of the beam search, while n specifies how many "top beams" to return (n <= best_of). AFAIU, larger values of best_of can significantly impact the engine, as the batch will be dominated by sequences from a few requests.

It might be insightful to compare the histograms of both n and best_of. A significant deviation between them could suggest that the engine is processing a substantial number of tokens that users aren't actually consuming.

Please let me know your thoughts on this.

Knowing the amount of tokens "wasted" in beam search could be interesting to know as a developer of vLLM. i.e. if beam search is being used extensively and many tokens are being "wasted", it signals that we need to optimise beam search if we can.

@simon-mo @Yard1 @robertgshaw2-neuralmagic what do you think about this? (not sure who to ping for metrics related things, so please tell me if I should stop)

hmellor · 2024-03-28T11:07:29Z

vllm/entrypoints/openai/api_server.py

-# Add prometheus asgi middleware to route /metrics requests
-metrics_app = make_asgi_app()
-app.mount("/metrics", metrics_app)
-
-


This is how the metrics defined in vllm/engine/metrics.py are exposed. It can't be removed.

I replaced it with the expose() method of prometheus-fastapi-instrumentator which also exposes a /metric endpoint.
https://github.com/vllm-project/vllm/pull/2764/files#diff-38318677b76349044192bf70161371c88fb2818b85279d8fc7f2c041d83a9544R48-R49

I noticed it also solves the /metrics/ redirection issue.
Which of the exposing methods should we use?

I replaced it with the expose() method of prometheus-fastapi-instrumentator which also exposes a /metric endpoint.

While this does expose a /metrics endpoint, none of the vLLM metrics will be in it because they come from make_asgi_app(), right?

Have you confirmed that /metrics still contains vLLM metrics with this code removed?

Yes, I've verified that both approaches expose all metrics. The only discrepancy I've noticed is that expose() from prometheus-fastapi-instrumentator exposes metrics on /metrics, whereas make_asgi_app() exposes them on /metrtics/. However, I'll revert to using the make_asgi_app() approach. I find the other method somewhat hacky, as it involves the prometheus-fastapi-instrumentator middleware handling the metrics endpoint. This could look weird if multiple middlewares are in use.

Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

ronensc · 2024-04-20T11:56:33Z

I've incorporated most of the changes from Fixes + More Metrics ronensc/vllm#1 into this PR.
I'm not sure the whether assumption in maybe_get_last_latency() and maybe_set_first_token_time() is correct.
Both methods are called after self.output_processor.process_outputs() (and specifically after seq_group.update_num_computed_tokens()). By this point, when the first generation token is ready, it is already added to the sequence, so the state of seq_group is changed from PREFILL to DECODE and get_num_uncomputed_tokens() == 1.
For maybe_set_first_token_time(), I suggest using the condition self.get_seqs()[0].get_output_len() == 1 to determine when the first token is generated.
As for maybe_get_last_latency(), I suggest using the condition self.is_prefill() to check when chunked_prefill is ongoing.
I modified num_generation_tokens_iter += 1 to num_generation_tokens_iter += seq_group.num_unfinished_seqs() to accommodate requests with more than one sequence (like beam search and parallel sampling).
I'll postpone adding the additional metrics until we get the current set of metrics right.

Note: I applied the changes from #4150 locally to aid in debugging.

robertgshaw2-neuralmagic · 2024-04-21T23:03:22Z

@ronensc is this ready for review?

ronensc · 2024-04-22T13:25:02Z

In current state of the PR, some of the metrics are still inaccurate in chunked_prefill.
Before addressing the chunked_prefill issue, could we please merge this PR up to commit 5ded719 (before attempting to solve the chunked_prefill issue)? We can tackle the chunked_prefill problem in a follow-up PR. What do you think?
Also, just a heads-up, I'll be less available in the coming days.

hmellor

From my perspective this LGTM.

Once @robertgshaw2-neuralmagic also approves, I'd say this is good to merge.

njhill · 2024-04-25T23:44:01Z

Huge thanks for all the work on this and reviews @ronensc @robertgshaw2-neuralmagic @hmellor

robertgshaw2-neuralmagic · 2024-04-26T01:27:44Z

Im just thinking though

"I'm not sure the whether assumption in maybe_get_last_latency() and maybe_set_first_token_time() is correct.
Both methods are called after self.output_processor.process_outputs() (and specifically after seq_group.update_num_computed_tokens()). By this point, when the first generation token is ready, it is already added to the sequence, so the state of seq_group is changed from PREFILL to DECODE and get_num_uncomputed_tokens() == 1.
For maybe_set_first_token_time(), I suggest using the condition self.get_seqs()[0].get_output_len() == 1 to determine when the first token is generated.
As for maybe_get_last_latency(), I suggest using the condition self.is_prefill() to check when chunked_prefill is ongoing."

Will merge this weekend

robertgshaw2-neuralmagic · 2024-04-28T21:39:03Z

@simon-mo @njhill had to make a couple changes for correctness due to some subtlety with chunked_prefill

Mind giving brief stamp?

Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com> Co-authored-by: Robert Shaw <rshaw@neuralmagic.com>

ronensc added 9 commits February 5, 2024 15:37

Add vllm:request_max_tokens

319bc37

Remove trailing dots from comments that are not sentences

9d4ce95

Add vllm:request_success

ae7eb6e

Remove redundant space

2188daa

Add vllm:request_n

e41c15f

Add vllm:prompt_tokens

71ec7c3

Add vllm:generation_tokens

45bd839

Add comments

f237c50

Rename metrics

f17a966

ronensc changed the title ~~Title: Add more Prometheus metrics~~ Add more Prometheus metrics Feb 5, 2024

ronensc commented Feb 5, 2024

View reviewed changes

vllm/engine/metrics.py Outdated Show resolved Hide resolved

ronensc added 2 commits February 7, 2024 17:01

Make type hint compatible with python 3.8

8e0d8c1

Rename metrics

9ed04ef

vllm:request_max_tokens -> vllm:request_params_max_tokens vllm:request_n -> vllm:request_params_n

ronensc added 4 commits February 19, 2024 13:12

Merge branch 'main' into more-metrics

6aebd80

Merge branch 'main' into more-metrics

de84dac

Merge branch 'main' into more-metrics

35944cc

Consider the value of max_model_len when building buckets

76cd774

ronensc added 3 commits March 4, 2024 14:23

Merge branch 'main' into more-metrics

93b0796

Merge branch 'main' into more-metrics

3643e0c

Fix too long line warning

60f1049

hmellor reviewed Mar 26, 2024

View reviewed changes

ronensc added 3 commits March 26, 2024 18:54

Add HTTP metrics from prometheus-fastapi-instrumentator

95daee7

Merge remote-tracking branch 'origin/main' into more-metrics

cf4acef

Make ruff happy

0f8dae9

hmellor reviewed Mar 28, 2024

View reviewed changes

hmellor mentioned this pull request Mar 28, 2024

Add latency metrics #1870

Closed

ronensc and others added 3 commits April 20, 2024 13:30

Refactor metrics logging methods

e81d95a

Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

Reorder metrics definition to match Stats order

ece2ec0

Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

Rename metric variables to match suffix convention

5a658c8

Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>

Make mypy happy

717b559

hmellor approved these changes Apr 22, 2024

View reviewed changes

rkooo567 self-assigned this Apr 22, 2024

robertgshaw2-neuralmagic self-assigned this Apr 25, 2024

robertgshaw2-neuralmagic and others added 2 commits April 25, 2024 17:15

Merge branch 'main' into more-metrics

61fad41

./format

f103ad8

robertgshaw2-neuralmagic and others added 4 commits April 28, 2024 15:20

Merge branch 'main' into more-metrics

bf1a0c4

fixed chunked prefill logic

cc0d5eb

make linter happy

d7f493b

fixed issues with chunked prefill X metrics

54bf260

robertgshaw2-neuralmagic self-requested a review April 28, 2024 21:37

robertgshaw2-neuralmagic approved these changes Apr 28, 2024

View reviewed changes

robertgshaw2-neuralmagic enabled auto-merge (squash) April 28, 2024 22:53

simon-mo disabled auto-merge April 28, 2024 22:59

simon-mo merged commit bf480c5 into vllm-project:main Apr 28, 2024
46 of 48 checks passed

dtrifiro mentioned this pull request May 8, 2024

fix stats opendatahub-io/vllm#17

Merged

dtrifiro mentioned this pull request May 15, 2024

bump ubi base image tag opendatahub-io/vllm#24

Merged

Metric Name	Type	Description
http_requests_total	counter	Total number of requests by method, status, and handler.
http_request_size_bytes	summary	Content length of incoming requests by handler. Only value of header is respected. Otherwise ignored.
http_response_size_bytes	summary	Content length of outgoing responses by handler. Only value of header is respected. Otherwise ignored.
http_request_duration_highr_seconds	histogram	Latency with many buckets but no API specific labels. Made for more accurate percentile calculations.
http_request_duration_seconds	histogram	Latency with only a few buckets by handler. Made to be only used if aggregation by handler is important.

Add more Prometheus metrics #2764

Add more Prometheus metrics #2764

Conversation

ronensc commented Feb 5, 2024

ronensc commented Feb 12, 2024

simon-mo commented Feb 28, 2024

ronensc commented Mar 18, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hmellor Mar 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hmellor Mar 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hmellor Mar 28, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ronensc Mar 29, 2024 • edited Loading

Choose a reason for hiding this comment

ronensc commented Apr 20, 2024

robertgshaw2-neuralmagic commented Apr 21, 2024

ronensc commented Apr 22, 2024

hmellor left a comment

Choose a reason for hiding this comment

njhill commented Apr 25, 2024

robertgshaw2-neuralmagic commented Apr 26, 2024

robertgshaw2-neuralmagic commented Apr 28, 2024 • edited Loading

hmellor Mar 28, 2024 •

edited

Loading

hmellor Mar 28, 2024 •

edited

Loading

hmellor Mar 28, 2024 •

edited

Loading

ronensc Mar 29, 2024 •

edited

Loading

robertgshaw2-neuralmagic commented Apr 28, 2024 •

edited

Loading