Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Prometheus and Add Request Level Metrics #2316

Merged
merged 52 commits into from
Jan 31, 2024
Merged

Refactor Prometheus and Add Request Level Metrics #2316

merged 52 commits into from
Jan 31, 2024

Conversation

robertgshaw2-neuralmagic
Copy link
Sponsor Collaborator

@robertgshaw2-neuralmagic robertgshaw2-neuralmagic commented Jan 1, 2024

Summary

This PR does three things:

A) Addresses open feature request (#1870) by refactoring and extending initial implementation of metrics (#1890) to:

  • Handle more complex metrics such as request level latencies (as requested)
  • Build interfaces to make it easier to add new metrics in the future

B) Creates an end-to-end example for how to monitoring vLLM with Prometheus and Grafana

C) Updates the existing metric implementations to follow Prometheus best practices, namely:

  • Metric Naming (Prom Docs) -> vllm:num_requests_running should be vllm_num_requests_running_total
  • Base Units (Prom Docs) -> times should be s rather than ms
  • Calculating averages as Gauges rather Counters + PromQL rate(Prom Docs) -> vllm:avg_generation_throughput_toks_per_sec should be a Counter called vllm_generation_tokens_total and use PromQL rate(vllm_generation_tokens_total[5s]) to calc tokens / second during dashboarding.

A) Implementation

Created / updated the following classes:

  • SequenceGroup: added last_token_time variable and get_last_latency/get_e2e_latency) methods, which enables us to capture the request-level latencies if logging is enabled.

  • LLMEngine: added a PrometheusLogger and logic to create Stats, making a cleaner interface between the LLMEngine and logging-related functionality.

    • At the the last step of _process_model_outputs, we call LLMEngine._get_stats to generate Stats that are passed to the PrometheusLogger.log.
  • PrometheusLogger: holds a list of PrometheusMetrics and passes the Stat generated by the LLMEngine to each.

    • PrometheusMetric: holds a metric (aioprometheus collector Counter, Gauge, Histogram) and a function to extract the appropriate data from Stats

Within this framework, created a registry of PrometheusMetrics:

  • Re-implemented the existing metrics (modulo updating to conform to Prometheus best practices)
  • Implemented new request-level latency timing metrics (TTFT, Inter-Token, and E2E Latency)

Currently Supported Include:

  • counter_prompt_tokens --> used with rate() to calculate prompt token throughput
  • counter_generation_tokens --> used with rate() to calculate generation token throughput
  • gauge_scheduler_running
  • gauge_scheduler_swapped
  • gauge_scheduler_waiting
  • gauge_gpu_cache_usage
  • gauge_cpu_cache_usage
  • histogram_time_to_first_token --> exposes counters needed to calculate avg ttft, P50, P90, P95, P99
  • histogram_inter_token_latency --> exposes counters needed to calculate avg itl, P50, P90, P95, P99
  • histogram_e2e_request_latency --> exposes counters needed to calculate e2e request latency, P50, P90, P95, P99

See the Example for a dashboard that shows how these exposed metrics should be monitored


B) Example

See examples/production_monitoring for an end-to-end example. I included a Grafana dashboard configuration which shows how these metrics should be monitored.


C) Best Practices

I recognize these changes have breaking impacts on the metrics exposed to users.

Key changes include:

  • Updated the names of EACH metric: (e.g. vllm:num_requests_swapped --> vllm_requests_stopped_total)
  • Updated average token throughput gauges (vllm:avg_prompt_throughput_toks_per_s / vllm:avg_generation_throughput_toks_per_s) to be total tokens processed counters (vllm_prompt_tokens_total / vllm_generation_tokens_total)
    • On the dashboard, we can calculate average tokens/sec with rate(vllm_prompt_tokens_total[30s])

My sense is that this is a very new feature, so Im not sure how much user impact there is. However, I think the changes I am suggesting are justified. I am happy to revert these if requested.


Overhead

I used the benchmarking scripts to test performance with and without the logger on an L4 GPU. There is very minor latency.

  • benchmark_serving.py Client:
    python3 benchmark_serving.py --backend vllm --tokenizer mistralai/Mistral-7B-v0.1 --dataset /home/robertgshaw/vllm/benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json --request-rate 1.0 --num-prompts 200

  • Launch with System Logging:
    python3 -m vllm.entrypoints.api_server --model mistralai/Mistral-7B-v0.1 --max-model-len 4096 --swap-space 16 --disable-log-requests

Total time: 265.46 s
Throughput: 0.75 requests/s
Average latency: 18.12 s
Average latency per token: 0.04 s
Average latency per output token: 0.08 s
  • Launch with System Logging Off:
    python3 -m vllm.entrypoints.api_server --model mistralai/Mistral-7B-v0.1 --max-model-len 4096 --swap-space 16 --disable-log-stats --disable-log-requests
Total time: 265.26 s
Throughput: 0.75 requests/s
Average latency: 18.06 s
Average latency per token: 0.04 s
Average latency per output token: 0.08 s

Next Steps

Next steps to finalize the PR are:

  • Running benchmarks on a more powerful system than an L4

Questions

Are there any other things I need to do?

@robertgshaw2-neuralmagic robertgshaw2-neuralmagic marked this pull request as ready for review January 5, 2024 03:18
@robertgshaw2-neuralmagic
Copy link
Sponsor Collaborator Author

@simon-mo this addresses #1870. Any feedback would be appreciated.

Additionally, this is my first time contributing to vLLM. I tried to get up to speed on the whole repo/submission process/QA process, but let me know if I am missing anything. Thanks.

@simon-mo
Copy link
Collaborator

simon-mo commented Jan 5, 2024

Hi @rib-2, thank you for your contribution. This PR is definitely in the right direction, few things to start:

  • We are recommending the OpenAI compatible server as the only server, the internal api_server is only to be used for demo and benchmark purpose. Can you limit the change to OpenAI compatible server?
  • For the change in abstracting metrics capturing, can you make sure to keep the existing logging message intact? It is heavily used as one of the main diagnostic information.
  • Regarding metrics definition, I find the fn in Counter, Gauge, and Histogram hard to interpret from a code readability and maintainability point of view. Is there a simpler to way to achieve something like this?

@robertgshaw2-neuralmagic
Copy link
Sponsor Collaborator Author

robertgshaw2-neuralmagic commented Jan 5, 2024

@simon-mo Sounds good. Thanks for the feedback.

  • Will remove from api_server. I just added that in so I could test the overhead with the existing benchmarking scripts
  • I will try to refactor the fn piece. I originally had each metric as a new class that inherits from CounterMetric (etc) and it got a bit verbose, but I will think through it again.

For the existing logging message --> are you referring to the logger to the command line?

@simon-mo
Copy link
Collaborator

simon-mo commented Jan 5, 2024

For the existing logging message --> are you referring to the logger to the command line?

Yes!

Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
@robertgshaw2-neuralmagic
Copy link
Sponsor Collaborator Author

In the Prometheus docs it appears that they do support FastAPI in a very similar way to aioprometheus, so it looks like it should just work.

I read the code and the Prometheus Python Client should also work. We added the metrics to vllm originally, and we just used aioprometheus because we were using it in another code async code-base. Should be ok to switch, but I would do it in another PR.

I like the idea of doing it in another PR

@robertgshaw2-neuralmagic
Copy link
Sponsor Collaborator Author

Only outstanding item I think is the time_per_output_token calculation. I left some thoughts @simon-mo, I believe the way we are doing it now works well and calculates properly. LMK if my comments make sense

  • fixed aesthetic changes
  • fixed the metric names so that they don't break the existing use (@NikolaBorisov's request)

@simon-mo requesting re-review

Copy link
Contributor

@NikolaBorisov NikolaBorisov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@simon-mo I think this is good, and should be merged

Copy link
Collaborator

@simon-mo simon-mo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the great work here. And thanks @NikolaBorisov for the review.

@simon-mo simon-mo merged commit 93b38be into vllm-project:main Jan 31, 2024
17 checks passed
hippothewild added a commit to vessl-ai/examples that referenced this pull request Feb 5, 2024
hippothewild added a commit to vessl-ai/examples that referenced this pull request Feb 5, 2024
hippothewild added a commit to vessl-ai/examples that referenced this pull request Feb 6, 2024
* Add vllm-online-serving

* Add prom metrics

* Update monitoring

* remove logging

* Add labels

* Use vllm directly from upstream latest to pick up vllm-project/vllm#2316

* Roll back vllm to 0.3.0

* Get patch files for metrics in vllm-project/vllm#2316

* Update llm_engine.py

* Write documents

* Add vllm-online-serving/README-ko.md

* write README.md
hongxiayang pushed a commit to hongxiayang/vllm that referenced this pull request Feb 13, 2024
alexm-neuralmagic pushed a commit to neuralmagic/nm-vllm that referenced this pull request Feb 13, 2024
xjpang pushed a commit to xjpang/vllm that referenced this pull request Feb 20, 2024
horheynm pushed a commit to neuralmagic/nm-vllm that referenced this pull request Feb 20, 2024
xjpang pushed a commit to xjpang/vllm that referenced this pull request Feb 22, 2024
xjpang pushed a commit to xjpang/vllm that referenced this pull request Mar 4, 2024
@hmellor hmellor mentioned this pull request Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants