Add support for prometheus metrics #1662

ichernev · 2023-11-14T13:45:06Z

This is very useful for monitoring, esp if you run more than one instance.

ichernev · 2023-11-15T16:34:14Z

@WoosukKwon @zhuohan123 can I get a review, please?

Also - should I add aioprometheus dependency (and drop the presence checks) or leave it as is?

simon-mo · 2023-11-15T19:22:45Z

@ichernev thank you for your PR!

This looks reasonable to me. Adding aioprometheus should be fine, it has little dependencies. I'm waiting on @Yard1 to confirm the list of metrics to track is reasonable and if there's anything else to add.

ichernev · 2023-11-15T19:45:26Z

One other thing -- currently the code adds metrics only for the openai entrypoint. If I also add support in the regular entrypoint, we should probably just stick the aioprometheus package in requirements.txt and drop the try/catch from the imports/uses. (obsolete, check next comment)

I'm open to suggestions :)

Add /metrics for openai endpoint with the metrics that were already logged.

ichernev · 2023-11-16T11:40:08Z

In light of this comment: #1663 (comment) I figured it doesn't make sense to include prometheus /metrics for regular (non openai) endpoint. I also went ahead and added aioprometheus package to the openai Docker recipe, just like the other openai-specific deps.

simon-mo · 2023-11-30T21:43:15Z

@ichernev Can you enable this PR be edited by maintainers? https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork#enabling-repository-maintainer-permissions-on-existing-pull-requests

I made some improvements to this PR and I think it's in a good state to merge:

separate metrics definition to a standalone file
add documentations
add default labels.

Here's the detail changes, you can also directly apply the diff.

diff --git a/docs/source/index.rst b/docs/source/index.rst
index eb98aa6..c3a1d8f 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -66,6 +66,7 @@ Documentation
    serving/run_on_sky
    serving/deploying_with_triton
    serving/deploying_with_docker
+   serving/metrics
 
 .. toctree::
    :maxdepth: 1
diff --git a/docs/source/serving/metrics.rst b/docs/source/serving/metrics.rst
new file mode 100644
index 0000000..15e57bd
--- /dev/null
+++ b/docs/source/serving/metrics.rst
@@ -0,0 +1,13 @@
+Production Metrics
+==================
+
+vLLM exposes a number of metrics that can be used to monitor the health of the
+system. These metrics are exposed via the `/metrics` endpoint on the vLLM
+OpenAI compatible API server.
+
+The following metrics are exposed:
+
+.. literalinclude:: ../../../vllm/engine/metrics.py
+    :language: python
+    :start-after: begin-metrics-definitions
+    :end-before: end-metrics-definitions
diff --git a/requirements.txt b/requirements.txt
index fa9eb63..a593324 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -11,3 +11,4 @@ xformers >= 0.0.22.post7  # Required for CUDA 12.1.
 fastapi
 uvicorn[standard]
 pydantic == 1.10.13  # Required for OpenAI server.
+aioprometheus[starlette]
diff --git a/vllm/engine/llm_engine.py b/vllm/engine/llm_engine.py
index dc910e1..6bf3622 100644
--- a/vllm/engine/llm_engine.py
+++ b/vllm/engine/llm_engine.py
@@ -8,6 +8,7 @@ from vllm.config import (CacheConfig, ModelConfig, ParallelConfig,
 from vllm.core.scheduler import Scheduler, SchedulerOutputs
 from vllm.engine.arg_utils import EngineArgs
 from vllm.engine.ray_utils import RayWorker, initialize_cluster, ray
+from vllm.engine.metrics import record_metrics
 from vllm.logger import init_logger
 from vllm.outputs import RequestOutput
 from vllm.sampling_params import SamplingParams
@@ -18,12 +19,6 @@ from vllm.transformers_utils.tokenizer import (detokenize_incrementally,
                                                get_tokenizer)
 from vllm.utils import Counter
 
-try:
-    from aioprometheus import Gauge
-    _prometheus_available = True
-except ImportError:
-    _prometheus_available = False
-
 if ray:
     from ray.air.util.torch_dist import init_torch_dist_process_group
     from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy
@@ -34,19 +29,6 @@ if TYPE_CHECKING:
 logger = init_logger(__name__)
 
 _LOGGING_INTERVAL_SEC = 5
-if _prometheus_available:
-    gauge_avg_prompt_throughput = Gauge("vllm:avg_prompt_throughput",
-                                        "Avg prefill throughput")
-    gauge_avg_generation_throughput = Gauge("vllm:avg_generation_throughput",
-                                            "Avg prefill throughput")
-    gauge_scheduler_running = Gauge("vllm:scheduler_running",
-                                    "Num requests running")
-    gauge_scheduler_swapped = Gauge("vllm:scheduler_swapped",
-                                    "Num requests swapped")
-    gauge_scheduler_waiting = Gauge("vllm:scheduler_waiting",
-                                    "Num requests waiting")
-    gauge_gpu_cache_usage = Gauge("vllm:gpu_cache_usage", "GPU KV-cache usage")
-    gauge_cpu_cache_usage = Gauge("vllm:cpu_cache_usage", "CPU KV-cache usage")
 
 
 class LLMEngine:
@@ -601,7 +583,7 @@ class LLMEngine:
             self.num_generation_tokens.append((now, num_batched_tokens))
 
         should_log = now - self.last_logging_time >= _LOGGING_INTERVAL_SEC
-        if not (should_log or _prometheus_available):
+        if not should_log:
             return
 
         # Discard the old stats.
@@ -640,26 +622,26 @@ class LLMEngine:
         else:
             cpu_cache_usage = 0.0
 
-        if _prometheus_available:
-            gauge_avg_prompt_throughput.set({}, avg_prompt_throughput)
-            gauge_avg_generation_throughput.set({}, avg_generation_throughput)
-            gauge_scheduler_running.set({}, len(self.scheduler.running))
-            gauge_scheduler_swapped.set({}, len(self.scheduler.swapped))
-            gauge_scheduler_waiting.set({}, len(self.scheduler.waiting))
-            gauge_gpu_cache_usage.set({}, gpu_cache_usage)
-            gauge_cpu_cache_usage.set({}, cpu_cache_usage)
-
-        if should_log:
-            logger.info("Avg prompt throughput: "
-                        f"{avg_prompt_throughput:.1f} tokens/s, "
-                        "Avg generation throughput: "
-                        f"{avg_generation_throughput:.1f} tokens/s, "
-                        f"Running: {len(self.scheduler.running)} reqs, "
-                        f"Swapped: {len(self.scheduler.swapped)} reqs, "
-                        f"Pending: {len(self.scheduler.waiting)} reqs, "
-                        f"GPU KV cache usage: {gpu_cache_usage * 100:.1f}%, "
-                        f"CPU KV cache usage: {cpu_cache_usage * 100:.1f}%,")
-            self.last_logging_time = now
+        record_metrics(
+            avg_prompt_throughput=avg_prompt_throughput,
+            avg_generation_throughput=avg_generation_throughput,
+            scheduler_running=len(self.scheduler.running),
+            scheduler_swapped=len(self.scheduler.swapped),
+            scheduler_waiting=len(self.scheduler.waiting),
+            gpu_cache_usage=gpu_cache_usage,
+            cpu_cache_usage=cpu_cache_usage,
+        )
+
+        logger.info("Avg prompt throughput: "
+                    f"{avg_prompt_throughput:.1f} tokens/s, "
+                    "Avg generation throughput: "
+                    f"{avg_generation_throughput:.1f} tokens/s, "
+                    f"Running: {len(self.scheduler.running)} reqs, "
+                    f"Swapped: {len(self.scheduler.swapped)} reqs, "
+                    f"Pending: {len(self.scheduler.waiting)} reqs, "
+                    f"GPU KV cache usage: {gpu_cache_usage * 100:.1f}%, "
+                    f"CPU KV cache usage: {cpu_cache_usage * 100:.1f}%,")
+        self.last_logging_time = now
 
     def _decode_sequence(self, seq: Sequence, prms: SamplingParams) -> None:
         """Decodes the new token for a sequence."""
diff --git a/vllm/engine/metrics.py b/vllm/engine/metrics.py
new file mode 100644
index 0000000..66050a1
--- /dev/null
+++ b/vllm/engine/metrics.py
@@ -0,0 +1,51 @@
+from aioprometheus import Gauge
+
+# The begin-* and end* here are used by the documentation generator
+# to extract the metrics definitions.
+
+# begin-metrics-definitions
+gauge_avg_prompt_throughput = Gauge("vllm:avg_prompt_throughput_toks_per_s",
+                                    "Average prefill throughput in tokens/s.")
+gauge_avg_generation_throughput = Gauge(
+    "vllm:avg_generation_throughput_toks_per_s",
+    "Average generation throughput in tokens/s.")
+
+gauge_scheduler_running = Gauge(
+    "vllm:num_requests_running",
+    "Number of requests that is currently running for inference.")
+gauge_scheduler_swapped = Gauge("vllm:num_requests_swapped",
+                                "Number requests swapped to CPU.")
+gauge_scheduler_waiting = Gauge("vllm:num_requests_waiting",
+                                "Number of requests waiting to be processed.")
+
+gauge_gpu_cache_usage = Gauge(
+    "vllm:gpu_cache_usage_perc",
+    "GPU KV-cache usage. 1 means 100 percent usage.")
+gauge_cpu_cache_usage = Gauge(
+    "vllm:cpu_cache_usage_perc",
+    "CPU KV-cache usage. 1 means 100 percent usage.")
+# end-metrics-definitions
+
+labels = {}
+
+
+def add_global_metrics_labels(**kwargs):
+    labels.update(kwargs)
+
+
+def record_metrics(
+    avg_prompt_throughput,
+    avg_generation_throughput,
+    scheduler_running,
+    scheduler_swapped,
+    scheduler_waiting,
+    gpu_cache_usage,
+    cpu_cache_usage,
+):
+    gauge_avg_prompt_throughput.set(labels, avg_prompt_throughput)
+    gauge_avg_generation_throughput.set(labels, avg_generation_throughput)
+    gauge_scheduler_running.set(labels, scheduler_running)
+    gauge_scheduler_swapped.set(labels, scheduler_swapped)
+    gauge_scheduler_waiting.set(labels, scheduler_waiting)
+    gauge_gpu_cache_usage.set(labels, gpu_cache_usage)
+    gauge_cpu_cache_usage.set(labels, cpu_cache_usage)
diff --git a/vllm/entrypoints/openai/api_server.py b/vllm/entrypoints/openai/api_server.py
index 76046e1..a0b74b9 100644
--- a/vllm/entrypoints/openai/api_server.py
+++ b/vllm/entrypoints/openai/api_server.py
@@ -18,6 +18,7 @@ from packaging import version
 
 from vllm.engine.arg_utils import AsyncEngineArgs
 from vllm.engine.async_llm_engine import AsyncLLMEngine
+from vllm.engine.metrics import add_global_metrics_labels
 from vllm.entrypoints.openai.protocol import (
     CompletionRequest, CompletionResponse, CompletionResponseChoice,
     CompletionResponseStreamChoice, CompletionStreamResponse,
@@ -39,12 +40,8 @@ try:
 except ImportError:
     _fastchat_available = False
 
-try:
-    from aioprometheus import MetricsMiddleware
-    from aioprometheus.asgi.starlette import metrics
-    _prometheus_available = True
-except ImportError:
-    _prometheus_available = False
+from aioprometheus import MetricsMiddleware
+from aioprometheus.asgi.starlette import metrics
 
 TIMEOUT_KEEP_ALIVE = 5  # seconds
 
@@ -53,9 +50,8 @@ served_model = None
 app = fastapi.FastAPI()
 engine = None
 
-if _prometheus_available:
-    app.add_middleware(MetricsMiddleware)
-    app.add_route("/metrics", metrics)
+app.add_middleware(MetricsMiddleware)  # Trace HTTP server metrics
+app.add_route("/metrics", metrics)  # Exposes HTTP metrics
 
 
 def create_error_response(status_code: HTTPStatus,
@@ -640,6 +636,9 @@ if __name__ == "__main__":
                               tokenizer_mode=engine_args.tokenizer_mode,
                               trust_remote_code=engine_args.trust_remote_code)
 
+    # Register labels for metrics
+    add_global_metrics_labels(model_name=engine_args.model, )
+
     uvicorn.run(app,
                 host=args.host,
                 port=args.port,

simon-mo · 2023-12-02T06:08:45Z

In the interest of getting this out for release, I made a copy of the PR here with my changes. #1890

ichernev force-pushed the metrics-v2 branch from 1ba23da to 4e26f38 Compare November 14, 2023 15:01

ichernev force-pushed the metrics-v2 branch from 4e26f38 to 27dd88a Compare November 16, 2023 11:33

Add support for prometheus metrics

20f77fb

Add /metrics for openai endpoint with the metrics that were already logged.

ichernev force-pushed the metrics-v2 branch from 27dd88a to 20f77fb Compare November 16, 2023 11:35

simon-mo mentioned this pull request Nov 30, 2023

[v0.2.3] Release Tracker #1856

Closed

5 tasks

simon-mo self-assigned this Nov 30, 2023

This was referenced Nov 30, 2023

Add latency metrics #1870

Closed

Add Production Metrics in Prometheus format #1890

Merged

simon-mo closed this Dec 2, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for prometheus metrics #1662

Add support for prometheus metrics #1662

ichernev commented Nov 14, 2023

ichernev commented Nov 15, 2023

simon-mo commented Nov 15, 2023

ichernev commented Nov 15, 2023 •

edited

Loading

ichernev commented Nov 16, 2023

simon-mo commented Nov 30, 2023

simon-mo commented Dec 2, 2023

Add support for prometheus metrics #1662

Add support for prometheus metrics #1662

Conversation

ichernev commented Nov 14, 2023

ichernev commented Nov 15, 2023

simon-mo commented Nov 15, 2023

ichernev commented Nov 15, 2023 • edited Loading

ichernev commented Nov 16, 2023

simon-mo commented Nov 30, 2023

simon-mo commented Dec 2, 2023

ichernev commented Nov 15, 2023 •

edited

Loading