Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for prometheus metrics #1662

Closed
wants to merge 1 commit into from

Conversation

ichernev
Copy link
Contributor

This is very useful for monitoring, esp if you run more than one instance.

@ichernev
Copy link
Contributor Author

@WoosukKwon @zhuohan123 can I get a review, please?

Also - should I add aioprometheus dependency (and drop the presence checks) or leave it as is?

@simon-mo
Copy link
Collaborator

@ichernev thank you for your PR!

This looks reasonable to me. Adding aioprometheus should be fine, it has little dependencies. I'm waiting on @Yard1 to confirm the list of metrics to track is reasonable and if there's anything else to add.

@ichernev
Copy link
Contributor Author

ichernev commented Nov 15, 2023

One other thing -- currently the code adds metrics only for the openai entrypoint. If I also add support in the regular entrypoint, we should probably just stick the aioprometheus package in requirements.txt and drop the try/catch from the imports/uses. (obsolete, check next comment)

I'm open to suggestions :)

Add /metrics for openai endpoint with the metrics that were already
logged.
@ichernev
Copy link
Contributor Author

In light of this comment: #1663 (comment) I figured it doesn't make sense to include prometheus /metrics for regular (non openai) endpoint. I also went ahead and added aioprometheus package to the openai Docker recipe, just like the other openai-specific deps.

@simon-mo simon-mo mentioned this pull request Nov 30, 2023
5 tasks
@simon-mo
Copy link
Collaborator

@ichernev Can you enable this PR be edited by maintainers? https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/allowing-changes-to-a-pull-request-branch-created-from-a-fork#enabling-repository-maintainer-permissions-on-existing-pull-requests

I made some improvements to this PR and I think it's in a good state to merge:

  • separate metrics definition to a standalone file
  • add documentations
  • add default labels.
Here's the detail changes, you can also directly apply the diff.

diff --git a/docs/source/index.rst b/docs/source/index.rst
index eb98aa6..c3a1d8f 100644
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@@ -66,6 +66,7 @@ Documentation
    serving/run_on_sky
    serving/deploying_with_triton
    serving/deploying_with_docker
+   serving/metrics
 
 .. toctree::
    :maxdepth: 1
diff --git a/docs/source/serving/metrics.rst b/docs/source/serving/metrics.rst
new file mode 100644
index 0000000..15e57bd
--- /dev/null
+++ b/docs/source/serving/metrics.rst
@@ -0,0 +1,13 @@
+Production Metrics
+==================
+
+vLLM exposes a number of metrics that can be used to monitor the health of the
+system. These metrics are exposed via the `/metrics` endpoint on the vLLM
+OpenAI compatible API server.
+
+The following metrics are exposed:
+
+.. literalinclude:: ../../../vllm/engine/metrics.py
+    :language: python
+    :start-after: begin-metrics-definitions
+    :end-before: end-metrics-definitions
diff --git a/requirements.txt b/requirements.txt
index fa9eb63..a593324 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -11,3 +11,4 @@ xformers >= 0.0.22.post7  # Required for CUDA 12.1.
 fastapi
 uvicorn[standard]
 pydantic == 1.10.13  # Required for OpenAI server.
+aioprometheus[starlette]
diff --git a/vllm/engine/llm_engine.py b/vllm/engine/llm_engine.py
index dc910e1..6bf3622 100644
--- a/vllm/engine/llm_engine.py
+++ b/vllm/engine/llm_engine.py
@@ -8,6 +8,7 @@ from vllm.config import (CacheConfig, ModelConfig, ParallelConfig,
 from vllm.core.scheduler import Scheduler, SchedulerOutputs
 from vllm.engine.arg_utils import EngineArgs
 from vllm.engine.ray_utils import RayWorker, initialize_cluster, ray
+from vllm.engine.metrics import record_metrics
 from vllm.logger import init_logger
 from vllm.outputs import RequestOutput
 from vllm.sampling_params import SamplingParams
@@ -18,12 +19,6 @@ from vllm.transformers_utils.tokenizer import (detokenize_incrementally,
                                                get_tokenizer)
 from vllm.utils import Counter
 
-try:
-    from aioprometheus import Gauge
-    _prometheus_available = True
-except ImportError:
-    _prometheus_available = False
-
 if ray:
     from ray.air.util.torch_dist import init_torch_dist_process_group
     from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy
@@ -34,19 +29,6 @@ if TYPE_CHECKING:
 logger = init_logger(__name__)
 
 _LOGGING_INTERVAL_SEC = 5
-if _prometheus_available:
-    gauge_avg_prompt_throughput = Gauge("vllm:avg_prompt_throughput",
-                                        "Avg prefill throughput")
-    gauge_avg_generation_throughput = Gauge("vllm:avg_generation_throughput",
-                                            "Avg prefill throughput")
-    gauge_scheduler_running = Gauge("vllm:scheduler_running",
-                                    "Num requests running")
-    gauge_scheduler_swapped = Gauge("vllm:scheduler_swapped",
-                                    "Num requests swapped")
-    gauge_scheduler_waiting = Gauge("vllm:scheduler_waiting",
-                                    "Num requests waiting")
-    gauge_gpu_cache_usage = Gauge("vllm:gpu_cache_usage", "GPU KV-cache usage")
-    gauge_cpu_cache_usage = Gauge("vllm:cpu_cache_usage", "CPU KV-cache usage")
 
 
 class LLMEngine:
@@ -601,7 +583,7 @@ class LLMEngine:
             self.num_generation_tokens.append((now, num_batched_tokens))
 
         should_log = now - self.last_logging_time >= _LOGGING_INTERVAL_SEC
-        if not (should_log or _prometheus_available):
+        if not should_log:
             return
 
         # Discard the old stats.
@@ -640,26 +622,26 @@ class LLMEngine:
         else:
             cpu_cache_usage = 0.0
 
-        if _prometheus_available:
-            gauge_avg_prompt_throughput.set({}, avg_prompt_throughput)
-            gauge_avg_generation_throughput.set({}, avg_generation_throughput)
-            gauge_scheduler_running.set({}, len(self.scheduler.running))
-            gauge_scheduler_swapped.set({}, len(self.scheduler.swapped))
-            gauge_scheduler_waiting.set({}, len(self.scheduler.waiting))
-            gauge_gpu_cache_usage.set({}, gpu_cache_usage)
-            gauge_cpu_cache_usage.set({}, cpu_cache_usage)
-
-        if should_log:
-            logger.info("Avg prompt throughput: "
-                        f"{avg_prompt_throughput:.1f} tokens/s, "
-                        "Avg generation throughput: "
-                        f"{avg_generation_throughput:.1f} tokens/s, "
-                        f"Running: {len(self.scheduler.running)} reqs, "
-                        f"Swapped: {len(self.scheduler.swapped)} reqs, "
-                        f"Pending: {len(self.scheduler.waiting)} reqs, "
-                        f"GPU KV cache usage: {gpu_cache_usage * 100:.1f}%, "
-                        f"CPU KV cache usage: {cpu_cache_usage * 100:.1f}%,")
-            self.last_logging_time = now
+        record_metrics(
+            avg_prompt_throughput=avg_prompt_throughput,
+            avg_generation_throughput=avg_generation_throughput,
+            scheduler_running=len(self.scheduler.running),
+            scheduler_swapped=len(self.scheduler.swapped),
+            scheduler_waiting=len(self.scheduler.waiting),
+            gpu_cache_usage=gpu_cache_usage,
+            cpu_cache_usage=cpu_cache_usage,
+        )
+
+        logger.info("Avg prompt throughput: "
+                    f"{avg_prompt_throughput:.1f} tokens/s, "
+                    "Avg generation throughput: "
+                    f"{avg_generation_throughput:.1f} tokens/s, "
+                    f"Running: {len(self.scheduler.running)} reqs, "
+                    f"Swapped: {len(self.scheduler.swapped)} reqs, "
+                    f"Pending: {len(self.scheduler.waiting)} reqs, "
+                    f"GPU KV cache usage: {gpu_cache_usage * 100:.1f}%, "
+                    f"CPU KV cache usage: {cpu_cache_usage * 100:.1f}%,")
+        self.last_logging_time = now
 
     def _decode_sequence(self, seq: Sequence, prms: SamplingParams) -> None:
         """Decodes the new token for a sequence."""
diff --git a/vllm/engine/metrics.py b/vllm/engine/metrics.py
new file mode 100644
index 0000000..66050a1
--- /dev/null
+++ b/vllm/engine/metrics.py
@@ -0,0 +1,51 @@
+from aioprometheus import Gauge
+
+# The begin-* and end* here are used by the documentation generator
+# to extract the metrics definitions.
+
+# begin-metrics-definitions
+gauge_avg_prompt_throughput = Gauge("vllm:avg_prompt_throughput_toks_per_s",
+                                    "Average prefill throughput in tokens/s.")
+gauge_avg_generation_throughput = Gauge(
+    "vllm:avg_generation_throughput_toks_per_s",
+    "Average generation throughput in tokens/s.")
+
+gauge_scheduler_running = Gauge(
+    "vllm:num_requests_running",
+    "Number of requests that is currently running for inference.")
+gauge_scheduler_swapped = Gauge("vllm:num_requests_swapped",
+                                "Number requests swapped to CPU.")
+gauge_scheduler_waiting = Gauge("vllm:num_requests_waiting",
+                                "Number of requests waiting to be processed.")
+
+gauge_gpu_cache_usage = Gauge(
+    "vllm:gpu_cache_usage_perc",
+    "GPU KV-cache usage. 1 means 100 percent usage.")
+gauge_cpu_cache_usage = Gauge(
+    "vllm:cpu_cache_usage_perc",
+    "CPU KV-cache usage. 1 means 100 percent usage.")
+# end-metrics-definitions
+
+labels = {}
+
+
+def add_global_metrics_labels(**kwargs):
+    labels.update(kwargs)
+
+
+def record_metrics(
+    avg_prompt_throughput,
+    avg_generation_throughput,
+    scheduler_running,
+    scheduler_swapped,
+    scheduler_waiting,
+    gpu_cache_usage,
+    cpu_cache_usage,
+):
+    gauge_avg_prompt_throughput.set(labels, avg_prompt_throughput)
+    gauge_avg_generation_throughput.set(labels, avg_generation_throughput)
+    gauge_scheduler_running.set(labels, scheduler_running)
+    gauge_scheduler_swapped.set(labels, scheduler_swapped)
+    gauge_scheduler_waiting.set(labels, scheduler_waiting)
+    gauge_gpu_cache_usage.set(labels, gpu_cache_usage)
+    gauge_cpu_cache_usage.set(labels, cpu_cache_usage)
diff --git a/vllm/entrypoints/openai/api_server.py b/vllm/entrypoints/openai/api_server.py
index 76046e1..a0b74b9 100644
--- a/vllm/entrypoints/openai/api_server.py
+++ b/vllm/entrypoints/openai/api_server.py
@@ -18,6 +18,7 @@ from packaging import version
 
 from vllm.engine.arg_utils import AsyncEngineArgs
 from vllm.engine.async_llm_engine import AsyncLLMEngine
+from vllm.engine.metrics import add_global_metrics_labels
 from vllm.entrypoints.openai.protocol import (
     CompletionRequest, CompletionResponse, CompletionResponseChoice,
     CompletionResponseStreamChoice, CompletionStreamResponse,
@@ -39,12 +40,8 @@ try:
 except ImportError:
     _fastchat_available = False
 
-try:
-    from aioprometheus import MetricsMiddleware
-    from aioprometheus.asgi.starlette import metrics
-    _prometheus_available = True
-except ImportError:
-    _prometheus_available = False
+from aioprometheus import MetricsMiddleware
+from aioprometheus.asgi.starlette import metrics
 
 TIMEOUT_KEEP_ALIVE = 5  # seconds
 
@@ -53,9 +50,8 @@ served_model = None
 app = fastapi.FastAPI()
 engine = None
 
-if _prometheus_available:
-    app.add_middleware(MetricsMiddleware)
-    app.add_route("/metrics", metrics)
+app.add_middleware(MetricsMiddleware)  # Trace HTTP server metrics
+app.add_route("/metrics", metrics)  # Exposes HTTP metrics
 
 
 def create_error_response(status_code: HTTPStatus,
@@ -640,6 +636,9 @@ if __name__ == "__main__":
                               tokenizer_mode=engine_args.tokenizer_mode,
                               trust_remote_code=engine_args.trust_remote_code)
 
+    # Register labels for metrics
+    add_global_metrics_labels(model_name=engine_args.model, )
+
     uvicorn.run(app,
                 host=args.host,
                 port=args.port,

@simon-mo simon-mo self-assigned this Nov 30, 2023
@simon-mo
Copy link
Collaborator

simon-mo commented Dec 2, 2023

In the interest of getting this out for release, I made a copy of the PR here with my changes. #1890

@simon-mo simon-mo closed this Dec 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants