Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add latency metrics #1870

Closed
simon-mo opened this issue Nov 30, 2023 · 11 comments
Closed

Add latency metrics #1870

simon-mo opened this issue Nov 30, 2023 · 11 comments
Labels
good first issue Good for newcomers help wanted Extra attention is needed

Comments

@simon-mo
Copy link
Collaborator

After #1662 (initial metrics support) and #1756 (refactoring chat endpoint), it will become practical to include latency metrics that's important to production (courtesy of @Yard1):

  • histogram of time to first token, and gauge of the mean, in ms
  • histogram of inter-token latency, and gauge of the mean, in ms
  • histogram of e2e time per request, and gauge of the mean, in ms
  • gauge of mean tokens per s per request. we currently only track the prefill and generation throughput, no request level.

A natural place to do it would be in the LLM engine or chat completion API, which ever one is less intrusive.

@simon-mo simon-mo added help wanted Extra attention is needed good first issue Good for newcomers labels Nov 30, 2023
@Yard1
Copy link
Collaborator

Yard1 commented Dec 2, 2023

I would suggest placing them in the engine - it will be more generic that way.

@robertgshaw2-neuralmagic
Copy link
Sponsor Collaborator

I am working on a PR for this

@robertgshaw2-neuralmagic
Copy link
Sponsor Collaborator

first draft #2316

@hmellor hmellor closed this as completed Mar 28, 2024
@hmellor hmellor reopened this Mar 28, 2024
@hmellor
Copy link
Collaborator

hmellor commented Mar 28, 2024

#2764 looks to add a request level histogram of token throughput

@grandiose-pizza
Copy link
Contributor

Hi,

@Yard1 @robertgshaw2-neuralmagic
I want to use the metrics. I have exposed an API using the api_server.py

When I do a http://localhost:8075/metrics/, I get the following instead of seeing the values as described in the Metrics Class, How to see those metrics? :

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 6290.0
python_gc_objects_collected_total{generation="1"} 8336.0
python_gc_objects_collected_total{generation="2"} 4726.0
# HELP python_gc_objects_uncollectable_total Uncollectable objects found during GC
# TYPE python_gc_objects_uncollectable_total counter
python_gc_objects_uncollectable_total{generation="0"} 0.0
python_gc_objects_uncollectable_total{generation="1"} 0.0
python_gc_objects_uncollectable_total{generation="2"} 0.0
# HELP python_gc_collections_total Number of times this generation was collected
# TYPE python_gc_collections_total counter
python_gc_collections_total{generation="0"} 826.0
python_gc_collections_total{generation="1"} 75.0
python_gc_collections_total{generation="2"} 6.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="10",patchlevel="12",version="3.10.12"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 3.098353664e+010
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 7.31774976e+08
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1.71188972784e+09
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 18.27
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 44.0
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1.048576e+06

@hmellor
Copy link
Collaborator

hmellor commented Mar 31, 2024

You need to make a request in order for the metrics to be populated

@grandiose-pizza
Copy link
Contributor

grandiose-pizza commented Mar 31, 2024

I am making a request with curl command and then monitoring the /metrics end point.

But I can't see the metrics like this screenshot.

I think I may need to add something to api_serve.py to point to the metrics.py but unsure what.

@hmellor
Copy link
Collaborator

hmellor commented Mar 31, 2024

Are you curling either /v1/chat/completions or /v1/completions?

@grandiose-pizza
Copy link
Contributor

grandiose-pizza commented Mar 31, 2024

Are you curling either /v1/chat/completions or /v1/completions?

Yes. /v1/chat/completions

curl -X 'POST' \
  'http://localhost:8075/v1/chat/completions' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -H "Authorization: Bearer <API_KEY>" \
  -d '{
  "messages": [
    {
      "role": "user",
      "content": "Write an essay on plastic use"
    }
  ],
  "model": "jais-30b",
  "stream": true
}'

@hmellor
Copy link
Collaborator

hmellor commented Mar 31, 2024

This debugging isn't really relevant to this thread, I'm going to move further discussion to #2850, where it is.

@HarryWu99
Copy link
Contributor

@hmellor hello, It seems that the discussion has moved to another place. Can this issue be closed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

6 participants