Releases · triton-inference-server/vllm_backend

30 Aug 18:37

nvda-mesharma

v24.08

98947a7

Release 2.49.0 corresponding to NGC container 24.08 Latest

Latest

What's Changed

refactor: Remove explicit callings to garbage collect by @kthui in #55
perf: Check for cancellation on response thread by @kthui in #54
feat: Add vLLM counter metrics access through Triton by @yinggeh in #53
feat: Report histogram metrics to Triton metrics server by @yinggeh in #58
feat: Report more histogram metrics by @yinggeh in #61

Full Changelog: v24.07...v24.08

Contributors

kthui and yinggeh

Assets 2

05 Aug 20:38

nvda-mesharma

v24.07

128abc3

Release 2.48.0 corresponding to NGC container 24.07

What's Changed

Removed explicit mode for multi-lora by @oandreeva-nv in #45
test: Limiting multi-gpu tests to use Ray as distributed_executor_backend by @oandreeva-nv in #47
perf: Improve vLLM backend performance by using a separate thread for responses by @Tabrizian in #46

Full Changelog: v24.06...v24.07

Contributors

Tabrizian and oandreeva-nv

Assets 2

23 Jul 19:27

nvda-mesharma

v24.06

18a96e3

Release 2.47.0 corresponding to NGC container 24.06

fix: Enhance checks around KIND_GPU and tensor parallelism (#42)

Co-authored-by: Olga Andreeva <124622579+oandreeva-nv@users.noreply.github.com>

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

Contributors

What's Changed

Contributors

Releases: triton-inference-server/vllm_backend

Release 2.49.0 corresponding to NGC container 24.08

What's Changed

Contributors

Release 2.48.0 corresponding to NGC container 24.07

What's Changed

Contributors

Release 2.47.0 corresponding to NGC container 24.06