Skip to content

Latest commit

 

History

History
38 lines (22 loc) · 2.64 KB

README.md

File metadata and controls

38 lines (22 loc) · 2.64 KB

Llama 2: Next generation of Meta's Language Model

Llama 2

TorchServe supports serving Llama 2 in a number of ways. The examples covered in this document range from someone new to TorchServe learning how to serve Llama 2 with an app, to an advanced user of TorchServe using micro batching and streaming response with Llama 2

🦙💬 Llama 2 Chatbot

This example shows how to deploy a llama2 chat app using TorchServe. We use streamlit to create the app

This example is using llama-cpp-python.

You can run this example on your laptop to understand how to use TorchServe, how to scale up/down TorchServe backend workers and play around with batch_size to see its effect on inference time

Chatbot Architecture

Llama 2 with HuggingFace

This example shows how to serve Llama 2 - 70b model with limited resource using HuggingFace. It shows the following optimizations 1) HuggingFace accelerate. This option can be activated with low_cpu_mem_usage=True. 2) Quantization from bitsandbytes using load_in_8bit=True The model is first created on the Meta device (with empty weights) and the state dict is then loaded inside it (shard by shard in the case of a sharded checkpoint).

Llama 2 on Inferentia

This example shows how to serve the Llama 2 model on AWS Inferentia2 for text completion with micro batching and streaming response support.

Inferentia2 uses Neuron SDK which is built on top of PyTorch XLA stack. For large model inference transformers-neuronx package is used that takes care of model partitioning and running inference.

Inferentia 2 Software Stack