diff --git a/README.md b/README.md index 8ee1126..7adda9b 100644 --- a/README.md +++ b/README.md @@ -10,6 +10,7 @@ Python bindings for the Transformer models implemented in C/C++ using [GGML](htt - [Hugging Face Hub](#hugging-face-hub) - [LangChain](#langchain) - [GPU](#gpu) + - [GPTQ](#gptq) - [Documentation](#documentation) - [License](#license) @@ -107,7 +108,7 @@ It is integrated into LangChain. See [LangChain docs](https://python.langchain.c ### GPU -> **Note:** Currently only LLaMA and Falcon models have GPU support. +> **Note:** Currently only LLaMA, MPT and Falcon models have GPU support. To run some of the model layers on GPU, set the `gpu_layers` parameter: @@ -154,30 +155,52 @@ To enable Metal support, install the `ctransformers` package using: CT_METAL=1 pip install ctransformers --no-binary ctransformers ``` +### GPTQ + +> **Note:** This is an experimental feature and only LLaMA models are supported using [ExLlama](https://github.com/turboderp/exllama). + +Install additional dependencies using: + +```sh +pip install ctransformers[gptq] +``` + +Load a GPTQ model using: + +```py +llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GPTQ") +``` + +[Run in Google Colab](https://colab.research.google.com/drive/1SzHslJ4CiycMOgrppqecj4VYCWFnyrN0) + +> If model name or path doesn't contain the word `gptq` then specify `model_type="gptq"`. + +It can also be used with LangChain. Low-level APIs are not fully supported. + ## Documentation ### Config -| Parameter | Type | Description | Default | -| :------------------- | :---------- | :------------------------------------------------------- | :------ | -| `top_k` | `int` | The top-k value to use for sampling. | `40` | -| `top_p` | `float` | The top-p value to use for sampling. | `0.95` | -| `temperature` | `float` | The temperature to use for sampling. | `0.8` | -| `repetition_penalty` | `float` | The repetition penalty to use for sampling. | `1.1` | -| `last_n_tokens` | `int` | The number of last tokens to use for repetition penalty. | `64` | -| `seed` | `int` | The seed value to use for sampling tokens. | `-1` | -| `max_new_tokens` | `int` | The maximum number of new tokens to generate. | `256` | -| `stop` | `List[str]` | A list of sequences to stop generation when encountered. | `None` | -| `stream` | `bool` | Whether to stream the generated text. | `False` | -| `reset` | `bool` | Whether to reset the model state before generating text. | `True` | -| `batch_size` | `int` | The batch size to use for evaluating tokens. | `8` | -| `threads` | `int` | The number of threads to use for evaluating tokens. | `-1` | -| `context_length` | `int` | The maximum context length to use. | `-1` | -| `gpu_layers` | `int` | The number of layers to run on GPU. | `0` | - -> **Note:** Currently only LLaMA, MPT, Falcon models support the `context_length` parameter and only LLaMA, Falcon models support the `gpu_layers` parameter. +| Parameter | Type | Description | Default | +| :------------------- | :---------- | :-------------------------------------------------------------- | :------ | +| `top_k` | `int` | The top-k value to use for sampling. | `40` | +| `top_p` | `float` | The top-p value to use for sampling. | `0.95` | +| `temperature` | `float` | The temperature to use for sampling. | `0.8` | +| `repetition_penalty` | `float` | The repetition penalty to use for sampling. | `1.1` | +| `last_n_tokens` | `int` | The number of last tokens to use for repetition penalty. | `64` | +| `seed` | `int` | The seed value to use for sampling tokens. | `-1` | +| `max_new_tokens` | `int` | The maximum number of new tokens to generate. | `256` | +| `stop` | `List[str]` | A list of sequences to stop generation when encountered. | `None` | +| `stream` | `bool` | Whether to stream the generated text. | `False` | +| `reset` | `bool` | Whether to reset the model state before generating text. | `True` | +| `batch_size` | `int` | The batch size to use for evaluating tokens in a single prompt. | `8` | +| `threads` | `int` | The number of threads to use for evaluating tokens. | `-1` | +| `context_length` | `int` | The maximum context length to use. | `-1` | +| `gpu_layers` | `int` | The number of layers to run on GPU. | `0` | + +> **Note:** Currently only LLaMA, MPT and Falcon models support the `context_length` and `gpu_layers` parameters. ### class `AutoModelForCausalLM` @@ -318,7 +341,7 @@ Computes embeddings for a text or list of tokens. **Args:** - `input`: The input text or list of tokens to get embeddings for. -- `batch_size`: The batch size to use for evaluating tokens. Default: `8` +- `batch_size`: The batch size to use for evaluating tokens in a single prompt. Default: `8` - `threads`: The number of threads to use for evaluating tokens. Default: `-1` **Returns:** @@ -341,7 +364,7 @@ Evaluates a list of tokens. **Args:** - `tokens`: The list of tokens to evaluate. -- `batch_size`: The batch size to use for evaluating tokens. Default: `8` +- `batch_size`: The batch size to use for evaluating tokens in a single prompt. Default: `8` - `threads`: The number of threads to use for evaluating tokens. Default: `-1` --- @@ -374,7 +397,7 @@ Generates new tokens from a list of tokens. - `repetition_penalty`: The repetition penalty to use for sampling. Default: `1.1` - `last_n_tokens`: The number of last tokens to use for repetition penalty. Default: `64` - `seed`: The seed value to use for sampling tokens. Default: `-1` -- `batch_size`: The batch size to use for evaluating tokens. Default: `8` +- `batch_size`: The batch size to use for evaluating tokens in a single prompt. Default: `8` - `threads`: The number of threads to use for evaluating tokens. Default: `-1` - `reset`: Whether to reset the model state before generating text. Default: `True` @@ -488,7 +511,7 @@ Generates text from a prompt. - `repetition_penalty`: The repetition penalty to use for sampling. Default: `1.1` - `last_n_tokens`: The number of last tokens to use for repetition penalty. Default: `64` - `seed`: The seed value to use for sampling tokens. Default: `-1` -- `batch_size`: The batch size to use for evaluating tokens. Default: `8` +- `batch_size`: The batch size to use for evaluating tokens in a single prompt. Default: `8` - `threads`: The number of threads to use for evaluating tokens. Default: `-1` - `stop`: A list of sequences to stop generation when encountered. Default: `None` - `stream`: Whether to stream the generated text. Default: `False` diff --git a/ctransformers/llm.py b/ctransformers/llm.py index ae15973..09c2221 100644 --- a/ctransformers/llm.py +++ b/ctransformers/llm.py @@ -68,7 +68,7 @@ class Config: stop="A list of sequences to stop generation when encountered.", stream="Whether to stream the generated text.", reset="Whether to reset the model state before generating text.", - batch_size="The batch size to use for evaluating tokens.", + batch_size="The batch size to use for evaluating tokens in a single prompt.", threads="The number of threads to use for evaluating tokens.", context_length="The maximum context length to use.", gpu_layers="The number of layers to run on GPU.", diff --git a/scripts/docs.py b/scripts/docs.py index fa9abbe..61ac2e2 100755 --- a/scripts/docs.py +++ b/scripts/docs.py @@ -29,7 +29,7 @@ default = getattr(Config, param) docs += f"| `{param}` | `{type_}` | {description} | `{default}` |\n" docs += """ -> **Note:** Currently only LLaMA, MPT, Falcon models support the `context_length` parameter and only LLaMA, Falcon models support the `gpu_layers` parameter. +> **Note:** Currently only LLaMA, MPT and Falcon models support the `context_length` and `gpu_layers` parameters. """ # Class Docs