diff --git a/README.md b/README.md
index 8ee1126..7adda9b 100644
--- a/README.md
+++ b/README.md
@@ -10,6 +10,7 @@ Python bindings for the Transformer models implemented in C/C++ using [GGML](htt
- [Hugging Face Hub](#hugging-face-hub)
- [LangChain](#langchain)
- [GPU](#gpu)
+ - [GPTQ](#gptq)
- [Documentation](#documentation)
- [License](#license)
@@ -107,7 +108,7 @@ It is integrated into LangChain. See [LangChain docs](https://python.langchain.c
### GPU
-> **Note:** Currently only LLaMA and Falcon models have GPU support.
+> **Note:** Currently only LLaMA, MPT and Falcon models have GPU support.
To run some of the model layers on GPU, set the `gpu_layers` parameter:
@@ -154,30 +155,52 @@ To enable Metal support, install the `ctransformers` package using:
CT_METAL=1 pip install ctransformers --no-binary ctransformers
```
+### GPTQ
+
+> **Note:** This is an experimental feature and only LLaMA models are supported using [ExLlama](https://github.com/turboderp/exllama).
+
+Install additional dependencies using:
+
+```sh
+pip install ctransformers[gptq]
+```
+
+Load a GPTQ model using:
+
+```py
+llm = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-7B-GPTQ")
+```
+
+[Run in Google Colab](https://colab.research.google.com/drive/1SzHslJ4CiycMOgrppqecj4VYCWFnyrN0)
+
+> If model name or path doesn't contain the word `gptq` then specify `model_type="gptq"`.
+
+It can also be used with LangChain. Low-level APIs are not fully supported.
+
## Documentation
### Config
-| Parameter | Type | Description | Default |
-| :------------------- | :---------- | :------------------------------------------------------- | :------ |
-| `top_k` | `int` | The top-k value to use for sampling. | `40` |
-| `top_p` | `float` | The top-p value to use for sampling. | `0.95` |
-| `temperature` | `float` | The temperature to use for sampling. | `0.8` |
-| `repetition_penalty` | `float` | The repetition penalty to use for sampling. | `1.1` |
-| `last_n_tokens` | `int` | The number of last tokens to use for repetition penalty. | `64` |
-| `seed` | `int` | The seed value to use for sampling tokens. | `-1` |
-| `max_new_tokens` | `int` | The maximum number of new tokens to generate. | `256` |
-| `stop` | `List[str]` | A list of sequences to stop generation when encountered. | `None` |
-| `stream` | `bool` | Whether to stream the generated text. | `False` |
-| `reset` | `bool` | Whether to reset the model state before generating text. | `True` |
-| `batch_size` | `int` | The batch size to use for evaluating tokens. | `8` |
-| `threads` | `int` | The number of threads to use for evaluating tokens. | `-1` |
-| `context_length` | `int` | The maximum context length to use. | `-1` |
-| `gpu_layers` | `int` | The number of layers to run on GPU. | `0` |
-
-> **Note:** Currently only LLaMA, MPT, Falcon models support the `context_length` parameter and only LLaMA, Falcon models support the `gpu_layers` parameter.
+| Parameter | Type | Description | Default |
+| :------------------- | :---------- | :-------------------------------------------------------------- | :------ |
+| `top_k` | `int` | The top-k value to use for sampling. | `40` |
+| `top_p` | `float` | The top-p value to use for sampling. | `0.95` |
+| `temperature` | `float` | The temperature to use for sampling. | `0.8` |
+| `repetition_penalty` | `float` | The repetition penalty to use for sampling. | `1.1` |
+| `last_n_tokens` | `int` | The number of last tokens to use for repetition penalty. | `64` |
+| `seed` | `int` | The seed value to use for sampling tokens. | `-1` |
+| `max_new_tokens` | `int` | The maximum number of new tokens to generate. | `256` |
+| `stop` | `List[str]` | A list of sequences to stop generation when encountered. | `None` |
+| `stream` | `bool` | Whether to stream the generated text. | `False` |
+| `reset` | `bool` | Whether to reset the model state before generating text. | `True` |
+| `batch_size` | `int` | The batch size to use for evaluating tokens in a single prompt. | `8` |
+| `threads` | `int` | The number of threads to use for evaluating tokens. | `-1` |
+| `context_length` | `int` | The maximum context length to use. | `-1` |
+| `gpu_layers` | `int` | The number of layers to run on GPU. | `0` |
+
+> **Note:** Currently only LLaMA, MPT and Falcon models support the `context_length` and `gpu_layers` parameters.
### class `AutoModelForCausalLM`
@@ -318,7 +341,7 @@ Computes embeddings for a text or list of tokens.
**Args:**
- `input`: The input text or list of tokens to get embeddings for.
-- `batch_size`: The batch size to use for evaluating tokens. Default: `8`
+- `batch_size`: The batch size to use for evaluating tokens in a single prompt. Default: `8`
- `threads`: The number of threads to use for evaluating tokens. Default: `-1`
**Returns:**
@@ -341,7 +364,7 @@ Evaluates a list of tokens.
**Args:**
- `tokens`: The list of tokens to evaluate.
-- `batch_size`: The batch size to use for evaluating tokens. Default: `8`
+- `batch_size`: The batch size to use for evaluating tokens in a single prompt. Default: `8`
- `threads`: The number of threads to use for evaluating tokens. Default: `-1`
---
@@ -374,7 +397,7 @@ Generates new tokens from a list of tokens.
- `repetition_penalty`: The repetition penalty to use for sampling. Default: `1.1`
- `last_n_tokens`: The number of last tokens to use for repetition penalty. Default: `64`
- `seed`: The seed value to use for sampling tokens. Default: `-1`
-- `batch_size`: The batch size to use for evaluating tokens. Default: `8`
+- `batch_size`: The batch size to use for evaluating tokens in a single prompt. Default: `8`
- `threads`: The number of threads to use for evaluating tokens. Default: `-1`
- `reset`: Whether to reset the model state before generating text. Default: `True`
@@ -488,7 +511,7 @@ Generates text from a prompt.
- `repetition_penalty`: The repetition penalty to use for sampling. Default: `1.1`
- `last_n_tokens`: The number of last tokens to use for repetition penalty. Default: `64`
- `seed`: The seed value to use for sampling tokens. Default: `-1`
-- `batch_size`: The batch size to use for evaluating tokens. Default: `8`
+- `batch_size`: The batch size to use for evaluating tokens in a single prompt. Default: `8`
- `threads`: The number of threads to use for evaluating tokens. Default: `-1`
- `stop`: A list of sequences to stop generation when encountered. Default: `None`
- `stream`: Whether to stream the generated text. Default: `False`
diff --git a/ctransformers/llm.py b/ctransformers/llm.py
index ae15973..09c2221 100644
--- a/ctransformers/llm.py
+++ b/ctransformers/llm.py
@@ -68,7 +68,7 @@ class Config:
stop="A list of sequences to stop generation when encountered.",
stream="Whether to stream the generated text.",
reset="Whether to reset the model state before generating text.",
- batch_size="The batch size to use for evaluating tokens.",
+ batch_size="The batch size to use for evaluating tokens in a single prompt.",
threads="The number of threads to use for evaluating tokens.",
context_length="The maximum context length to use.",
gpu_layers="The number of layers to run on GPU.",
diff --git a/scripts/docs.py b/scripts/docs.py
index fa9abbe..61ac2e2 100755
--- a/scripts/docs.py
+++ b/scripts/docs.py
@@ -29,7 +29,7 @@
default = getattr(Config, param)
docs += f"| `{param}` | `{type_}` | {description} | `{default}` |\n"
docs += """
-> **Note:** Currently only LLaMA, MPT, Falcon models support the `context_length` parameter and only LLaMA, Falcon models support the `gpu_layers` parameter.
+> **Note:** Currently only LLaMA, MPT and Falcon models support the `context_length` and `gpu_layers` parameters.
"""
# Class Docs