Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document generate HTTP endpoint #6412

Merged
merged 5 commits into from
Oct 12, 2023
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 53 additions & 14 deletions docs/protocol/extension_generate.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,13 +28,16 @@

# Generate Extension

GuanLuo marked this conversation as resolved.
Show resolved Hide resolved
**Note:** The Generate Extension is *provisional* and likely to change in future versions.

This document describes Triton's generate extension. The generate
extension provides a simple text oriented endpoint schema for interacting
with LLM models. This generate endpoint is specific to HTTP/REST frontend.
extension provides a simple text-oriented endpoint schema for interacting with
large language models (LLMs). The generate endpoint is specific to HTTP/REST
frontend.

## HTTP/REST

In all JSON schemas shown in this document `$number`, `$string`, `$boolean`,
In all JSON schemas shown in this document, `$number`, `$string`, `$boolean`,
`$object` and `$array` refer to the fundamental JSON types. #optional
indicates an optional JSON field.

Expand All @@ -52,23 +55,23 @@ POST v2/models/${MODEL_NAME}[/versions/${MODEL_VERSION}]/generate_stream
### generate v.s. generate_stream

Both URLs expect the same request JSON object, and generate the same response
JSON object. However, `generate` return exactly 1 response JSON object, while
`generate_stream` may return various number of responses based on the inference
JSON object. However, `generate` returns exactly 1 response JSON object, while
`generate_stream` may return multiple responses based on the inference
results. `generate_stream` returns the responses as
[Server-Sent Events](https://html.spec.whatwg.org/multipage/server-sent-events.html#server-sent-events)
(SSE), where each response will be a "data" chunk in the HTTP response body.
Also note that an error may be returned during inference, whereas the HTTP
response code has been set at first response of the SSE, which can result in
receiving error object while status code is success (200). Therefore the user
must always check whether an error object is received when generating responses
through `generate_stream`.
Also, note that an error may be returned during inference, whereas the HTTP
response code has been set in the first response of the SSE, which can result in
receiving an [error object](#generate-response-json-error-object) while status
code shows success (200). Therefore the user must always check whether an error
object is received when generating responses through `generate_stream`.

### Generate Request JSON Object

The generate request object, identified as *$generate_request*, is
required in the HTTP body of the POST request. The model name and
(optionally) version must be available in the URL. If a version is not
provided the server may choose a version based on its own policies or
provided, the server may choose a version based on its own policies or
return an error.

$generate_request =
Expand All @@ -95,10 +98,28 @@ $string, $number, or $boolean.

$parameter = $string : $string | $number | $boolean

Parameters are model-specific. In other words, the accepting names and values
varies according to the model. The user should check with the model
Parameters are model-specific. The user should check with the model
specification to set the parameters.

#### Example Request

Below is an example to send generate request with additional model parameters `stream` and `temperature`.

```
POST /v2/models/mymodel/generate HTTP/1.1
Host: localhost:8000
Content-Type: application/json
Content-Length: <xx>
{
"text_input": "client input",
"parameters" :
{
"stream": false,
"temperature": 0
}
}
```

### Generate Response JSON Object
nnshah1 marked this conversation as resolved.
Show resolved Hide resolved

A successful generate request is indicated by a 200 HTTP status code.
Expand All @@ -112,6 +133,15 @@ the HTTP body.

* "text_output" : The output of the inference.

#### Example Response

```
200
{
"text_output" : "model output"
GuanLuo marked this conversation as resolved.
Show resolved Hide resolved
}
```

### Generate Response JSON Error Object

A failed generate request must be indicated by an HTTP error status
Expand All @@ -123,4 +153,13 @@ A failed generate request must be indicated by an HTTP error status
"error": <error message string>
}

* “error” : The descriptive message for the error.
* “error” : The descriptive message for the error.

#### Example Error

```
400
{
"error" : "error message"
}
```
Loading