Skip to content

Commit

Permalink
modified math equation display
Browse files Browse the repository at this point in the history
  • Loading branch information
Demi-wlw committed Jul 28, 2024
1 parent d08a65e commit 0bf0989
Showing 1 changed file with 9 additions and 9 deletions.
18 changes: 9 additions & 9 deletions _posts/2023-03-19-ChatGPT.md
Original file line number Diff line number Diff line change
Expand Up @@ -92,16 +92,16 @@ GPT has been a major breakthrough in natural language processing and the version
The term _generative pre-training_ represents the unsupervised pre-training of the generative model.<d-footnote>They used a multi-layer Transformer decoder to produce an output distribution over target tokens.</d-footnote> Given an unsupervised corpus of tokens $\mathcal{U} = \{u_1,\dots,u_n\}$, they use a standard language modelling objective to maximize the following likelihood:
{: .text-justify}

$$
\begin{equation}
L_1(\mathcal{U})=\sum_i\log P(u_i|u_{i-k},\dots,u_{i-1};\Theta)
$$
\end{equation}

where $k$ is the size of the context window, and the conditional probability $P$ is modelled using a neural network with parameters $\Theta$ trained using stochastic gradient descent. **Intuitively, we train the Transformer-based model to predict the next token within the $k$-context window using unlabeled text from which we also extract the latent features $h$.**
{: .text-justify}

### Supervised fine-tuning

After training the model with the objective function above, they adapt the parameters to the supervised target task which refers to supervised fine-tuning. Assume a labelled dataset $$\mathcal{C}$$, where each instance consists of a sequence of input tokens, $$x^1,\dots, x^m$$, along with a label $$y$$. The inputs are passed through the pre-trained model to obtain the final transformer block's activation $$h_l^m$$, which is then fed into an added linear output layer with parameters $$W_y$$ to predict $$y$$:
After training the model with the objective function above, they adapt the parameters to the supervised target task which refers to supervised fine-tuning. Assume a labelled dataset $\mathcal{C}$, where each instance consists of a sequence of input tokens, $x^1,\dots, x^m$, along with a label $y$. The inputs are passed through the pre-trained model to obtain the final transformer block's activation $h_l^m$, which is then fed into an added linear output layer with parameters $W_y$ to predict $y$:
{: .text-justify}

$$
Expand All @@ -114,7 +114,7 @@ $$
L_2(\mathcal{C})=\sum_{(x,y)}\log P(y|x^1,\dots,x^m)
$$

They additionally found that including language modelling as an auxiliary objective to the fine-tuning helped learning by (a) improving the generalization of the supervised model, and (b) accelerating convergence. Specifically, we optimize the following objective (with weight $$\lambda$$): $$L_3(\mathcal{C})=L_2(\mathcal{C})+\lambda*L_1(\mathcal{C})$$.
They additionally found that including language modelling as an auxiliary objective to the fine-tuning helped learning by (a) improving the generalization of the supervised model, and (b) accelerating convergence. Specifically, we optimize the following objective (with weight $\lambda$): $L_3(\mathcal{C})=L_2(\mathcal{C})+\lambda*L_1(\mathcal{C})$.
{: .text-justify}

Some tasks, like question answering or textual entailment, have structured inputs such as ordered sentence pairs, or triplets of document, questions, and answers that are different from the contiguous sequences of text inputs of the pre-trained model so they require some modifications to apply GPT. This results in the **_input transformations_** which allow us to avoid making extensive changes to the architecture across tasks. A brief description of these input transformations is shown in Figure 1 (credit to the [paper](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf)).
Expand All @@ -126,19 +126,19 @@ Some tasks, like question answering or textual entailment, have structured input
</div>
</div>

> GPT-3 was applied with tasks and **_few-shot_** demonstrations specified purely via text interaction with the model. Since fine-tuning involves updating the weights of a pre-trained model by training on a supervised dataset specific to the desired task, it typically requires thousands to hundreds of thousands of labelled examples. However, the main disadvantages are the need for a new large dataset for every task, the potential for poor generalization out-of-distribution, and the potential to exploit spurious features of the training data, potentially resulting in an unfair comparison with human performance.
> GPT-3 was applied with tasks and **_few-shot_** demonstrations specified purely via text interaction with the model. Since fine-tuning involves updating the weights of a pre-trained model by training on a supervised dataset specific to the desired task, it typically requires thousands to hundreds of thousands of labelled examples. However, the main disadvantages are the need for a new large dataset for every task, the potential for poor generalization out-of-distribution, and the potential to exploit spurious features of the training data, potentially resulting in an unfair comparison with human performance.
{: .text-justify}

### Few-shot learning

Few-Shot is the term referring to the setting where the model is given a few demonstrations of the task at inference time as conditioning, but no weight updates are allowed. _Few-shot learning_ involves learning based on a broad distribution of tasks (in this case implicit in the pre-training data) and then rapidly adapting to a new task. The primary goal in traditional Few-Shot frameworks is to learn a similarity function that can map the similarities between the classes in the support and query sets.
Few-Shot is the term referring to the setting where the model is given a few demonstrations of the task at inference time as conditioning, but no weight updates are allowed. _Few-shot learning_ involves learning based on a broad distribution of tasks (in this case implicit in the pre-training data) and then rapidly adapting to a new task. The primary goal in traditional Few-Shot frameworks is to learn a similarity function that can map the similarities between the classes in the support and query sets.
{: .text-justify}

Figure 2.1 <d-cite key="GPT-3"></d-cite> illustrates different settings, from which we see for a typical dataset an example has a context and a desired completion (for example an English sentence and the French translation), and few-shot works by giving $$K$$ examples of context and completion, and then one final example of context, with the model expected to provide the completion.
Figure 2.1 <d-cite key="GPT-3"></d-cite> illustrates different settings, from which we see for a typical dataset an example has a context and a desired completion (for example an English sentence and the French translation), and few-shot works by giving $$K$$ examples of context and completion, and then one final example of context, with the model expected to provide the completion.
{: .text-justify}

The main advantages of few-shot are a major reduction in the need for task-specific data and a reduced potential to learn an overly narrow distribution from a large but narrow fine-tuning dataset. The main disadvantage is that results from this method have so far been much worse than state-of-the-art fine-tuned models. Also, a small amount of task-specific data is still required.
{: .text-justify}
{: .text-justify}

<div class="row justify-content-sm-center">
<div class="col-sm mt-3 mt-md-0">
Expand All @@ -153,7 +153,7 @@ The main advantages of few-shot are a major reduction in the need for task-speci
**GPT-1** employs the idea of unsupervised learning for training representations of words using large amounts of unlabeled data consisting of terabytes of information and then integrates supervised learning for fine-tuning to improve performance on a wide range of NLP tasks. However, it has drawbacks including (1) Compute requirements (expensive pre-training step) (2) The limits and bias of learning about the world through text, and (3) Still brittle generalization.
{: .text-justify}

**GPT-2** is a larger model with 1.5 billion parameters following the details of GPT-1 (117 million parameters) with a few modifications<d-footnote>These include pre-normalization, modified initialization, expanded vocabulary to 50,257, larger context size to 1024 tokens and larger batch-size of 512.</d-footnote>. This larger size allows it to capture more complex language patterns and relationships. In short, GPT-2 is a direct scale-up of GPT-1, with more than $$10\times$$ the parameters and trained on more than $$10\times$$ the amount of data.
**GPT-2** is a larger model with 1.5 billion parameters following the details of GPT-1 (117 million parameters) with a few modifications<d-footnote>These include pre-normalization, modified initialization, expanded vocabulary to 50,257, larger context size to 1024 tokens and larger batch-size of 512.</d-footnote>. This larger size allows it to capture more complex language patterns and relationships. In short, GPT-2 is a direct scale-up of GPT-1, with more than $10\times$ the parameters and trained on more than $10\times$ the amount of data.
{: .text-justify}

**GPT-3** uses a variety of techniques to improve performance, including:
Expand Down

0 comments on commit 0bf0989

Please sign in to comment.