home..

PEFT

huyi / August 2023

Paper_1: Parameter-Efficient Fine-Tuning without Introducing New Latency

1. Introduction

What are previous techniques regarding FT?

full finetuning
parameter-efficient finetuning
- sparse finetuning: tune a small subset of existing parameters without introducing new parameters:
  - BitFit: only tune biases (not scalable ?)
  - Diff Pruning
  - FISH Mask (learning and updating task-specific masked parameters with a specified sparsity ratio, task-specific masks make two methods unsuitable for federated learning setting)
- infused finetuning: introduces new parameters to PLMs, and only updates these parameters during training
  - adapter fine-tuning inserts adapters to each layer of the PLM (inference latency)
  - prefix tuning (inference latency, 这个感觉其实是reduce input length)
  - prompt tuning (inference latency, 这个感觉其实是reduce input length)
  lora 为啥没有inference latency？ input1)过base matrix, 2)过A，B matrix，还是A和B乘起来加到base matrix上去存起来？还是运算时候先加上去放在显存里？

What does this paper introduce?

propose two methods: PaFi and HiWi
- PaFi: sparse FT, generate task-agnostic masks, mask generation doesn’t require any training on any data
- HiWi: infused FT method that applies the adapters directly to weights instead of activations ?, after training, abandon adapter –> no inference latency

2. Preliminaries

itroduce sparse FT and adapter FT

Sparse Fine-Tuning

formulates a task-specific finetuning as a two-phase learning problem:

first phase determine which parameters $\theta^{(0)}$ can be modified by generate a sparse mask $m\in {0,1}^{|theta^(0)|}$, 1 denotes trainable parameters

BitFit heuristically specifies the bias terms trainable

Diff Pruning and LT-SFT fully fine-tune PLM on downstream task to obtain $\theta^{(1)}$. And the k parameters with the greatest absolute difference $

\theta^{(0)} - \theta^{(1)}

$ are selected for updating in the next phase

FISH Mask uses the gradient information of $\theta^{(1)}$ to learn the mask

所以这些方法都是要先full finetune 一遍LM的，并不省训练时间和训练显存，只是省了内存？

second phase the PLM is finetuned and only the masked parameters are updated whereas the others are frozen $\theta^{(2)} = \argmin\limits_\theta \mathcal{L}(D; m\odot\theta)$

Adapter Fine-Tuning

insert one or multiple small MLP modules into each layer of the PLM, $W_{down}\in\mathcal{R}^{d\times r}$, $W_{up}\in\mathcal{R}^{r\times d}$, $r«d$

\[h\leftarrow h + f(hW_{down})W_{up}\]

Adapter finetuning inserts the module in between or after the original module.(插入新的层)

In contrast: LoRA: $h\leftarrow h W + hW_{down}W_{up}$

A Unified Framework

\[\hat{\phi} = \argmin\limits_{\phi}\mathcal{L}(\mathcal{D}; z\odot \phi)\]

Key Challenges

Limitations of sparse finetuning:

the generation of sparse mask is task-specific; whereas adapter FT always has fixed positions for adapters
One needs some tricks to generate these masks, which may require more computation than a normal full finetuning for the same iteration

Limitations of adapter finetuning:

typically introduce additional inference latency
LoRA: one cannot apply a non linear function in the adapter for LoRA $hW + f(hW_{down})W_{up}\neq h(W+f(W_{down}W_{up}))$

3. Methodologies

two methods: one for sparse FT that generates a universal mask for various tasks without any training, one for adapter finetuning that has no inference latency and requires even less storage than BitFit

3.1 Task-Agnostic Mask Generation

Insight: we should fix the parameters important for pre-training and only tune the unimportant ones

options for choosing trainable parameters:

train on pre-training data (choose the ones with less fisher information) $\hat{F_{\theta}} = \frac{1}{N}\sum\limits_{i=1}^{N}(\nabla_{\theta}\mathcal{})$
dataless (magnitude-based) choose parameters with smallest pre-trained parameters with small absolute magnitude as the unimportant ones

PaFi: Pruning-and-Finetuning $\rightarrow$ magnitude-based

3.2 Adapter for Pre-trained Parameters

任务越难，要加的参数越多 the number of added parameters correlates with the complexity of the downstream task. For example, one can add 0.5% of the PLM parameters to achieve the same performance as full fine-tuning for the GLUE benchmark, while 4% is required for machine translation
故复杂任务的inference latency很大

proposed method $W\leftarrow W + f(WW_{down})W_{ip}$

if we neglect the nonlinear function, $h$ becomes $hW(1+W_{down}W_{up})$
- LoRA: $\Delta W = W_{down}W_{up}$
- HiWi: $\Delta W = WW_{down}W_{up}$
Insight: $rank(WW_{down}W_{up})\leq \min(rank(W), rank(W_{down}W_{up})) = rank(W_{down}W_{up})$

HiWi: Hides (throws away) the Weights from adapters

4. Experimental Setup

result_1

result_2

5. Result and Discussion

Paper_2: INTRINSIC DIMENSIONALITY EXPLAINS THE EFFECTIVENESS OF LANGUAGE MODEL FINE-TUNING

1. Introduction

QUESTION: Why can we use relatively vanilla gradient descent algorithms (e.g., without strong regularization) to tune a model with hundreds of millions of parameters on datasets with only hundreds or thousands of labeled examples?

解决角度：intrinsic dimentionality

What does this paper propose?

empirically show that standard pre-trained models can learn a large set of NLP tasks with very few parameters and that the process of pre-training itself implicitly minimizes the intrinsic dimension of later tuning for different NLP tasks
number of parameters strongly inversely correlate with intrinsic dimentionality 为啥是反相关？ –> interpret pre-training as providing a framework that learns how to compress the average NLP task
generalization bound

2. Intrinsic Dimensionality of Finetuning

background

$\theta^D = \theta_{0}^D + P(\theta^d)$ where $P:\mathbb{R}^d\rightarrow\mathbb{R}^D$.

P in Li et al (2018):

random linear dense projection $\theta^d W$
random linear sparse projection $\theta^dW_{sparse}$
fastfood transform $\theta^D =\theta_0^D + \theta^d M, M = HG\Pi HB$

$H$, hadamard matrix; $G$ random diagnal matrix, standard normal entries; $B$ random diagnal matrix with equal probability $\pm 1$ entries

structure aware intrinsic dimension

$\theta_i^D = \theta_{0,i}^D + \lambda_i P(\theta^{(d-m)})$

$m$ layers, $\theta_d$ becomes $[\theta_{d-m},\lambda]$

3. Intrinsic Dimensionality of Common NLP Tasks

3.1 sentence prediction

did table

3.2 analysis

incredible low dimensionality of viable solutions
RoBERTa consistently outperforms BERT across various subspace dimensions d while having more parameters
adding a notion of structure is beneficial
4. Intrinsic Dimension, Pre-training, and Generalization Gap

One interpretation of the intrinsic parameter vector is that it encodes the task at hand with respect to the original pre-trained representations

$d$ as the minimal description length of the task within the framework dictated by the pre-trained representations

–> pretraining is implicitly lowering the intrinsic dimensionality of NLP tasks

4.1 Pre-Training Intrinsic Dimension Trajectory

pre

the intrinsic dimensionality of RoBERTa-Base monotonically decreases as we continue pre-training –>

large scale training of Masked Language Models (MLM) learns generic and distributed enough representations of language to facilitate downstream learning of highly compressed task representations
4.2 Parameter Count and Intrinsic Dimension
the more parameters we have in the model, the less we need to represent a task
Within the same window of number of parameters, pre-training methodology becomes essential. For example, in the regime of $10^8$ parameters, the RoBERTa method of pre-training dominates similar sized pre-training methods.
4.3 Generalization Bounds Through Intrinsic Dimension
hypothesis: generalization improves as the intrinsic dimension decreases 不同的点是update steps in 4.1
4.3.1 generalization bound
the generalization bound is independent of the underlying parameter count $D$ of the pre-trained model
but depends on the ability to compress the downstream task $d$ –> larger models compress better, larger pre-trained models generalize better

PEFT

Paper_1: Parameter-Efficient Fine-Tuning without Introducing New Latency

1. Introduction

What are previous techniques regarding FT?

What does this paper introduce?

2. Preliminaries

Sparse Fine-Tuning

Adapter Fine-Tuning

A Unified Framework

Key Challenges

3. Methodologies

3.1 Task-Agnostic Mask Generation

3.2 Adapter for Pre-trained Parameters

4. Experimental Setup

5. Result and Discussion

Paper_2: INTRINSIC DIMENSIONALITY EXPLAINS THE EFFECTIVENESS OF LANGUAGE MODEL FINE-TUNING

1. Introduction

What does this paper propose?

2. Intrinsic Dimensionality of Finetuning

background

structure aware intrinsic dimension

3. Intrinsic Dimensionality of Common NLP Tasks

3.1 sentence prediction

3.2 analysis

4. Intrinsic Dimension, Pre-training, and Generalization Gap

4.1 Pre-Training Intrinsic Dimension Trajectory

4.2 Parameter Count and Intrinsic Dimension

4.3 Generalization Bounds Through Intrinsic Dimension

4.3.1 generalization bound