PEFT
huyi / August 2023
Paper_1: Parameter-Efficient Fine-Tuning without Introducing New Latency
1. Introduction
What are previous techniques regarding FT?
- full finetuning
- parameter-efficient finetuning
- sparse finetuning: tune a small subset of existing parameters without introducing new parameters:
- BitFit: only tune biases (not scalable ?)
- Diff Pruning
- FISH Mask (learning and updating task-specific masked parameters with a specified sparsity ratio, task-specific masks make two methods unsuitable for federated learning setting)
- infused finetuning: introduces new parameters to PLMs, and only updates these parameters during training
- adapter fine-tuning inserts adapters to each layer of the PLM (inference latency)
- prefix tuning (inference latency, 这个感觉其实是reduce input length)
- prompt tuning (inference latency, 这个感觉其实是reduce input length)
- sparse finetuning: tune a small subset of existing parameters without introducing new parameters:
What does this paper introduce?
- propose two methods: PaFi and HiWi
- PaFi: sparse FT, generate task-agnostic masks, mask generation doesn’t require any training on any data
- HiWi: infused FT method that applies the adapters directly to weights instead of activations ?, after training, abandon adapter –> no inference latency
2. Preliminaries
itroduce sparse FT and adapter FT
Sparse Fine-Tuning
formulates a task-specific finetuning as a two-phase learning problem:
- first phase
determine which parameters $\theta^{(0)}$ can be modified by generate a sparse mask $m\in {0,1}^{|theta^(0)|}$, 1 denotes trainable parameters
- BitFit heuristically specifies the bias terms trainable
-
Diff Pruning and LT-SFT fully fine-tune PLM on downstream task to obtain $\theta^{(1)}$. And the k parameters with the greatest absolute difference $ \theta^{(0)} - \theta^{(1)} $ are selected for updating in the next phase - FISH Mask uses the gradient information of $\theta^{(1)}$ to learn the mask
- second phase the PLM is finetuned and only the masked parameters are updated whereas the others are frozen \(\theta^{(2)} = \argmin\limits_\theta \mathcal{L}(D; m\odot\theta)\)
Adapter Fine-Tuning
insert one or multiple small MLP modules into each layer of the PLM, $W_{down}\in\mathcal{R}^{d\times r}$, $W_{up}\in\mathcal{R}^{r\times d}$, $r«d$
\[h\leftarrow h + f(hW_{down})W_{up}\]Adapter finetuning inserts the module in between or after the original module.(插入新的层)
In contrast: LoRA: $h\leftarrow h W + hW_{down}W_{up}$
A Unified Framework
\[\hat{\phi} = \argmin\limits_{\phi}\mathcal{L}(\mathcal{D}; z\odot \phi)\]Key Challenges
Limitations of sparse finetuning:
- the generation of sparse mask is task-specific; whereas adapter FT always has fixed positions for adapters
- One needs some tricks to generate these masks, which may require more computation than a normal full finetuning for the same iteration
Limitations of adapter finetuning:
- typically introduce additional inference latency
- LoRA: one cannot apply a non linear function in the adapter for LoRA \(hW + f(hW_{down})W_{up}\neq h(W+f(W_{down}W_{up}))\)
3. Methodologies
two methods: one for sparse FT that generates a universal mask for various tasks without any training, one for adapter finetuning that has no inference latency and requires even less storage than BitFit
3.1 Task-Agnostic Mask Generation
Insight: we should fix the parameters important for pre-training and only tune the unimportant ones
options for choosing trainable parameters:
- train on pre-training data (choose the ones with less fisher information) \(\hat{F_{\theta}} = \frac{1}{N}\sum\limits_{i=1}^{N}(\nabla_{\theta}\mathcal{})\)
- dataless (magnitude-based) choose parameters with smallest pre-trained parameters with small absolute magnitude as the unimportant ones
PaFi: Pruning-and-Finetuning $\rightarrow$ magnitude-based
3.2 Adapter for Pre-trained Parameters
- 任务越难,要加的参数越多 the number of added parameters correlates with the complexity of the downstream task. For example, one can add 0.5% of the PLM parameters to achieve the same performance as full fine-tuning for the GLUE benchmark, while 4% is required for machine translation
- 故复杂任务的inference latency很大
-
proposed method \(W\leftarrow W + f(WW_{down})W_{ip}\)
if we neglect the nonlinear function, $h$ becomes $hW(1+W_{down}W_{up})$
- LoRA: $\Delta W = W_{down}W_{up}$
- HiWi: $\Delta W = WW_{down}W_{up}$
Insight: \(rank(WW_{down}W_{up})\leq \min(rank(W), rank(W_{down}W_{up})) = rank(W_{down}W_{up})\)
HiWi: Hides (throws away) the Weights from adapters
4. Experimental Setup
5. Result and Discussion
Paper_2: INTRINSIC DIMENSIONALITY EXPLAINS THE EFFECTIVENESS OF LANGUAGE MODEL FINE-TUNING
1. Introduction
QUESTION: Why can we use relatively vanilla gradient descent algorithms (e.g., without strong regularization) to tune a model with hundreds of millions of parameters on datasets with only hundreds or thousands of labeled examples?
解决角度:intrinsic dimentionality
What does this paper propose?
- empirically show that standard pre-trained models can learn a large set of NLP tasks with very few parameters and that the process of pre-training itself implicitly minimizes the intrinsic dimension of later tuning for different NLP tasks
- number of parameters strongly inversely correlate with intrinsic dimentionality 为啥是反相关? –> interpret pre-training as providing a framework that learns how to compress the average NLP task
- generalization bound
2. Intrinsic Dimensionality of Finetuning
background
\(\theta^D = \theta_{0}^D + P(\theta^d)\) where $P:\mathbb{R}^d\rightarrow\mathbb{R}^D$.
P in Li et al (2018):
- random linear dense projection $\theta^d W$
- random linear sparse projection $\theta^dW_{sparse}$
-
fastfood transform \(\theta^D =\theta_0^D + \theta^d M, M = HG\Pi HB\)
$H$, hadamard matrix; $G$ random diagnal matrix, standard normal entries; $B$ random diagnal matrix with equal probability $\pm 1$ entries
structure aware intrinsic dimension
\(\theta_i^D = \theta_{0,i}^D + \lambda_i P(\theta^{(d-m)})\)
$m$ layers, $\theta_d$ becomes $[\theta_{d-m},\lambda]$
3. Intrinsic Dimensionality of Common NLP Tasks
3.1 sentence prediction
3.2 analysis
- incredible low dimensionality of viable solutions
- RoBERTa consistently outperforms BERT across various subspace dimensions d while having more parameters
- adding a notion of structure is beneficial
4. Intrinsic Dimension, Pre-training, and Generalization Gap
One interpretation of the intrinsic parameter vector is that it encodes the task at hand with respect to the original pre-trained representations
$d$ as the minimal description length of the task within the framework dictated by the pre-trained representations
–> pretraining is implicitly lowering the intrinsic dimensionality of NLP tasks
4.1 Pre-Training Intrinsic Dimension Trajectory
the intrinsic dimensionality of RoBERTa-Base monotonically decreases as we continue pre-training –>
- large scale training of Masked Language Models (MLM) learns generic and distributed enough representations of language to facilitate downstream learning of highly compressed task representations
4.2 Parameter Count and Intrinsic Dimension
- the more parameters we have in the model, the less we need to represent a task
- Within the same window of number of parameters, pre-training methodology becomes essential. For example, in the regime of $10^8$ parameters, the RoBERTa method of pre-training dominates similar sized pre-training methods.
4.3 Generalization Bounds Through Intrinsic Dimension
- hypothesis: generalization improves as the intrinsic dimension decreases
不同的点是update steps in 4.1
4.3.1 generalization bound
- the generalization bound is independent of the underlying parameter count $D$ of the pre-trained model
- but depends on the ability to compress the downstream task $d$ –> larger models compress better, larger pre-trained models generalize better