home..

PEFT

Paper_1: Parameter-Efficient Fine-Tuning without Introducing New Latency

1. Introduction

What are previous techniques regarding FT?

What does this paper introduce?

2. Preliminaries

itroduce sparse FT and adapter FT

Sparse Fine-Tuning

formulates a task-specific finetuning as a two-phase learning problem:

Adapter Fine-Tuning

insert one or multiple small MLP modules into each layer of the PLM, $W_{down}\in\mathcal{R}^{d\times r}$, $W_{up}\in\mathcal{R}^{r\times d}$, $r«d$

\[h\leftarrow h + f(hW_{down})W_{up}\]

Adapter finetuning inserts the module in between or after the original module.(插入新的层)

In contrast: LoRA: $h\leftarrow h W + hW_{down}W_{up}$

A Unified Framework

\[\hat{\phi} = \argmin\limits_{\phi}\mathcal{L}(\mathcal{D}; z\odot \phi)\]

Key Challenges

Limitations of sparse finetuning:

  1. the generation of sparse mask is task-specific; whereas adapter FT always has fixed positions for adapters
  2. One needs some tricks to generate these masks, which may require more computation than a normal full finetuning for the same iteration

Limitations of adapter finetuning:

  1. typically introduce additional inference latency
  2. LoRA: one cannot apply a non linear function in the adapter for LoRA \(hW + f(hW_{down})W_{up}\neq h(W+f(W_{down}W_{up}))\)

3. Methodologies

two methods: one for sparse FT that generates a universal mask for various tasks without any training, one for adapter finetuning that has no inference latency and requires even less storage than BitFit

3.1 Task-Agnostic Mask Generation

Insight: we should fix the parameters important for pre-training and only tune the unimportant ones

options for choosing trainable parameters:

  1. train on pre-training data (choose the ones with less fisher information) \(\hat{F_{\theta}} = \frac{1}{N}\sum\limits_{i=1}^{N}(\nabla_{\theta}\mathcal{})\)
  2. dataless (magnitude-based) choose parameters with smallest pre-trained parameters with small absolute magnitude as the unimportant ones

PaFi: Pruning-and-Finetuning $\rightarrow$ magnitude-based

3.2 Adapter for Pre-trained Parameters


HiWi: Hides (throws away) the Weights from adapters

4. Experimental Setup

result_1

result_2

5. Result and Discussion

Paper_2: INTRINSIC DIMENSIONALITY EXPLAINS THE EFFECTIVENESS OF LANGUAGE MODEL FINE-TUNING

1. Introduction

QUESTION: Why can we use relatively vanilla gradient descent algorithms (e.g., without strong regularization) to tune a model with hundreds of millions of parameters on datasets with only hundreds or thousands of labeled examples?

解决角度:intrinsic dimentionality

What does this paper propose?

  1. empirically show that standard pre-trained models can learn a large set of NLP tasks with very few parameters and that the process of pre-training itself implicitly minimizes the intrinsic dimension of later tuning for different NLP tasks
  2. number of parameters strongly inversely correlate with intrinsic dimentionality 为啥是反相关? –> interpret pre-training as providing a framework that learns how to compress the average NLP task
  3. generalization bound

2. Intrinsic Dimensionality of Finetuning

background

\(\theta^D = \theta_{0}^D + P(\theta^d)\) where $P:\mathbb{R}^d\rightarrow\mathbb{R}^D$.

P in Li et al (2018):

  1. random linear dense projection $\theta^d W$
  2. random linear sparse projection $\theta^dW_{sparse}$
  3. fastfood transform \(\theta^D =\theta_0^D + \theta^d M, M = HG\Pi HB\)

    $H$, hadamard matrix; $G$ random diagnal matrix, standard normal entries; $B$ random diagnal matrix with equal probability $\pm 1$ entries

structure aware intrinsic dimension

\(\theta_i^D = \theta_{0,i}^D + \lambda_i P(\theta^{(d-m)})\)

$m$ layers, $\theta_d$ becomes $[\theta_{d-m},\lambda]$

3. Intrinsic Dimensionality of Common NLP Tasks

3.1 sentence prediction

did table

3.2 analysis

  1. incredible low dimensionality of viable solutions
  2. RoBERTa consistently outperforms BERT across various subspace dimensions d while having more parameters
  3. adding a notion of structure is beneficial

    4. Intrinsic Dimension, Pre-training, and Generalization Gap

One interpretation of the intrinsic parameter vector is that it encodes the task at hand with respect to the original pre-trained representations

$d$ as the minimal description length of the task within the framework dictated by the pre-trained representations

–> pretraining is implicitly lowering the intrinsic dimensionality of NLP tasks

4.1 Pre-Training Intrinsic Dimension Trajectory

pre

the intrinsic dimensionality of RoBERTa-Base monotonically decreases as we continue pre-training –>

© 2023 huyi   •  Powered by Soopr   •  Theme  Moonwalk