home..

Browsing through 2023ICML oral

huyi / July 2023

[toc]

Paper_1: Specializing Smaller Language Models towards Multi-Step Reasoning

Abstract

task:
- distill multi-step reasoning ability from GPT-3.5 to smaller models
- model specialization: specialize small models for specific tasks
insights:
- generic VS specialized
- there exists a very complex balance/ tradeoff between language models’ multi-dimensional abilities
- by paying the price of decreased generic ability, we can clearly lift up the scaling curve of models smaller than 10B towards a specialized multi-step math reasoning ability
how to generalize better?
- data format
- the start model checkpoint
- a new model selection method

1. Introduction

Motivation

the community is eager to know to what extent CoT reasoning abilities can be further improved in smaller models
Solution
concentrate the small models(<10B)’s limited ability on specific tasks
previous work focuses on generic tasks

Results

for small FlanT5 models(250M, 760M, and 3B):

$\downarrow$ BigBench Hard suite
$\uparrow$ math reasoning tasks

3. Specializing Multi-Step Reasoning

seed dataset: GSM8K
math test: MultiArith, AsDiv, SVAMP
generic reasoning ability: BigBench Hard
base model: T5(vanilla), FlanT5(instruction tuned)
code-davinci-002 to generate distillation/ specialization data (?)

Distillation from Code-Davinci-002

use code-davinci-002 to generate training set: 40 new CoT solutions
four format settings
1. in-context answer only (few-shot standard)
2. in-context CoT (few-shot CoT)
3. zero-shot answer only (zero-shot standard)
4. zero-shot CoT (zero-shot CoT)
two types of distillation approaches:
1. sample matching
  
  one trains the student model on the data generated by the teacher. In our case, sample matching means we directly optimize the student’s likelihood on the data generated by code-davinci-002.
2. distribution matching –> faster convergence and better performance, choice of the paper
  - one minimizes the KL divergence between the student’s output distribution (in our case, the per-step autoregressive distribution) and the teacher’s
  - OpenAI API only grants access to the 5 most probable tokens at each decoding step. We set to zero the probabilities of tokens not in the top 5.

Aligning tokenizers by dynamic programming

Problem: One problem when matching the two distributions is the misalignment between the GPT tokenizer and the T5 tokenizer.

minimizes the total cost of editing one sequence to the other

future challenge: aligning sequences generated by different tokenizers is a generic problem of contemporary NLP

4. Experiments

main idea: tradeoff between math reasoning ability and generic ability

4.1 Overall Performance Tradeoff

all specialized models suffer from performance drop on BigBench, specifically, they lose all the CoT prompting abilities on BBH, and a large portion of AO prompting performance

4.2 Scaling Behavior of Smaller Models’ CoT ability

GPT family (ada, babbage, curie, code-davinci-002) and their specialized version
- GPT phase change curve, almost flat in small scale
raw T5
- Specialized T5 exhibit log-linear scaling curve
the instruction-tuned FlanT5 of different scales and their specialized versions
- Specialization lifts up FlanT5 log-linear curve

–> directly trained on CoT data can also lift up the flat scaling curve of the raw T5 checkpoints (Fig. 2B) to be log-linear

–> This further indicates that chain-of-thought may not be an emergent ability which is marked by the flat-then-phase-change curve, but they have the loglinear curve just like large models

Browsing through 2023ICML oral

Paper_1: Specializing Smaller Language Models towards Multi-Step Reasoning

Abstract

1. Introduction

Motivation

Solution

Results

3. Specializing Multi-Step Reasoning

Distillation from Code-Davinci-002

Aligning tokenizers by dynamic programming

4. Experiments

4.1 Overall Performance Tradeoff

4.2 Scaling Behavior of Smaller Models’ CoT ability

4.3 Specialization Process and Generalization Behaviors

The dynamics of model specilization

In-distribution and out-of-distribution tradeoffs

4.4 Further Design Choices Analysis

Paper_2: A Watermark for Large Language Models