Browsing through 2023ICML oral
huyi / July 2023
[toc]
Paper_1: Specializing Smaller Language Models towards Multi-Step Reasoning
Abstract
- task:
- distill multi-step reasoning ability from GPT-3.5 to smaller models
- model specialization: specialize small models for specific tasks
- insights:
- generic VS specialized
- there exists a very complex balance/ tradeoff between language models’ multi-dimensional abilities
- by paying the price of decreased generic ability, we can clearly lift up the scaling curve of models smaller than 10B towards a specialized multi-step math reasoning ability
- how to generalize better?
- data format
- the start model checkpoint
- a new model selection method
1. Introduction
Motivation
- the community is eager to know to what extent CoT reasoning abilities can be further improved in smaller models
Solution
- concentrate the small models(<10B)’s limited ability on specific tasks
- previous work focuses on generic tasks
Results
for small FlanT5 models(250M, 760M, and 3B):
- $\downarrow$ BigBench Hard suite
- $\uparrow$ math reasoning tasks
3. Specializing Multi-Step Reasoning
- seed dataset: GSM8K
- math test: MultiArith, AsDiv, SVAMP
- generic reasoning ability: BigBench Hard
- base model: T5(vanilla), FlanT5(instruction tuned)
- code-davinci-002 to generate distillation/ specialization data (?)
Distillation from Code-Davinci-002
- use code-davinci-002 to generate training set: 40 new CoT solutions
- four format settings
- in-context answer only (few-shot standard)
- in-context CoT (few-shot CoT)
- zero-shot answer only (zero-shot standard)
- zero-shot CoT (zero-shot CoT)
- two types of distillation approaches:
-
sample matching
one trains the student model on the data generated by the teacher. In our case, sample matching means we directly optimize the student’s likelihood on the data generated by code-davinci-002.
-
distribution matching –> faster convergence and better performance, choice of the paper
-
one minimizes the KL divergence between the student’s output distribution (in our case, the per-step autoregressive distribution) and the teacher’s
-
OpenAI API only grants access to the 5 most probable tokens at each decoding step. We set to zero the probabilities of tokens not in the top 5.
-
-
Aligning tokenizers by dynamic programming
Problem: One problem when matching the two distributions is the misalignment between the GPT tokenizer and the T5 tokenizer.
minimizes the total cost of editing one sequence to the other
future challenge: aligning sequences generated by different tokenizers is a generic problem of contemporary NLP
4. Experiments
main idea: tradeoff between math reasoning ability and generic ability
4.1 Overall Performance Tradeoff
- all specialized models suffer from performance drop on BigBench, specifically, they lose all the CoT prompting abilities on BBH, and a large portion of AO prompting performance
4.2 Scaling Behavior of Smaller Models’ CoT ability
- GPT family (ada, babbage, curie, code-davinci-002) and their specialized version
- GPT phase change curve, almost flat in small scale
- raw T5
- Specialized T5 exhibit log-linear scaling curve
- the instruction-tuned FlanT5 of different scales and their specialized versions
- Specialization lifts up FlanT5 log-linear curve
–> directly trained on CoT data can also lift up the flat scaling curve of the raw T5 checkpoints (Fig. 2B) to be log-linear
–> This further indicates that chain-of-thought may not be an emergent ability which is marked by the flat-then-phase-change curve, but they have the loglinear curve just like large models