home..

Browsing through 2023ICML oral

[toc]

Paper_1: Specializing Smaller Language Models towards Multi-Step Reasoning

Abstract

1. Introduction

Motivation

Results

for small FlanT5 models(250M, 760M, and 3B):

3. Specializing Multi-Step Reasoning

Distillation from Code-Davinci-002

Aligning tokenizers by dynamic programming

Problem: One problem when matching the two distributions is the misalignment between the GPT tokenizer and the T5 tokenizer.

minimizes the total cost of editing one sequence to the other

future challenge: aligning sequences generated by different tokenizers is a generic problem of contemporary NLP

4. Experiments

main idea: tradeoff between math reasoning ability and generic ability

4.1 Overall Performance Tradeoff

4.2 Scaling Behavior of Smaller Models’ CoT ability

  1. GPT family (ada, babbage, curie, code-davinci-002) and their specialized version
    • GPT phase change curve, almost flat in small scale
  2. raw T5
    • Specialized T5 exhibit log-linear scaling curve
  3. the instruction-tuned FlanT5 of different scales and their specialized versions
    • Specialization lifts up FlanT5 log-linear curve

–> directly trained on CoT data can also lift up the flat scaling curve of the raw T5 checkpoints (Fig. 2B) to be log-linear

–> This further indicates that chain-of-thought may not be an emergent ability which is marked by the flat-then-phase-change curve, but they have the loglinear curve just like large models

4.3 Specialization Process and Generalization Behaviors

The dynamics of model specilization

In-distribution and out-of-distribution tradeoffs

4.4 Further Design Choices Analysis

Paper_2: A Watermark for Large Language Models

© 2023 huyi   •  Powered by Soopr   •  Theme  Moonwalk