Paper List:

Physics of Language Models: Part 1, Learning Hierarchical Language Structures

LLM的reasoning mechanism一直是一个热议话题，先前工作主要做copy/selection这样的simple symbolic tasks，本文探讨一个更难的task：context-free grammars (CFGs).

最终能够得到以下insights：

A Simple Case: attention head可以做括号匹配（Can transformers learn to solve problems recursively?）
知识存储：transformer的MLP层存了一些key-value形式的知识（ROME，MEMIT等工作）
Induction Heads：
- transformer有一些头能存一些更为抽象的feature，而不只是token层面的matching
  
  They “hypothesized” that induction heads may exist to “match and copy more abstract and sophisticated linguistic features, rather than precise tokens”, yet they acknowledge that they “don’t have a strong framework for mechanistically understanding” this.
一些关于logical reasoning的reverse engineering：提出了不同功能的attention heads

Most notably, they explained how GPT2 predicts the next token “Mary” given prefix “When Mary and John went to the store, John gave a drink to […]” This requires some logical reasoning by selecting (not naively copying) what is the right name.

目前模型已经很强了，我们往往实际上关心的是一些非常难的reasoning tasks，可能背后没有很好的algotirhm，不像之前研究的类似copy，selection，sorting这样的tasks。那我们能否提出一个setting去研究更为复杂的任务？

synthetic CFGs, 作者创建了一套比较复杂的CFG规则。
- 在这种情况下，判断一个句子是否符合这段语法是比较困难的，需要用DP之类的思想。
- 且保证要有local ambiguous，使得模型不太能有shortcut
- 语法树深度可以被扩展得很深

GPT这种架构可以学会CFG
用rotatary或relative attention是很必要的，尤其对于比较复杂的CFG而言
通过attention pattern或者hidden states给出一些解释
evidence

作者在上述CFG上generate了一个符合语法规范的large corpus，并在上面pretrain了一个decoder-only transformer。把每一个termial token作为一个separate token。

实验的模型为GPT2-small(12-layer, 12-head, 768-dimensions)，利用以下PE：

$\text{GPT}_{rel}$: 使用如下的相对位置编码，在hidden state上concat相对位置编码 for attention的计算
$\text{GPT}_{rot}$: 用RoPE
$\text{GPT}{pos}$: 把attention matrix直接替换成$A{i,j}$仅依赖于$i,j$相对位置的形式，但这个$A_{i,j}=f(i,j)$是可以训的
$\text{GPT}_{uni}$：用fix住的attention matrix，第h个头用uniform average over the previous $2^h-1$个token（？什么意思）

以上模型，除了原始的GPT（用的绝对位置编码）都可以学到synthetic CFG，给出任意前缀，均可生成completion strings满足以下要求：

<font color=#FF6384>💭：这里模型的泛化性如何界定？特别是能否界定模型学到了哪一个hierachy的rule？test sample中的sequence是否没有出现在过pretraining corpus中？</font>

实验结果如下： fig

左图展示了不同GPT在不同难度的CFG上的test accuracy，其中cut0表示prompt sequence length为0，cut50表示prompt sequence length为50。
中间的图展示了模型的生成多样性，作者认为，生成多样性说明了模型并不是在预训练时仅仅记住了CFG的一个subset。
右图展示了和true CFG distribution的KL divergence。

<font color=#FF6384>💭：同样地，是否依赖memorization感觉目前的证据并不充分。可以做一些干预实验，比如在pre-training corpus里面去掉某个pattern，看模型是否仍然能学会。</font>

待更新……

上次更新：2025.6.2