Language models can explain neurons in language models

date

icon

password

Sub-item

Blocked by

Parent item

type

status

slug

summary

Source

https://openaipublic.blob.core.windows.net/neuron-explainer/paper/index.html

Introduction

This paper applies automation to the problem of scaling an interpretability technique to all the neurons in a large language model. Our hope is that building on this approach of automating interpretability will enable us to comprehensively audit the safety of models before deployment.

目标是将自动化解释技术扩展到大型语言模型中的所有神经元的问题，这样在部署前就可以做安全审核

Our technique seeks to explain what patterns in text cause a neuron to activate. It consists of three steps:

给出某个神经元对文本所有token的激活值，然后使用GPT-4解释神经元的激活情况。

使用基于解释的条件，模拟使用GPT-4进行激活

通过比较模拟和真实激活来评分解释。

这使得我们可以自动衡量可解释性的定量概念（quantitative notion）

在这里称为：explanation score

这是一种衡量模型压缩以及重构模型激活值能力的方法，并且该框架是定量的，可以让人类理解神经网络的计算

为啥啊，这方法哪里神奇了？为什么跟可解释性有关系？这里的激活值是啥？是重点词吗？

下面是三种可以提高分数的办法：

迭代解释，要求模型给出一些反例，然后给予反例来修正解释

用更好的模型给出解释

用更好的模型基于解释模拟出激活值

但是仍然发现，人类和gpt4的绝对分数相差甚远，于是在观察神经元的时候，发现典型的神经元有一词多义，这说明应该改变解释的内容

没看懂分数相差远为什么要去观察神经元？神经元的一词多义怎么观察出来的？一词多义为什么要改变解释的内容？

在预备实验中，做了两个事情

改变模型架构：通过更换激活函数得到了更好的分数

寻找更可解释方向。我们对直接优化分数进行了初步调查，结果显示可以找到一些良好解释线性组合的神经元

也没看懂

我们将此方法应用于GPT-2 XL中所有MLP神经元。我们发现超过1,000个神经元具有至少0.8 的得分，并根据GPT-4来说它们占据大部分神经元顶级激活行为。我们使用这些说明来构建新用户界面以理解模型，例如允许快速查看哪些神经元在特定数据集示例上激活以及这些神经元做什么。

Methods

Setting

方法涉及多个模型

subject model 作为研究对象的模型

explainer model 为subject model的行为提出假说的模型

simulator model 基于上面的假说提出

本文选择一种最简单的方法着手，即：

将输入文本的property对应到激活值上

对于中间的激活值，本次特指MLP的神经元，

也称MLP post-activation value （

特指GPT-2使用的GELU

当我们对神经元有一个假设的解释时，假设是该神经元在具有该property的token上激活，其中属性可能包括previous token作为上下文。

Overall algorithm

具体的，本文的算法如下

Explain：通过从神经元的反应到文本摘录中，展示explainer model (token, activation) pairs，生成神经元行为的解释

simulate：利用上一步的解释，来模拟神经元的激活

score：根据模拟的激活和实际的激活匹配的情况，打分

Step 1: Generate explanations of the neuron's behavior

给出一个prompt和fewshot examples

要求模型给出一个Explaination，解释该神经元在做什么

prompt：

We're studying neurons in a neural network. Each neuron looks for some particular thing in a short document. Look at the parts of the document the neuron activates for and summarize in a single sentence what the neuron is looking for. Don't list examples of words.

The activation format is token<tab>activation. Activation values range from 0 to 10. A neuron finding what it's looking for is represented by a non-zero activation value. The higher the activation value, the stronger the match.

examples:

Neuron 1 Activations: <start> the 0 sense 0 of 0 together 3 ness 7 in 0 our 0 town 1 is 0 strong 0 . 0 <end> <start> [prompt truncated …] <end>

Same activations, but with all zeros filtered out: <start> together 3 ness 7 town 1 <end> <start> [prompt truncated …] <end>

每个神经元都对应那么多token，全部整一遍不是要爆炸了

小结论：

激活值在0-10之间

在给出全部的(token, activation) pair之后，再去掉激活值为0的部分再来一发，有助于模型专注于相关token

Step 2: Simulate the neuron's behavior using the explanations

这一节的目的在于解释以下问题：

假设提出的解释准确而完全地诠释了神经元的行为，那么该神经元如何激活特定序列中每个token？

选择的方法为：有条件地基于explanation，利用simulate model来模拟subject model的每个token的神经元激活值

反正全没看懂

具体做法：

同样的，也是给出 prompt，然后要求模型为每一个subject model的每个token输出一个1-10的值，对于每个位置，根据0-10的出现概率计算期望。因此，得到的模拟神经元值在[0, 10]范围内。

为什么会有每个位置这种说法，我的理解只能是多个subject model，得到了多个sequence？

Step 3: Score the explanations by comparing the simulated and actual neuron behavior

计算两列数据：在多个文本摘录上进行解释的模拟激活值，以及相同文本摘录上真实神经元的实际激活值。

indicates the simulated activation given the explanation

indicates the true activation

VALIDATING AGAINST ABLATION SCORING

还有一种做法，就是用模拟的激活值来代替原始的激活值，看看输出是否能够保持

为了计算出变化的大小，使用JS散度来计算两个模型的输出logprobs，averaged across all tokens

作为baseline，另外做了一个干扰项，就是将激活值替换为它在所有token的激活值的均值