搞定LLM模型GLU剪枝：30分钟极速降本提速

各位跨境圈的朋友们，大家好！今天咱们来聊点硬核的干货，手把手教大家如何“瘦身”大型语言模型（LLMs），让它们在保持聪明劲儿的同时，跑得更快、更省资源。这可是实打实的降本增效，对于咱们在跨境业务中部署AI应用来说，意义重大。

大家可能都听过，现在的LLMs是越来越强大，但体积也越来越庞大，对算力的要求水涨船高。这就好比一辆豪华跑车，性能没得说，但油耗也惊人。咱们做跨境的，最讲究的就是效率和成本控制，所以，如何让这些“大模型”变得“小而美”，就成了亟待解决的问题。

在给模型“瘦身”的技术里，量化和剪枝是两大常用策略。量化是简化数据表示，而剪枝则是直接“砍掉”模型中那些不那么重要的部分。剪枝的效果往往更明显，但操作起来也更考验技术和理解。毕竟，哪里能剪、怎么剪，才能不伤筋动骨，这是门大学问。

今天，咱们就聚焦到一种高效的结构化宽度剪枝方法上，专门针对那些采用了门控线性单元（GLU）结构的MLP层进行操作。现在很多主流模型，比如Llama 3.2、Gemma、Mistral、Qwen等，都大量使用了这种结构。掌握了这套方法，大家就能在大幅缩小模型体积的同时，依然能保证模型输出的连贯性，甚至在BoolQ这样的任务上还能保持不俗的准确率。跟着我的步骤来，咱们一起把理论变成实战！

剪枝是个啥？它对模型影响几何？

刚才咱们提到了，剪枝说白了就是把模型里那些对最终输出贡献度较低的部分“请出去”。通过有选择地移除这些非关键组件，目标是打造一个参数更少、计算需求更低，但核心能力不打折扣的“精简版”模型。

剪枝最核心的挑战在于，到底剪哪里？模型各个部分功能不同，重要性也不同。为了让大家有个直观的感受，咱们以Llama 3.2-1B模型为例，看看它的内部结构。

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (up_proj): Linear(in_features=2048, out_features=8192, bias=False)
          (down_proj): Linear(in_features=8192, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
    (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)

从模型结构上咱们可以看到三大块：词嵌入层（embeddings）、自注意力机制（self-attention）和多层感知机（MLP）层。在决定剪枝目标前，咱们得先摸清各部分的“家底”，看看它们分别占了模型多大的比重，以及剪了之后会有什么潜在影响。

参数分布深度解析

咱们来算一笔账，看看这些模块各自有多少参数：

词嵌入层和输出层（embed_tokens, lm_head）: 大约每层有 $128256 \times 2048 \approx 2.62$ 亿个参数，两层加起来就是 $5.24$ 亿参数。
自注意力机制（self_attn）: 这部分有16层，每层包含四个投影子层。每一层大约是 $2048 \times (2048 + 512 + 512 + 2048) \approx 1050$ 万参数。16层算下来，总共大约 $1.68$ 亿参数。
MLP层（mlp）: 同样是16层，由于采用了GLU结构，每层包含 gate_proj、up_proj 和 down_proj。每层大约是 $2048 \times 8192 + 2048 \times 8192 + 8192 \times 2048 \approx 5000$ 万参数。16层总计约 $8.05$ 亿参数。

影响分析：剪哪里，怎么剪？

从参数分布来看，MLP层占比超过50%，毫无疑问是剪枝的“大户”。但是，决定动手之前，咱们还得搞清楚各部分对模型行为的影响：

词嵌入层： 负责把输入转化为模型能理解的向量表示。如果剪它，模型可能就“听不懂”某些词了，或者对词义的理解能力下降。除非你只想构建一个只处理特定领域词汇（比如金融、医疗）的专属模型，否则一般不建议轻易动这里。
注意力机制： 这是模型聚焦关键信息的能力所在。它计算输入序列中每个词之间的重要性得分，帮助模型抓住上下文。剪掉它，模型在需要宏观理解上下文的任务上（比如文本摘要、翻译）能力会受损，生成文本的连贯性也会大打折扣。
MLP层： 它们是注意力机制的好搭档，通过数据的扩张和收缩，帮助模型理解更复杂的模式。剪枝MLP层，会限制模型对“没见过”的数据或未训练任务的泛化能力，简单说就是模型遇到新问题时，回答的“跑偏”概率会增加。

所以，确定了剪枝目标后，下一步就是决定是进行宽度剪枝（移除单个神经元）还是深度剪枝（移除整个层）。

你看，给模型剪枝，可不是简单的“一刀切”，这里面需要权衡和决策的地方可不少。不光要考虑剪枝后模型的实际能力，还得考虑它后续再训练的可能性。毕竟，咱们拿到手的基础模型，往往都是要经过微调，才能更好地为咱们的特定业务场景服务的。

门控线性单元（GLU）的奥秘

在Llama这类现代大语言模型中，门控线性单元（GLU）架构非常流行。GLU最核心的地方，在于引入了一种逐元素的门控机制，让模型能“选择性地”过滤和控制信息的流动。这种架构通常包含配对的层，比如咱们前面看到的 gate_proj、up_proj 和 down_proj，它们协同工作，负责数据的扩张和收缩。通过这种机制，模型能够处理更复杂的模式，同时还能保持效率。

新媒网跨境获悉，正因为GLU结构内部的层是紧密耦合的，对它们进行剪枝时，可得格外小心。简单来说，如果你移除了 gate_proj 层的一个神经元，那么它的“搭档” up_proj 也必须移除对应的神经元，同时 down_proj 层的输入尺寸也得相应调整。划重点了！最关键的一点是，在计算神经元的重要性来决定去留时，你必须把这“一对”神经元一起评估。如果破坏了这些层之间的平衡，模型性能就会严重下降，甚至完全崩溃，即使你只剪掉了一小部分神经元。

实战：剪枝 Llama 3.2 模型 (GLU)

接下来，咱们就以Llama模型为例，手把手演示剪枝过程。这套代码也成功在Gemma和Qwen模型上测试过。完整代码我放在了GitHub仓库，这里咱们只展示与剪枝核心逻辑相关的代码，一些辅助函数就先省略了。笔记本里也包含了模型评估和上传Hugging Face Hub的代码，有兴趣的朋友可以去看看。

在动手剪枝前，我有个小建议，也是我个人的经验教训：先用原始模型跑一个简单的提示（prompt），把结果保存下来。比如“Paris is the capital of.”（巴黎是……的首都）。这样，剪枝后，你就能快速、直观地对比，看看新模型是不是还能生成连贯的文本。我可以明确地告诉大家，我第一次尝试时，没有尊重GLU结构，模型生成的文本简直是“语无伦次”，一眼就能看出问题。

咱们来看看原始模型和我的第一次“瞎剪”后的对比：

基础模型： “Paris is the capital of France and one of the most visited cities in the world. It is a city of art, culture, fashion, and gastronomy. The city has a rich history and is home to many famous landmarks, including the E.” （巴黎是法国的首都，也是世界上游客最多的城市之一。它是一个艺术、文化、时尚和美食之城。这座城市历史悠久，拥有许多著名地标，包括埃菲尔……）
第一次尝试（只剪了20%）： “Paris is the capital of of France. This is the the the the main the area of. This is the the the the the the the the the the the the the the the the the the the city of the the France of the of the of the of.” （巴黎是的的法国首都。这是这这这这主要的这区域的。这是这这这这这这这这这这这这这这这这这城市的的法国的的的。）

大家看到了吧？第一次尝试明显出了问题。这看起来可能很小儿科，但这种经验性的检查，能帮你省下好几个小时的调试时间。

具体实现步骤

咱们先从计算神经元重要性的函数看起，它将决定哪些神经元留下，哪些被移除。

def compute_neuron_pair_importance(gate_weight, up_weight):
    """
    compute neuron pair importance scores (Maximum Absolute Weight)
    Args:
        - gate_weight: Weight matrix from the gate_proj layer.
        - up_weight: Weight matrix from the up_weight layer.
    Returns:
        - importance_scores: Importance scores for each neuron pair.
    """
    gate_max_abs = torch.max(gate_weight, dim=1).values + torch.abs(torch.min(gate_weight, dim=1).values)
    up_max_abs = torch.max(up_weight, dim=1).values + torch.abs(torch.min(up_weight, dim=1).values)
    importance_scores = gate_max_abs + up_max_abs
    return importance_scores

这个函数接收 gate_proj 层和 up_proj 层的权重，前面咱们说过，它们是配对工作的，所以神经元的重要性也必须“捆绑”计算。计算逻辑很简单：取每个神经元权重的绝对值。正负值都要考虑，因为理论上，权重值越是极端（无论是正还是负），对模型输出的影响就越大。这里，我要特别感谢Mariusz Kurman（一位波兰的贡献者），他建议加入了最小值计算，虽然之前没有也能工作，但加入后效果确实更好了。

函数会分别计算每一层的权重绝对值之和，然后返回一个综合的重要性得分。

接下来这个函数，负责创建新的层，并把它们“嫁接”到模型中，替代原有的层。

#Prunes a specific percentatge of neurons from the MLP (feed forward layers).
def prune_neuron_pairs(mlp, prune_percent):
    """
    Reduces the dimensions of the **gate_proj**,**up_proj**, **down_proj** layers removing the least important neurons.
    Args:
        - mlp: Layers to prune.
        - prune_percent: Percentage of neurons to prune.
    Returns:
        - new_gate_proj, new_up_proj, new_down_proj: New pruned layers.
        - k: New intermediate size.
    """
    # Extract the weights from the MLP layers
    # these weights are used to calculate each neuron's
    # importance score in the next step.
    gate_weight = mlp.gate_proj.weight.data.float()
    up_weight = mlp.up_proj.weight.data.float()

    #Compute importance stores. Neurons with higher importance scores
    # are considered more important and less likely to be pruned.
    importance_scores = compute_neuron_pair_importance(gate_weight, up_weight)

    #Store the original number of neurons in the intermediate layer.
    original_intermediate_size = gate_weight.size(0)

    #Computes the number of neurons to prune.
    num_neuron_pairs_to_prune = min(int(prune_percent * original_intermediate_size), original_intermediate_size - 1)

    #Calculate the number of neurons to keep. The new intermediate size.
    k = original_intermediate_size - num_neuron_pairs_to_prune

    #Just check that there is no big error calculating k. We can't prune all the neurons.
    if k <= 0:
        raise ValueError(f"Invalid number of neuron pairs to keep: {k}. Adjust the prune_percent.")

    _, indices_to_keep = torch.topk(importance_scores, k, largest=True, sorted=True)
    indices_to_keep = indices_to_keep.sort().values

    #create the new layers
    new_gate_proj = nn.Linear(mlp.gate_proj.in_features, k, bias=False).to(device)
    new_up_proj = nn.Linear(mlp.up_proj.in_features, k, bias=False).to(device)
    new_down_proj = nn.Linear(k, mlp.down_proj.out_features, bias=False).to(device)

    #copy weights to the new layers.
    new_gate_proj.weight.data = mlp.gate_proj.weight.data[indices_to_keep, :]
    new_up_proj.weight.data = mlp.up_proj.weight.data[indices_to_keep, :]
    new_down_proj.weight.data = mlp.down_proj.weight.data[:, indices_to_keep]

    #return new layers and intermediate size.
    return new_gate_proj, new_up_proj, new_down_proj, k

这个函数稍微复杂一些，它接收MLP块中的一个层和剪枝比例。通过调用 compute_neuron_pair_importance 函数，它能判断哪些神经元该留下。咱们一步步拆解：

# Extract the weights from the MLP layers
# these weights are used to calculate each neuron's
# importance score in the next step.
gate_weight = mlp.gate_proj.weight.data.float()
up_weight = mlp.up_proj.weight.data.float()

这两行代码，咱们就拿到了当前层（gate_proj 和 up_proj）的权重数据。

importance_scores = compute_neuron_pair_importance(gate_weight, up_weight)

现在，咱们就得到了一个张量，里面存储了为每个神经元计算的重要性分数。这些分数将告诉我们哪些神经元对最终输出贡献更大，应该被保留。

#Store the original number of neurons in the intermediate layer.
original_intermediate_size = gate_weight.size(0)

#Computes the number of neurons to prune.
num_neuron_pairs_to_prune = min(int(prune_percent * original_intermediate_size), original_intermediate_size - 1)

#Calculate the number of neurons to keep. The new intermediate size.
k = original_intermediate_size - num_neuron_pairs_to_prune

这里，根据传入的剪枝百分比和层的原始尺寸，咱们计算出要保留的神经元总数，也就是新的中间层大小 k。由于 gate_proj 和 up_proj 层的尺寸相同，所以只需要取一个即可。

#Select the neuros to keep, by obtaining the indices to keep.
_, indices_to_keep = torch.topk(importance_scores, k, largest=True, sorted=True)
indices_to_keep = indices_to_keep.sort().values

这两行代码非常关键。咱们使用PyTorch的 torch.topk 函数来找出重要性得分最高的 k 个神经元，同时确保它们是按重要性从高到低排序的。由于 torch.topk 返回的是降序数据，所以咱们又用 sort 方法把它们重新排成了升序，这正是咱们需要的。

有了这些选定的索引，咱们就可以创建新的层了。

#create the new layers
new_gate_proj = nn.Linear(mlp.gate_proj.in_features, k, bias=False).to(device)
new_up_proj = nn.Linear(mlp.up_proj.in_features, k, bias=False).to(device)
new_down_proj = nn.Linear(k, mlp.down_proj.out_features, bias=False).to(device)

#copy weights to the new layers.
new_gate_proj.weight.data = mlp.gate_proj.weight.data[indices_to_keep, :]
new_up_proj.weight.data = mlp.up_proj.weight.data[indices_to_keep, :]
new_down_proj.weight.data = mlp.down_proj.weight.data[:, indices_to_keep]

首先，创建了三个新的 nn.Linear 层，它们的尺寸根据咱们选定的索引 k 进行了调整。对于 new_gate_proj 和 new_up_proj，它们的输入尺寸保持不变，但输出尺寸被缩减为 k。而 new_down_proj 则相反，它的输入尺寸调整为 k，输出尺寸保持不变。这些新层一开始是没有权重的，所以在最后几行，咱们将原始层中对应选定神经元的权重数据拷贝到新层中，确保只保留了那些“精英”神经元。

#return new layers and intermediate size.
return new_gate_proj, new_up_proj, new_down_proj, k

最终，函数返回这三个新的层以及新的中间尺寸 k。

现在，咱们来看看如何遍历所有模型层，并构建一个经过修改的新模型。

#Iterates through the model layers and applies pruning.
def update_model(model, prune_percent):
    """
    It modifies each mlp layer present in model, to retain only the most important neurons.
    Creating new smaller versions of each layer pruned.
    Args:
        - model: Model to prune.
        - prune_percent: Percentage of neurons to prune.
    Returns:
        - model: New pruned model.
    """
    new_intermediate_size = None
    #loop for each model layer.
    for idx, layer in enumerate(model.model.layers):
        #Since each layer is a LlamaDecoderLayer it contains multiple components
        # Attention, MLP and Layer norms. We're targetting MLP component
        # by accesing layer.mlp.
        mlp = layer.mlp
        #Call the prune_neiron_pairs with the layers and receiving the pruned.
        new_gate_proj, new_up_proj, new_down_proj, new_size = prune_neuron_pairs(mlp, prune_percent)

        #Replace the Origiginal Layers with Pruned Layers.
        mlp.gate_proj = new_gate_proj
        mlp.up_proj = new_up_proj
        mlp.down_proj = new_down_proj

        #new_intermediate_size only needs to be set once
        if new_intermediate_size is None:
            new_intermediate_size = new_size

    #Update the model config file.
    model.config.intermediate_size = new_intermediate_size
    return model

这个函数可以说非常直接。它接收模型和剪枝百分比作为输入。然后，它遍历模型中的每一个层，从中提取出 mlp 部分。接着，它调用前面定义的 prune_neuron_pairs 函数，并用返回的新层来替换掉模型中原有的 mlp 层。

#Call the prune_neiron_pairs with the layers and receiving the pruned.
new_gate_proj, new_up_proj, new_down_proj, new_size = prune_neuron_pairs(mlp, prune_percent)

#Replace the Origiginal Layers with Pruned Layers.
mlp.gate_proj = new_gate_proj
mlp.up_proj = new_up_proj
mlp.down_proj = new_down_proj

最后，它还会更新模型配置文件中的一个关键变量：new_intermediate_size。

#Update the model config file.
model.config.intermediate_size = new_intermediate_size

这个步骤千万不能忽视！如果模型配置文件没有同步更新，那么模型在保存后就无法正常使用了，无论是在Hugging Face上还是本地部署。很多库，比如Hugging Face的Transformers，都依赖 model.config 来解析模型的架构。如果配置与实际结构不符，那么通过这些库进行的微调或推理操作都可能失败。

结果分析：这波操作下来，效果如何？

利用这套代码，我成功构建了好几个“瘦身版”模型，它们都已经在Hugging Face Hub上公开了。其中包括：

三个基于Llama-3.2-1b的模型，它们的MLP层分别被剪枝了20%、40%和60%的神经元。
一个基于Gemma-2-2B的模型，被剪枝了40%。

大家可以下载这些模型，除了直接使用，也可以深入研究它们的架构变化，和原始模型进行对比。

咱们来具体看看，对Llama3.2-1b模型进行20%剪枝后，架构发生了什么变化。

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(128256, 2048)
    (layers): ModuleList(
      (0-15): 16 x LlamaDecoderLayer(
        (self_attn): LlamaSdpaAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (k_proj): Linear(in_features=2048, out_features=512, bias=False)
          (v_proj): Linear(in_features=2048, out_features=512, bias=False)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=2048, out_features=6554, bias=False)
          (up_proj): Linear(in_features=2048, out_features=6554, bias=False)
          (down_proj): Linear(in_features=6554, out_features=2048, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
        (post_attention_layernorm): LlamaRMSNorm((2048,), eps=1e-05)
      )
    )
    (norm): LlamaRMSNorm((2048,), eps=1e-05)
  (rotary_emb): LlamaRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2048, out_features=128256, bias=False)
)

可以看到，模型结构除了MLP块中中间层的大小外，其他都保持不变。gate_proj 和 up_proj 层的特征维度从8192减少到了6554，down_proj 层的输入特征也做了同样的调整。这和咱们代码做的事情完全一致：在保留模型关键性能的前提下，精确地修改这些层。如果咱们计算一下，8192的20%大约是1638，8192减去1638就是6554，说明剪枝比例是准确的。

再来看看这个经过剪枝的模型，用咱们之前的测试提示词表现如何：

Paris is the capital of France. It is also one of the most beautiful cities in the world. There is so much to see and do in Paris that it is impossible to cover it all in one day. However, there are some things you

模型给出的回答虽然和原始模型不完全一样，但明显保持了连贯性。这说明模型保留了大部分能力，而且更重要的是，通过后续的知识蒸馏或微调等过程，完全有可能恢复甚至提升性能。

除了这种直观的经验性检查，新媒网跨境还对模型进行了一些常用基准测试。咱们来分析一下不同剪枝程度对模型性能的影响。
image/png

从图表上可以看出，剪枝的影响有点“不对称”。BoolQ测试任务的性能下降不明显，即使是MLP层剪掉40%神经元的模型，也只下降了约2%。但Lambada测试的下降就非常显著了，准确率跌了超过50%。这表明模型保留了大部分理解能力，但在需要更开放式生成（比如预测下一个词）的任务上表现不佳。

BoolQ的任务很简单，就是给模型一段文字和一道是非题。它主要衡量模型理解文本内部关系的能力。而Lambada则要求模型猜测一段话的最后一个词，这是一个复杂的语言建模任务，非常考验模型的整体能力。

这些测试结果，正好印证了咱们对MLP层功能的理解：它们确实影响着模型的泛化能力和对复杂语言模式的掌握。

结语

这次模型的剪枝过程非常成功。通过这种针对GLU层的处理方法，咱们能够在大幅度降低模型体积和资源消耗的同时，依然保留模型的大部分核心能力。

值得强调的是，这些测试结果是在剪枝后、未经任何能力恢复过程（如知识蒸馏或微调）的前提下获得的。而通常，经过剪枝的模型，都会进行这些后续处理，以进一步优化性能。

未来展望

剪枝技术还有很多值得探索的方向。比如，更直接的深度剪枝，就是移除那些对模型性能贡献最小的层。另一个重要的研究方向是，对这些剪枝后的模型进行知识蒸馏，评估它们学习新任务的能力是否能恢复到接近基础模型的水平，尤其是在那些性能下降明显的基准测试中。

开发更小、更高效的模型，对于咱们跨境企业来说，有着巨大的吸引力。这意味着在不需要投入大量基础设施的情况下，也能部署LLM能力。这项工作为未来研究提供了坚实基础，让这些强大的模型变得更易于获取和部署。

参考文献

Martra, P. (2024). EXPLORING GLU EXPANSION RATIOS: STRUCTURED PRUNING IN LLAMA-3.2 MODELS. https://doi.org/https://doi.org/10.31219/osf.io/qgxea

新媒网（公号: 新媒网跨境发布），是一个专业的跨境电商、游戏、支付、贸易和广告社区平台，为百万跨境人传递最新的海外淘金精准资讯情报。

本文来源：新媒网 https://nmedialink.com/posts/llm-glu-pruning-30min-boost-efficiency.html