极速调优LLM解码策略：30分钟让跨境内容效率翻倍！

各位跨境实战精英们，大家好！

说起大模型（LLM），很多朋友可能第一时间想到的是它们强大的模型架构、海量的数据处理能力，以及那些让人眼花缭乱的优化算法。但咱们今天想聊一个常常被忽视，却在文本生成中扮演着关键角色的“幕后英雄”——那就是文本生成策略，也就是我们常说的解码策略。

新媒网跨境认为，如果把大模型本身比作咱们跨境大卖家选好了品、铺好了货，那这些生成策略就好比是咱们怎么把这些好产品精准地推荐给客户，甚至讲出打动人心的品牌故事。它们决定了模型“开口说话”的方式和质量。

今天，咱们就一起深入浅出地聊聊几种主流的文本生成策略：贪婪搜索（Greedy Search）、束搜索（Beam Search），以及带有Top-k和Nucleus采样的抽样技术。读完这篇，你不仅能透彻理解这些策略的原理，更能掌握如何像个经验老到的操盘手一样，灵活调优温度（temperature）、束数（num_beams）、Top-k和Top-p这些核心参数，让模型更好地为咱们的跨境业务服务。文章中的实操代码，新媒网也同步为大家准备了，方便大家动手实践。

1. 背景：大模型到底怎么“说话”？

咱们先从一个例子说起。如果给一个GPT-2模型输入“I have a dream”，让它生成接下来的5个词（或子词），它可能会给出“I have a dream of being a doctor.”这样的结果。

from transformers import GPT2LMHeadModel, GPT2Tokenizer
import torch

device = 'cuda' if torch.cuda.is_available() else 'cpu'
model = GPT2LMHeadModel.from_pretrained('gpt2').to(device)
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model.eval()

text = "I have a dream"
input_ids = tokenizer.encode(text, return_tensors='pt').to(device)
outputs = model.generate(input_ids, max_length=len(input_ids.squeeze())+5)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated text: {generated_text}")

Generated text: I have a dream of being a doctor.

是不是觉得GPT-2直接就“写”出了这句话？其实不然。这里有个普遍的误解，认为大模型是直接生成文本的。实际上，它们做的是给词汇表里的每一个可能的词（或子词）打分，专业术语叫“logits”。

简单来说，它的流程是这样的：
Image by author.

首先，咱们的文本“I have a dream”会通过分词器（比如这里的Byte-Pair Encoding）转换成对应的ID序列。然后，GPT-2模型接收这些ID，开始预测下一个最可能的词。

最终，模型会输出一堆“分数”（logits），再通过softmax函数，把这些分数转化成咱们更容易理解的概率。比如，模型可能预测“of”是下一个词的概率是17%。这本质上就是一份按照可能性高低排列的“下一个词”的清单。用数学语言表达，就是给定“I have a dream”后，“of”出现的条件概率是17%。

像GPT这类自回归模型，就是基于前面已经生成的所有词，来预测序列中的下一个词。简单来说，模型每生成一个词，都会考虑前面所有词的语境，然后预测下一个最可能的词。GPT-2会为它词汇表中的50257个词，逐一计算出这个条件概率。

那么问题来了：有了这些概率，咱们到底怎么指挥模型“开口说话”，生成咱们想要的文本呢？这就是咱们今天要重点探讨的——文本生成策略。

2. 贪婪搜索（Greedy Search）：直来直去的选择

贪婪搜索是最简单、最直接的一种生成策略。它的核心思想就一句话：在每一步都选择当下概率最高的那个词作为下一个词。用咱们跨境人的话来说，这就像你永远只选眼前利润率最高、看起来最“稳”的那个产品，而不考虑长远影响。

咱们用回“I have a dream”的例子：

第一步：输入“I have a dream” → 模型预测概率最高的词是“of”
第二步：输入“I have a dream of” → 模型预测概率最高的词是“being”
第三步：输入“I have a dream of being” → 模型预测概率最高的词是“a”
第四步：输入“I have a dream of being a” → 模型预测概率最高的词是“doctor”
第五步：输入“I have a dream of being a doctor” → 模型预测概率最高的词是“.”

这种方法虽然直观、效率高，因为它不需要跟踪多条序列，但它有个致命的弱点：短视。它只关注每一步的局部最优，却忽略了全局的整体效果。可能眼前的这个词概率最高，但长远看，它却把整个句子带偏了，导致最终生成的文本质量并不高。

接下来，咱们通过代码和图示来直观感受一下贪婪搜索的实现过程。咱们会选择得分最高的ID，计算其对数概率（方便计算），然后添加到生成树中。重复这个过程5次。

import matplotlib.pyplot as plt
import networkx as nx
import numpy as np
import time

def get_log_prob(logits, token_id):
    # Compute the softmax of the logits
    probabilities = torch.nn.functional.softmax(logits, dim=-1)
    log_probabilities = torch.log(probabilities)
    # Get the log probability of the token
    token_log_probability = log_probabilities[token_id].item()
    return token_log_probability

def greedy_search(input_ids, node, length=5):
    if length == 0:
        return input_ids
    outputs = model(input_ids)
    predictions = outputs.logits
    # Get the predicted next sub-word (here we use top-k search)
    logits = predictions[0, -1, :]
    token_id = torch.argmax(logits).unsqueeze(0)
    # Compute the score of the predicted token
    token_score = get_log_prob(logits, token_id)
    # Add the predicted token to the list of input ids
    new_input_ids = torch.cat([input_ids, token_id.unsqueeze(0)], dim=-1)

    # Add node and edge to graph
    next_token = tokenizer.decode(token_id, skip_special_tokens=True)
    current_node = list(graph.successors(node))[0]
    graph.nodes[current_node]['tokenscore'] = np.exp(token_score) * 100
    graph.nodes[current_node]['token'] = next_token + f"_{length}"

    # Recursive call
    input_ids = greedy_search(new_input_ids, current_node, length-1)
    return input_ids

# Parameters
length = 5
beams = 1

# Create a balanced tree with height 'length'
graph = nx.balanced_tree(1, length, create_using=nx.DiGraph())

# Add 'tokenscore', 'cumscore', and 'token' attributes to each node
for node in graph.nodes:
    graph.nodes[node]['tokenscore'] = 100
    graph.nodes[node]['token'] = text

# Start generating text
output_ids = greedy_search(input_ids, 0, length=length)
output = tokenizer.decode(output_ids.squeeze().tolist(), skip_special_tokens=True)
print(f"Generated text: {output}")

Generated text: I have a dream of being a doctor.

咱们的贪婪搜索生成了和transformers库一样的文本：“I have a dream of being a doctor.”。接下来，咱们把这个生成过程用图示画出来，看得更直观：

import matplotlib.pyplot as plt
import networkx as nx
import matplotlib.colors as mcolors
from matplotlib.colors import LinearSegmentedColormap

def plot_graph(graph, length, beams, score):
    fig, ax = plt.subplots(figsize=(3+1.2*beams**length, max(5, 2+length)), dpi=300, facecolor='white')

    # Create positions for each node
    pos = nx.nx_agraph.graphviz_layout(graph, prog="dot")

    # Normalize the colors along the range of token scores
    if score == 'token':
        scores = [data['tokenscore'] for _, data in graph.nodes(data=True) if data['token'] is not None]
    elif score == 'sequence':
        scores = [data['sequencescore'] for _, data in graph.nodes(data=True) if data['token'] is not None]

    vmin = min(scores)
    vmax = max(scores)
    norm = mcolors.Normalize(vmin=vmin, vmax=vmax)
    cmap = LinearSegmentedColormap.from_list('rg', ["r", "y", "g"], N=256)

    # Draw the nodes
    nx.draw_networkx_nodes(graph, pos, node_size=2000, node_shape='o', alpha=1, linewidths=4, node_color=scores, cmap=cmap)

    # Draw the edges
    nx.draw_networkx_edges(graph, pos)

    # Draw the labels
    if score == 'token':
        labels = {node: data['token'].split('_')[0] + f"\n{data['tokenscore']:.2f}%" for node, data in graph.nodes(data=True) if data['token'] is not None}
    elif score == 'sequence':
        labels = {node: data['token'].split('_')[0] + f"\n{data['sequencescore']:.2f}" for node, data in graph.nodes(data=True) if data['token'] is not None}
    nx.draw_networkx_labels(graph, pos, labels=labels, font_size=10)
    plt.box(False)

    # Add a colorbar
    sm = plt.cm.ScalarMappable(cmap=cmap, norm=norm)
    sm.set_array([])
    if score == 'token':
        fig.colorbar(sm, ax=ax, orientation='vertical', pad=0, label='Token probability (%)')
    elif score == 'sequence':
        fig.colorbar(sm, ax=ax, orientation='vertical', pad=0, label='Sequence score')
    plt.show()

# Plot graph
plot_graph(graph, length, 1.5, 'token')

Image by author.

在这张图中，最上面的节点是咱们的输入词，概率是100%，下面的节点则是模型生成的词。虽然序列中的每个词在当时都是概率最高的，但“being”和“doctor”的概率却相对较低，分别是9.68%和2.86%。这说明，“of”这个开头的选择，可能并非最佳，因为它导致了后续一些不太可能出现的词。

在下一节中，咱们就来看看束搜索是如何解决这个问题的，帮助咱们生成更优质的文本。

3. 束搜索（Beam Search）：有远见的布局

跟贪婪搜索只看眼前不同，束搜索（Beam Search）是一个更加“有远见”的策略。它不再只盯着当前概率最高的那个词，而是会同时考虑N个最可能的词，这里的N就是“束数”（num_beams）。这个过程会一直重复，直到达到预设的最大长度，或者遇到序列结束符。最后，它会从这N条生成的序列（也就是“束”）中，选择整体得分最高的那一条作为最终输出。

新媒网跨境了解到，束搜索好比咱们做跨境市场分析，不再只盯着一个爆款，而是同时考察几款潜力产品，综合评估它们的市场表现和后续潜力，最终选择一个最优的策略组合。

咱们可以调整之前的函数，让它同时考虑N个最可能的词，而不是只考虑一个。在这里，咱们会维护一个序列得分（log P(w)），它是一个束中所有词的对数概率的累加和。为了避免模型偏向长序列，咱们还会对这个得分进行归一化（当然，这个归一化因子可以根据实际需求调整）。

同样，咱们还是生成5个词来完成“I have a dream”这句话。

from tqdm.notebook import tqdm

def greedy_sampling(logits, beams):
    return torch.topk(logits, beams).indices

def beam_search(input_ids, node, bar, length, beams, sampling, temperature=0.1):
    if length == 0:
        return None
    outputs = model(input_ids)
    predictions = outputs.logits
    # Get the predicted next sub-word (here we use top-k search)
    logits = predictions[0, -1, :]

    if sampling == 'greedy':
        top_token_ids = greedy_sampling(logits, beams)
    elif sampling == 'top_k':
        top_token_ids = top_k_sampling(logits, temperature, 20, beams)
    elif sampling == 'nucleus':
        top_token_ids = nucleus_sampling(logits, temperature, 0.5, beams)

    for j, token_id in enumerate(top_token_ids):
        bar.update(1)
        # Compute the score of the predicted token
        token_score = get_log_prob(logits, token_id)
        cumulative_score = graph.nodes[node]['cumscore'] + token_score

        # Add the predicted token to the list of input ids
        new_input_ids = torch.cat([input_ids, token_id.unsqueeze(0).unsqueeze(0)], dim=-1)

        # Add node and edge to graph
        token = tokenizer.decode(token_id, skip_special_tokens=True)
        current_node = list(graph.successors(node))[j]
        graph.nodes[current_node]['tokenscore'] = np.exp(token_score) * 100
        graph.nodes[current_node]['cumscore'] = cumulative_score
        graph.nodes[current_node]['sequencescore'] = 1/(len(new_input_ids.squeeze())) * cumulative_score
        graph.nodes[current_node]['token'] = token + f"_{length}_{j}"

        # Recursive call
        beam_search(new_input_ids, current_node, bar, length-1, beams, sampling, 1)

# Parameters
length = 5
beams = 2

# Create a balanced tree with height 'length' and branching factor 'k'
graph = nx.balanced_tree(beams, length, create_using=nx.DiGraph())
bar = tqdm(total=len(graph.nodes))

# Add 'tokenscore', 'cumscore', and 'token' attributes to each node
for node in graph.nodes:
    graph.nodes[node]['tokenscore'] = 100
    graph.nodes[node]['cumscore'] = 0
    graph.nodes[node]['sequencescore'] = 0
    graph.nodes[node]['token'] = text

# Start generating text
beam_search(input_ids, 0, bar, length, beams, 'greedy', 1)

这段代码会计算出63个词的分数，以及beams^length = 5² = 25种可能的序列。所有信息都存储在咱们的生成图中。下一步，咱们需要从图中提取出最好的那条序列。首先，找到序列得分最高的那个叶子节点。接着，从根节点到这个叶子节点找到最短路径。这条路径上的每个节点都包含最优序列中的一个词。

def get_best_sequence(G):
    # Create a list of leaf nodes
    leaf_nodes = [node for node in G.nodes() if G.out_degree(node)==0]

    # Get the leaf node with the highest cumscore
    max_score_node = None
    max_score = float('-inf')
    for node in leaf_nodes:
        if G.nodes[node]['sequencescore'] > max_score:
            max_score = G.nodes[node]['sequencescore']
            max_score_node = node

    # Retrieve the sequence of nodes from this leaf node to the root node in a list
    path = nx.shortest_path(G, source=0, target=max_score_node)

    # Return the string of token attributes of this sequence
    sequence = "".join([G.nodes[node]['token'].split('_')[0] for node in path])
    return sequence, max_score

sequence, max_score = get_best_sequence(graph)
print(f"Generated text: {sequence}")

Generated text: I have a dream. I have a dream

看起来，最好的序列是“I have a dream. I have a dream”，这在GPT-2的生成中还挺常见的，虽然可能有点出乎意料。为了验证，咱们再把生成的图画出来。这次，咱们会展示每个节点的序列得分，也就是到这个节点为止的序列总分。如果get_best_sequence()函数没错，那么在“I have a dream. I have a dream”这条序列中，“dream”这个节点的得分应该在所有叶子节点中最高。

# Plot graph
plot_graph(graph, length, beams, 'sequence')

Image by author.

果然，“dream”这个词的序列得分最高，达到了-0.69。有意思的是，咱们也能看到贪婪搜索生成的“I have a dream of being a doctor.”，它的得分是-1.16。正如咱们所预期的，贪婪搜索确实导致了次优的结果。不过说实话，咱们现在这个新结果也算不上特别有吸引力。

为了生成更多样化的文本，咱们接下来要介绍两种更高级的抽样算法：Top-k采样和Nucleus采样。

4. Top-k 采样：给模型一点“想象力”

Top-k采样是一种在生成过程中注入“随机性”和“创造力”的技术。它不像贪婪搜索那样死板，而是利用语言模型生成的概率分布，从k个最可能的词中随机选择一个。

打个比方，咱们跨境商家做营销文案，可能模型预测了A、B、C、D四个词，概率分别是30%、15%、5%、1%。如果咱们设定k=3，那么D这个词就会被忽略，模型会在A、B、C中随机选一个，比如60%的概率选A，30%选B，10%选C。这种方式既保证了选择的是最可能的那一批词，又引入了随机性，让生成的文本不那么死板。

除了Top-k，还有一个能引入随机性的概念，叫做温度（temperature）。温度T是一个介于0到1之间的参数，它会影响softmax函数生成的概率，让那些原本就概率最高的词变得更加突出。简单来说，它就是把模型的原始分数除以一个“温度”值。

softmax(xi)=exi/T∑jexj/T

这里有一张图，展示了温度对给定输入分数[1.5, -1.8, 0.9, -3.2]生成概率的影响。咱们绘制了三个不同的温度值，以便观察差异。
Image by author.

温度设置为1.0时，就相当于没有温度调节的默认softmax。而当温度设得很低（0.1）时，概率分布会发生显著变化。这在文本生成中很常用，用来控制生成内容的“创意”程度。通过调整温度，咱们可以影响模型生成更具多样性还是更可预测的文本。就像咱们调咖啡的温度，太烫了喝不下去，太凉了又没味，得调到刚刚好。

现在，咱们来实现Top-k采样算法。通过给beam_search()函数传入“top_k”参数就能使用它。为了更直观，咱们还会绘制当top_k = 20时概率分布图。

def plot_prob_distribution(probabilities, next_tokens, sampling, potential_nb, total_nb=50):
    # Get top k tokens
    top_k_prob, top_k_indices = torch.topk(probabilities, total_nb)
    top_k_tokens = [tokenizer.decode([idx]) for idx in top_k_indices.tolist()]

    # Get next tokens and their probabilities
    next_tokens_list = [tokenizer.decode([idx]) for idx in next_tokens.tolist()]
    next_token_prob = probabilities[next_tokens].tolist()

    # Create figure
    plt.figure(figsize=(0.4*total_nb, 5), dpi=300, facecolor='white')
    plt.rc('axes', axisbelow=True)
    plt.grid(axis='y', linestyle='-', alpha=0.5)

    if potential_nb < total_nb:
        plt.axvline(x=potential_nb-0.5, ls=':', color='grey', label='Sampled tokens')

    plt.bar(top_k_tokens, top_k_prob.tolist(), color='blue')
    plt.bar(next_tokens_list, next_token_prob, color='red', label='Selected tokens')
    plt.xticks(rotation=45, ha='right', va='top')
    plt.gca().spines['top'].set_visible(False)
    plt.gca().spines['right'].set_visible(False)

    if sampling == 'top_k':
        plt.title('Probability distribution of predicted tokens with top-k sampling')
    elif sampling == 'nucleus':
        plt.title('Probability distribution of predicted tokens with nucleus sampling')

    plt.legend()
    plt.savefig(f'{sampling}_{time.time()}.png', dpi=300)
    plt.close()

def top_k_sampling(logits, temperature, top_k, beams, plot=True):
    assert top_k >= 1
    assert beams <= top_k

    indices_to_remove = logits < torch.topk(logits, top_k)[0][..., -1, None]
    new_logits = torch.clone(logits)
    new_logits[indices_to_remove] = float('-inf')

    # Convert logits to probabilities
    probabilities = torch.nn.functional.softmax(new_logits / temperature, dim=-1)

    # Sample n tokens from the resulting distribution
    next_tokens = torch.multinomial(probabilities, beams)

    # Plot distribution
    if plot:
        total_prob = torch.nn.functional.softmax(logits / temperature, dim=-1)
        plot_prob_distribution(total_prob, next_tokens, 'top_k', top_k)

    return next_tokens

# Start generating text
beam_search(input_ids, 0, bar, length, beams, 'top_k', 1)

这些动图直观地展示了Top-k采样的工作原理：所有可能被选中的词都在竖线左侧。虽然模型大多数时候会选择概率最高的词（红色标记），但它也允许选择概率略低的词。这带来了一个有趣的权衡：在文本多样性和听起来更自然之间找到平衡。

现在，咱们来看看它生成了什么文本。

sequence, max_score = get_best_sequence(graph)
print(f"Generated text: {sequence}")

Generated text: I have a dream job and I want to

Top-k采样找到了一个新的序列：“I have a dream job and I want to”，这听起来比之前“I have a dream. I have a dream”要自然得多，也更有语义连贯性。咱们取得了不小的进步！再来看看这次决策树有什么不同。

# Plot graph
plot_graph(graph, length, beams, 'sequence')

Image by author.

可以看到，这次的节点选择与之前有显著差异，展现出更多样化的选择。虽然这个新结果的序列得分可能不是最高的（-1.01，之前是-0.69），但请记住，更高的得分并不总是意味着生成的内容更真实或更有意义。

了解了Top-k采样，咱们就不能不提另一种同样流行的采样技术：Nucleus采样。

5. Nucleus 采样：灵活的“核心区”选择

Nucleus采样，也被称为Top-p采样，与Top-k采样采取了不同的策略。它不再是选择Top-k个最可能的词，而是设定一个累积概率阈值p。模型会从概率最高的词开始，依次把它们加进来，直到这些词的累计概率超过p为止。这就形成了一个“核心区”的词，模型会从这个“核心区”中随机选择下一个词。

换句话说，模型会按照概率从高到低检查它的预测词，然后不断将它们添加到列表中，直到总概率超过阈值p。与Top-k采样不同的是，Nucleus采样选入“核心区”的词的数量是动态变化的，每一步都可能不同。这种可变性通常能产生更多样化、更富有创意的输出，使得Nucleus采样在文本生成任务中备受欢迎。

为了实现Nucleus采样，咱们可以在beam_search()函数中使用“nucleus”参数。在这个例子中，咱们将p值设置为0.5。为了简化，咱们会设置一个最小词数，等于束数。同时，咱们会考虑累计概率低于p而不是高于p的词。虽然具体细节可能有所不同，但Nucleus采样的核心思想保持不变。

def nucleus_sampling(logits, temperature, p, beams, plot=True):
    assert p > 0
    assert p <= 1

    # Sort the probabilities in descending order and compute cumulative probabilities
    sorted_logits, sorted_indices = torch.sort(logits, descending=True)
    probabilities = torch.nn.functional.softmax(sorted_logits / temperature, dim=-1)
    cumulative_probabilities = torch.cumsum(probabilities, dim=-1)

    # Create a mask for probabilities that are in the top-p
    mask = cumulative_probabilities < p

    # If there's not n index where cumulative_probabilities < p, we use the top n tokens instead
    if mask.sum() > beams:
        top_p_index_to_keep = torch.where(mask)[0][-1].detach().cpu().tolist()
    else:
        top_p_index_to_keep = beams

    # Only keep top-p indices
    indices_to_remove = sorted_indices[top_p_index_to_keep:]
    sorted_logits[indices_to_remove] = float('-inf')

    # Sample n tokens from the resulting distribution
    probabilities = torch.nn.functional.softmax(sorted_logits / temperature, dim=-1)
    next_tokens = torch.multinomial(probabilities, beams)

    # Plot distribution
    if plot:
        total_prob = torch.nn.functional.softmax(logits / temperature, dim=-1)
        plot_prob_distribution(total_prob, next_tokens, 'nucleus', top_p_index_to_keep)

    return next_tokens

# Start generating text
beam_search(input_ids, 0, bar, length, beams, 'nucleus', 1)

在这张图里，咱们可以看到，“核心区”（垂直线左侧）包含的词的数量波动很大。生成的概率分布也大相径庭，这使得模型能够选择那些并非总是概率最高的词。这就为生成独特且多样化的序列打开了大门。

现在，咱们来看看它生成的文本。

sequence, max_score = get_best_sequence(graph)
print(f"Generated text: {sequence}")

Generated text: I have a dream. I'm going to

Nucleus采样算法生成了序列：“I have a dream. I’m going to”，这在语义连贯性上比贪婪采样有了显著提升。为了对比决策路径，咱们再把Nucleus采样生成的树画出来。

# Plot graph
plot_graph(graph, length, beams, 'sequence')

Image by author.

和Top-k采样一样，这棵树与贪婪采样生成的树大相径庭，显示出更多的多样性。

无论是Top-k采样还是Nucleus采样，它们都在文本生成中提供了独特的优势，增强了输出的多样性和创造力。选择哪种方法（甚至包括贪婪搜索），取决于咱们项目的具体需求和限制。

6. 总结与展望

各位跨境朋友，一路走来，咱们从最直接粗暴的贪婪搜索，到更全面有远见的束搜索，再到注入“创意”和“惊喜”的Top-k和Nucleus采样，是不是对大模型的文本生成有了更深刻的理解？

这些策略不是高高在上的理论，而是咱们跨境人提升内容营销、优化用户沟通的“利器”！无论是生成吸引眼球的产品描述、撰写个性化的营销邮件，还是提供智能化的客户服务回复，灵活运用这些生成策略，都能让咱们的模型产出更贴近用户需求、更具竞争力的内容。

每种方法都有其独特的优势和潜在的局限性，关键在于学以致用，灵活运用。没有一劳永逸的参数组合，只有不断尝试和调优，才能找到最适合你业务场景的生成方式。

掌握这些核心技能，无疑会让你在未来的跨境出海竞争中脱颖而出！新媒网跨境预测，未来AI生成内容会更加普及，如何驾驭这些工具，输出高质量、高价值的文本，将是咱们每个跨境人必修的功课。希望今天的分享能给大家带来启发，让我们一起在跨境AI的道路上不断探索，共创辉煌！

新媒网（公号: 新媒网跨境发布），是一个专业的跨境电商、游戏、支付、贸易和广告社区平台，为百万跨境人传递最新的海外淘金精准资讯情报。

本文来源：新媒网 https://nmedialink.com/posts/llm-gen-strat-tune-30min-x2-xborder-eff.html