PyTorch自然语言处理实战：第3章监督学习训练与情感分类案例解析

2025-06-02 23:16:39作者：薛曦旖Francesca

本文基于《NLP with PyTorch》第3章内容，深入讲解监督学习在自然语言处理中的应用。我们将从基础概念出发，逐步构建完整的文本分类模型，并通过两个典型案例帮助读者掌握核心技能。

一、监督学习基础概念

监督学习是机器学习中最常见的范式之一，其核心思想是利用已标注的训练数据来构建预测模型。在NLP领域，监督学习广泛应用于文本分类、情感分析、命名实体识别等任务。

本章重点介绍以下关键组件：

模型架构（如感知机）
激活函数（Sigmoid、ReLU等）
损失函数（交叉熵、MSE等）
优化算法（如Adam）

二、感知机模型与激活函数详解

感知机是最简单的神经网络模型，本章提供了PyTorch实现示例：

import torch
import torch.nn as nn

class Perceptron(nn.Module):
    def __init__(self, input_dim):
        super(Perceptron, self).__init__()
        self.fc = nn.Linear(input_dim, 1)
    
    def forward(self, x):
        return torch.sigmoid(self.fc(x))

常用激活函数实现

Sigmoid函数：将输出压缩到(0,1)区间

def sigmoid_activation(z):
    return 1/(1+torch.exp(-z))

ReLU函数：解决梯度消失问题

def relu_activation(z):
    return torch.max(z, torch.zeros_like(z))

Softmax函数：多分类任务常用

def softmax(z):
    return torch.exp(z)/torch.sum(torch.exp(z), dim=1)

三、损失函数对比与应用场景

均方误差(MSE)：适用于回归任务

mse_loss = nn.MSELoss()

交叉熵损失：分类任务首选

ce_loss = nn.CrossEntropyLoss()

二元交叉熵：二分类专用

bce_loss = nn.BCELoss()

四、实战案例1：合成数据二分类

我们首先生成一个简单的二维合成数据集，演示感知机如何学习决策边界：

数据生成：使用sklearn的make_classification
模型训练：设置学习率、迭代次数等超参数
结果可视化：绘制决策边界和分类效果

这个案例帮助读者直观理解模型如何从数据中学习规律。

五、实战案例2：Yelp评论情感分析

本案例完整展示NLP项目流程：

1. 数据预处理

# 示例数据清洗代码
def clean_text(text):
    text = text.lower()
    text = re.sub(r"i'm", "i am", text)
    text = re.sub(r"\r", "", text)
    return text

提供"精简版"和"完整版"两种数据集方案，适应不同硬件环境。

2. 构建词汇表(Vocabulary)

class Vocabulary:
    def __init__(self):
        self.token2idx = {}
        self.idx2token = {}
        
    def add_token(self, token):
        if token not in self.token2idx:
            idx = len(self.token2idx)
            self.token2idx[token] = idx
            self.idx2token[idx] = token

3. 文本向量化(Vectorizer)

将文本转换为模型可处理的数值向量：

class Vectorizer:
    def __init__(self, vocabulary):
        self.vocabulary = vocabulary
        
    def vectorize(self, text):
        one_hot = torch.zeros(len(self.vocabulary))
        for token in text.split():
            if token in self.vocabulary.token2idx:
                one_hot[self.vocabulary.token2idx[token]] = 1
        return one_hot

4. 模型训练与评估

完整训练流程包括：

数据加载器准备
模型初始化
损失函数和优化器设置
训练循环
验证集评估

# 训练循环示例
for epoch in range(num_epochs):
    for batch in train_loader:
        optimizer.zero_grad()
        outputs = model(batch['features'])
        loss = criterion(outputs, batch['label'])
        loss.backward()
        optimizer.step()

5. 结果分析与模型解释

分析模型学到的权重，识别对分类结果影响最大的词汇：

# 获取最重要的特征
def get_important_features(model, vocabulary, n=10):
    weights = model.fc.weight.data.numpy().flatten()
    indices = np.argsort(weights)[-n:]
    return [(vocabulary.idx2token[i], weights[i]) for i in indices]