首页
/ AI-For-Beginners语言建模:CBOW与Skip-gram实战

AI-For-Beginners语言建模:CBOW与Skip-gram实战

2026-02-04 05:15:14作者:钟日瑜

引言:为什么需要词嵌入?

在传统的自然语言处理中,我们通常使用词袋模型(Bag-of-Words)或TF-IDF来表示文本。这些方法虽然简单有效,但存在两个主要问题:

  1. 维度灾难:词汇表大小可能达到数万甚至数十万,导致特征向量维度极高
  2. 语义缺失:one-hot编码无法表达词语之间的语义相似性

词嵌入(Word Embedding)技术通过将词语映射到低维稠密向量空间,完美解决了这两个问题。Word2Vec作为最经典的词嵌入算法,包含两种核心架构:CBOW(Continuous Bag-of-Words)和Skip-gram。

CBOW与Skip-gram原理对比

CBOW(连续词袋模型)

CBOW通过上下文词语预测中心词,训练目标是最大化给定上下文时中心词的条件概率。

flowchart TD
    A[输入层: 上下文词语] --> B[嵌入层: 词向量查找]
    B --> C[隐藏层: 向量平均或求和]
    C --> D[输出层: Softmax分类]
    D --> E[预测中心词]

Skip-gram(跳字模型)

Skip-gram与CBOW相反,通过中心词预测上下文词语,训练目标是最大化给定中心词时上下文词语的条件概率。

flowchart TD
    A[输入层: 中心词语] --> B[嵌入层: 词向量查找]
    B --> C[隐藏层: 共享权重矩阵]
    C --> D[输出层: Softmax分类]
    D --> E[预测上下文词语]

性能对比表

特性 CBOW Skip-gram
训练速度 较快 较慢
低频词处理 一般 优秀
语义捕捉 整体语义 细粒度语义
适用场景 大规模语料 小规模语料

实战环境搭建

首先确保安装必要的依赖库:

# PyTorch版本
import torch
import torchtext
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset

# 数据处理
import collections
import random
import numpy as np

CBOW模型实现详解

1. 数据预处理与词汇表构建

def build_vocabulary(text_corpus, vocab_size=5000):
    """构建词汇表"""
    tokenizer = torchtext.data.utils.get_tokenizer('basic_english')
    counter = collections.Counter()
    
    for text in text_corpus:
        tokens = tokenizer(text)
        counter.update(tokens)
    
    # 选择前vocab_size个最频繁的词语
    vocab = torchtext.vocab.Vocab(
        collections.Counter(dict(counter.most_common(vocab_size))), 
        min_freq=1
    )
    return vocab, tokenizer

def text_to_indices(text, vocab, tokenizer):
    """将文本转换为索引序列"""
    tokens = tokenizer(text)
    return [vocab[token] for token in tokens if token in vocab.stoi]

2. CBOW训练数据生成

def generate_cbow_pairs(sentence_indices, window_size=2):
    """生成CBOW训练样本对"""
    pairs = []
    n = len(sentence_indices)
    
    for center_pos in range(n):
        # 获取上下文窗口
        context_start = max(0, center_pos - window_size)
        context_end = min(n, center_pos + window_size + 1)
        
        # 排除中心词本身
        context_indices = [
            sentence_indices[i] 
            for i in range(context_start, context_end) 
            if i != center_pos
        ]
        
        if context_indices:
            pairs.append((context_indices, sentence_indices[center_pos]))
    
    return pairs

3. CBOW模型架构

class CBOWModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(CBOWModel, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.linear = nn.Linear(embedding_dim, vocab_size)
        
    def forward(self, context_words):
        # 上下文词嵌入并求平均
        embedded = self.embedding(context_words)  # [batch_size, context_size, embedding_dim]
        embedded_mean = torch.mean(embedded, dim=1)  # [batch_size, embedding_dim]
        
        # 预测中心词
        output = self.linear(embedded_mean)  # [batch_size, vocab_size]
        return output

4. 训练循环

def train_cbow(model, train_data, vocab_size, epochs=10, learning_rate=0.01):
    """训练CBOW模型"""
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)
    
    for epoch in range(epochs):
        total_loss = 0
        for context_words, center_word in train_data:
            optimizer.zero_grad()
            
            # 前向传播
            output = model(context_words)
            loss = criterion(output, center_word.unsqueeze(0))
            
            # 反向传播
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
        
        print(f'Epoch {epoch+1}, Loss: {total_loss/len(train_data):.4f}')

Skip-gram模型实现

1. Skip-gram训练数据生成

def generate_skipgram_pairs(sentence_indices, window_size=2):
    """生成Skip-gram训练样本对"""
    pairs = []
    n = len(sentence_indices)
    
    for center_pos in range(n):
        center_word = sentence_indices[center_pos]
        
        # 获取上下文窗口
        context_start = max(0, center_pos - window_size)
        context_end = min(n, center_pos + window_size + 1)
        
        for context_pos in range(context_start, context_end):
            if context_pos != center_pos:
                context_word = sentence_indices[context_pos]
                pairs.append((center_word, context_word))
    
    return pairs

2. Skip-gram模型架构

class SkipGramModel(nn.Module):
    def __init__(self, vocab_size, embedding_dim):
        super(SkipGramModel, self).__init__()
        self.center_embedding = nn.Embedding(vocab_size, embedding_dim)
        self.context_embedding = nn.Embedding(vocab_size, embedding_dim)
        
    def forward(self, center_words, context_words):
        # 中心词和上下文词嵌入
        center_embedded = self.center_embedding(center_words)  # [batch_size, embedding_dim]
        context_embedded = self.context_embedding(context_words)  # [batch_size, embedding_dim]
        
        # 计算相似度得分
        scores = torch.matmul(center_embedded, context_embedded.t())  # [batch_size, batch_size]
        return scores

3. 负采样训练

class NegativeSamplingLoss(nn.Module):
    def __init__(self):
        super(NegativeSamplingLoss, self).__init__()
        
    def forward(self, center_vectors, context_vectors, negative_vectors):
        # 正样本得分
        positive_score = torch.sum(center_vectors * context_vectors, dim=1)
        positive_loss = -F.logsigmoid(positive_score).mean()
        
        # 负样本得分
        negative_score = torch.sum(center_vectors.unsqueeze(1) * negative_vectors, dim=2)
        negative_loss = -F.logsigmoid(-negative_score).mean()
        
        return positive_loss + negative_loss

实战案例:AG新闻数据词嵌入

数据加载与预处理

def load_ag_news_dataset(sample_size=10000):
    """加载AG新闻数据集"""
    train_dataset, test_dataset = torchtext.datasets.AG_NEWS(root='./data')
    texts = []
    
    for i, (label, text) in enumerate(train_dataset):
        if i >= sample_size:
            break
        texts.append(text)
    
    return texts

# 加载数据并构建词汇表
news_texts = load_ag_news_dataset(10000)
vocab, tokenizer = build_vocabulary(news_texts, vocab_size=5000)

# 生成训练数据
all_cbow_pairs = []
for text in news_texts:
    indices = text_to_indices(text, vocab, tokenizer)
    cbow_pairs = generate_cbow_pairs(indices, window_size=2)
    all_cbow_pairs.extend(cbow_pairs)

模型训练与评估

# 初始化模型
embedding_dim = 100
vocab_size = len(vocab)
cbow_model = CBOWModel(vocab_size, embedding_dim)

# 训练CBOW模型
train_cbow(cbow_model, all_cbow_pairs, vocab_size, epochs=20, learning_rate=0.01)

# 获取词向量
word_vectors = cbow_model.embedding.weight.data

语义相似度查询

def find_similar_words(query_word, word_vectors, vocab, top_n=5):
    """查找语义相似的词语"""
    if query_word not in vocab.stoi:
        return f"'{query_word}' not in vocabulary"
    
    query_idx = vocab.stoi[query_word]
    query_vector = word_vectors[query_idx]
    
    # 计算余弦相似度
    similarities = torch.nn.functional.cosine_similarity(
        word_vectors, query_vector.unsqueeze(0), dim=1
    )
    
    # 获取最相似的词语(排除查询词本身)
    top_indices = similarities.argsort(descending=True)[1:top_n+1]
    similar_words = [vocab.itos[idx] for idx in top_indices]
    
    return similar_words

# 测试语义相似度
print("与'microsoft'相似的词语:", find_similar_words('microsoft', word_vectors, vocab))
print("与'basketball'相似的词语:", find_similar_words('basketball', word_vectors, vocab))

性能优化技巧

1. 负采样加速训练

def negative_sampling(vocab, num_negatives=5):
    """负采样实现"""
    word_freq = np.array([vocab.freqs[word] for word in vocab.itos])
    word_probs = word_freq ** 0.75  # 平滑处理
    word_probs /= word_probs.sum()
    
    negative_samples = np.random.choice(
        len(vocab), size=num_negatives, p=word_probs, replace=False
    )
    return torch.tensor(negative_samples)

2. 分层Softmax

class HierarchicalSoftmax(nn.Module):
    """分层Softmax实现"""
    def __init__(self, vocab_size, embedding_dim):
        super(HierarchicalSoftmax, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        # 构建霍夫曼树等实现...

3. 批量训练优化

class CBOWDataset(Dataset):
    """自定义数据集类"""
    def __init__(self, pairs):
        self.pairs = pairs
        
    def __len__(self):
        return len(self.pairs)
    
    def __getitem__(self, idx):
        context, center = self.pairs[idx]
        return torch.tensor(context), torch.tensor(center)

# 使用DataLoader进行批量训练
dataset = CBOWDataset(all_cbow_pairs)
dataloader = DataLoader(dataset, batch_size=64, shuffle=True)

实际应用场景

1. 文本分类增强

class TextClassifierWithEmbeddings(nn.Module):
    """使用预训练词嵌入的文本分类器"""
    def __init__(self, vocab_size, embedding_dim, num_classes, pretrained_embeddings=None):
        super(TextClassifierWithEmbeddings, self).__init__()
        
        if pretrained_embeddings is not None:
            self.embedding = nn.Embedding.from_pretrained(pretrained_embeddings)
        else:
            self.embedding = nn.Embedding(vocab_size, embedding_dim)
        
        self.lstm = nn.LSTM(embedding_dim, 128, batch_first=True, bidirectional=True)
        self.classifier = nn.Linear(256, num_classes)
    
    def forward(self, x):
        embedded = self.embedding(x)
        lstm_out, _ = self.lstm(embedded)
        pooled = torch.mean(lstm_out, dim=1)
        return self.classifier(pooled)

2. 推荐系统语义匹配

def semantic_similarity(item1_text, item2_text, word_vectors, vocab):
    """计算两个文本的语义相似度"""
    def text_to_vector(text):
        tokens = tokenizer(text)
        indices = [vocab[token] for token in tokens if token in vocab.stoi]
        if not indices:
            return None
        vectors = word_vectors[indices]
        return torch.mean(vectors, dim=0)
    
    vec1 = text_to_vector(item1_text)
    vec2 = text_to_vector(item2_text)
    
    if vec1 is None or vec2 is None:
        return 0.0
    
    return F.cosine_similarity(vec1.unsqueeze(0), vec2.unsqueeze(0)).item()

常见问题与解决方案

1. 内存不足问题

# 使用稀疏更新
optimizer = optim.SparseAdam(model.parameters(), lr=0.001)

# 梯度累积
def train_with_gradient_accumulation(model, dataloader, accumulation_steps=4):
    optimizer.zero_grad()
    
    for i, (context, center) in enumerate(dataloader):
        output = model(context)
        loss = criterion(output, center)
        loss = loss / accumulation_steps
        loss.backward()
        
        if (i + 1) % accumulation_steps == 0:
            optimizer.step()
            optimizer.zero_grad()

2. 训练不稳定问题

# 梯度裁剪
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# 学习率调度
scheduler = optim.lr_scheduler.ReduceLROnPlateau(
    optimizer, mode='min', factor=0.5, patience=2
)

3. 低频词处理

# 子词嵌入
import fasttext

# 或者使用字符级CNN
class CharCNNEmbedding(nn.Module):
    def __init__(self, char_vocab_size, char_embedding_dim, word_embedding_dim):
        super(CharCNNEmbedding, self).__init__()
        self.char_embedding = nn.Embedding(char_vocab_size, char_embedding_dim)
        self.conv = nn.Conv1d(char_embedding_dim, word_embedding_dim, kernel_size=3)
        
    def forward(self, char_indices):
        embedded = self.char_embedding(char_indices)
        conv_out = self.conv(embedded.permute(0, 2, 1))
        return torch.max(conv_out, dim=2)[0]

总结与展望

通过本教程,我们深入探讨了CBOW和Skip-gram两种Word2Vec架构的原理和实现。这两种方法虽然简单,但为现代NLP奠定了坚实基础。

关键收获:

  1. CBOW适合快速训练:上下文预测中心词,训练效率高
  2. Skip-gram处理低频词更优:中心词预测上下文,对罕见词更友好
  3. 负采样大幅提升性能:通过采样负样本加速训练过程
  4. 词嵌入是NLP基础:为下游任务提供高质量的语义表示

进阶方向:

  • 探索GloVe、fastText等改进算法
  • 尝试BERT、GPT等预训练语言模型
  • 应用于具体业务场景的定制化词嵌入

词嵌入技术仍在不断发展,但CBOW和Skip-gram作为经典方法,仍然是理解现代NLP的重要基础。掌握这些基础技术,将为你深入学习更复杂的语言模型打下坚实基础。

登录后查看全文
热门项目推荐
相关项目推荐