文本挖掘与NLP基础：从分词到Zipf定律分析

2025-06-04 18:45:59作者：尤峻淳Whitney

引言

在数据科学领域，文本数据是最常见也最具挑战性的数据类型之一。本文将基于NLTK工具包，深入探讨文本挖掘和自然语言处理(NLP)的基础技术，包括分词、词频分析和Zipf定律等核心概念。

环境准备

在开始文本分析前，我们需要配置适当的环境：

安装必要的Python库：
- NumPy：用于高效数值计算
- NLTK：自然语言处理工具包
- Tkinter：图形用户界面支持
下载NLTK数据包，包含语料库、词性标注器、分块器等资源

import nltk
nltk.download('all')  # 下载所有NLTK数据资源

文本预处理基础

1. 句子分割

将段落分割成句子是文本处理的第一步。NLTK的sent_tokenize函数能智能处理各种复杂情况：

example = '''Good bagels cost $2.88 in N.Y.C. Hey Prof. Ipeirotis, please buy me two of them.
    
    Thanks.
    
    PS: You have a Ph.D. you can handle this, right?'''

print(nltk.sent_tokenize(example))

2. 词语切分

词语切分(tokenization)比简单的空格分割复杂得多，需要考虑缩写、货币符号等情况：

import string

for sentence in nltk.sent_tokenize(example):
    tokens = nltk.word_tokenize(sentence)
    # 只保留字母组成的词，并转为小写
    words = [w.lower() for w in tokens if w not in string.punctuation]
    print("处理后的词语:", words)

词频分析与Zipf定律

1. 基本词频统计

以达尔文的《物种起源》为例，我们可以进行词频分析：

content = open('/data/origin-of-species.txt', 'r').read()
tokens = nltk.word_tokenize(content)
fdist = nltk.FreqDist(tokens)

print("总词数:", len(tokens))
print("独特词数:", len(fdist))
print("'species'出现次数:", fdist["species"])

2. Zipf定律可视化

Zipf定律指出，在自然语言文本中，词频与排名呈幂律关系：

# 绘制前100个高频词的频率分布
fdist.plot(100, cumulative=False)
fdist.plot(100, cumulative=True)

分析发现：

前100个高频词占文本总量的50%以上
2666个词(占独特词数的34.7%)只出现一次(称为hapaxes)

3. 停用词处理

停用词(如"the", "and")通常不携带关键信息，可以过滤：

from nltk.corpus import stopwords

stopwords = stopwords.words('english')
stopwords.extend(['one', 'may', 'would'])  # 扩展停用词表

def get_most_frequent_words(text, top):
    content = [w.lower() for w in text 
              if w.lower() not in stopwords and w.isalpha()]
    return nltk.FreqDist(content).most_common(top)

print("过滤停用词后的高频词:", get_most_frequent_words(tokens, 10))

文本分布分析

1. 分布图(Dispersion Plot)

分布图展示关键词在文本中的位置分布：

text = nltk.Text(tokens)
text.dispersion_plot(["species", "natural", "selection", "evolution"])

2. 实践练习

NLTK内置了多个经典文本，可用于练习：

text1: 《白鲸记》
text2: 《理智与情感》
text3: 《创世纪》
text4: 美国总统就职演说
text5: 聊天语料

关键概念总结

词频分布：统计文本中词语出现频率的基本方法
分词技术：将文本分割为有意义的单元，比简单空格分割更复杂
Zipf定律：解释自然语言中词频分布的幂律现象
停用词过滤：移除高频但低信息量的词语
分布分析：研究词语在文本中的位置分布特征

通过掌握这些基础技术，我们可以为更高级的文本挖掘和自然语言处理任务奠定坚实基础。

登录后查看全文

文本挖掘与NLP基础：从分词到Zipf定律分析

引言

环境准备

文本预处理基础

1. 句子分割

2. 词语切分

词频分析与Zipf定律

1. 基本词频统计

2. Zipf定律可视化

3. 停用词处理

文本分布分析

1. 分布图(Dispersion Plot)

2. 实践练习

关键概念总结

最新内容推荐

项目优选

文本挖掘与NLP基础：从分词到Zipf定律分析

引言

环境准备

文本预处理基础

1. 句子分割

2. 词语切分

词频分析与Zipf定律

1. 基本词频统计

2. Zipf定律可视化

3. 停用词处理

文本分布分析

1. 分布图(Dispersion Plot)

2. 实践练习

关键概念总结

相关内容推荐

最新内容推荐

项目优选