逻辑回归用于讽刺文本检测
最编程
2024-08-13 10:46:23
...
一、实验目的及要求
1)实验目的
- 了解训练分类模型的基本原理;
- 掌握模型解释及模型改进流程;
- 熟悉讽刺文本检测的逻辑回归方法。
2)实验要求
- 根据实验题目编写好源程序;
- 对上机操作过程中可能出现的问题预先分析,确定调试步骤和测试方法;
- 输入一定数量的测试数据,对运行结果进行分析;
- 上机实验后, 认真写出实验报告,对上机中出现的问题进行分析、总结。
二、实验环境(工具、配置等)
- 硬件要求:计算机一台;
- 软件要求:Mac操作系统,本实验在jupyter notebook上进行开发。
三、实验内容(实验方案、实验步骤、设计思路等)
1)实验方案
- 学习并跟随着蓝桥云课完成此次试验;
- 使用词频和加权的方法,再通过分类逻辑回归模型进行分类,以实现讽刺性内容的检测;
- 通过已有的数据构建训练集与测试集,在进行简单的数据图可视化,之后训练分类模型并加以改进,以期达到较高的准确率。
2)实验步骤
- 模型定义;
- 数据处理和加载;
- 训练模型;
- 训练过程可视化;
- 测试,并多次修改。 上面步骤知识个人做深度学习项目或实验时的一些个人习惯,具体到不同项目,为了得到最优的模型结果,中间往往需要很多次的尝试和修改。
3)设计思路
- 设计的模型需要具有高度可配置性,便于修改参数,修改模型,反复实验;
- 代码应具备良好的组织结构,使人一目了然;
- 代码应具有良好的说明,使其他人能够理解。
四、实验结果
- 加载语料并预览:
train_df = pd.read_csv('train-balanced-sarcasm.csv')
train_df.head()
图一:数据集预览
- 查看数据集变量类别信息,发现
comment
数量小于其他特征数量,说明存在缺失值。这里直接将其删除;
train_df.info()
图二:数据集info
train_df.dropna(subset=['comment'], inplace=True)
- 输出数据标签,看看类别是否平衡:
train_df['label'].value_counts()
图三:数据标签信息
- 可视化讽刺和正常文本长度:
train_df.loc[train_df['label'] == 0, 'comment'].str.len().apply(
np.log1p).hist(label='normal', alpha=.5)
train_df.loc[train_df['label'] == 1, 'comment'].str.len().apply(
np.log1p).hist(label='sarcastic', alpha=.5)
plt.legend()
图四:可视化讽刺和正常文本长度
- 使用
groupby
确定各子板块讽刺评论数量排序:
sub_df = train_df.groupby('subreddit')['label'].agg([np.size, np.mean, np.sum])
sub_df.sort_values(by='sum', ascending=False).head(10)
图五:各子板块讽刺评论数量排序
- 使用wordcloud模块实现词云:
from wordcloud import WordCloud, STOPWORDS
wordcloud = WordCloud(background_color = 'black', stopwords = STOPWORDS, max_words = 200, max_font_size = 100, random_state = 17, width = 800, height = 400)
plt.figure(figsize = (16,12))
wordcloud.generate(str(train_df.loc[train_df['label'] == 1, 'comment']))
plt.imshow(wordcloud)
图六:词云1
plt.figure(figsize = (16, 12))
wordcloud.generate(str(train_df.loc[train_df['label'] == 0, 'comment']))
plt.imshow(wordcloud)
图七:词云2
- 输出子版块评论数大于1000且讽刺评论比例排名前 10 的信息:
sub_df[sub_df['size']>1000].sort_values(by = 'mean', ascending = False).head(10)
图八:输出信息
- 输出发表评论总数大于300,且讽刺评论比例最高的10位用户信息:
sub_df = train_df.groupby('author')['label'].agg([np.size, np.mean, np.sum])
sub_df[sub_df['size'] > 300].sort_values(by = 'mean', ascending = False).head(10)
图九:输出信息
- 训练讽刺文本分类预测模型,并得到测试集上的准确度评估结果
tf_idf = TfidfVectorizer(ngram_range(1, 2), max_features = 50000, min_df = 2)
logit = LogisticRegression(C = 1, n_jobs = 4, solver = 'lbfgs', random_state = 17, verbose = 1)
tfidf_logit_pipeline = Pipeline([('tf_idf', tf_idf), ('logit', logit)])
tfidf_logit_pipeline.fit(train_texts, y_train)
valid_pred = tfidf_logit_pipeline.predict(valid_texts)
accuracy_score(y_valid, valid_pred)
图十:预测结果
- 构建一个混淆矩阵的函数
plot_confusion_matrix
: 图十一:混淆矩阵的函数
- 使用
eli5
输出分类器在预测判定是文本特征的权重
import eli5
eli5.show_weights(estimator = tfidf_logit_pipeline.named_steps['logit'], vec = tfidf_logit_pipeline.named_steps['tf_idf'])
图十二:文本特征的权重
- 接下来,补充一个
subreddit
特征进行模型改进,同样完成切分,注意切分时一定要选择同一个random_state
,保证能和上面的评论数据对齐:
subreddits = train_df['subreddit']
train_subreddits, valid_subreddits = train_test_split(
subreddits, random_state=17)
- 接下来,同样使用 tf-idf 算法分别构建 2 个
TfidfVectorizer
用于comment
和subreddits
的特征提取。
tf_idf_texts = TfidfVectorizer(
ngram_range=(1, 2), max_features=50000, min_df=2)
tf_idf_subreddits = TfidfVectorizer(ngram_range=(1, 1))
- 使用构建好的
TfidfVectorizer
完成特征提取:
X_train_texts = tf_idf_texts.fit_transform(train_texts)
X_valid_texts = tf_idf_texts.transform(valid_texts)
X_train_texts.shape, X_valid_texts.shape
X_train_subreddits = tf_idf_subreddits.fit_transform(train_subreddits)
X_valid_subreddits = tf_idf_subreddits.transform(valid_subreddits)
X_train_subreddits.shape, X_valid_subreddits.shape
图十三:特征的提取
- 将提取出的特征拼接在一起:
from scipy.sparse import hstack
X_train = hstack([X_train_texts, X_train_subreddits])
X_valid = hstack([X_valid_texts, X_valid_subreddits])
- 继续使用逻辑回归进行建模与预测
logit.fit(X_train, y_train)
valid_pred = logit.predict(X_valid)
accuracy_score(y_valid, valid_pred)
图十四:改进后accuracy
遇到的问题及解决方法
- 问题:最后结果与自己预期不符,准确率太低
- 解决方法: 重新实现了Logistic Regression, 并将代价函数预期值调低,学习参数设定为0.1,一定程度上提高了准确率,代码如下:
class LogisticRegression:
def __init__(self, alpha=0.3):
self.coef_ = 0.0
self.intercept = 0.
self.theta = None
self.alpha = alpha
self.cost_list = []
# 迭代次数
self.iter_count = 0
def fit(self, x, y, threshold):
if x.shape[0] != y.shape[0]:
raise '输入格式错误'
self.m = x.shape[0]
self.n_feature = x.shape[1]
self.theta = np.zeros(x[0].size)
# 数据归一化
self.y = y
self.x = self.normalize_(x)
self.gradient_descent(threshold)
self.coef_ = self.theta
pass
def gradient_descent(self, threshold):
cost = 100000.0
# threshold = 0.1
# 此处用时间换准确率,将代价函数设的足够小
while abs(cost) > threshold:
self.theta = self.theta - self.alpha * self.partialDerivative()
self.intercept = self.intercept - self.alpha * self.iteratedFunctionForIntersect()
cost = -self.Jfunction()
# 便于后期可视化处理
self.cost_list.append(cost)
self.iter_count += 1
# 数据归一化,采用(0,1)归一,将数据集中数据值归一到【0,1】区间
def normalize_(self, x):
offset = np.zeros(self.n_feature)
scalar = np.ones(self.n_feature)
for feature_idx in range(0, self.n_feature):
col = x[:, np.newaxis, feature_idx]
min = col.min()
max = col.max()
if (min != max):
scalar[feature_idx] = 1.0 / (max - min)
else:
scalar[feature_idx] = 1.0 / max
offset[feature_idx] = min
x = (x - offset) * scalar
return x
# 激活函数
def sigmoid(self, z):
e_part = np.exp(-z)
return 1 / (1 + e_part)
def hypotheticFun(self, x):
z = np.dot(self.theta, x) + self.intercept
return self.sigmoid(z)
def error_dist(self, x, y):
return self.hypotheticFun(x) - y
# 代价函数
def Jfunction(self):
sum = 0
for i in range(0, self.m):
h = self.hypotheticFun(self.x[i])
sum += self.y[i] * np.log(h) + (1 - self.y[i]) * np.log(1 - h)
return 1 / self.m * sum
# 梯度下降算法求偏导数部分
def partialDerivative(self, ):
h = np.zeros(self.m)
for i in range(0, self.m):
h[i] = self.hypotheticFun(self.x[i])
dist = h - self.y
result = np.asarray(np.mat(dist.T) * self.x) / self.m
return result
# 迭代更新假设与样本距离
def iteratedFunctionForIntersect(self):
sum = 0
for i in range(0, self.m):
err = self.error_dist(self.x[i], self.y[i])
sum += err
return 1 / self.m * sum
def predict(self, x):
# 数据归一化
x = self.normalize_(x)
y_pred = []
for element in x:
y_pred.append(self.hypotheticFun(element))
y_pred = np.array(y_pred)
# for i in range(len(y_pred)):
return np.array(y_pred >= 0.5, dtype='float')
def plantCostDec(self):
# print(self.cost_list)
# print(self.iter_count)
plt.plot(range(0, self.iter_count), self.cost_list, color="red", label="costFunNum")
plt.legend()
plt.show()
图十五:模型accuracy对比
五、附源程序
import os
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
from matplotlib import pyplot as plt
import warnings
warnings.filterwarnings('ignore')
train_df = pd.read_csv('train-balanced-sarcasm.csv')
train_df.head()
train_df.info()
train_df.dropna(subset=['comment'], inplace=True)
train_texts, valid_texts, y_train, y_valid = \
train_test_split(train_df['comment'], train_df['label'], random_state=17)
train_df.loc[train_df['label'] == 1, 'comment'].str.len().apply(
np.log1p).hist(label='sarcastic', alpha=.5)
train_df.loc[train_df['label'] == 0, 'comment'].str.len().apply(
np.log1p).hist(label='normal', alpha=.5)
plt.legend()
!pip install wordcloud # 安装必要模块
from wordcloud import WordCloud, STOPWORDS
wordcloud = WordCloud(background_color='black', stopwords=STOPWORDS,
max_words=200, max_font_size=100,
random_state=17, width=800, height=400)
plt.figure(figsize=(16, 12))
wordcloud.generate(str(train_df.loc[train_df['label'] == 1, 'comment']))
plt.imshow(wordcloud)
plt.figure(figsize=(16, 12))
wordcloud.generate(str(train_df.loc[train_df['label'] == 0, 'comment']))
plt.imshow(wordcloud)
sub_df = train_df.groupby('subreddit')['label'].agg([np.size, np.mean, np.sum])
sub_df.sort_values(by='sum', ascending=False).head(10)
sub_df[sub_df['size'] > 1000].sort_values(by='mean', ascending=False).head(10)
sub_df = train_df.groupby('author')['label'].agg([np.size, np.mean, np.sum])
sub_df[sub_df['size'] > 300].sort_values(by='mean', ascending=False).head(10)
tf_idf = TfidfVectorizer(ngram_range=(1, 2), max_features=50000, min_df=2)
logit = LogisticRegression(C=1, n_jobs=4, solver='lbfgs',
random_state=17, verbose=1)
tfidf_logit_pipeline = Pipeline([('tf_idf', tf_idf),
('logit', logit)])
tfidf_logit_pipeline.fit(train_texts, y_train)
valid_pred = tfidf_logit_pipeline.predict(valid_texts)
accuracy_score(y_valid, valid_pred)
def plot_confusion_matrix(actual, predicted, classes,
normalize=False,
title='Confusion matrix', figsize=(7, 7),
cmap=plt.cm.Blues, path_to_save_fig=None):
"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
import itertools
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(actual, predicted).T
if normalize:
cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
plt.figure(figsize=figsize)
plt.imshow(cm, interpolation='nearest', cmap=cmap)
plt.title(title)
plt.colorbar()
tick_marks = np.arange(len(classes))
plt.xticks(tick_marks, classes, rotation=90)
plt.yticks(tick_marks, classes)
fmt = '.2f' if normalize else 'd'
thresh = cm.max() / 2.
for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, format(cm[i, j], fmt),
horizontalalignment="center",
color="white" if cm[i, j] > thresh else "black")
plt.tight_layout()
plt.ylabel('Predicted label')
plt.xlabel('True label')
if path_to_save_fig:
plt.savefig(path_to_save_fig, dpi=300, bbox_inches='tight')
plot_confusion_matrix(y_valid, valid_pred,
tfidf_logit_pipeline.named_steps['logit'].classes_, figsize=(8, 8))
!pip install eli5 # 安装必要模块
import eli5
eli5.show_weights(estimator=tfidf_logit_pipeline.named_steps['logit'],
vec=tfidf_logit_pipeline.named_steps['tf_idf'])
subreddits = train_df['subreddit']
train_subreddits, valid_subreddits = train_test_split(
subreddits, random_state=17)
tf_idf_texts = TfidfVectorizer(
ngram_range=(1, 2), max_features=50000, min_df=2)
tf_idf_subreddits = TfidfVectorizer(ngram_range=(1, 1))
X_train_texts = tf_idf_texts.fit_transform(train_texts)
X_valid_texts = tf_idf_texts.transform(valid_texts)
X_train_texts.shape, X_valid_texts.shape
X_train_subreddits = tf_idf_subreddits.fit_transform(train_subreddits)
X_valid_subreddits = tf_idf_subreddits.transform(valid_subreddits)
X_train_subreddits.shape, X_valid_subreddits.shape
from scipy.sparse import hstack
X_train = hstack([X_train_texts, X_train_subreddits])
X_valid = hstack([X_valid_texts, X_valid_subreddits])
X_train.shape, X_valid.shape
logit.fit(X_train, y_train)
valid_pred = logit.predict(X_valid)
accuracy_score(y_valid, valid_pred)