欢迎您访问 最编程 本站为您分享编程语言代码,编程技术文章!
您现在的位置是: 首页

逻辑回归用于讽刺文本检测

最编程 2024-08-13 10:46:23
...

一、实验目的及要求

1)实验目的

  • 了解训练分类模型的基本原理;
  • 掌握模型解释及模型改进流程;
  • 熟悉讽刺文本检测的逻辑回归方法。

2)实验要求

  • 根据实验题目编写好源程序;
  • 对上机操作过程中可能出现的问题预先分析,确定调试步骤和测试方法;
  • 输入一定数量的测试数据,对运行结果进行分析;
  • 上机实验后, 认真写出实验报告,对上机中出现的问题进行分析、总结。

二、实验环境(工具、配置等)

  • 硬件要求:计算机一台;
  • 软件要求:Mac操作系统,本实验在jupyter notebook上进行开发。

三、实验内容(实验方案、实验步骤、设计思路等)

1)实验方案

  • 学习并跟随着蓝桥云课完成此次试验;
  • 使用词频和加权的方法,再通过分类逻辑回归模型进行分类,以实现讽刺性内容的检测;
  • 通过已有的数据构建训练集与测试集,在进行简单的数据图可视化,之后训练分类模型并加以改进,以期达到较高的准确率。

2)实验步骤

  • 模型定义;
  • 数据处理和加载;
  • 训练模型;
  • 训练过程可视化;
  • 测试,并多次修改。 上面步骤知识个人做深度学习项目或实验时的一些个人习惯,具体到不同项目,为了得到最优的模型结果,中间往往需要很多次的尝试和修改。

3)设计思路

  • 设计的模型需要具有高度可配置性,便于修改参数,修改模型,反复实验;
  • 代码应具备良好的组织结构,使人一目了然;
  • 代码应具有良好的说明,使其他人能够理解。

四、实验结果

  1. 加载语料并预览:
train_df = pd.read_csv('train-balanced-sarcasm.csv')
train_df.head()

加载语料

图一:数据集预览


  1. 查看数据集变量类别信息,发现comment数量小于其他特征数量,说明存在缺失值。这里直接将其删除;
train_df.info()

图片描述

图二:数据集info

train_df.dropna(subset=['comment'], inplace=True)

  1. 输出数据标签,看看类别是否平衡:
train_df['label'].value_counts()

数据标签

图三:数据标签信息


  1. 可视化讽刺和正常文本长度:
train_df.loc[train_df['label'] == 0, 'comment'].str.len().apply(
    np.log1p).hist(label='normal', alpha=.5)
train_df.loc[train_df['label'] == 1, 'comment'].str.len().apply(
    np.log1p).hist(label='sarcastic', alpha=.5)
plt.legend()

讽刺和正常文本长度

图四:可视化讽刺和正常文本长度


  1. 使用groupby确定各子板块讽刺评论数量排序:
sub_df = train_df.groupby('subreddit')['label'].agg([np.size, np.mean, np.sum])
sub_df.sort_values(by='sum', ascending=False).head(10)

图片描述

图五:各子板块讽刺评论数量排序


  1. 使用wordcloud模块实现词云:
from wordcloud import WordCloud, STOPWORDS
wordcloud = WordCloud(background_color = 'black', stopwords = STOPWORDS, max_words = 200, max_font_size = 100, random_state = 17, width = 800, height = 400)
plt.figure(figsize = (16,12))
wordcloud.generate(str(train_df.loc[train_df['label'] == 1, 'comment']))
plt.imshow(wordcloud)

图片描述

图六:词云1

plt.figure(figsize = (16, 12))
wordcloud.generate(str(train_df.loc[train_df['label'] == 0, 'comment']))
plt.imshow(wordcloud)

图片描述

图七:词云2


  1. 输出子版块评论数大于1000且讽刺评论比例排名前 10 的信息:
sub_df[sub_df['size']>1000].sort_values(by = 'mean', ascending = False).head(10)

图片描述

图八:输出信息


  1. 输出发表评论总数大于300,且讽刺评论比例最高的10位用户信息:
sub_df = train_df.groupby('author')['label'].agg([np.size, np.mean, np.sum])
sub_df[sub_df['size'] > 300].sort_values(by = 'mean', ascending = False).head(10)

图片描述

图九:输出信息


  1. 训练讽刺文本分类预测模型,并得到测试集上的准确度评估结果
tf_idf = TfidfVectorizer(ngram_range(1, 2), max_features = 50000, min_df = 2)
logit = LogisticRegression(C = 1, n_jobs = 4, solver = 'lbfgs', random_state = 17, verbose = 1)
tfidf_logit_pipeline = Pipeline([('tf_idf', tf_idf), ('logit', logit)])
tfidf_logit_pipeline.fit(train_texts, y_train)
valid_pred = tfidf_logit_pipeline.predict(valid_texts)
accuracy_score(y_valid, valid_pred)

图片描述

图十:预测结果

  1. 构建一个混淆矩阵的函数plot_confusion_matrix: 图片描述图十一:混淆矩阵的函数

  1. 使用eli5输出分类器在预测判定是文本特征的权重
import eli5
eli5.show_weights(estimator = tfidf_logit_pipeline.named_steps['logit'], vec = tfidf_logit_pipeline.named_steps['tf_idf'])

图片描述

图十二:文本特征的权重


  1. 接下来,补充一个subreddit特征进行模型改进,同样完成切分,注意切分时一定要选择同一个random_state,保证能和上面的评论数据对齐
subreddits = train_df['subreddit']
train_subreddits, valid_subreddits = train_test_split(
    subreddits, random_state=17)

  1. 接下来,同样使用 tf-idf 算法分别构建 2 个 TfidfVectorizer 用于 commentsubreddits 的特征提取。
tf_idf_texts = TfidfVectorizer(
    ngram_range=(1, 2), max_features=50000, min_df=2)
tf_idf_subreddits = TfidfVectorizer(ngram_range=(1, 1))

  1. 使用构建好的TfidfVectorizer完成特征提取:
X_train_texts = tf_idf_texts.fit_transform(train_texts)
X_valid_texts = tf_idf_texts.transform(valid_texts)
X_train_texts.shape, X_valid_texts.shape
X_train_subreddits = tf_idf_subreddits.fit_transform(train_subreddits)
X_valid_subreddits = tf_idf_subreddits.transform(valid_subreddits)
X_train_subreddits.shape, X_valid_subreddits.shape

图片描述

图十三:特征的提取


  1. 将提取出的特征拼接在一起:
from scipy.sparse import hstack
X_train = hstack([X_train_texts, X_train_subreddits])
X_valid = hstack([X_valid_texts, X_valid_subreddits])

  1. 继续使用逻辑回归进行建模与预测
logit.fit(X_train, y_train)
valid_pred = logit.predict(X_valid)
accuracy_score(y_valid, valid_pred)

图片描述

图十四:改进后accuracy

遇到的问题及解决方法

  • 问题:最后结果与自己预期不符,准确率太低
  • 解决方法: 重新实现了Logistic Regression, 并将代价函数预期值调低,学习参数设定为0.1,一定程度上提高了准确率,代码如下:
class LogisticRegression:
    def __init__(self, alpha=0.3):
        self.coef_ = 0.0
        self.intercept = 0.
        self.theta = None
        self.alpha = alpha
        self.cost_list = []
        # 迭代次数
        self.iter_count = 0

    def fit(self, x, y, threshold):
        if x.shape[0] != y.shape[0]:
            raise '输入格式错误'
        self.m = x.shape[0]
        self.n_feature = x.shape[1]
        self.theta = np.zeros(x[0].size)
        # 数据归一化
        self.y = y
        self.x = self.normalize_(x)
        self.gradient_descent(threshold)
        self.coef_ = self.theta
        pass

    def gradient_descent(self, threshold):
        cost = 100000.0
        # threshold = 0.1

        # 此处用时间换准确率,将代价函数设的足够小
        while abs(cost) > threshold:
            self.theta = self.theta - self.alpha * self.partialDerivative()
            self.intercept = self.intercept - self.alpha * self.iteratedFunctionForIntersect()
            cost = -self.Jfunction()
            # 便于后期可视化处理
            self.cost_list.append(cost)
            self.iter_count += 1

    # 数据归一化,采用(0,1)归一,将数据集中数据值归一到【0,1】区间
    def normalize_(self, x):
        offset = np.zeros(self.n_feature)
        scalar = np.ones(self.n_feature)
        for feature_idx in range(0, self.n_feature):
            col = x[:, np.newaxis, feature_idx]
            min = col.min()
            max = col.max()

            if (min != max):
                scalar[feature_idx] = 1.0 / (max - min)
            else:
                scalar[feature_idx] = 1.0 / max

            offset[feature_idx] = min

        x = (x - offset) * scalar
        return x

    # 激活函数
    def sigmoid(self, z):
        e_part = np.exp(-z)
        return 1 / (1 + e_part)

    def hypotheticFun(self, x):
        z = np.dot(self.theta, x) + self.intercept
        return self.sigmoid(z)

    def error_dist(self, x, y):
        return self.hypotheticFun(x) - y


    # 代价函数
    def Jfunction(self):
        sum = 0
        for i in range(0, self.m):
            h = self.hypotheticFun(self.x[i])
            sum += self.y[i] * np.log(h) + (1 - self.y[i]) * np.log(1 - h)
        return 1 / self.m * sum
    # 梯度下降算法求偏导数部分
    def partialDerivative(self, ):

        h = np.zeros(self.m)
        for i in range(0, self.m):
            h[i] = self.hypotheticFun(self.x[i])

        dist = h - self.y
        result = np.asarray(np.mat(dist.T) * self.x) / self.m
        return result
    # 迭代更新假设与样本距离
    def iteratedFunctionForIntersect(self):
        sum = 0

        for i in range(0, self.m):
            err = self.error_dist(self.x[i], self.y[i])
            sum += err

        return 1 / self.m * sum


    def predict(self, x):
        # 数据归一化
        x = self.normalize_(x)
        y_pred = []
        for element in x:
            y_pred.append(self.hypotheticFun(element))
        y_pred = np.array(y_pred)
        # for i in range(len(y_pred)):
        return np.array(y_pred >= 0.5, dtype='float')



    def plantCostDec(self):
        # print(self.cost_list)
        # print(self.iter_count)
        plt.plot(range(0, self.iter_count), self.cost_list, color="red", label="costFunNum")
        plt.legend()
        plt.show()

图片描述

图十五:模型accuracy对比

五、附源程序

import os
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
from matplotlib import pyplot as plt
import warnings
warnings.filterwarnings('ignore')
train_df = pd.read_csv('train-balanced-sarcasm.csv')
train_df.head()
train_df.info()
train_df.dropna(subset=['comment'], inplace=True)
train_texts, valid_texts, y_train, y_valid = \
  train_test_split(train_df['comment'], train_df['label'], random_state=17)
train_df.loc[train_df['label'] == 1, 'comment'].str.len().apply(
  np.log1p).hist(label='sarcastic', alpha=.5)
train_df.loc[train_df['label'] == 0, 'comment'].str.len().apply(
  np.log1p).hist(label='normal', alpha=.5)
plt.legend()
!pip install wordcloud # 安装必要模块
from wordcloud import WordCloud, STOPWORDS
wordcloud = WordCloud(background_color='black', stopwords=STOPWORDS,
                    max_words=200, max_font_size=100,
                    random_state=17, width=800, height=400)
plt.figure(figsize=(16, 12))
wordcloud.generate(str(train_df.loc[train_df['label'] == 1, 'comment']))
plt.imshow(wordcloud)
plt.figure(figsize=(16, 12))
wordcloud.generate(str(train_df.loc[train_df['label'] == 0, 'comment']))
plt.imshow(wordcloud)
sub_df = train_df.groupby('subreddit')['label'].agg([np.size, np.mean, np.sum])
sub_df.sort_values(by='sum', ascending=False).head(10)
sub_df[sub_df['size'] > 1000].sort_values(by='mean', ascending=False).head(10)
sub_df = train_df.groupby('author')['label'].agg([np.size, np.mean, np.sum])
sub_df[sub_df['size'] > 300].sort_values(by='mean', ascending=False).head(10)
tf_idf = TfidfVectorizer(ngram_range=(1, 2), max_features=50000, min_df=2)
logit = LogisticRegression(C=1, n_jobs=4, solver='lbfgs',
                         random_state=17, verbose=1)
tfidf_logit_pipeline = Pipeline([('tf_idf', tf_idf),
                               ('logit', logit)])
tfidf_logit_pipeline.fit(train_texts, y_train)
valid_pred = tfidf_logit_pipeline.predict(valid_texts)
accuracy_score(y_valid, valid_pred)
def plot_confusion_matrix(actual, predicted, classes,
                        normalize=False,
                        title='Confusion matrix', figsize=(7, 7),
                        cmap=plt.cm.Blues, path_to_save_fig=None):
  """
  This function prints and plots the confusion matrix.
  Normalization can be applied by setting `normalize=True`.
  """
  import itertools
  from sklearn.metrics import confusion_matrix
  cm = confusion_matrix(actual, predicted).T
  if normalize:
      cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]

  plt.figure(figsize=figsize)
  plt.imshow(cm, interpolation='nearest', cmap=cmap)
  plt.title(title)
  plt.colorbar()
  tick_marks = np.arange(len(classes))
  plt.xticks(tick_marks, classes, rotation=90)
  plt.yticks(tick_marks, classes)

  fmt = '.2f' if normalize else 'd'
  thresh = cm.max() / 2.
  for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
      plt.text(j, i, format(cm[i, j], fmt),
               horizontalalignment="center",
               color="white" if cm[i, j] > thresh else "black")

  plt.tight_layout()
  plt.ylabel('Predicted label')
  plt.xlabel('True label')

  if path_to_save_fig:
      plt.savefig(path_to_save_fig, dpi=300, bbox_inches='tight')
plot_confusion_matrix(y_valid, valid_pred,
                    tfidf_logit_pipeline.named_steps['logit'].classes_, figsize=(8, 8))

!pip install eli5  # 安装必要模块
import eli5
eli5.show_weights(estimator=tfidf_logit_pipeline.named_steps['logit'],
                vec=tfidf_logit_pipeline.named_steps['tf_idf'])

subreddits = train_df['subreddit']
train_subreddits, valid_subreddits = train_test_split(
  subreddits, random_state=17)

tf_idf_texts = TfidfVectorizer(
  ngram_range=(1, 2), max_features=50000, min_df=2)
tf_idf_subreddits = TfidfVectorizer(ngram_range=(1, 1))

X_train_texts = tf_idf_texts.fit_transform(train_texts)
X_valid_texts = tf_idf_texts.transform(valid_texts)
X_train_texts.shape, X_valid_texts.shape
X_train_subreddits = tf_idf_subreddits.fit_transform(train_subreddits)
X_valid_subreddits = tf_idf_subreddits.transform(valid_subreddits)
X_train_subreddits.shape, X_valid_subreddits.shape
from scipy.sparse import hstack
X_train = hstack([X_train_texts, X_train_subreddits])
X_valid = hstack([X_valid_texts, X_valid_subreddits])
X_train.shape, X_valid.shape
logit.fit(X_train, y_train)
valid_pred = logit.predict(X_valid)
accuracy_score(y_valid, valid_pred)