基于机器学习的古汉语自动分段注释算法与语料库研究

最编程 2024-03-25 22:21:34

...

摘要
近年来，深度学习的浪潮渗透在科研和生活领域的方方面面，本文主要研究深度学习在自然语言处理，尤其是古汉语自然语言处理方面的应用。本文旨在利用计算机帮助古文研究者对古汉语完成断代、断句、分词及词性标注等特殊而繁琐的任务，其中的断句、分词是不同于英文自然语言处理的，中文自然语言处理所特有的任务，尤其是断句任务更是古汉语自然语言处理所特有的任务。利用计算机处理古代汉语的各种任务有助于提高语言工作者的工作效率，避免人为主观因素误差，这将他们从繁重的古汉语基础任务中解脱出来，从使他们而将更多的精力投入到后续的授受、义理等内容方面上的研究。
本文使用长短期记忆神经网络作为主体，并针对不同的古汉语自然语言处理任务，设计不同的输入输出结构来搭建具体模型，训练集使用的是网络上公开下载的古汉语语料，并且我们对其中的部分上古汉语语料文本进行了手工标记。本文中设计的模型可对古汉语文本完成断代、断句、分词及词性标注的操作。本文涉及的的主要工作和创新点如下：
（1）使用长短期记忆神经网络作为主体构建古代文本断代模型。在断代模型当中，文本中的每一个字被转换成一串高维向量，然后将文本包含的所有向量送入模型分析它们之间的非线性关系。最终，模型会输出一个该段文本的年代类别标签。实验结果表明利用Bi-LSTM(Bi-directional Long Short-Term Memory, Bi-LSTM)神经网络构造的模型能够很好的完成断代任务，断代的正确率能达到80%以上。本文的断代模型提供了一种高效且准确的古文断代方法，这将节省古文研究工作者在文本断代过程中的时间。
（2）针对某些古代汉语书籍原著中缺少标点符号的问题，本文提出一个断句模型。本部分我们通过深度神经网络对大量已经断句的古汉语文本进行学习，使断句模型自动学习到某一时期、某种题材的断句规则，从而在后面的古代汉语文献信息化过程中，可以将断句工作交给计算机来完成，减少部分古汉语工作者的任务量。
（3）提出一个自动分词及词性标注一体化模型。由于目前尚没有公开的具有分词和词性标注的古汉语语料库，因此本文通过手工标记部分语料的方法得到了少量的数据集，将它们存入数据库作为训练集训练模型。实验表明本文提出的分词标注模型可以较好的完成古汉语分词标注任务。数据库也可通过模型加人工校准的方式进一步扩充。
论文以Bi-LSTM网络为主要结构，建立了一系列针对古代汉语文本不同任务的模型。实验证明，在现有有限的古汉语语料库中本文提出的模型已具备较好的效果，并可以应用到后续更大语料库的构建当中，作为辅助工具帮助古汉语工作者对文本的标记工作。新产生的语料库又可继续用来训练模型提高模型的精度，以此构成语料库和模型互相促进提高的局面，促进古汉语信息化及大型古汉语语料库的构建。
关键词：古汉语，自然语言处理，断代，断句，分词，词性标注
Machine Learning-based Segmentation, Tagging
and Corpus Building for Ancient Chinese
Abstract
In recent years, deep learning has penetrated into every aspect of research and life. This paper mainly studies the application of deep learning in natural language processing, especially in ancient Chinese natural language processing. This paper aims to use computer to help ancient Chinese researchers to complete special and cumbersome tasks such as dating, sentence breaking, word segmentation and part-of-speech tagging in ancient Chinese. The sentence breaking and the word segmentation are the unique tasks of Chinese natural language processing, especially the sentence-breaking tasks are the unique tasks of ancient Chinese natural language processing. The use of computers to deal with the various tasks of ancient Chinese helps to improve the efficiency of language workers and avoid the subjective factors of human error, which frees them from the heavy basic tasks of ancient Chinese, so that they can put more energy into other aspects of research.
In this paper, we use Long short-term memory neural networks as the main body, and design different input and output structures to build specific models for different ancient Chinese natural language processing tasks. The training set is an ancient Chinese corpus that we have publicly downloaded from the Internet, and we have manually marked some of the ancient Chinese corpus texts. The model designed in this paper can complete tasks such as breaking the ancient Chinese text, breaking sentences, word segmentation and part-of-speech tagging. The main work and innovations covered in this article are as follows:
(1) The Bi-LSTM was used as the main body to construct the ancient text dating model. In the age judging model, each word in the text is converted into a series of high-dimensional vectors, and then all the vectors contained in the text are sent to the model to analyze the nonlinear relationship between them. Finally, the model outputs a time category label for the text of the paragraph. Experiments show that the model constructed by Bi-LSTM can perform the task of age judging well, and the prediction accuracy can reach 80%. The model in this part provides an efficient and accurate method for ancient Chinese texts’ age judging, which will save the time consumption of ancient Chinese researchers in the process of textualization.
(2) In view of the lack of punctuation in the original works of some ancient Chinese books, this paper proposes a sentences breaking model. In this part, we use the deep neural network to learn a large number of ancient Chinese texts that have already been sentenced, so that the sentences breaking model automatically learns the rules of sentences breaking in a certain period and a certain subject. So in the process of informationization of ancient Chinese literature, we can hand over the sentences breaking work to the computer to reduce the task of ancient Chinese workers.
(3) An integrated model of automatic word segmentation and part-of-speech tagging is proposed. Since there is no public Chinese corpus with word segmentation and part-of-speech tagging, this paper obtains a small number of data sets by manually marking tag, and stores them in the database as a training set training model to verify the word segmentation proposed in this paper. Experiments show that the word segmentation and annotation model proposed in this paper can accomplish the task of marking ancient Chinese word segmentation well. The database can also be further expanded by model labeling and manual calibration.
Based on the Bi-LSTM network, the paper establishes a series of models for different tasks of ancient Chinese texts. The experiment proves that the model proposed in this paper has good effects in the existing limited ancient Chinese corpus. The model can be applied to the construction of the subsequent larger corpus as an auxiliary tool to help the ancient Chinese workers mark the text. The new corpus generated by the model can be used to train the model to improve the accuracy of the model, which constitutes a situation in which the corpus and the model promote each other, and promotes the informationization of ancient Chinese and the construction of a large ancient Chinese corpus.
Key Words： Ancient Chinese, Natural language processing, Judging the age, Punctuation, Word segmentation, Part of speech
目录
致谢 I
摘要 III
Abstract V
1 引言 1
1.1 课题研究背景及意义 1
1.2 研究内容 5
1.3 论文组织结构 6
2 研究综述 8
2.1 古代文本断代方法 8
2.2 古代文本断句方法 10
2.3 古代文本分词方法 12
2.4 词性标注综述 16
2.5 本章小结 17
3 古代文本断代模型 18
3.1 数据来源及预处理 18
3.2 模型结构 19
3.3 实验 24
3.4 本章小结 31
4 古代汉语断句模型 32
4.1 数据来源及预处理 32
4.2 模型构建 33
4.3 实验及效果展示 34
4.4 本章小结 38
5 古代汉语分词、标注系统及数据库建设 39
5.1 数据来源及预处理 39
5.2 分类模型的评估标准 41
5.3 模型架构 42
5.4 实验及性能分析 46
5.5 词性标注 49
5.6 本章小结 51
6 总结与展望 53
6.1 总结 53
6.2 展望 53
参考文献 55
研究内容
本课题的研究目的是利用现有成熟的基于深度学习的自然语言处理技术对中国古汉语建立一系列模型，旨在完成古代汉语的自动断代、断句及分词标注任务，减轻部分古汉语工作者的繁琐劳动，将这部分繁琐工作让机器去完成，从而加速古汉语信息化过程。研究内容从课题研究目的入手，可分为以下几个方面：
（1）为解决古代书籍断代的问题，本文提出使用双向长短期记忆神经网络作为主体构建古代文本断代模型。整理互联网上现有的已知年代的文本作为训练集对模型进行训练。利用word2vec模型将文本中的每一个字转换成一串高维向量，然后将文本包含的所有文字的字向量送入模型分析它们之间的非线性关系。最终，模型会输出一个该段文本的年代类别标签。实验结果表明利用Bi-LSTM神经网络构造的模型能够很好的完成断代任务，断代的正确率能达到80%以上。本文的断代模型提供了一种高效且准确的古文断代方法，这将节省古文研究工作者在文本断代过程中的时间。
（2）针对某些古代汉语书籍原著中缺少标点符号的问题，本文提出一个断句模型。本部分我们通过深度神经网络对大量经过断句的古汉语文本进行学习，使断句模型自动学习到某一时期、某种题材的断句规则，从而达到输入一段无断句的文字序列，机器自动为其添加断句的效果。
（3）针对古汉语分词及词性标注任务，我们需要解决训练集获取的问题，分词标注任务需要已经分好词、标注好词性的文本来做模型的训练集，但目前尚没有公开的具有分词和词性标注的古汉语语料库。因此我们通过手工标记部分语料的方法得到了少量的数据集对我们所设计的分词标注模型进行少量的实验，用以验证本文提出的分词标注模型可以较好的完成古汉语分词标注任务。
论文组织结构
论文的整体安排如下：
第一章作为绪论部分，首先对论文的研究意义及研究背景进行了简要的阐述。之后将研究中面临的主要问题和所做的工作内容进行简单的梳理，通过系统地归纳总结帮助读者了解论文中面临问题的本质和相对应的解决方法。最后对论文的大体结构进行简略介绍，方便读者了解整篇文章的体系架构。
第二章对课题相关的研究内容进行了详细介绍和总结。除了介绍中国古代汉语研究领域里古代文人关于著作年代的判断方法外，还介绍了自然语言方法在古汉语断代方面的应用；在古籍断句领域，本章介绍了一些传统常用的断句方法；此外还介绍了自然语言处理领域分词及词性标注任务的研究现状，并对分词及词性标注的常用方法、算法进行了总结和优劣分析。
第三章首先根据古汉语语料较少的特点，选择了双向长短期记忆神经网络结构作为模型的主体，并介绍了断代模型的总体结构框架。之后针对断代模型的多层结构，分层依次讲解了每一层的构成及作用。本章首次将双向长短期记忆神经网络应用到古代汉语的断代问题上去，通过两组实验分析了模型的性能，并简要分析出了同一时期内同一书籍中及不同书籍之间具有互相独立而统一的关系。
第四章对于古代汉语书籍没有标点符号的特点，利用字符标签的形式对输入的一句或多句古汉语文本进行标记，标记出应该含有标点符号的位置。本部分首先介绍了模型的数据来源和整体结构，然后介绍模型的代码实现，最后通过部分真实数据进行了一定的实验分析，分析证明模型的正确率较高，可以当做断句辅助工具供古汉语工作人员参考。
由于第五章模型任务的特殊性，第五章首先阐述了魔性训练集的数据来源及预处理，介绍了几种模型分类效果的评估标准。然后提出了本章的主要内容：基于双向长短期记忆神经网络的古代汉语分词及词性标注一体化系统。针对一体化系统，本部分创新性的提出了将两种标签进行一体化输出的编码方式，使得模型的输出可以同时带有分词及词性标注标签。关于模型分词及词性标注效果的评估，本文利用少量的手工标记的数据集对模型分别进行了分词实验和词性标注实验两部分实验，实验证明本部分提出的一体化系统在古代汉语分词及词性标注任务上有不错的效果，后期若有更加充足、准确的数据集后，该模型的准确率将可以达到更高。最后利用一体化模型，建立一个简单的上古语料库，并建设一个网站进行管理。
第六章对研究工作进行了总结分析，并对下一步的研究方向和计划进行阐述。
本文转载自：http://www.biyezuopin.vip/onews.asp?id=16559

# 导入数据
from sklearn.model_selection import train_test_split
import pickle
import numpy as np
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
from tqdm import tqdm
import time
import pickle
import os


#   https://github.com/yongyehuang/Tensorflow-Tutorial/blob/master/Tutorial_6%20-%20Bi-directional%20LSTM%20for%20sequence%20labeling%20(Chinese%20segmentation).ipynb  #
with open('cleaned_train800-1160.txt', 'rb') as inp:
    texts = inp.read().decode('utf-16')
sentences = texts.split('\r\n')  # 根据换行切分


# 将不规范的内容（如每行的开头）去掉
def clean(s):
    if u'“/s' not in s:  # 句子中间的引号不应去掉
        return s.replace(u'“ ', '')
    elif u'”/s' not in s:
        return s.replace(u'” ', '')
    elif u'‘/s' not in s:
        return s.replace(u'‘', '')
    elif u'’/s' not in s:
        return s.replace(u'’', '')
    else:
        return s

texts = u''.join(map(clean, sentences))  # 把所有的词拼接起来

# print('Length of texts is %d' % len(texts))
# print('Example of texts: \n', texts[:300])

sentence = re.split(u'[，。！？、‘’“”：（）—《》]', texts)
# print('Sentences number:', len(sentence))
# print('Sentence Example:\n', sentence[2])

#############################为每个字添加标签##############
sentences=[]
# f = open('E:\\pyCode\\Bi-directional_LSTM\\a.txt','w')
for sentenc in sentence:#给每个字添加标签
    a=sentenc.split()
    for index in range(len(a)):
        if (len(a[index]) == 1):
          a[index] += '/s  '
        elif (len(a[index]) == 2):
            a[index] = a[index][:1] + '/b  ' + a[index][1:] + '/e  '
        elif (len(a[index]) == 3):
            a[index] = a[index][:1] + '/b  ' + a[index][1:2] + '/m  '+a[index][2:]+'/e  '
        elif (len(a[index]) == 4):
            a[index] = a[index][:1] + '/b  ' + a[index][1:2] + '/m  '+a[index][2:3]+'/m  '+a[index][3:]+'/e  '
        elif (len(a[index]) == 5):
            a[index] = a[index][:1] + '/b  ' + a[index][1:2] + '/m  '+a[index][2:3]+'/m  ' + a[index][3:4] + '/m  ' + a[index][4:] + '/e  '
        elif (len(a[index]) == 6):
            a[index] = a[index][:1] + '/b  ' + a[index][1:2] + '/m  '+a[index][2:3]+'/m  '+a[index][3:4]+'/m  '+a[index][4:5]+'/m  '+a[index][5:]+'/e  '
        elif (len(a[index]) == 7):
            a[index] = a[index][:1] + '/b  ' + a[index][1:2] + '/m  '+a[index][2:3]+'/m  '+a[index][3:4]+'/m  '+a[index][4:5]+'/m  '+a[index][5:6]+'/m  '+a[index][6:]+'/e  '
        elif (len(a[index]) == 8):
            a[index] = a[index][:1] + '/b  ' + a[index][1:2] + '/m  '+a[index][2:3]+'/m  '+a[index][3:4]+'/m  '+a[index][4:5]+'/m  '+a[index][5:6]+'/m  '+a[index][6:7]+'/m  '+a[index][7:]+'/e  '
    s=u''.join(a)
    # f.write(sentences+'\n')
    sentences.append(s)
    # print(sentences)

########################
def get_Xy(sentence):
    """将 sentence 处理成 [word1, w2, ..wn], [tag1, t2, ...tn]"""
    words_tags = re.findall('(.)/(.)', sentence)
    if words_tags:
        words_tags = np.asarray(words_tags)
        words = words_tags[:, 0]
        tags = words_tags[:, 1]
        return words, tags # 所有的字和tag分别存为 data / label
    return None
datas = list()
labels = list()
# print('Start creating words and tags data ...')
for sentence in tqdm(iter(sentences)):
    result = get_Xy(sentence)
    if result:
        datas.append(result[0])
        labels.append(result[1])

import pickle
from sklearn.model_selection import train_test_split
# import numpy as np

with open('data/data.pkl', 'rb') as inp:
    X = pickle.load(inp)
    y = pickle.load(inp)
    word2id = pickle.load(inp)
    id2word = pickle.load(inp)
    tag2id = pickle.load(inp)
    id2tag = pickle.load(inp)

# 划分测试集/训练集/验证集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train,  test_size=0.2, random_state=42)
# print('X_train.shape={}, y_train.shape={}; \nX_valid.shape={}, y_valid.shape={};\nX_test.shape={}, y_test.shape={}'.format(
#     X_train.shape, y_train.shape, X_valid.shape, y_valid.shape, X_test.shape, y_test.shape))


# ** 3.build the data generator
class BatchGenerator(object):
    """ Construct a Data generator. The input X, y should be ndarray or list like type.

    Example:
        Data_train = BatchGenerator(X=X_train_all, y=y_train_all, shuffle=False)
        Data_test = BatchGenerator(X=X_test_all, y=y_test_all, shuffle=False)
        X = Data_train.X
        y = Data_train.y
        or:
        X_batch, y_batch = Data_train.next_batch(batch_size)
     """

    def __init__(self, X, y, shuffle=False):
        if type(X) != np.ndarray:
            X = np.asarray(X)
        if type(y) != np.ndarray:
            y = np.asarray(y)
        self._X = X
        self._y = y
        self._epochs_completed = 0
        self._index_in_epoch = 0
        self._number_examples = self._X.shape[0]
        self._shuffle = shuffle
        if self._shuffle:
            new_index = np.random.permutation(self._number_examples)
            self._X = self._X[new_index]
            self._y = self._y[new_index]

    @property
    def X(self):
        return self._X

    @property
    def y(self):
        return self._y

    @property
    def num_examples(self):
        return self._number_examples

    @property
    def epochs_completed(self):
        return self._epochs_completed

    def next_batch(self, batch_size):
        """ Return the next 'batch_size' examples from this data set."""
        start = self._index_in_epoch
        self._index_in_epoch += batch_size
        if self._index_in_epoch > self._number_examples:
            # finished epoch
            self._epochs_completed += 1
            # Shuffle the data
            if self._shuffle:
                new_index = np.random.permutation(self._number_examples)
                self._X = self._X[new_index]
                self._y = self._y[new_index]
            start = 0
            self._index_in_epoch = batch_size
            assert batch_size <= self._number_examples
        end = self._index_in_epoch
        return self._X[start:end], self._y[start:end]


# print('Creating the data generator ...')
data_train = BatchGenerator(X_train, y_train, shuffle=True)
data_valid = BatchGenerator(X_valid, y_valid, shuffle=False)
data_test = BatchGenerator(X_test, y_test, shuffle=False)
# print('Finished creating the data generator.')

import tensorflow as tf

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
from tensorflow.contrib import rnn
import numpy as np
import time
'''
For Chinese word segmentation.
'''
# ##################### config ######################
decay = 0.85
max_epoch = 5
max_max_epoch = 10
timestep_size = max_len = 32  # 句子长度
vocab_size = 7010  # 样本中不同字的个数+1(padding 0)，根据处理数据的时候得到
input_size = embedding_size = 64  # 字向量长度
class_num = 5
hidden_size = 128  # 隐含层节点数
layer_num = 2  # bi-lstm 层数
max_grad_norm = 5.0  # 最大梯度（超过此值的梯度将被裁剪）

lr = tf.placeholder
						
							上一篇：							
								记录常用成语和古文的笔记本 - 便签							
						
						
							下一篇：							
								matlab 等距采样，[单选题] 在 MATLAB 中，要求在封闭区间[0,5]上对一维数组 z 产生 50 个等距采样，以下哪条指令是正确的？
A. z=0:1/50:5 B. z=0:5 C. z=...


															
						
							推荐阅读						
						
														
								
									
										基于机器学习的古汉语自动分段注释算法与语料库研究									
								
							
														
								
									
										基于深度学习 LSTM 的古汉语切分和标注算法及语料库研究 (Next)