摘 要
(1)使用长短期记忆神经网络作为主体构建古代文本断代模型。在断代模型当中,文本中的每一个字被转换成一串高维向量,然后将文本包含的所有向量送入模型分析它们之间的非线性关系。最终,模型会输出一个该段文本的年代类别标签。实验结果表明利用Bi-LSTM(Bi-directional Long Short-Term Memory, Bi-LSTM)神经网络构造的模型能够很好的完成断代任务,断代的正确率能达到80%以上。本文的断代模型提供了一种高效且准确的古文断代方法,这将节省古文研究工作者在文本断代过程中的时间。
关键词: 古汉语,自然语言处理,断代,断句,分词,词性标注
Machine Learning-based Segmentation, Tagging
and Corpus Building for Ancient Chinese
In recent years, deep learning has penetrated into every aspect of research and life. This paper mainly studies the application of deep learning in natural language processing, especially in ancient Chinese natural language processing. This paper aims to use computer to help ancient Chinese researchers to complete special and cumbersome tasks such as dating, sentence breaking, word segmentation and part-of-speech tagging in ancient Chinese. The sentence breaking and the word segmentation are the unique tasks of Chinese natural language processing, especially the sentence-breaking tasks are the unique tasks of ancient Chinese natural language processing. The use of computers to deal with the various tasks of ancient Chinese helps to improve the efficiency of language workers and avoid the subjective factors of human error, which frees them from the heavy basic tasks of ancient Chinese, so that they can put more energy into other aspects of research.
In this paper, we use Long short-term memory neural networks as the main body, and design different input and output structures to build specific models for different ancient Chinese natural language processing tasks. The training set is an ancient Chinese corpus that we have publicly downloaded from the Internet, and we have manually marked some of the ancient Chinese corpus texts. The model designed in this paper can complete tasks such as breaking the ancient Chinese text, breaking sentences, word segmentation and part-of-speech tagging. The main work and innovations covered in this article are as follows:
(1) The Bi-LSTM was used as the main body to construct the ancient text dating model. In the age judging model, each word in the text is converted into a series of high-dimensional vectors, and then all the vectors contained in the text are sent to the model to analyze the nonlinear relationship between them. Finally, the model outputs a time category label for the text of the paragraph. Experiments show that the model constructed by Bi-LSTM can perform the task of age judging well, and the prediction accuracy can reach 80%. The model in this part provides an efficient and accurate method for ancient Chinese texts’ age judging, which will save the time consumption of ancient Chinese researchers in the process of textualization.
(2) In view of the lack of punctuation in the original works of some ancient Chinese books, this paper proposes a sentences breaking model. In this part, we use the deep neural network to learn a large number of ancient Chinese texts that have already been sentenced, so that the sentences breaking model automatically learns the rules of sentences breaking in a certain period and a certain subject. So in the process of informationization of ancient Chinese literature, we can hand over the sentences breaking work to the computer to reduce the task of ancient Chinese workers.
(3) An integrated model of automatic word segmentation and part-of-speech tagging is proposed. Since there is no public Chinese corpus with word segmentation and part-of-speech tagging, this paper obtains a small number of data sets by manually marking tag, and stores them in the database as a training set training model to verify the word segmentation proposed in this paper. Experiments show that the word segmentation and annotation model proposed in this paper can accomplish the task of marking ancient Chinese word segmentation well. The database can also be further expanded by model labeling and manual calibration.
Based on the Bi-LSTM network, the paper establishes a series of models for different tasks of ancient Chinese texts. The experiment proves that the model proposed in this paper has good effects in the existing limited ancient Chinese corpus. The model can be applied to the construction of the subsequent larger corpus as an auxiliary tool to help the ancient Chinese workers mark the text. The new corpus generated by the model can be used to train the model to improve the accuracy of the model, which constitutes a situation in which the corpus and the model promote each other, and promotes the informationization of ancient Chinese and the construction of a large ancient Chinese corpus.
Key Words: Ancient Chinese, Natural language processing, Judging the age, Punctuation, Word segmentation, Part of speech
目 录
致 谢 I
摘 要 III
Abstract V
1 引言 1
1.1 课题研究背景及意义 1
1.2 研究内容 5
1.3 论文组织结构 6
2 研究综述 8
2.1 古代文本断代方法 8
2.2 古代文本断句方法 10
2.3 古代文本分词方法 12
2.4 词性标注综述 16
2.5 本章小结 17
3 古代文本断代模型 18
3.1 数据来源及预处理 18
3.2 模型结构 19
3.3 实验 24
3.4 本章小结 31
4 古代汉语断句模型 32
4.1 数据来源及预处理 32
4.2 模型构建 33
4.3 实验及效果展示 34
4.4 本章小结 38
5 古代汉语分词、标注系统及数据库建设 39
5.1 数据来源及预处理 39
5.2 分类模型的评估标准 41
5.3 模型架构 42
5.4 实验及性能分析 46
5.5 词性标注 49
5.6 本章小结 51
6 总结与展望 53
6.1 总结 53
6.2 展望 53
# 导入数据
from sklearn.model_selection import train_test_split
import pickle
import numpy as np
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
from tqdm import tqdm
import time
import pickle
import os

#   https://github.com/yongyehuang/Tensorflow-Tutorial/blob/master/Tutorial_6%20-%20Bi-directional%20LSTM%20for%20sequence%20labeling%20(Chinese%20segmentation).ipynb  #
with open('cleaned_train800-1160.txt', 'rb') as inp:
    texts = inp.read().decode('utf-16')
sentences = texts.split('\r\n')  # 根据换行切分

# 将不规范的内容(如每行的开头)去掉
def clean(s):
    if u'“/s' not in s:  # 句子中间的引号不应去掉
        return s.replace(u'“ ', '')
    elif u'”/s' not in s:
        return s.replace(u'” ', '')
    elif u'‘/s' not in s:
        return s.replace(u'‘', '')
    elif u'’/s' not in s:
        return s.replace(u'’', '')
        return s

texts = u''.join(map(clean, sentences))  # 把所有的词拼接起来

# print('Length of texts is %d' % len(texts))
# print('Example of texts: \n', texts[:300])

sentence = re.split(u'[,。!?、‘’“”:()—《》]', texts)
# print('Sentences number:', len(sentence))
# print('Sentence Example:\n', sentence[2])

# f = open('E:\\pyCode\\Bi-directional_LSTM\\a.txt','w')
for sentenc in sentence:#给每个字添加标签
    for index in range(len(a)):
        if (len(a[index]) == 1):
          a[index] += '/s  '
        elif (len(a[index]) == 2):
            a[index] = a[index][:1] + '/b  ' + a[index][1:] + '/e  '
        elif (len(a[index]) == 3):
            a[index] = a[index][:1] + '/b  ' + a[index][1:2] + '/m  '+a[index][2:]+'/e  '
        elif (len(a[index]) == 4):
            a[index] = a[index][:1] + '/b  ' + a[index][1:2] + '/m  '+a[index][2:3]+'/m  '+a[index][3:]+'/e  '
        elif (len(a[index]) == 5):
            a[index] = a[index][:1] + '/b  ' + a[index][1:2] + '/m  '+a[index][2:3]+'/m  ' + a[index][3:4] + '/m  ' + a[index][4:] + '/e  '
        elif (len(a[index]) == 6):
            a[index] = a[index][:1] + '/b  ' + a[index][1:2] + '/m  '+a[index][2:3]+'/m  '+a[index][3:4]+'/m  '+a[index][4:5]+'/m  '+a[index][5:]+'/e  '
        elif (len(a[index]) == 7):
            a[index] = a[index][:1] + '/b  ' + a[index][1:2] + '/m  '+a[index][2:3]+'/m  '+a[index][3:4]+'/m  '+a[index][4:5]+'/m  '+a[index][5:6]+'/m  '+a[index][6:]+'/e  '
        elif (len(a[index]) == 8):
            a[index] = a[index][:1] + '/b  ' + a[index][1:2] + '/m  '+a[index][2:3]+'/m  '+a[index][3:4]+'/m  '+a[index][4:5]+'/m  '+a[index][5:6]+'/m  '+a[index][6:7]+'/m  '+a[index][7:]+'/e  '
    # f.write(sentences+'\n')
    # print(sentences)

def get_Xy(sentence):
    """将 sentence 处理成 [word1, w2, ..wn], [tag1, t2, ...tn]"""
    words_tags = re.findall('(.)/(.)', sentence)
    if words_tags:
        words_tags = np.asarray(words_tags)
        words = words_tags[:, 0]
        tags = words_tags[:, 1]
        return words, tags # 所有的字和tag分别存为 data / label
    return None
datas = list()
labels = list()
# print('Start creating words and tags data ...')
for sentence in tqdm(iter(sentences)):
    result = get_Xy(sentence)
    if result:

import pickle
from sklearn.model_selection import train_test_split
# import numpy as np

with open('data/data.pkl', 'rb') as inp:
    X = pickle.load(inp)
    y = pickle.load(inp)
    word2id = pickle.load(inp)
    id2word = pickle.load(inp)
    tag2id = pickle.load(inp)
    id2tag = pickle.load(inp)

# 划分测试集/训练集/验证集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
X_train, X_valid, y_train, y_valid = train_test_split(X_train, y_train,  test_size=0.2, random_state=42)
# print('X_train.shape={}, y_train.shape={}; \nX_valid.shape={}, y_valid.shape={};\nX_test.shape={}, y_test.shape={}'.format(
#     X_train.shape, y_train.shape, X_valid.shape, y_valid.shape, X_test.shape, y_test.shape))

# ** 3.build the data generator
class BatchGenerator(object):
    """ Construct a Data generator. The input X, y should be ndarray or list like type.

        Data_train = BatchGenerator(X=X_train_all, y=y_train_all, shuffle=False)
        Data_test = BatchGenerator(X=X_test_all, y=y_test_all, shuffle=False)
        X = Data_train.X
        y = Data_train.y
        X_batch, y_batch = Data_train.next_batch(batch_size)

    def __init__(self, X, y, shuffle=False):
        if type(X) != np.ndarray:
            X = np.asarray(X)
        if type(y) != np.ndarray:
            y = np.asarray(y)
        self._X = X
        self._y = y
        self._epochs_completed = 0
        self._index_in_epoch = 0
        self._number_examples = self._X.shape[0]
        self._shuffle = shuffle
        if self._shuffle:
            new_index = np.random.permutation(self._number_examples)
            self._X = self._X[new_index]
            self._y = self._y[new_index]

    def X(self):
        return self._X

    def y(self):
        return self._y

    def num_examples(self):
        return self._number_examples

    def epochs_completed(self):
        return self._epochs_completed

    def next_batch(self, batch_size):
        """ Return the next 'batch_size' examples from this data set."""
        start = self._index_in_epoch
        self._index_in_epoch += batch_size
        if self._index_in_epoch > self._number_examples:
            # finished epoch
            self._epochs_completed += 1
            # Shuffle the data
            if self._shuffle:
                new_index = np.random.permutation(self._number_examples)
                self._X = self._X[new_index]
                self._y = self._y[new_index]
            start = 0
            self._index_in_epoch = batch_size
            assert batch_size <= self._number_examples
        end = self._index_in_epoch
        return self._X[start:end], self._y[start:end]

# print('Creating the data generator ...')
data_train = BatchGenerator(X_train, y_train, shuffle=True)
data_valid = BatchGenerator(X_valid, y_valid, shuffle=False)
data_test = BatchGenerator(X_test, y_test, shuffle=False)
# print('Finished creating the data generator.')

import tensorflow as tf

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
sess = tf.Session(config=config)
from tensorflow.contrib import rnn
import numpy as np
import time
For Chinese word segmentation.
# ##################### config ######################
decay = 0.85
max_epoch = 5
max_max_epoch = 10
timestep_size = max_len = 32  # 句子长度
vocab_size = 7010  # 样本中不同字的个数+1(padding 0),根据处理数据的时候得到
input_size = embedding_size = 64  # 字向量长度
class_num = 5
hidden_size = 128  # 隐含层节点数
layer_num = 2  # bi-lstm 层数
max_grad_norm = 5.0  # 最大梯度(超过此值的梯度将被裁剪)

lr = tf.placeholder

