Chinese Word Segmentation -- Acquisition and Word2Vec model construction

1. Write in front

The corpus used in this paper is Fudan Chinese text classification corpus, which contains 20 categories. Specific codes and resources Click here Download

2. File reading

First, we finish reading a text document, remove stopwords, keep only Chinese characters, and then segment words. The following code includes two functions:

  1. extract_words_one_file() can process a single file, extract all Chinese word segmentation and return values.
  2. After a single document realizes the function, we want to traverse all text files under the folder and process them. extract_words_folder() can implement this function and save data in a specified format and requirements.

It is recommended to look at the hierarchy of the data set before looking at this part of the code to facilitate code understanding

import re,os,jieba
import numpy as np
import pandas as pd
import jieba.analyse

#Extract Chinese word segmentation from a text file and save it in the form of sentence or article in the list
def extract_words_one_file(filepath,all=True):
    # All refers to whether all participles are saved in full text.
    # True means that all participles of an article are stored in a list,
    # False means that the word segmentation of each sentence is stored in a list, and then the list is stored in the form of an article

    def open_file(file_text):
        with open(file_text,'r',errors='ignore',encoding='utf-8') as fp:
            content=fp.readlines()
        return content

    # Only Chinese characters are reserved
    def remove(text):
        remove_chars=r'^\u4e00-\u9fa5'
        return re.sub(remove_chars,'',text)

    # Open stopwords file function declaration
    def open_stop(file_stop):
        stopwords=[line.strip() for line in open(file_stop,'r',encoding='utf-8-sig').readlines()]
        return stopwords

    # Word segmentation using jieba
    def seg_sentence(sentence):
        sentence_seged=jieba.cut(sentence.strip())
        stopwords=open_stop('data/stopwords/hit_stopwords.txt')
        outstr=''
        for word in sentence_seged:
            if word not in stopwords:
                outstr+=word
                outstr+=' '
        # strip() removes the first space
        return outstr.strip()

    # Open the file to load
    inputs=open_file(filepath)

    # Get the participle in each sentence
    words_in_sentence=[]
    for line in inputs:
        line_delete=remove(line)
        # The return value is a string
        line_seg=seg_sentence(line_delete)
        words_in_sentence.append(line_seg)

    print('words_in_sentence_1:',words_in_sentence)
    words_in_sentence=[x for x in words_in_sentence if x!='']
    print('words_in_sentence_2:', words_in_sentence)

    # Use space cutting to get all participles
    alltokens=[]
    chinesewords_sentence=[]
    for i in range(len(words_in_sentence)):
        # \s is blank
        word=re.split(r'\s',words_in_sentence[i])
        alltokens.append(word)

    print('alltokens:',alltokens)

    # Delete the empty word for the column stored by the word segmentation generated by each sentence
    for element in alltokens:
        element=[x for x in element if x!='']
        chinesewords_sentence.append(element)

    print('chinesewords_sentence:',chinesewords_sentence)

    # Get a list of all participles
    chinesewords_article=[i for k in chinesewords_sentence for i in k]

    print('chinesewords_article:',chinesewords_article)

    if all==True:
        return chinesewords_article
    else:
        return chinesewords_sentence

#Batch process the files in multiple folders under a folder to obtain the corresponding Chinese word segmentation and save it in the form of sentence or full text
def extract_words_folder(path, all=True):
    # All refers to whether all participles are saved in full text.
    # True means that all participles are stored in a list,
    # False means that the word segmentation of each sentence is stored in a list, and then the list is stored in the form of an article

    # path='data/stopwords'
    files=os.listdir(path)
    features=[]

    # Traversal file
    for i in range(len(files)):
        dirs=os.listdir(path+'/'+files[i])
        # Traverse all files under the sub file and extract word segmentation in the form of sentence or full text
        for f in dirs:
            if all==True:
                word_single_text=extract_words_one_file(path+'/'+files[i]+'/'+f,all=True)
                word_with_label = [word_single_text, files[i], f]  # Save the category and file name together
                features.append(word_with_label)
            else:
                word_single_text = extract_words_one_file(path + "/" + files[i] + "/" + f, all=False)
                features.append(word_single_text)

    if all == True:
        return pd.DataFrame(features, columns=['Words', 'Category', 'File'])  # Store the participle, category and file name in dataframe
    else:
        return features


article_features = extract_words_folder(path='data/fudan-utf8/train',all = True)
#Save dataframe as csv
article_features.to_csv("article_features_train_raw.csv",encoding='utf_8',index=False)

#Save sentence segmentation in text format
sent_features = extract_words_folder(path='data/fudan-utf8/train',all = False)
with open("word_sentence_train.txt", "w", encoding='utf-8') as f:
    f.write(str(sent_features))

3. Data preprocessing

In order to facilitate the consistency of later data, I didn't save all the data together, and then divide the training set and test set. Instead, I manually divided them before. I created a test folder, extracted the data of training set and test set respectively, and then stored them in two csv files respectively. During the division, I also found that the number of samples in different categories varies greatly, some are thousands, and some are only a hundred. Therefore, I only selected the most nine categories and removed the remaining 11 categories. Because these categories with too few samples are difficult to make unbiased judgment by the model compared with other widely different samples.

import pandas as pd
import numpy as np

# raw data preprocessing is used to save the learned data to the corresponding file
train=pd.read_csv('data/article_features_train_raw.csv')
test=pd.read_csv('data/article_features_test_raw.csv')

# Some categories cannot be trained because the number of samples is too small, so they are deleted and only 9 categories with a large number are retained
# The Enviroment in the train folder is misspelled. In order to be consistent with test, replace it with the correct spelling
train.Category.replace('C31-Enviornment','C31-Environment',inplace=True)
train=train[(train['Category'] == 'C3-Art')|(train['Category'] == 'C11-Space')|(train['Category'] == 'C19-Computer')
             |(train['Category'] == 'C31-Environment')|(train['Category'] == 'C32-Agriculture')
             |(train['Category'] == 'C34-Economy')|(train['Category'] == 'C38-Politics')|(train['Category'] == 'C39-Sports')
             |(train['Category'] == 'C7-History')]
test = test[(test['Category'] == 'C3-Art')|(test['Category'] == 'C11-Space')|(test['Category'] == 'C19-Computer')
             |(test['Category'] =='C31-Environment')|(test['Category'] == 'C32-Agriculture')
             |(test['Category'] == 'C34-Economy')|(test['Category'] == 'C38-Politics')|(test['Category'] == 'C39-Sports')
             |(test['Category'] == 'C7-History')]

#Define label replacement dictionary
label2category = {0: 'C11-Space', 1: 'C19-Computer', 2: 'C3-Art', 3: 'C31-Environment', 4: 'C32-Agriculture',
                  5: 'C34-Economy', 6:'C38-Politics',7:'C39-Sports',8:'C7-History'}
category2label = dict(zip(label2category.values(), label2category.keys()))

train['label'] = train.Category.replace(category2label)
test['label'] = test.Category.replace(category2label)
train = train.reset_index(drop=True)
test = test.reset_index(drop=True)

#Save new data
train.to_csv("article_features_train.csv",encoding='utf_8_sig',index=False)
test.to_csv("article_features_test.csv",encoding='utf_8_sig',index=False)

4. Word2Vec model

After word segmentation is extracted, we need to express these word segmentation in a certain way to convert these unstructured information that cannot be processed by computer into computable structured information. One hot method is one of them. Its principle is well understood. First, it generates a list with an initial value of zero and a length of all words. If the article or sentence contains a word, the corresponding position of the word is 1, otherwise it is 0. However, it is often too sparse when representing large-scale text, and it can not represent the relationship between words, so it is only used in relatively simple practice.

The text representation method we adopt in this paper is Word2Vec in word embedding. Generally speaking, word segmentation is represented by multi-dimensional vectors according to training. The two training modes are predicting the current word through context and predicting the context through the current word.

4.1 data import

gensim version is 4.1.2

import gensim
#Import data
with open("word_sentence.txt", "r") as f: #Open file
    word_sentence = f.read() #read file

sent_feature = eval(word_sentence)

#We only select sentences with more than 3 word segments for W2v model training
sent_words = [i for k in sent_feature for i in k if len(i)>3]

4.2 using only the extracted word segmentation training model

In the second part, we have saved the word segmentation into the text file in sentence form. Now, we can use it to train directly to obtain the model.

model = gensim.models.Word2Vec(sent_words, sg=1, size=100, window=3,iter=5,
                               min_count=3, negative=3, sample=0.001, hs=1)
#Model saving
model.wv.save_word2vec_format('./word2vec_model.txt', binary=False)

4.3 training model based on existing word vector

# Third party pre training model is adopted
w2v_model=gensim.models.Word2Vec(vector_size=300,window=3,sg=1,min_count=3)
w2v_model.build_vocab(sent_words)
# Then load the third-party pre training model
third_model=gensim.models.KeyedVectors.load_word2vec_format('data/fudan-utf8/sgns.merge.word',binary=False)
# Through intersect_word2vec_format() method merge word vector:
w2v_model.build_vocab([list(third_model.vocab.key_to_index())],update=True)
w2v_model.intersect_word2vec_format('data/fudan-utf8/sgns.merge.word',binary=False,lockf=1.0)
w2v_model.train(sent_words, total_examples=w2v_model.corpus_count, epochs=5)
print("Model training finished.")

w2v_model.wv.save_word2vec_format('./word2vec_ensemble.txt', binary=False)
print("Model saved.")

reference resources

Know and never forget the original heart

Tags: NLP

Posted on Thu, 25 Nov 2021 21:04:17 -0500 by CodeToad