Tianchi zero foundation introduction NLP competition practice: task 4 text classification based on deep learning 1-FastText

Task4 text classification based on deep learning 1-FastText

Different from traditional machine learning, deep learning not only provides the function of feature extraction, but also can complete the function of classification.

Learning objectives

  • Learn the use and basic principles of FastText
  • Learn to use validation sets for parameter tuning

Text representation Part2-1

Defects of existing text representation methods

The previous text representation methods (one hot, Bag of Words, N-gram, TF-IDF) all have some problems more or less: the converted vector dimension is very high and needs long training practice; The relationship between words is not considered, but statistics are carried out.

Different from these representations, deep learning can also be used for text representation, and it can also be mapped to a low latitude space. Typical examples are FastText, Word2Vec and Bert.

FastText

FastText is a typical representation method of deep learning word vector. It is very simple to map words to dense space through the Embedding layer, and then average all words in the sentence in the Embedding space, so as to complete the classification operation.

Therefore, FastText is a three-layer neural network, including input layer, hidden layer and output layer.

The following figure shows the FastText network structure implemented by keras:

FastText is superior to TF-IDF in text classification tasks:

  • FastText uses the document vector obtained by the Embedding superposition of words to classify similar sentences into one category
  • The embedded space dimension learned by FastText is relatively low and can be trained quickly

Reference papers: Bag of Tricks for Efficient Text Classification

Text classification based on FastText

Install FastText

FastText can quickly train on the CPU. The best practice is the official open source version: https://github.com/facebookresearch/fastText/tree/master/python

It is suggested here that This website Download the fasttext library directly, then enter the download path and install it directly with the following statement:

cd File path
pip install fasttext-0.9.2-cp38-cp38-win_amd64.whl

Data loading, preprocessing and partitioning

The operation here is the same as before and will not be repeated.

#Data loading and preprocessing
import pandas as pd
import joblib
data_file = 'train_set.csv'
rawdata = pd.read_csv(data_file, sep='\t', encoding='UTF-8')
#Replace text by punctuation with regular expressions
import re
rawdata['words']=rawdata['text'].apply(lambda x: re.sub('3750|900|648',"",x))
del rawdata['text']

#Data division
from sklearn.model_selection import train_test_split
import joblib
rawdata.reset_index(inplace=True,drop=True)
test_data=joblib.load('test_index.pkl')
train_data=joblib.load('train_index.pkl')

X=list(rawdata.index)
y=rawdata['label']
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.1,stratify=y) #stratify=y means stratified sampling, sampling according to the proportion of samples of different categories
test_data={'X_test':X_test,'y_test':y_test}
joblib.dump(test_data,'test_index.pkl')
train_data={'X_train':X_train,'y_train':y_train}
joblib.dump(train_data,'train_index.pkl')

train_x=rawdata.loc[train_data['X_train']]
train_y=rawdata.loc[train_data['X_train']]['label'].values
test_x=rawdata.loc[test_data['X_test']]
test_y=rawdata.loc[test_data['X_test']]['label'].values
# Test read test set data
test_data_file = 'test_a.csv'
f = pd.read_csv(test_data_file, sep='\t', encoding='UTF-8')
test_data = f['text'].apply(lambda x: re.sub('3750|900|648',"",x))

FastText training and prediction

import pandas as pd
from sklearn.metrics import f1_score
# Convert to the format required by FastText
train_x['label_ft'] = '__label__' + train_x['label'].astype(str)

train_x[['words','label_ft']].to_csv('fasttext_need_train.csv', index=None, header=None, sep='\t')

import fasttext

model = fasttext.train_supervised('fasttext_need_train.csv', lr=1.0, wordNgrams=2, verbose=2, minCount=1, epoch=25, loss="hs")

val_pred = [model.predict(x)[0][0].split('__')[-1] for x in test_x['words']]

print(f1_score(test_y.astype(str), val_pred, average='macro'))

#Verification set score: 0.9198590651705543

FastText + cross validation + grid search parameters

Here I mainly refer to the blog FastText parameter: GridSearch+CV

# Arrange and combine the values of various parameters
def get_gridsearch_params(param_grid):
    params_combination = [dict()]  # Used to store all possible parameter combinations
    for k, v_list in param_grid.items():
        tmp = [{k: v} for v in v_list]
        n = len(params_combination)
        # params_combination = params_combination*len(tmp)  # Shallow copy, problem
        copy_params = [copy.deepcopy(params_combination) for _ in range(len(tmp))] 
        params_combination = sum(copy_params, [])
        _ = [params_combination[i*n+k].update(tmp[i]) for k in range(n) for i in range(len(tmp))]
    return params_combination

# Use k-fold cross validation to get the final score and save the best score and its corresponding set of parameters
# The inputs are the training data frame, the parameters to be searched, the KFold object for cross validation, the best score evaluation index, and several classifications
def get_KFold_scores(df, params, kf, metric, n_classes):
    metric_score = 0.0

    for train_idx, val_idx in kf.split(df['words'],df['label']):
        df_train = df.iloc[train_idx]
        df_val = df.iloc[val_idx]

        tmpdir = tempfile.mkdtemp() #Because the directory or file where the training data is located is read during the training of fasttext, a temporary directory / file is opened with the cross validation set
        tmp_train_file = tmpdir + '/train.txt'
        df_train.to_csv(tmp_train_file, sep='\t', index=False, header=None, encoding='UTF-8')  # No header

        fast_model = fasttext.train_supervised(tmp_train_file, label_prefix='__label__', thread=3, **params) #Training, incoming parameters
        
        #Use the trained model for evaluation and prediction
        predicted = fast_model.predict(df_val['words'].tolist())  # ([label...], [probs...])
        y_val_pred = [int(label[0][-1:]) for label in predicted[0]]  # label[0]  __label__0
        y_val = df_val['label'].values

        score = get_metrics(y_val, y_val_pred, n_classes)[metric]
        metric_score += score #The scores accumulated on different training sets are used to calculate the average score on the whole cross validation set
        shutil.rmtree(tmpdir, ignore_errors=True) #Delete temporary training data file

    print('average:', metric_score / kf.n_splits)
    return metric_score / kf.n_splits

# Grid search + cross validation
# The input is the training data frame, the parameters to be searched, the best score evaluation index, and how much discount should be made for cross validation
def my_gridsearch_cv(df, param_grid, metrics, kfold=10):
    n_classes = len(np.unique(df['label']))
    print('n_classes', n_classes)

    #kf = KFold(n_splits=kfold)  # k-fold cross validation 
    skf = StratifiedKFold(n_splits=kfold,shuffle=True,random_state=1) #k-fold stratified sampling cross validation

    params_combination = get_gridsearch_params(param_grid) # Get various permutations and combinations of parameters

    best_score = 0.0
    best_params = dict()
    for params in params_combination:
        avg_score = get_KFold_scores(df, params, skf, metrics, n_classes)
        if avg_score > best_score:
            best_score = avg_score
            best_params = copy.deepcopy(params)

    return best_score, best_params

import fasttext
from sklearn.model_selection import KFold, StratifiedKFold
import numpy as np
import pandas as pd
import copy
import tempfile
import shutil
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
#Calculate classification evaluation index
def get_metrics(y_true, y_pred, n_classes):
    metrics = {}

    if n_classes==2:
        #Second classification
        metrics['precision'] = precision_score(y_true, y_pred, pos_label=1)
        metrics['recall'] = recall_score(y_true, y_pred, pos_label=1)
        metrics['f1'] = f1_score(y_true, y_pred, pos_label=1)
    else:#Multi classification
        average = 'macro'
        metrics[average+'_precision'] = precision_score(y_true, y_pred, average=average)
        metrics[average+'_recall'] = recall_score(y_true, y_pred, average=average)
        metrics[average+'_f1'] = f1_score(y_true, y_pred, average=average)
    

    metrics['accuracy'] = accuracy_score(y_true, y_pred)
    metrics['confusion_matrix'] = confusion_matrix(y_true, y_pred)
    metrics['classification_report'] = classification_report(y_true, y_pred)
    
    return metrics

#DATA_PATH = '../data/'

# Parameters to debug
tuned_parameters = {
    'lr': [1.0, 0.85, 0.5],
    'epoch': [30,50],
    'dim': [ 200],
    'wordNgrams': [2, 3],
}

# The above three methods are introduced here

if __name__ == '__main__':
    #filepath = DATA_PATH + 'fast/augmented/js_pd_tagged_train.txt'
    #df = pd.read_csv(filepath, encoding='UTF-8', sep='\t', header=None, index_col=False, usecols=[0, 1])
    print(train_x.head())
    print(train_x.shape)  
    best_score, best_params = my_gridsearch_cv(train_x, tuned_parameters, 'accuracy', kfold=10)
    print('best_score', best_score)
    print('best_params', best_params)

reference material

Datawhale zero foundation entry NLP event - Task4 text classification based on deep learning 1-fastText

FastText parameter: GridSearch+CV

Tags: Python Machine Learning Deep Learning NLP

Posted on Sun, 24 Oct 2021 14:52:54 -0400 by dominod