NLTK-006: classified text (gender identification)

Classification is the task of selecting the correct class label for a given input. In the basic classification task, each input is considered to be isolated from all other inputs, and the label set is predefined. Here are some examples of classification tasks:

  • Determine whether an email is spam.
  • From a fixed list of subject areas, such as' Sports', 'technology' and 'politics', decide what the theme of the news report is.

There are many interesting variants of the basic classification task. For example, in multi class classification, each instance can be assigned multiple labels. In open classification, the label set is undefined. In sequence classification, an input linked list is classified as a whole.

Supervised classification

However, if the establishment of classification contains the training corpus with the correct label of each input, it is called supervised classification. Frame diagram:

(a) : in the training process, the feature extractor is used to convert the value of each input into feature sets, which capture the basic information that should be applied to its classification in each input. The matching of feature set and label is sent into machine learning algorithm to generate model. (b) : in the prediction process, the same feature extractor is used to convert the unseen input into feature sets, and then these feature sets are sent into the model to generate prediction tags.

Gender identification:

The name corpus between us includes 8000 names classified by gender.

names = nltk.corpus.names
print(names.fileids())
male_names = names.words('male.txt')
female_names = names.words('female.txt')
print([ w for w in male_names if w in female_names ])

Output: ['Abbey', 'Abbie', 'Abby', 'Addie', 'Willi', 'Willie', 'Willy', 'Winnie', 'Winny', 'Wynn']

Generally speaking, names ending with the letter a are female names. So we can extract the last letter name[-1] Then:

cfd = nltk.ConditionalFreqDist((fileid,name[-1]) for fileid in names.fileids() for name in names.words(fileid))
cfd.plot()

Output condition frequency distribution:

As can be seen from this figure, most names ending with a, e and I are women, and those ending with k,o,r,s and t are more likely to be men. Men and women ending with H and L are similar.

Then let's build a classifier to simulate these differences more accurately.

The first step in creating a classifier is to decide what input features can be related and how to encode those features. In this example, we start by looking for the last letter of a given name. The following feature extractor functions establish a dictionary containing information about the given name:

def gender_features(word):
    return {'last_letter':word[-1]}
print(gender_features('Shark'))

The dictionary returned by this function is called a feature set and maps feature names to their values. Feature names are values of simple types, such as Booleans, numbers, and strings.

Now that we have established a feature extractor, we need to prepare an example and a linked list of corresponding class Tags:

from nltk.corpus import names
import random
names = ([(name,'male') for name in names.words('male.txt')]
         +[(name,'female') for name in names.words('female.txt')])
random.shuffle(names)			# shuffle has no return value

Next, we use the feature extractor to process the name data, and divide the result list of the feature set into a training set and a test set. The training set is used to train a new "naive Bayesian" classifier.

featuresets = [(gender_features(n),g) for (n,g) in names]
train_set , test_set = featuresets[500:],featuresets[:500]  

Names that do not appear in the training data under the following test:

classiffier = nltk.NaiveBayesClassifier.train(train_set)    #Naive Bayes classifier 
print(classiffier.classify(gender_features('Neo')))         #classification
print(classiffier.classify(gender_features('Trinity')))

Then we can use a large amount of data to systematically evaluate this classifier:

print(nltk.classify.accuracy(classiffier,test_set))     #Using test sets
#	accuracy 	 Accuracy, for a given test data set, the ratio of the number of samples correctly classified by the classifier to the total number of samples

The output result is about 0.77

Finally, we can check the classifier to determine which features are most effective for distinguishing the gender of names.

print(classiffier.show_most_informative_features(10))

It can be seen from the output results that in the training set, women are 34 times more likely to end with 'a' than men, while men are 30 times more likely to end with 'k'. These ratios are called likelihood ratios and can be used to compare different feature result relationships.

ps: we can also modify gender_ The features () function provides the classifier with the length of the name, its first letter, and any other features that may seem useful. Then these new features are used to train the classifier and test its accuracy.

Moreover, when dealing with large corpus, building a separate linked list containing the characteristics of each instance will use a lot of memory. In this case, use the function nltk.classify.apply_features, which returns an object that behaves like a linked list without storing all feature sets in memory:

from nltk.classify import apply_features
train_set = apply_features(gender_features,names[500:])
test_set = apply_features(gender_features,names[:500])

Select the correct feature:

Selecting relevant features and deciding how to encode them with a learning method can have a great impact on the learning method to extract a good model. One of the many interesting tasks of building a classifier is to find out which features may be relevant and how we can represent them. While decent performance is often achieved with fairly simple and obvious feature sets, the benefits are usually significantly improved with carefully constructed features based on a thorough understanding of the current task.

However, feature extraction is established through repeated experiments and wrong processes, which is guided by the intuition related to the problem. You need to find out all the features and then choose the ones that are actually useful.

For example: Based on the above example, an over fitting gender feature extractor. (over fitting means that the extracted feature set contains a large number of specified features, resulting in over fitting for a relatively small name corpus)

def gender_features2(name):
    features = {}
    features['firstletter'] =  name[0].lower()
    features['lastletter' ] =  name[-1].lower()
    for letter in 'abcdefghijklmnopqrstuvwxyz':
        features['count(%s)'%letter] = name.lower().count(letter)
        features['has(%s)'%letter] = (letter in name.lower())
    return features

print(gender_features2('john'))

Output: {'firstletter': 'j', 'lastletter': 'n', 'count(a)': 0,..., 'has(z)': False}

When over fitting occurs, there will be big problems when operating on a small training set. Let's take the previous example to test:

featuresets2 = [(gender_features2(n),g) for (n,g) in names]
train_set , test_set = featuresets2[500:],featuresets2[:500]
classiffier = nltk.NaiveBayesClassifier.train(train_set)    #Naive Bayes classifier 
print(nltk.classify.accuracy(classiffier,test_set))     #Evaluating classifiers using test sets

The output result is more than 0.7, and the accuracy of the system is much lower than the previous classifier which only measures the last letter.

Development set error analysis: Once the initial feature set is selected, a very effective method to improve the feature set is error analysis. First, we need to select a development set that contains the corpus data used to create the model. Then the development set is divided into training set and development test set.

train_names = names[1500:]
devtest_names = names[500:1500]
test_names = names[:500]

The training set is used to train the model, the development test set is used for error analysis, and the test set is used for the final evaluation of the system. We have divided the corpus into appropriate data sets. We use the training set to train a model, and then run it on the development test set.

train_set = [(gender_features(n),g) for (n,g) in train_names]
devtest_set = [(gender_features(n),g) for (n,g) in devtest_names]
test_set = [(gender_features(n),g) for (n,g) in test_names]

classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier,devtest_set))

The accuracy of the output is 0.75

Using the development test set, we can generate a list of errors when the classifier predicts the name and gender. Then check the case list to see if the prediction is wrong, and then adjust the feature set accordingly.

errors = []
for (name,tag) in devtest_names:
            guess = classifier.classify(gender_features(name))
            if guess != tag:
                    errors.append((tag,guess,name))
print(errors)

There are about 200 errors in the output.

for (tag,guess,name) in sorted(errors):
    print('correct=%-8s guess=%-8s name=%-30s'%(tag,guess,name))

Output results: correct=female guess=male name=Ag correct=female guess=male name=Alis correct=female guess=male name=Allsun correct=female guess=male name=Amargo correct=female guess=male name=Annabell correct=female guess=male name=Ardeen ... ... Just look at the above. Although men usually end with n, en and un do end with women. So we should further optimize the feature extractor.

def gender_features(word):
    return {'suffix1':word[-1],
            'suffix2':word[-2]}

Retest:

train_set = [(gender_features(n),g) for (n,g) in train_names]
devtest_set = [(gender_features(n),g) for (n,g) in devtest_names]
test_set = [(gender_features(n),g) for (n,g) in test_names]

classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier,devtest_set))

Sure enough, the accuracy was 0.77, a little higher than before. However, error analysis requires repeated checks to ensure that the classifier does not begin to reflect the characteristics of the development test set.

Posted on Mon, 22 Nov 2021 04:37:13 -0500 by pelegk2