NLTK-007: classified text (document emotion classification)

We looked at several examples before, where documents have been marked by category. Using these corpora, we can build classifiers. Automatically add appropriate category labels to new documents. Firstly, we construct a list of documents marked with corresponding categories. For this example, I select the film review corpus in nltk and classify each review into positive or negative.

import random
from nltk.corpus import movie_reviews
documents = [(list(movie_reviews.words(fileid)),category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]
random.shuffle(documents)

Next, we define a feature extractor so that the classifier will know which aspects of the data should be paid attention to. For document subject recognition, we can define a feature for each word to indicate whether the document contains this word.

In order to limit the number of features that the classifier needs to process, we first construct a linked list of the first 2000 most frequent words in the whole corpus, and then define a feature extractor. Simply check whether these words are in a given document.

import nltk
all_words = nltk.FreqDist(w.lower() for w in movie_reviews.words())
word_features = list(all_words.keys())[:2000]
def document_features(document):
    documents_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)'%word] = (word in documents_words)
    return features

Now that we have defined our feature extractor, we can use it to train a classifier to tag new movie reviews. In order to check the reliability of the generated classifier, we calculate its accuracy on the test set. Then we use show_most_informative_features() to find out which features the classifier finds the most informative.

Train and test a classifier for document classification:

featuresets = [(document_features(d),c) for (d,c) in documents]
train_set,test_set = featuresets[100:],featuresets[:100]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier,test_set))
print(classifier.show_most_informative_features(5))

Output result: accuracy 0.86 (neg negative, pos positive)

We can see that in this corpus, for example, the negative comments mentioned by Justin are nine times more positive.

We have previously established a regular expression tagger to select part of speech tags for words by looking up the internal composition of words. But this is manual. We can train a classifier to calculate which suffix has the most information. Let's find the most common suffix first:

import nltk
from nltk.corpus import brown
suffix_fdist = nltk.FreqDist()
for word in brown.words():
    word = word.lower()
    suffix_fdist[word[-1:]] +=1
    suffix_fdist[word[-2:]] +=1
    suffix_fdist[word[-3:]] +=1
common_suffixes = list(suffix_fdist.keys())[:100]
print(common_suffixes)
# 100 suffixes with the highest output frequency (last letter, two, and three)

Output: ['e', 'he', 'the', 'n', 'on', 'ton', 'y', 'ty', 'nty', 'urt', 'dge', 'od']

Next: we define a feature extraction function to check these suffixes of a given word.

def pos_features(word):
    features={}
    for suffix in common_suffixes:
        features['endswith(%s)'%suffix] = word.lower().endwith(suffix)
    return features

The feature extraction function is like a colored glasses, emphasizing some attributes (colors) in our data and making it unable to see other attributes. When deciding how to label, classifiers will completely rely on the attributes they emphasize. In this case, the classifier will only make a decision based on the information of which common suffix (if any) a given word has.

Now we have defined our own feature extractor, which can be used to train a new decision tree classifier.

tagged_words = brown.tagged_words(categories = 'news')
featuresets = [(pos_features(n),g) for (n,g) in tagged_words]
size = int(len(featuresets) * 0.1)
train_set,test_set = featuresets[size:],featuresets[:size]
classifier = nltk.DecisionTreeClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))
print(classifier.classify(pos_features('cats')))

Exploring context

Contextual features often provide powerful clues about the correct marking - for example, the tagging word fly, if we know that the word before it is "a", we will be able to determine that it is a noun rather than a verb. If the preceding word is "to", it is obviously a verb. So today we construct a part of speech classifier.

A part of speech classifier whose feature detector checks the context in which a word appears in order to determine the part of speech markers that should be assigned. In particular, the preceding word is used as a feature.

def pos_features(sentence, i):
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = ""
    else:
        features["prev-word"] = sentence[i-1]
    return features
print(brown.sents()[0])

Output [u 'The', u 'Fulton', u 'County', u 'Grand',... U 'place', u '.]

pos_features(brown.sents()[0], 8)

Output {'suffix(3)': u 'ion', 'prev word': u 'an', 'suffix(2)': u 'on', 'suffix(1)': u 'n'}

tagged_sents = brown.tagged_sents(categories='news')
featuresets = []
for tagged_sent in tagged_sents:
     untagged_sent = nltk.tag.untag(tagged_sent)
     for i, (word, tag) in enumerate(tagged_sent):
         featuresets.append( (pos_features(untagged_sent, i), tag) )

size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

Output 0.77

Using context features can improve the performance of our part of speech tagger. For example, if the classifier learns that a word follows large or gubernational, it is likely to be a noun.

Posted on Mon, 22 Nov 2021 04:01:02 -0500 by jibster