NLTK-008: classification text (more examples of supervised classification)

Sentence segmentation:

Sentence segmentation can be regarded as a punctuation classification task: whenever we encounter a symbol of a sentence that may end, we must decide whether it terminates the current sentence.

#Firstly, some data that have been segmented into sentences are obtained  #Transform it into a form suitable for feature extraction
import nltk
sents = nltk.corpus.treebank_raw.sents()    
tokens = []                               
boundaries = set()
offset = 0
for sent in nltk.corpus.treebank_raw.sents():

Here, tokens is a combined linked list of individual sentence identifiers, and boundaries is a set containing indexes of all sentence boundary identifiers. Next, we will specify the data features used to determine whether punctuation represents sentence boundaries:

def punct_features(tokens, i):
    return {'next-word-capitalized': tokens[i+1][0].isupper(),
        'prevword': tokens[i-1].lower(),'punct': tokens[i],
        'prev-word-is-one-char': len(tokens[i-1]) == 1}

Based on this feature extractor, we can create a linked list of labeled feature sets by selecting all punctuation symbols, and then mark whether they are boundary identifiers:

featuresets = [(punct_features(tokens, i), (i in boundaries))
               for i in range(1, len(tokens)-1)if tokens[i] in '.?!']

Train another punctuation classifier:

size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
nltk.classify.accuracy(classifier, test_set)

Use this classifier to break sentences:

def segment_sentences(words):
    start = 0
    sents = []
    for i, word in enumerate(words):
        if word in '.?!' and classifier.classify(punct_features(words, i)) == True:
            start = i+1
    if start < len(words):
    return sents

Identify conversation behavior types

In dealing with dialogue, it is very useful to regard discourse as an act done by the speaker. This explanation is the simplest statement expressing behavior, such as "I forgive you" or "I bet you can't climb that mountain." but greetings, questions, answers, assertions and explanations can be regarded as speech based behavior types. Identifying the dialogue behavior under the dialogue speech is an important first step in understanding the dialogue.

In NPS corpus, there are more than 10000 posts from instant messaging sessions. These posts have been labeled as one of the 15 types of conversation behavior.

posts = nltk.corpus.nps_chat.xml_posts()[:10000]

Define a simple feature extractor to check what words the post contains:

def dialogue_act_features(post):
    features = {}
    for word in nltk.word_tokenize(post):
        features['contains(%s)' % word.lower()] = True
    return features

Then extract features for each post and construct training and test data. And create a new classifier.

featuresets = [(dialogue_act_features(post.text), post.get('class'))
               for post in posts]
size = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[size:], featuresets[:size]
classifier = nltk.NaiveBayesClassifier.train(train_set)
print(nltk.classify.accuracy(classifier, test_set))

Recognition text (RTE)

Recognizing the meaning of text is to judge whether a given segment of text T contains another text called hypothesis.

In our RTE feature detector, we let the word (i.e. word type) as the information of the agent, calculate the word with the degree of overlap with our characteristics, and assume that there is the degree of word rather than text. Not all words are equally important - named entities, such as people's names, organizations and places, may be more important, which urges us to extract different information for word and nes. In addition, some high-frequency function words are filtered out as "stop words".

Construction feature extractor:

def rte_features(rtepair):
    extractor = nltk.RTEFeatureExtractor(rtepair)
    features = {}
    features['word_overlap'] = len(extractor.overlap('word'))
    features['word_hyp_extra'] = len(extractor.hyp_extra('word'))
    features['ne_overlap'] = len(extractor.overlap('ne'))
    features['ne_hyp_extra'] = len(extractor.hyp_extra('ne'))
    return features

In order to show the content of the feature, some properties of the text / hypothesis shown above can be tested

rtepair = nltk.corpus.rte.pairs(['rte3_dev.xml'])[33]
extractor = nltk.RTEFeatureExtractor(rtepair)
print extractor.overlap('word')
print extractor.overlap('ne')
print extractor.hyp_extra('word')

Posted on Mon, 22 Nov 2021 13:29:27 -0500 by timolein