[NLP] two methods of text keyword extraction TFIDF and TextRank

background

Two days ago, I saw that TextRank was used for keyword extraction in the paper Chinese Poetry Generation with Planning based Neural Network. When reading the article, I also thought that in addition to TextRank, TFIDF is often used for keyword extraction.

The use of some algorithms depends on the business scenario and the characteristics of the algorithm. What does keyword extraction do? The task of keyword extraction is to automatically extract some meaningful words or phrases from a given text. Then this meaningful will be combined with the characteristics of the algorithm.

Add: these two schemes are unsupervised. Of course, classification can also be used for supervised processing. This paper does not discuss the supervised keyword extraction method.

TFIDF

1. Basic theory

TF-IDF (term frequency – Inverse Document Frequency) is a common weighting technology for information retrieval and data mining. TF is term frequency and IDF is inverse document frequency. In other words, the more a word appears in an article and the less it appears in all documents, the more it can represent the article.

Term frequency (TF) refers to the number of times a given word appears in the file. It can also be a sentence in an actual task. It needs to be flexible in combination with specific tasks. This number is usually normalized by dividing the word frequency by the total number of words in the article (the same word may have a higher word frequency in a long file than in a short file, regardless of whether the word is important or not), so as to prevent it from being biased towards a long file. Find a literature in which word frequency t f tf tf indicates the frequency of the word item in the document:
t f d t = m d t M d tf_{dt} = \frac{m_{dt}}{M_d} tfdt​=Md​mdt​​
t f d t tf_{dt} tfdt , refers to the characteristic item t t t can also be considered a word in a document or statement d d Frequency of occurrence in d m d t m_{dt} mdt divided by text or statement d d Total words in d M d M_d Md​.
There are also many useless words in the text, such as "de", "yes" and "Zai" -- the most commonly used words. They are called "stop words", which have a large number and make little contribution to the representation of the text. At this time, you need to use a frequency called reverse file i d f idf idf reflects the importance of word items in document data. The calculation formula is as follows:
i d f t = log ⁡ ( N n t + 0.1 ) idf_t=\log\left( \frac{N}{n_t + 0.1} \right) idft​=log(nt​+0.1N​)
i d f t idf_t idft # total files N N N divided by include feature t t t's document book n t n_t nt (add 0.1 to avoid denominator being 0), and then take logarithm of the result. We can imagine the physical meaning of his expression: when every document contains this feature t t t. So n t = N n_t=N nt​=N, i d f t idf_t idft# the final result is close to 0, which means that this feature is not very important.

So a document d d A feature in d t t t the final TF IDF weight is:
w d t = t f d t × i d f t w_{dt}=tf_{dt} \times idf_t wdt​=tfdt​×idft​

2. Code interpretation

Now let's take a look at how tfidf is implemented in jieba word segmentation. Of course, there are tfidf libraries in the scikit learn library, but scikit learn encapsulates the training methods of various models in a comparative way, and temporarily does not consider the programming design and algorithm skills outside the algorithm.

TextRank

1. Introduction to PageRank principle

Understanding the PageRank principle makes it easier to understand the basic principle of TextRank. PageRank was originally used to the importance of web pages. The whole www can be regarded as A directed graph, and the nodes are web pages. If web page A has A link to web page B, there is A directed edge from web page A to web page B. There are two basic ideas:

Number of links: the more a web page is linked by other web pages, the more important this web page is.
Link quality: the higher the weight of a web page is linked, the more important the web page is.

The scoring formula for a node is as follows:
S ( V i ) = ( 1 − d ) + d ∗ ∑ j ∈ In ⁡ ( V i ) 1 ∣Out ⁡ ( V j ) ∣ S ( V j ) S\left(V_{i}\right)=(1-d)+d * \sum_{j \in \operatorname{In}\left(V_{i}\right)} \frac{1}{\operatorname{|Out}\left(V_{j}\right)|} S\left(V_{j}\right) S(Vi​)=(1−d)+d∗j∈In(Vi​)∑​∣Out(Vj​)∣1​S(Vj​)

Of which:

  • S ( V i ) S(V_i) S(Vi) indicates a web page V i V_i The importance score of Vi, that is, the PageRank (PR) value
  • I n ( V i ) In(V_i) In(Vi) indicates pointing to a web page V i V_i A collection of all other web pages in Vi
  • O u t ( V j ) Out(V_j) Out(Vj) represents a web page V j V_j A collection of other web pages contained in Vj #, ∣ O u t ( V j ) ∣ |Out(V_j)| ∣ Out(Vj) ∣ indicates the number of web pages
  • d d d is a damping coefficient. In order to solve the score of web pages without entering the chain, it is generally taken as 0.85

Through continuous iteration, the PR value can converge to a stable value. When the damping coefficient is close to 1, the number of iterations required will increase sharply, and the ranking is unstable.

How to determine the initial score of linked web pages: the algorithm initializes the scores of all web pages to 1 at the beginning, and then converges the scores of each web page through multiple iterations. The convergence score is the final score of the web page. If it cannot converge, the calculation can also be controlled by setting the maximum number of iterations. The score when the calculation stops is the score of the web page.

2. Introduction to textrank principle

When extracting keywords, the idea of TextRank algorithm is similar to that of PageRank algorithm. The difference is that TextRank takes words as nodes and establishes links between nodes with co-occurrence relationship. PageRank is a directed edge, while TextRank is an undirected edge, or bidirectional edge.

What is co-occurrence? After segmenting the text, removing stop words or part of speech filtering, set the window length to k k k. That is, only K words can appear at most. Slide the window, and an undirected edge can be established between the common words in the window.

The examples in the original paper extract keywords from the following texts:

Compatibility of systems of linear constraints over the set of natural numbers.
Criteria of compatibility of a system of linear Diophantine equations, strict
inequations, and nonstrict inequations are considered. Upper bounds for
components of a minimal set of solutions and algorithms of construction of
minimal generating sets of solutions for all types of systems are given.
These criteria and the corresponding algorithms for constructing a minimal
supporting set of solutions can be used in solving all the considered types
systems and systems of mixed types.

The following figure is obtained by constructing the diagram:

The key words are as follows:

Overall, the effect was OK. TextRank is calculated as follows:

W S ( V i ) = ( 1 − d ) + d × ∑ V j ∈ I n ( V i ) w j i ∑ V k ∈ O u t ( V j ) w j k W S ( V j ) WS\left(V_{i}\right)=(1-d)+d \times \sum_{V_{j} \in I n\left(V_{i}\right)} \frac{w_{j i}}{\sum_{V_{k} \in O u t\left(V_{j}\right)} w_{j k}} WS\left(V_{j}\right) WS(Vi​)=(1−d)+d×Vj​∈In(Vi​)∑​∑Vk​∈Out(Vj​)​wjk​wji​​WS(Vj​)

It can be seen that this formula has only one more weight item than PageRank w j i w_{ji} wji, used to indicate that the edge connection between two nodes has different degrees of importance.

TextRank algorithm for keyword extraction is as follows:

  1. Put the given text T T T is segmented according to the complete sentence T = [ S 1 , S 2 , ⋯   , S m ] T=[S_1,S_2,\cdots, S_m] T=[S1​,S2​,⋯,Sm​]
  2. For each sentence S i ∈ T S_i\in T Si ∈ T, carry out word segmentation and part of speech tagging, filter out stop words, and only retain words with specified part of speech, such as nouns, verbs and adjectives S i = [ t i , 1 , t i , 2 , ⋯   , t i , n ] S_i=[t_{i,1}, t_{i,2}, \cdots,t_{i, n}] Si = [ti,1, ti,2,..., ti,n], where t i , j t_{i,j} ti,j , are candidate keywords after retention
  3. Build candidate keyword graph G = ( V , E ) G=(V,E) G=(V,E), where V V V is the node set, which is composed of the candidate keywords generated in the previous step, and then the co-occurrence relationship is used to construct the edge between any two points. There is an edge between the two nodes only if their corresponding vocabulary is in length k k The window of k appears together, k k k represents the window size, that is, the maximum co-occurrence k k k words
  4. According to the above formula, the weight of each node is iteratively propagated until convergence
  5. The node weights are inverted to get the most important weight T T T words as candidate keywords
  6. The most important result from the previous step T T T words are marked in the original text. If adjacent phrases are formed, they are combined into multi word keywords

The last step above is equivalent to processing the TextRank result, or you can consider the position of candidate keywords in the text for further processing. Generally, the words in the front and back of the document are more important

The method of extracting keyword phrases is based on keyword extraction. It can be simply considered that if several extracted keywords are adjacent in the text, they constitute an extracted key phrase.

If the summary is generated, each sentence in the text is regarded as a node respectively. If the two sentences are similar, it is considered that there is an undirected weighted edge between the nodes corresponding to the two sentences. There are many ways to investigate the similarity of two sentences.

Algorithm usage analysis

For TFIDF algorithm, if the keywords are extracted from the current existing text data, the current corpus can be used to calculate the weight of each word and obtain the keywords of the corresponding document. For some corpus, if the keywords of the new text are extracted, the keyword extraction effect in the new text depends on the existing corpus.

For TextRank, if the text of the keyword to be extracted is long, the text can be directly used for keyword extraction without relevant corpus. When the text to be extracted is short, such as a sentence, it is necessary to calculate the importance of each word through corpus data, which may be better.

In addition, we also need to consider the text length to flexibly use the two algorithms.

Code interpretation

1. Code implementation

jieba word segmentation is an essential tool in Chinese information processing, in which TFIDF and TextRank algorithms are implemented. Here we take a look at how it is implemented and how it is used. The implementation path of these two algorithms in python jieba is as follows:

After looking at the implementation of the source code,

1.1 TFIDF

First, a keyword extraction class is implemented:

class KeywordExtractor(object):

    STOP_WORDS = set((
        "the", "of", "is", "and", "to", "in", "that", "we", "for", "an", "are",
        "by", "be", "as", "on", "with", "can", "if", "from", "which", "you", "it",
        "this", "then", "at", "have", "all", "not", "one", "has", "or", "that"
    ))

    def set_stop_words(self, stop_words_path):
        abs_path = _get_abs_path(stop_words_path)
        if not os.path.isfile(abs_path):
            raise Exception("jieba: file does not exist: " + abs_path)
        content = open(abs_path, 'rb').read().decode('utf-8')
        for line in content.splitlines():
            self.stop_words.add(line)

    def extract_tags(self, *args, **kwargs):
        raise NotImplementedError

The core code is as follows:

class TFIDF(KeywordExtractor):

    def __init__(self, idf_path=None):
        self.tokenizer = jieba.dt
        self.postokenizer = jieba.posseg.dt
        self.stop_words = self.STOP_WORDS.copy()
        self.idf_loader = IDFLoader(idf_path or DEFAULT_IDF)
        self.idf_freq, self.median_idf = self.idf_loader.get_idf()

    def set_idf_path(self, idf_path):
        new_abs_path = _get_abs_path(idf_path)
        if not os.path.isfile(new_abs_path):
            raise Exception("jieba: file does not exist: " + new_abs_path)
        self.idf_loader.set_new_path(new_abs_path)
        self.idf_freq, self.median_idf = self.idf_loader.get_idf()

    def extract_tags(self, sentence, topK=20, withWeight=False, allowPOS=(), withFlag=False):
        """
        Extract keywords from sentence using TF-IDF algorithm.
        Parameter:
            - topK: return how many top keywords. `None` for all possible words.
            - withWeight: if True, return a list of (word, weight);
                          if False, return a list of words.
            - allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v','nr'].
                        if the POS of w is not in this list,it will be filtered.
            - withFlag: only work with allowPOS is not empty.
                        if True, return a list of pair(word, weight) like posseg.cut
                        if False, return a list of words
        """
        if allowPOS:
            allowPOS = frozenset(allowPOS)
            words = self.postokenizer.cut(sentence)
        else:
            words = self.tokenizer.cut(sentence)
        freq = {}
        for w in words:
            if allowPOS:
                if w.flag not in allowPOS:
                    continue
                elif not withFlag:
                    w = w.word
            wc = w.word if allowPOS and withFlag else w
            if len(wc.strip()) < 2 or wc.lower() in self.stop_words:
                continue
            freq[w] = freq.get(w, 0.0) + 1.0
        total = sum(freq.values())
        for k in freq:
            kw = k.word if allowPOS and withFlag else k
            freq[k] *= self.idf_freq.get(kw, self.median_idf) / total

        if withWeight:
            tags = sorted(freq.items(), key=itemgetter(1), reverse=True)
        else:
            tags = sorted(freq, key=freq.__getitem__, reverse=True)
        if topK:
            return tags[:topK]
        else:
            return tags

jieba word segmentation has calculated the idf value of 270000 words. You can directly calculate the TFIDF value of each word in the current statement or document, and then obtain the corresponding keywords. If you need to calculate the idf value in your own corpus, it is recommended to use a relatively professional database, such as scikit learn, which will not be introduced in this paper.

1.2 TextRank

In addition to inheriting the KeywordExtractor class, TextRank also writes a class for calculating undirected graphs as follows:

class UndirectWeightedGraph:
    d = 0.85

    def __init__(self):
        self.graph = defaultdict(list)

    def addEdge(self, start, end, weight):
        # use a tuple (start, end, weight) instead of a Edge object
        self.graph[start].append((start, end, weight))
        self.graph[end].append((end, start, weight))

    def rank(self):
        ws = defaultdict(float)
        outSum = defaultdict(float)

        wsdef = 1.0 / (len(self.graph) or 1.0)
        for n, out in self.graph.items():
            ws[n] = wsdef
            outSum[n] = sum((e[2] for e in out), 0.0)

        # this line for build stable iteration
        sorted_keys = sorted(self.graph.keys())
        for x in xrange(10):  # 10 iters
            for n in sorted_keys:
                s = 0
                for e in self.graph[n]:
                    s += e[2] / outSum[e[1]] * ws[e[1]]
                ws[n] = (1 - self.d) + self.d * s

        (min_rank, max_rank) = (sys.float_info[0], sys.float_info[3])

        for w in itervalues(ws):
            if w < min_rank:
                min_rank = w
            if w > max_rank:
                max_rank = w

        for n, w in ws.items():
            # to unify the weights, don't *100.
            ws[n] = (w - min_rank / 10.0) / (max_rank - min_rank / 10.0)

        return ws

The core code is as follows:

class TextRank(KeywordExtractor):

    def __init__(self):
        self.tokenizer = self.postokenizer = jieba.posseg.dt
        self.stop_words = self.STOP_WORDS.copy()
        self.pos_filt = frozenset(('ns', 'n', 'vn', 'v'))
        self.span = 5

    def pairfilter(self, wp):
        return (wp.flag in self.pos_filt and len(wp.word.strip()) >= 2
                and wp.word.lower() not in self.stop_words)

    def textrank(self, sentence, topK=20, withWeight=False, allowPOS=('ns', 'n', 'vn', 'v'), withFlag=False):
        """
        Extract keywords from sentence using TextRank algorithm.
        Parameter:
            - topK: return how many top keywords. `None` for all possible words.
            - withWeight: if True, return a list of (word, weight);
                          if False, return a list of words.
            - allowPOS: the allowed POS list eg. ['ns', 'n', 'vn', 'v'].
                        if the POS of w is not in this list, it will be filtered.
            - withFlag: if True, return a list of pair(word, weight) like posseg.cut
                        if False, return a list of words
        """
        self.pos_filt = frozenset(allowPOS)
        g = UndirectWeightedGraph()
        cm = defaultdict(int)
        words = tuple(self.tokenizer.cut(sentence))
        for i, wp in enumerate(words):
            if self.pairfilter(wp):
                for j in xrange(i + 1, i + self.span):
                    if j >= len(words):
                        break
                    if not self.pairfilter(words[j]):
                        continue
                    if allowPOS and withFlag:
                        cm[(wp, words[j])] += 1
                    else:
                        cm[(wp.word, words[j].word)] += 1

        for terms, w in cm.items():
            g.addEdge(terms[0], terms[1], w)
        nodes_rank = g.rank()
        if withWeight:
            tags = sorted(nodes_rank.items(), key=itemgetter(1), reverse=True)
        else:
            tags = sorted(nodes_rank, key=nodes_rank.__getitem__, reverse=True)

        if topK:
            return tags[:topK]
        else:
            return tags

    extract_tags = textrank

This implementation also calculates the topK keywords for the current document or statement, and does not use some corpus. So whether it's tfdif or textrank, if the business is complex, you can either use scikit learn or implement it yourself. However, the above code can also be used for reference, and the implementation method is also very good.

2. Use examples

The test sample is also part of the previous article.

Some time ago, I reviewed and learned based on RNN+Attention And based CNN+Attention of seq2seq Model:[ NLP]seq2seq From shallow to deep -- Based on Rnn and Cnn So now I want to find some cases to practice.
seq2seq The most common practice is translation. Look at some cases from English to French and German on the Internet. To be honest, can you understand them at the whole point? Maybe we don't have public corpus. Frankly, I don't have any, ha ha. Then go github Look up.
Besides machine translation, seq2seq There are still some interesting landing scenes. For example, we call Haidilao to make a reservation. Generally, the female voice who answers the phone is actually a robot to make a reservation for you. It sounds smart. Here we use it seq2seq,But it involves voice processing. The other is the Huawei team, through seq2seq Based on the model, the computer automatically replies to microblog, and a series of interesting results are obtained through the comparison between the models, as shown in the figure below, post The other four columns are the replies made by different models to the microblog.

2.1 TFIDF

In jieba, TFIDF is used to extract keywords by default, as follows:

text = '''Some time ago, I reviewed and learned based on RNN+Attention And based CNN+Attention of seq2seq Model:[ NLP]seq2seq From shallow to deep -- Based on Rnn and Cnn So now I want to find some cases to practice.
seq2seq The most common practice is translation. Look at some cases from English to French and German on the Internet. To be honest, can you understand them at the whole point? Maybe we don't have public corpus. Frankly, I don't have any, ha ha. Then go github Look up.
Besides machine translation, seq2seq There are still some interesting landing scenes. For example, we call Haidilao to make a reservation. Generally, the female voice who answers the phone is actually a robot to make a reservation for you. It sounds smart. Here we use it seq2seq,But it involves voice processing. The other is the Huawei team, through seq2seq Based on the model, the computer automatically replies to microblog, and a series of interesting results are obtained through the comparison between the models, as shown in the figure below, post The other four columns are the replies made by different models to the microblog.
'''
res = jieba.analyse.extract_tags(text, topK=5)
print(res)

The results are as follows:

The overall result is still unsatisfactory.

2.2 TextRank

The function interface that calls textrank to extract keywords in jieba is similar to using tfidf, and the specific operation is as follows:

res = jieba.analyse.textrank(text, topK=5)
print(res)


The results here seem not as good as those extracted by TFIDF, but the keyword "model" is extracted. After all, on the one hand, TFIDF uses the idf value calculated in advance, and the content topic described by the test text is not very prominent.

summary

In general, TFIDF and TextRank algorithms are not difficult to understand and implement (don't ask me why TextRank algorithm can converge after iteration) However, the specific actual scenarios and businesses need to be processed in combination with relevant corpora. If the business is complex, TextRank can also be implemented by using packages with better interfaces, such as TFIDF in scikit learn. However, it is still feasible to use jieba's TFIDF for keyword extraction. After all, there are calculated idf values and the number of words covered has reached 270000 , the overall effect is also acceptable.

Reference

  1. Baidu Encyclopedia TF IDF
  2. Zhang Wei, Shi Qian, he Xiao, Wang Chen, Li Hexiang, Li Jiran. Research on improved TF-IDF algorithm in text classification [J]. Information technology and network security, 2021,40 (07): 72-76 + 83
  3. The PageRank Citation Ranking:Bringing Order to the Web
  4. TextRank: Bringing Order into Texts
  5. TextRank -- Keyword Extraction
  6. PageRank and TextRank algorithms

Tags: Machine Learning NLP

Posted on Mon, 04 Oct 2021 16:57:59 -0400 by Blesbok