Analysis of unsupervised key phrase generation blog 11--tfidf.py

2021SC@SDUSCĀ 

In the last blog, we completed the analysis of utils.py in the project. In this blog, we will analyze the tfidf.py file in pke. First, we will analyze the calculation method of TF IDF index in combination with the paper, and then analyze it in combination with the use of examples and TF IDF source code.

I   Calculation method of TF IDF

We know that the key phrase generation problem actually provides ranking for a series of phrases, so we need some methods to rank the candidate phrases. We note that the similarity of vocabulary and semantics is very important for the ranking of key phrases. Therefore, this project combines two similarities to get sliver labels

The first is embedded similarity. We know that technologies like WordVec and Doc2Vec can encode the extracted key phrases and documents in the same space. Because of this, semantic similarity can be measured by cosine distance in space, In this article, the Doc2Vec model trained by wikipedia English dataset encodes the documents and candidate key phrases in the corpus into 300 dimensional vectorsKey phrasesThe embedding of is represented as,, the semantic similarity level can be calculated by the following formula:  

TF-IDF measures lexical similarity, especially for a prediction library, one of the documents, which contains the number of words,   For phrasesIn general, it is in the documentThe number of occurrences is counted as, andExpressed in CorpusHow many phrases appear in the articlesThe similarity at the lexical level can be expressed as follows:

When the document is very long, the TF-IDF score is stable. Doc2Vec is very reliable in coding short documents and relatively long documents. By combining semantic and lexical similarity with geometric mean, the two similarities can be considered comprehensively. The final calculation formula is as follows:

2, TF IDF usage instance of pke

  An example of TF IDF scoring after keyword extraction is given below, and the relevant codes are analyzed.

import string
import pke

# Create a TfIdf extractor
extractor = pke.unsupervised.TfIdf()

# Load document content
extractor.load_document(input='path/to/input',
                        language='en',
                        normalization=None)

# Select {1-3} syntax without punctuation as a candidate
extractor.candidate_selection(n=3,stoplist=list(string.punctuation))

# Use 'TF' x 'IDF' to weight candidates
df = pke.load_document_frequency_file(input_file='path/to/df.tsv.gz')
extractor.candidate_weighting(df=df)

# Take the top 10 candidate phrases as key phrases
keyphrases = extractor.get_n_best(n=10)

First, the pke package is imported. Then, in the unsupervised method of pke, create a TfIdf object as the extractor and call load_ The document function loads the document and assigns stoplist as a punctuated list, that is, the key phrase does not contain punctuation, and the ngram feature is 3. Then load the document frequency file, specify its input path, and the result is candidate_ The parameter df of weighting is finally called get_. n_ The best method obtains the top 10 song candidate key phrases.

3, Analysis of tfidf source code

The structure of tfidf.py is as follows. There are two main functions, candidate_selection and candidate_weighting, the former function realizes the extraction of keywords, and the latter function sorts the keywords with TF IDF.

  The first is to import related packages.

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import math
import string
import logging # Log module

from pke.base import LoadFile
from pke.utils import load_document_frequency_file # Load document frequency file

Let's look at candidate first_ Selection function:

def candidate_selection(self, n=3, stoplist=None, **kwargs):
    # Select the phrase with 1-3gram feature as the keyphrase candidate
    # Select ngram from 1-3
    self.ngram_selection(n=n)

    # If there is no stoplist, the initialization stoplist is an empty list
    if stoplist is None:
        stoplist = list(string.punctuation)

    # Filter candidate key phrases containing punctuation
    self.candidate_filtering(stoplist=stoplist)

  About the parameters of this function:

  • n: The expected input value is int, n of ngram, and the default value is 3
  • Stoplist: the expected input is a list type, which is used to filter the stoplist of candidate key phrases. The default is "None"
  • Words marked by the 'string.punctuation' symbol are not allowed

Look at candidate_weighting function:

def candidate_weighting(self, df=None):
#Score calculation function of candidate keywords using document frequency

    # If no document frequency count is provided, the default document frequency count is initialized
    if df is None:
        logging.warning('LoadFile._df_counts is hard coded to {}'.format(self._df_counts))
        df = load_document_frequency_file(self._df_counts, delimiter='\t')

    # The number of initialization documents is -- NB_DOC--+ 1
    N = 1 + df.get('--NB_DOC--', 0)

    # Circular candidates
    for k, v in self.candidates.items():
        # Get candidate document frequency
        candidate_df = 1 + df.get(k, 0)
        # Calculate idf score
        idf = math.log(N / candidate_df, 2)
        # Add idf scores to the weight container
        self.weights[k] = len(v.surface_forms) * idf

Description of the parameter df: df is the frequency of documents, and the number of documents is specified with the "-" NB_DOC -- "key.

Tags: Python Deep Learning NLP

Posted on Thu, 02 Dec 2021 22:47:27 -0500 by timmybuck