Decision tree algorithm

8. Feature extraction

1 Definition

Convert any data (such as text or image) into digital features that can be used for machine learning

Note: eigenvalue is for better understanding of data by computer

  • Feature extraction and classification:
    • Dictionary feature extraction (feature discretization)
    • Text feature extraction
    • Image feature extraction (to be introduced in depth learning)
2 feature extraction API
sklearn.feature_extraction

3 dictionary feature extraction

Function: to eigenvalue dictionary data

  • sklearn.feature_extraction.DictVectorizer(sparse=True,...)
    • DictVectorizer.fit_transform(X)
      • 10: Dictionary or iterator containing dictionary return value
      • Return sparse matrix
    • Dictvectorizer. Get? Feature? Names() returns the category name
3.1. use
from sklearn.feature_extraction import DictVectorizer

def dict_demo():
    """
    //Feature extraction of dictionary type data
    :return: None
    """
    data = [{'city': 'Beijing','temperature':100}, {'city': 'Shanghai','temperature':60}, {'city': 'Shenzhen','temperature':30}]
    # 1. Instantiate a converter class
    transfer = DictVectorizer(sparse=False)
    # 2. Call fit? Transform
    data = transfer.fit_transform(data)
    print("Results returned:\n", data)
    # Print feature name
    print("Feature Name:\n", transfer.get_feature_names())

    return None

Notice that the result without the parameter spark = false is observed

Results returned:
   (0, 1)    1.0
  (0, 3)    100.0
  (1, 0)    1.0
  (1, 3)    60.0
  (2, 2)    1.0
  (2, 3)    30.0
 Feature Name:
 ['city = Shanghai ',' city = Beijing ',' city = Shenzhen ',' temperature ']

This result is not what we want to see, so add parameters to get the desired result:

Results returned:
 [[   0.    1.    0.  100.]
 [   1.    0.    0.   60.]
 [   0.    0.    1.   30.]]
Feature Name:
 ['city = Shanghai ',' city = Beijing ',' city = Shenzhen ',' temperature ']

We call this technique of data processing "one hot" coding

3.2. For features with category information, we will do one hot coding

4. Text feature extraction

Function: eigenvalue text data

  • sklearn.feature_extraction.text.CountVectorizer(stop_words=[])
    • Return word frequency matrix
    • CountVectorizer.fit_transform(X)
      • 10: Text or an iteratable object containing a text string
      • Return value: return sparse matrix
    • Countvector. Get feature names() return value: list of words
  • sklearn.feature_extraction.text.TfidfVectorizer
4.1 use
from sklearn.feature_extraction.text import CountVectorizer

def text_count_demo():
    """
    //Feature extraction of text, countvector
    :return: None
    """
    data = ["life is short,i like like python", "life is too long,i dislike python"]
    # 1. Instantiate a converter class
    # transfer = CountVectorizer(sparse=False) # Note that there is no spark parameter
    transfer = CountVectorizer()
    # 2. Call fit? Transform
    data = transfer.fit_transform(data)
    print("Results of text feature extraction:\n", data.toarray())
    print("Return characteristic Name:\n", transfer.get_feature_names())

    return None

Return result:

Results of text feature extraction:
 [[0 1 1 2 0 1 1 0]
 [1 1 1 0 1 1 0 1]]
Return characteristic Name:
 ['dislike', 'is', 'life', 'like', 'long', 'python', 'short', 'too']

This is only the text feature extraction of English. What about the text feature extraction of Chinese characters??

5.jieba word segmentation

  • jieba.cut()
    • Returns a generator of words

Need to install the next jieba Library

pip3 install jieba

Eigenvalue the following three sentences

Today is cruel, tomorrow is more cruel, and the day after tomorrow is beautiful,
But most of them will die tomorrow night, so everyone should not give up today.

We see light coming from distant galaxies millions of years ago,
So when we see the universe, we are looking at its past.

If you only know something in one way, you won't really know it.
The secret of knowing the true meaning of things depends on how we relate them to what we know.
  • Analysis
    • Prepare sentences and use jieba.cut to segment words
    • Instantiate CountVectorizer
    • Turn the word segmentation result into a string as the input value of fit "transform"
from sklearn.feature_extraction.text import CountVectorizer
import jieba

def cut_word(text):
    """
    //Chinese word segmentation
    "I Love Beijing Tiananmen "-->"I love Tiananmen, Beijing"
    :param text:
    :return: text
    """
    # Word segmentation of Chinese character string by stuttering
    text = " ".join(list(jieba.cut(text)))

    return text

def text_chinese_count_demo2():
    """
    //Feature extraction of Chinese
    :return: None
    """
    data = ["One is that today is cruel, tomorrow is crueler, and the day after tomorrow is beautiful, but most of them die tomorrow night, so everyone should not give up today.",
            "We see light coming from distant galaxies millions of years ago, so when we see the universe, we are looking at its past.",
            "If you only know something in one way, you won't really know it. The secret of knowing the true meaning of things depends on how we relate them to what we know."]
    # Transform the original data into the form of good words
    text_list = []
    for sent in data:
        text_list.append(cut_word(sent))
    print(text_list)

    # 1. Instantiate a converter class
    # transfer = CountVectorizer(sparse=False)
    transfer = CountVectorizer()
    # 2. Call fit? Transform
    data = transfer.fit_transform(text_list)
    print("Results of text feature extraction:\n", data.toarray())
    print("Return characteristic Name:\n", transfer.get_feature_names())

Return result:

Building prefix dict from the default dictionary ...
Dumping model to file cache /var/folders/mz/tzf2l3sx4rgg6qpglfb035_r0000gn/T/jieba.cache
Loading model cost 1.032 seconds.
['One is that today is cruel, tomorrow is crueler, and the day after tomorrow is beautiful, but most of them die tomorrow night, so everyone should not give up today.', 'We see light coming from distant galaxies millions of years ago, so when we see the universe, we are looking at its past.', 'If you only know something in one way, you won't really know it. The secret of knowing the true meaning of things depends on how we relate them to what we know.']
Prefix dict has been built succesfully.
//Results of text feature extraction:
 [[2 0 1 0 0 0 2 0 0 0 0 0 1 0 1 0 0 0 0 1 1 0 2 0 1 0 2 1 0 0 0 1 1 0 0 1 0]
 [0 0 0 1 0 0 0 1 1 1 0 0 0 0 0 0 0 1 3 0 0 0 0 1 0 0 0 0 2 0 0 0 0 0 1 0 1]
 [1 1 0 0 4 3 0 0 0 0 1 1 0 1 0 1 1 0 1 0 0 1 0 0 0 1 0 0 0 2 1 0 0 1 0 0 0]]
//Return characteristic Name:
 ['one kind', 'Can't', 'Do not', 'before', 'understand', 'Thing', 'Today', 'Just in', 'Millions of years', 'Issue', 'Depending on', 'only need', 'The day after tomorrow', 'Meaning', 'Gross', 'How', 'If', 'universe', 'We', 'therefore', 'give up', 'mode', 'Tomorrow', 'Galaxy', 'Night', 'Certain sample', 'cruel', 'each', 'notice', 'real', 'Secret', 'Absolutely', 'fine', 'contact', 'Past times', 'still', 'such']

But if we use such word features for classification, what problems will arise?

How to deal with the situation that a word or phrase appears frequently in multiple articles

6. TF IDF text feature extraction

  • The main idea of TF-IDF is: if a word or phrase appears in one article with a high probability, and rarely appears in other articles, it is considered that this word or phrase has a good ability of classification and is suitable for classification.
  • TF-IDF role: To evaluate the importance of a word to a document set or one of the documents in a corpus.
6.1 formula
  • term frequency (tf) refers to the frequency of a given word in the file
  • inverse document frequency (idf) is a measure of the general importance of words. The idf of a specific word can be obtained by dividing the total number of files by the number of files containing the word, and then taking the base 10 logarithm as the quotient
6.2 cases
from sklearn.feature_extraction.text import TfidfVectorizer
import jieba

def cut_word(text):
    """
    //Chinese word segmentation
    "I Love Beijing Tiananmen "-->"I love Tiananmen, Beijing"
    :param text:
    :return: text
    """
    # Word segmentation of Chinese character string by stuttering
    text = " ".join(list(jieba.cut(text)))

    return text

def text_chinese_tfidf_demo():
    """
    //Feature extraction of Chinese
    :return: None
    """
    data = ["One is that today is cruel, tomorrow is crueler, and the day after tomorrow is beautiful, but most of them die tomorrow night, so everyone should not give up today.",
            "We see light coming from distant galaxies millions of years ago, so when we see the universe, we are looking at its past.",
            "If you only know something in one way, you won't really know it. The secret of knowing the true meaning of things depends on how we relate them to what we know."]
    # Transform the original data into the form of good words
    text_list = []
    for sent in data:
        text_list.append(cut_word(sent))
    print(text_list)

    # 1. Instantiate a converter class
    # transfer = CountVectorizer(sparse=False)
    transfer = TfidfVectorizer(stop_words=['one kind', 'Can't', 'Do not'])
    # 2. Call fit? Transform
    data = transfer.fit_transform(text_list)
    print("Results of text feature extraction:\n", data.toarray())
    print("Return characteristic Name:\n", transfer.get_feature_names())

    return None

Return result:

Building prefix dict from the default dictionary ...
Loading model from cache /var/folders/mz/tzf2l3sx4rgg6qpglfb035_r0000gn/T/jieba.cache
Loading model cost 0.856 seconds.
Prefix dict has been built succesfully.
['One is that today is cruel, tomorrow is crueler, and the day after tomorrow is beautiful, but most of them die tomorrow night, so everyone should not give up today.', 'We see light coming from distant galaxies millions of years ago, so when we see the universe, we are looking at its past.', 'If you only know something in one way, you won't really know it. The secret of knowing the true meaning of things depends on how we relate them to what we know.']
//Results of text feature extraction:
 [[ 0.          0.          0.          0.43643578  0.          0.          0.
   0.          0.          0.21821789  0.          0.21821789  0.          0.
   0.          0.          0.21821789  0.21821789  0.          0.43643578
   0.          0.21821789  0.          0.43643578  0.21821789  0.          0.
   0.          0.21821789  0.21821789  0.          0.          0.21821789
   0.        ]
 [ 0.2410822   0.          0.          0.          0.2410822   0.2410822
   0.2410822   0.          0.          0.          0.          0.          0.
   0.          0.2410822   0.55004769  0.          0.          0.          0.
   0.2410822   0.          0.          0.          0.          0.48216441
   0.          0.          0.          0.          0.          0.2410822
   0.          0.2410822 ]
 [ 0.          0.644003    0.48300225  0.          0.          0.          0.
   0.16100075  0.16100075  0.          0.16100075  0.          0.16100075
   0.16100075  0.          0.12244522  0.          0.          0.16100075
   0.          0.          0.          0.16100075  0.          0.          0.
   0.3220015   0.16100075  0.          0.          0.16100075  0.          0.
   0.        ]]
//Return characteristic Name:
 ['before', 'understand', 'Thing', 'Today', 'Just in', 'Millions of years', 'Issue', 'Depending on', 'only need', 'The day after tomorrow', 'Meaning', 'Gross', 'How', 'If', 'universe', 'We', 'therefore', 'give up', 'mode', 'Tomorrow', 'Galaxy', 'Night', 'Certain sample', 'cruel', 'each', 'notice', 'real', 'Secret', 'Absolutely', 'fine', 'contact', 'Past times', 'still', 'such']
6.3 importance of TF IDF

Data processing in the early stage of article classification by classification and its learning algorithm

84 original articles published, 14 praised, 10000 visitors+
Private letter follow

Tags: Python Spark

Posted on Mon, 13 Jan 2020 02:18:22 -0500 by aquilla