Machine learning - overview, feature extraction of data (notes)

1. The relationship among artificial intelligence, machine learning and deep learning

What machine learning can do.

Recommended books for learning
Learning objectives

2. What is machine learning

Machine learning is to automatically analyze and obtain laws (models) from data, and use laws to predict unknown data.



Get the rules from historical data? What is the format of these historical data?

3. Structure of data set

1. Available data sets

2. Data set structure
The characteristic value is the area, location, floor and orientation of the house; the target value is the house price.

There are two packages for feature processing in the data:
1.pandas: a very convenient tool for data reading and basic processing format;
2.sklearn: provides a powerful interface for feature processing

4. Characteristic engineering of data

1. What is characteristic Engineering

Feature engineering is the process of transforming the original data into the features that better represent the potential problems of the prediction model, so as to improve the prediction accuracy of the unknown data.

2. Significance of characteristic Engineering

Direct impact on forecast results

3. Introduction to scikit learn Library

5. Feature extraction of data

Eigenvalue is for the computer to better understand the data
Most features in the real world are not continuous variables, such as classification, text, image, etc. in order to express the features of discontinuous variables, we need to express these features mathematically, so we use feature extraction sklearn.feature_extraction provides many methods of feature extraction.

1. Dictionary feature extraction

We use city and environment as dictionary data to extract features.

sklearn.feature_extraction.DictVectorizer(sparse = True)

Convert the mapping list to a Numpy array or scipy.sparse Matrix (whether sparse is converted to scipy.sparse Matrix representation, on by default).

code:

from sklearn.feature_extraction import DictVectorizer
dic = DictVectorizer(sparse=True) 
instances = [{'city': 'Beijing','temperature':100},{'city': 'Shanghai','temperature':60}, {'city': 'Shenzhen','temperature':30}]
X = dic.fit_transform(instances)
print('fit_transform Results of:')
print(X)
print('inverse_transform Results:')
print(dic.inverse_transform(X))

result:
Because the DictVectorizer parameter sparse=True is set, X is a sparse matrix. And make converse_ Transform is to convert x to the original dictionary type, with only one more number tag.

If spark is set to False.
result:
At this point, fit_ The transform function returns a numpy array, that is, X is a numpy array. Comparing the two arrays, we can see that the relationship between the numpy array and the sparse array is that the sparse array is to select the non-zero elements in the numpy array separately, such as 1 in row 0, column 1, 100 in row 0, column 3, and 30 in Row 2, column 3, etc. (we usually convert to numpy array, sparse is less used).

From this numpy array, we can see that the principle of dictionary feature extraction is to convert character classes into machine recognizable numbers without changing the numbers.

2. Text feature extraction

Eigenvalue text data

Text feature extraction is applied to many aspects, such as document classification, spam classification and news classification. Then text classification is represented by the existence of words and the probability (importance) of words.

(1) The appearance of middle words in documents
A value of 1 indicates that the word appears in the vocabulary, and a value of 0 indicates that it does not appear

sklearn.feature_extraction.text.CountVectorizer()

Convert a collection of text documents to a count matrix( scipy.sparse matrices)
code:

English text

from sklearn.feature_extraction.text import CountVectorizer
content = ["life is short,i like python","life is too long,i dislike python"]
verctor = CountVectorizer()
res = verctor.fit_transform(content)
print(verctor.get_feature_names())
res = res.toarray()  #Convert sparse array to numpy array
print(res)

result:

If you add the word is to the first string in the list, there are two is. Look at the results:


You can see that the first element in row 0 of the array changes from 1 to 2. You can see that each row of the array counts the number of words in each string (except for single letters). get_feature_names lists all the words that appear in the content (except those with a single letter).

Chinese text
What happens when the string in content is Chinese
code:

from sklearn.feature_extraction.text import CountVectorizer
content = ["Life is short, I like it python","Life is too long, I don't like it python"]
verctor = CountVectorizer()
res = verctor.fit_transform(content)
print(verctor.get_feature_names())
res = res.toarray()
print(res)

result:

It can be seen that Chinese text can't separate every word as well as English text, because there are spaces in English sentences. If the Chinese text is also separated by spaces.

from sklearn.feature_extraction.text import CountVectorizer
content = ["Life is short, I like it python","Life is too long, I don't like it python"]
verctor = CountVectorizer()
res = verctor.fit_transform(content)
print(verctor.get_feature_names())
res = res.toarray()
print(res)

result:

['like python', 'Too long', 'Very short', 'I don't', 'life']
[[1 0 1 0 1]
 [1 1 0 1 1]]

It can be seen that those separated by spaces can be separated well, and single Chinese characters are not counted.

If we want to extract the features of Chinese text, we need to segment Chinese words in order to realize the eigenvalue in detail. We need a package: jieba segmentation

import jieba
a = jieba.cut('I'm a good programmer')
print(a)
for i in a:
    print(i)

result:

code

import jieba
def cut_word(str1,str2,str3):
    """jieba participle"""
    
    c1 = jieba.cut(str1)
    c2 = jieba.cut(str2)
    c3 = jieba.cut(str3)
    c1 = list(c1)
    c2 = list(c2)
    c3 = list(c3)
    
    return c1, c2, c3


cont1 = "Today is cruel, tomorrow is more cruel, the day after tomorrow is beautiful, but most of them die tomorrow night, so everyone should not give up today."
cont2 = "We see light coming from distant galaxies millions of years ago, so when we see the universe, we are looking at its past."
cont3 = "If you only know something in one way, you won't really know it. The secret of knowing the true meaning of things depends on how we relate them to what we know."


c1,c2,c3 = cut_word(cont1,cont2,cont3)
print(c1)
print(c2)
print(c3)

result:

['today', 'very', 'cruel', ',', 'tomorrow', 'more', 'cruel', ',', 'the day after tomorrow', 'very', 'fine', ',', 'but', 'absolutely', 'gross', 'yes', 'die', 'stay', 'tomorrow', 'night', ',', 'therefore', 'each', 'people', 'No', 'give up', 'today', '. ']
['We', 'notice', 'Of', 'from', 'very', 'far', 'Galaxy', 'come', 'Of', 'It's just', 'Millions of years', 'before', 'issue', 'Of', ',', 'such', 'When', 'We', 'notice', 'universe', 'Time', ',', 'We', 'yes', 'stay', 'see', 'it', 'Of', 'past times', '. ']
['If', 'only need', 'one kind', 'mode', 'understand', 'Something', 'thing', ',', 'you', 'Just', 'can't', 'real', 'understand', 'it', '. ', 'understand', 'thing', 'real', 'meaning', 'Of', 'secret', 'Depending on', 'how', 'take', 'his', 'And', 'We', 'place', 'understand', 'Of', 'thing', 'mutually', 'contact', '. ']

join function in python:

c1,c2,c3 = cut_word(cont1,cont2,cont3)

# The Python join() method is used to connect elements in a sequence with the specified character to generate a new string.
s1 = '*'.join(c1) #Connect with *
s2 = ' '.join(c1) #Connect with spaces
print(s1)
print(s2)

result

Chinese text feature extraction complete code

import jieba
def cut_word(str1,str2,str3):
    """jieba participle"""
    
    c1 = jieba.cut(str1)
    c2 = jieba.cut(str2)
    c3 = jieba.cut(str3)
    c1 = list(c1)
    c2 = list(c2)
    c3 = list(c3)
    
    return c1, c2, c3


cont1 = "Today is cruel, tomorrow is more cruel, the day after tomorrow is beautiful, but most of them die tomorrow night, so everyone should not give up today."
cont2 = "We see light coming from distant galaxies millions of years ago, so when we see the universe, we are looking at its past."
cont3 = "If you only know something in one way, you won't really know it. The secret of knowing the true meaning of things depends on how we relate them to what we know."


c1,c2,c3 = cut_word(cont1,cont2,cont3)

# The Python join() method is used to connect elements in a sequence with the specified character to generate a new string.
c1 = ' '.join(c1) #Connect with spaces
c2 = ' '.join(c2) #Connect with spaces
c3 = ' '.join(c3) #Connect with spaces

from sklearn.feature_extraction.text import CountVectorizer
content = [c1, c2, c3]
verctor = CountVectorizer()
res = verctor.fit_transform(content)
print(verctor.get_feature_names())
res = res.toarray()
print(res)

result:

CountVectorizer is used to count the number of times each word appears in text feature extraction. If it is used in text classification, it will not achieve the effect of classification, because there will be some often used neutral words in every article, such as us, tomorrow, now and so on. This kind of words will appear frequently in every article. It is obviously wrong to divide two articles of different types into one category because these neutral words are the same. Therefore, CountVectorizer is not commonly used. Compared with him, tfif vectorizer is more commonly used.

TF-IDF

Among them, tf is the frequency of term frequency: words, which is used to count the frequency of words like countvector. idf: inverse document frequency is equal to log (total document quantity / document quantity of the word), while TF-IDF is used to count the frequency of words
The frequency of a word is multiplied by log (the total number of documents / the number of documents in which the word appears). What we get is the importance of the word to an article.
code

import jieba
def cut_word(str1,str2,str3):
    """jieba participle"""
    
    c1 = jieba.cut(str1)
    c2 = jieba.cut(str2)
    c3 = jieba.cut(str3)
    c1 = list(c1)
    c2 = list(c2)
    c3 = list(c3)
    
    return c1, c2, c3


cont1 = "Today is cruel, tomorrow is more cruel, the day after tomorrow is beautiful, but most of them die tomorrow night, so everyone should not give up today."
cont2 = "We see light coming from distant galaxies millions of years ago, so when we see the universe, we are looking at its past."
cont3 = "If you only know something in one way, you won't really know it. The secret of knowing the true meaning of things depends on how we relate them to what we know."


c1,c2,c3 = cut_word(cont1,cont2,cont3)

# The Python join() method is used to connect elements in a sequence with the specified character to generate a new string.
c1 = ' '.join(c1) #Connect with spaces
c2 = ' '.join(c2) #Connect with spaces
c3 = ' '.join(c3) #Connect with spaces

from sklearn.feature_extraction.text import TfidfVectorizer
content = [c1, c2, c3]
tf = TfidfVectorizer()
res = tf.fit_transform(content)
print(tf.get_feature_names())
res = res.toarray()
print(res)

result:

['one kind', 'can't', 'No', 'before', 'understand', 'thing', 'today', 'It's just', 'Millions of years', 'issue', 'Depending on', 'only need', 'the day after tomorrow', 'meaning', 'gross', 'how', 'If', 'universe', 'We', 'therefore', 'give up', 'mode', 'tomorrow', 'Galaxy', 'night', 'Something', 'cruel', 'each', 'notice', 'real', 'secret', 'absolutely', 'fine', 'contact', 'past times', 'such']
[[0.         0.         0.21821789 0.         0.         0.
  0.43643578 0.         0.         0.         0.         0.
  0.21821789 0.         0.21821789 0.         0.         0.
  0.         0.21821789 0.21821789 0.         0.43643578 0.
  0.21821789 0.         0.43643578 0.21821789 0.         0.
  0.         0.21821789 0.21821789 0.         0.         0.        ]
 [0.         0.         0.         0.2410822  0.         0.
  0.         0.2410822  0.2410822  0.2410822  0.         0.
  0.         0.         0.         0.         0.         0.2410822
  0.55004769 0.         0.         0.         0.         0.2410822
  0.         0.         0.         0.         0.48216441 0.
  0.         0.         0.         0.         0.2410822  0.2410822 ]
 [0.15698297 0.15698297 0.         0.         0.62793188 0.47094891
  0.         0.         0.         0.         0.15698297 0.15698297
  0.         0.15698297 0.         0.15698297 0.15698297 0.
  0.1193896  0.         0.         0.15698297 0.         0.
  0.         0.15698297 0.         0.         0.         0.31396594
  0.15698297 0.         0.         0.15698297 0.         0.        ]]

Tags: Python Spark less

Posted on Fri, 12 Jun 2020 00:23:35 -0400 by Goose