Article Directory
- Job List (4/20)
- Job List (4/22)
- csv, linear regression
- Job List (4/29, 5/4)
- Experimental Report
- 1. Univariate Regression--Predicting House Price by Area
- 2. Establish Multiple Regression Model-Boston House Price Forecast
- data set
- Third-party libraries used
- Read and process data
- View data
- View data fragmentation - plot box charts
- Data Set Split
- Establishing a multiple regression model
- test
- Graphic representation of results
- Analysis of Experimental Results
- Summary of knowledge points
- Over-fit-under-fit
- Data cleaning
- Job List (5/11)
- Job List (5/13)
- KMeans Experimental Report (K=2)
- Experimental purpose
- Experimental steps
- 1. Data preparation
- 2. KMeans algorithm implementation
- 3. Set parameters, call functions, and get experimental results
- experimental result
- KMeans Experimental Report (K=3)
- Experimental purpose
- Experimental steps
- 1. Data preparation
- 2. KMeans algorithm implementation
- 3. Set parameters, call functions, and get results
- experimental result
- Standard KNN Classes for Sklearn
- Sklearn's standard K-means class
- Visualization of K-means--3D Scatter Chart
- K-means Visualization - Multiple Subgraphs
- Defects and Improvements of KMeans Algorithm
- Defects of KNN algorithm and improvement methods
- Job List (5/20)
- Job List (5/27)
[1] Is it complete to install Python 3.X and Orange3 software?
Completed
[2] Complete the classroom experiment (given the data of teachers, an experiment to determine identity), is it completed?
Completed
data =[ {'NAME':'Mike','RANK':'Assistant Prof', 'YEARS':3, 'TENURED':None}, {'NAME':'Mary','RANK':'Assistant Prof', 'YEARS':7, 'TENURED':None}, {'NAME':'Bill','RANK':'Professor', 'YEARS':2, 'TENURED':None}, {'NAME':'Jim','RANK':'Assistant Prof', 'YEARS':7, 'TENURED':None}, {'NAME':'Dave','RANK':'Assistant Prof', 'YEARS':6, 'TENURED':None}, {'NAME':'Anne','RANK':'Assistant Prof', 'YEARS':3, 'TENURED':None}] for dict in data: if dict['RANK']=='Professor' or dict['YEARS']>6: dict['TENURED']='yes' else: dict['TENURED']='no' print(dict['NAME'],dict['TENURED'])
[3] Review the main functions of Numpy (follow PPT in class to complete related experiments)
-
Find average, median, majority, sort arrays
>>> from numpy import mean,median,sort >>> from scipy.stats import mode >>> a = [1,2,3,4,5,5,6] >>> print(mean(a)) 3.7142857142857144 >>> print(median(a)) 4.0 >>> print(mode(a)[0]) [5] >>> print(sort(a)) [1 2 3 4 5 5 6]
-
Create Matrix
>>> from numpy import mean,median,sort >>> b= np.arange(3,15) >>> b array([ 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]) >>> b.reshape(3,4) array([[ 3, 4, 5, 6], [ 7, 8, 9, 10], [11, 12, 13, 14]])
-
Matrix sorting
-
matrix multiplication
>>> import numpy as np >>> b = np.arange(12,24).reshape([4,3]) >>> a = np.arange(0,12).reshape([3,4]) >>> p = np.dot(a,b) >>> p array([[114, 120, 126], [378, 400, 422], [642, 680, 718]]) >>>
[4] Complete topics 1 and 4 (transcription) of Chapter 1.9 of the reference textbook
Problem:
1.1 What is data mining? In your answer, address the following:
(a) Is it another hype?
(b) Is it a simple transformation or application of technology developed from databases,statistics, machine learning, and pattern recognition?
© We have presented a view that data mining is the result of the evolution of database technology. Do you think that data mining is also the result of the evolution of machine learning research? Can you present such views based on the historical progress of this discipline? Address the same for the fields of statistics and pattern recognition.
(d) Describe the steps involved in data mining when viewed as a process of knowledge discovery.
Answer:
(a) No. Data mining is the process of extracting potentially useful information and knowledge hidden in large, incomplete, noisy, fuzzy and random data from practical application.
(b) No
_Data mining began in the second half of the 20th century and developed on the basis of the development of multiple disciplines at that time.With the development and application of database technology, the accumulation of data is expanding. As a result, simple queries and statistics can no longer meet the business needs of enterprises, and some revolutionary technologies are urgently needed to mine the information behind the data.At the same time, artificial intelligence in the field of computer has also made great progress, entering the stage of machine learning.Therefore, people combine the two, store data with a database management system, analyze data with a computer, and try to dig the information behind the data.
Pattern recognition methods are also increasingly used in data mining.
Later, people began to find that a lot of work in data mining can be done by statistical methods, and thought that the best strategy is to combine statistical methods with data mining.
Data mining is affected by many disciplines, among which database, machine learning and statistics undoubtedly have the greatest impact.In short, for data mining, databases provide data management techniques, while machine learning and statistics provide data analysis techniques.
Because statistics is often obsessed with the beauty of theory and ignores the actual utility, many of the techniques provided by the statistics community are usually further studied in the machine learning world before they can enter the data mining field after they become effective machine learning algorithms.In this sense, statistics mainly exerts its influence on data mining through machine learning, while machine learning and database are the two main supporting technologies of data mining.*From the point of view of data analysis, most data mining technologies come from the field of machine learning, but machine learning research often does not take a large amount of data as the processing object. Therefore, data mining needs to transform the algorithm to make the performance and space occupancy of the algorithm reach a practical level.*At the same time, data mining has its own unique content, that is, Association analysis.
Problem:
Present an example where data mining is crucial to the success of a business. What data mining functionalities does this business need (e.g., think of the kinds of patterns that could be mined)? Can such patterns be generated alternatively by data query processing or simple statistical analysis?
Answer:
China Petroleum and Gas Group Limited Company uses "Magic Mirror" for data mining.
Firstly, Magic Mirror analyzed the supply chain business data of PetroChina, which can make it more reasonable to monitor the supply time, inventory, cost, logistics, where the order is placed and the amount involved, etc.
Secondly, the magic mirror also analyzed the sales situation, so that it could understand the order, turnover, trading area, sales product type, sales forecast and the rate of meeting the standards, cost and so on.
Moreover, the Magic Mirror analyzed the logistics of PetroChina, including the number of delivery, delivery days, order completion time, non-delivery orders, and so on, to know the purchase situation in different regions at different time periods.
Fourth, Mirror also analyzed PetroChina's customers.Magic Mirror uses radar coverage to measure the overall quality of users from various dimensions, such as purchase cycle, average purchase volume, loyalty, average purchase sales, average purchase profit and so on, to see whether the customer's longest, shortest and average purchase cycle, is a stable customer or a volatile agent or individual's temporary cycle, and to analyze which region of customer purchases.Power is the most stable, to compare different users at the same time, while showing, comprehensive analysis of customer value, directly find the short board.Finally, anticipate the next purchase time for these users, alert them and make timely decisions.
Fifth, the magic mirror is analyzed from the production line, product type, product status, inventory, product quality and quantity, etc.From the Magic Mirror Visual Chart, you can see at a glance how much each product occupies in the inventory as a whole, how much each order occupies, how much of each product is consumed in each process, how much is consumed in each process, and how many products are purchased in each area. Finally, you can turn these situations into results and know which products are selling well and which are not selling well.
Sixth, the magic mirror enables accurate analysis of PetroChina's sales, costs, budgets, profits, balances and other financial aspects.Visually understand the flow of funds in each part of the problem, where do the funds come from?Where to go?It also validates the rule that lending must be equal.
Finally, through the analysis of job performance, organization structure, employee satisfaction, etc.From the chart, we can see that the organization structure of Petroleum is global, each employee's KPI sales performance, contribution level and so on, and the distribution of organizations and personnel at different levels is distinguished by different colors.
These are difficult to do with data query processing and simple data analysis.
Job List (4/22)[1] Do you want to continue installing Python 3.X and Orange3 software?
Completed
[2] Complete common probability distribution codes, as shown below
- Normal Distribution
import numpy as np import matplotlib.pyplot as plt import math # mean value u = 0 #standard deviation sig = math.sqrt(0.2) x = np.linspace(u - 3 * sig, u + 3 * sig, 50) y_sig = np.exp(-(x - u) ** 2 / (2 * sig ** 2) / (math.sqrt(2 * math.pi) * sig)) plt.plot(x,y_sig,"r-",linewidth=2) plt.grid(True) plt.show()
Normal distribution probability density function curve
- Two Point Distribution/Bernoulli Distribution
import numpy as np import matplotlib.pyplot as plt p = 0.7 x = [0,1] y = [1-p, p] plt.scatter(x,y) # Scatter plot plt.grid(True) plt.show()
- Binomial Distribution
Binomial distribution, i.e. repeated N Bernoulli trials, uses zeta to represent the results of random trials.If the probability of event occurrence is p, the probability of non-occurrence q=1-p, and the probability of K occurrences in N independent repeat trials is
P( Factor=K)= C(n,k) * p^k * (1-p)^(n-k), where C(n, k) =n!/(k!(n-k)!). So this is a binomial distribution.P is called the probability of success.Recorded as zeta~B(n,p)
Expectation: Ezet=np;
Variance: D zeta=npq, where q=1-p
import numpy as np import matplotlib.pyplot as plt from scipy.special import comb p = 0.4 n = 10 x = np.linspace(0,n,n+1) y = comb(n,x)*p**x*(1-p)**(n-x) print(x) print(y) plt.scatter(x,y) plt.grid(True) plt.show()
- Geometric Distribution
import numpy as np import matplotlib.pyplot as plt p = 0.4 n = 10 x = np.linspace(1,n,n) y = p*(1-p)**(x-1) print(x) print(y) plt.scatter(x,y) plt.grid(True) plt.show()
- Poisson distribution
import numpy as np import matplotlib.pyplot as plt x = np.random.poisson(lam = 5, size = 10000) pillar = 15 a = plt.hist(x, pillar, color = 'g') plt.plot(a[1][0:pillar], a[0],'r') plt.grid() plt.show()
- Uniform Distribution
import numpy as np import matplotlib.pyplot as plt a = 3 b = 5 x = np.linspace(a, b, 50) y = [] for i in range(0,50): y.append(1 / (b - a)) plt.plot(x,y,"r-",linewidth=2) plt.grid(True) plt.show()
- Exponential Distribution
import numpy as np import matplotlib.pyplot as plt x = np.random.exponential(scale = 100, size = 10000) pillar = 25 a = plt.hist(x, pillar, color = 'g') plt.plot(a[1][0:pillar], a[0],'r') plt.grid() plt.show()
Python code to complete maximum likelihood estimation MLE
Estimate the mean and variance of a normal distribution using the maximum likelihood estimation method
import numpy as np fig = plt.figure() mu = 30 # mean of distribution sigma = 2 # standard deviation of distribution x = mu + sigma * np.random.randn(10000) def mle(x): u = np.mean(x) return u, np.sqrt(np.dot(x - u, (x - u).T) / x.shape[0]) print(mle(x))
[4] DM Lab2 experimental results using orange3 mining (optional)
csv, linear regression[1] Familiarize yourself with opening, reading and writing CSV files.
The csv file is a text file where values in columns are separated by commas
- Write data to csv file
Write data to the DataFrame using the pandas package and store the DataFrame as a csv file
import pandas as pd # Multigroup List no = [1, 2, 3, 4, 5, 6, 7] square_feet = [150, 200, 250, 300, 350, 400, 400] price = [6450, 7450, 8450, 9450, 11450, 15450, 18450] # The key value in the dictionary is the column name in csv data = pd.DataFrame({'No': no, 'square_feet': square_feet, 'price': price}) # Store DataFrame as csv, index indicates whether row name is displayed, default = True data.to_csv("data/housing price.csv", index=False, sep=',')
- Read data from csv file
pandas library provides read_csv
import pandas as pd data = pd.read_csv('data/housing price.csv') print(data)
[2] Complete linear regression using Orange3 and Python orange methods
import Orange import matplotlib.pyplot as plt data = Orange.data.Table("F:\Big Three\data mining\DM_Lab\housing price") out_learner = Orange.regression.LinearRegressionLearner() model = out_learner(data)
[3] Think about the relationship between maximum likelihood estimation MLE and least squares?
The least squares method takes the sum of squares of the difference between the estimated and observed values as the loss function, while the maximum likelihood method takes the maximum likelihood probability function of the target value as the objective function, processes the linear regression from the point of probability statistics, and establishes the relationship with the least squares under the assumption that the likelihood probability function is a Gaussian function.
[4] Based on the scatterplot of DM Lab3 data, a univariate regression line is drawn.
import numpy from pandas import read_csv from sklearn.linear_model import LinearRegression import matplotlib.pyplot as plt # Import data data = read_csv('test1.csv') u =data.corr() # Correlation coefficient print(u) lrModel = LinearRegression() x = data.Activity promotion fee.values.reshape(-1,1) lrModel.fit(x,data.Sales volume) print(lrModel.predict([[75]])) alpha = lrModel.intercept_ beta = lrModel.coef_ new_r = alpha + beta * numpy.array([75]) plt.scatter(data.Activity promotion fee,data.Sales volume) plt.plot(data.Activity promotion fee,lrModel.predict(x)) plt.show()
[5] According to the DM Lab3 experiment process, the height of the mother is 167cm, what is the predicted height of the child?
import numpy as np from sklearn.linear_model import LinearRegression x = [154,157,158,159,160,161,162,163] y = [155,156,159,162,161,164,165,166] model = LinearRegression() x =np.array(x).reshape(-1,1) print(x) model.fit(x,y) print(model.predict([[167]]))
Predict a child's height of 171.4 cm
Job List (4/29, 5/4)[1] A linear regression model is built based on the following datasets (data tables are stored in csv format).
(1) Predicted house prices for 1,000 square feet.
Requirements: Complete 2 times, the first time can refer to class notes, consult network materials, etc. Complete the regression model by writing code independently within 15 minutes without referencing any auxiliary methods.
(2) Establish a multivariate review model.Add at least two features of the house price, such as location, old and new factors.
(3) Organize (1) and (2) into experimental reports.Check the experiment report in class on May 6.
Experimental Report
1. Univariate Regression--Predicting House Price by Area
- Dataset: csv format
No,square_feet,price 1,150,6450 2,200,7450 3,250,8450 4,300,9450 5,350,11450 6,400,15450 7,500,18450
- Read data from csv file
import pandas as pd data = pd.read_csv('data/housing price.csv')
- Establishing Linear Regression Model
from sklearn.linear_model import LinearRegression model = LinearRegression() x = data['square_feet'].values.reshape(-1, 1) y = data['price'] model.fit(x, y)
- Predicted house price for 1,000 square feet
print(model.predict([[1000]]))
- Drawing
import matplotlib.pyplot as plt plt.scatter(data['square_feet'], y, color='blue') plt.plot(x, model.predict(x), 'r-') plt.grid(True) plt.show()
- experimental result
- The price of a house with a projected area of 1,000 square feet is 35839.34
- From the pictures we get, we can see that the model fits well.
2. Establish Multiple Regression Model-Boston House Price Forecast
data set attribute Meaning attribute Meaning CRIM Per capita crime rate in cities and towns ZN Residential Land More than 25000Sq.ftRatio of. INDUS Proportion of non-retail land in urban areas CHAS Charles River empty variable (1 if the boundary is a river; otherwise 0) NOX Nitric oxide concentration RM Average number of residential rooms DIS Weighted Distance to the Five Central Areas of Boston RAD Approximation Index of Radial Highways TAX Full value property tax rate per $10,000 PTRATIO Ratio of teachers to students in cities and towns B 1000 (Bk-0.63) ^ 2, where Bk refers to the proportion of blacks in cities and towns LSTAT The proportion of the low-ranking population. MEDV Average home price from home, in thousands of dollars Third-party libraries used- pandas: csv file reading, data collection
- sklearn.linear_model.LinearRegression: Linear Regression Model
- Sklearn.model_Selection.train_Test_The split dataset is divided into training and test sets
import pandas as pd from sklearn.linear_model import LinearRegression import matplotlib.pyplot as plt from sklearn.model_selection import train_test_splitRead and process data
- Read data
data = pd.read_csv('data/housing.csv')
- Regardless of the per capita crime rate in cities and towns, the model has a higher score
# Do not use data from the first column new_data = data.iloc[:, 1:]View data
- View processed datasets
# Get the dataset and view it print('head:', new_data.head(), '\nShape:', new_data.shape)
- Check for missing values
# Missing Value Test print(new_data[new_data.isnull().sum())View data fragmentation - plot box charts
- Output row count, mean, standard deviation std, minimum min, maximum max, upper quartile 75%, median 50%, lower quartile 25%
print(new_data.describe())
- Box plotting code
new_data.boxplot() plt.show()Data Set Split
Split the original data into Test Sets and Training Sets at a 2:8 scale
X_train, X_test, Y_train, Y_test = train_test_split(new_data.iloc[:, :13], new_data.MEDV, train_size=.80)Establishing a multiple regression model
Modeling from training sets
model = LinearRegression() model.fit(X_train, Y_train) a = model.intercept_ b = model.coef_ print("Best Fit Line:intercept", a, ",Regression coefficient:", b) score = model.score(X_test, Y_test) print(score)Raw Data Features: (506, 13), Training Data Features: (404, 13), Test Data Features: (102, 13) Raw Data Label: (506,), Training Data Label: (404,), Test Data Label: (102,) Best fit line: intercept 0.0, regression coefficient: [-1.74325842e-16 1.11629233e-16-1.79794258e-15 7.04652389e-15 -2.92277767e-15 2.97853711e-17 -8.23334194e-16 1.17159575e-16 1.88696229e-17 -3.41643920e-16 -1.28401929e-17 -5.78208730e-17 1.00000000e+00] 1.0 test
Y_pred = model.predict(X_test) print(Y_pred) plt.plot(range(len(Y_pred)), Y_pred, 'b', label="predict") plt.show()Graphic representation of results
Draw polylines for actual and predicted values, respectively
X_train, X_test, Y_train, Y_test = train_test_split(new_data.iloc[:, :13], new_data.MEDV, train_size=.80) plt.figure() plt.plot(range(len(X_test)), Y_pred, 'b', label="predict") plt.plot(range(len(X_test)), Y_test, 'r', label="test") plt.legend(loc="upper right") plt.xlabel("the number of MEDV") plt.ylabel('value of MEDV') plt.show()Analysis of Experimental Results
- The polyline drawn from the predicted value roughly matches the trend of the polyline drawn from the actual value. It can be seen that the training model works well, and this can be confirmed by the score of 1.0 given by sklearn's score.
Summary of knowledge points
-
regression analysis
The purpose of regression analysis is to examine the quantitative relationships among variables and to solve the following problems:
(1) Determine the mathematical relationship between variables by using a set of sample data;
(2) Perform various statistical tests on the reliability of these relationships to find out which variables have significant and which are not.
(3) Estimate the value of another variable based on the value of one or more variables by using the relationship, and give the reliability of the estimation.
-
Univariate linear regression
Regression involving only one independent variable is called univariate regression, and the equation describing the relationship between two variables with linear relationship is called regression model, which can be expressed as:
y=ax+b y=ax+b y=ax+bThe above formula, called a theoretical regression model, assumes the following:
(1) there is a linear relationship between y and x;
(2) x is nonrandom and the value of X is fixed in repeated sampling.The regression equation for estimates of univariate linear regression is as follows:
-
multiple regression analysis
Quantitatively characterizing the linear dependence between a dependent variable and multiple independent variables using regression equations is called multivariate regression analysis, or multivariate regression for short.
-
DataFrame.iloc Common usage
Get property name, first row of data, data type
print(data.iloc[0])
No 1 square_feet 150 loaction 4 built 10 price 6450 Name: 0, dtype: int64
Get property name, second row of data, data type
print(data.iloc[1])
No 2 square_feet 200 loaction 5 built 9 price 7450 Name: 1, dtype: int64
Get all the data
Method One
print(data.iloc[:])
Method 2
print(data.iloc[0:])
Method Three
print(data.iloc[:, :])
No square_feet loaction built price 0 1 150 4 10 6450 1 2 200 5 9 7450 2 3 250 3 7 8450 3 4 300 3 4 9450 4 5 350 4 3 11450 5 6 400 2 4 15450 6 7 400 1 2 18450
Get the data starting with the second row
print(data.iloc[1:])
No square_feet loaction built price 1 2 200 5 9 7450 2 3 250 3 7 8450 3 4 300 3 4 9450 4 5 350 4 3 11450 5 6 400 2 4 15450 6 7 400 1 2 18450
Get data for rows 3-n and columns 4-m (assuming there are n rows and m columns)
print(data.iloc[2:, 3:])
built price 2 7 8450 3 4 9450 4 3 11450 5 4 15450 6 2 18450
-
A Box-plot is a statistical graph used to display a set of data dispersion.
-
Box plots are drawn by first finding the upper edge, lower edge, median, and two quartiles of a set of data, then joining the two quartiles to draw the box, and then joining the upper and lower edges to the box with the median in the middle.
Over-fit-under-fit
[2] Combine the following figures (a) and (b) to explain what is overfitting and underfitting?What are the usual ways to solve these two problems?
-
Overfit
Overfitting refers to the phenomenon that a model chosen during learning contains so many parameters that it predicts well for known data and poorly for unknown data.
-
Reasons for Overfit
- Modeling sample selection errors, such as too few samples, wrong sampling methods, wrong sample labels, etc., result in the selected sample data not representing the intended classification rules
- Sample noise interferes so much that the machine considers part of the noise to be a feature that disrupts the preset classification rules
- A hypothetical model cannot exist properly, or the conditions under which it is supposed to be true are not valid.
- Too many parameters, too complex model
- For decision tree models, if we have no reasonable restrictions on their growth, their free growth may make nodes contain only event data or no event data, which makes them match (fit) the training data perfectly, but can not adapt to other datasets.
- For neural network models:
There may be a non-unique classification decision surface for the sample data. As learning progresses, the BP algorithm may converge the weights to overly complex decision surfaces.
b) Overtraining with enough iterations to fit the noise in the training data and the unrepresentative features in the training samples
-
Overfit Solution
-
regularization
The regularization method is to add a regular term after the objective function or the cost function when optimizing the objective function or the cost function. Generally, there are L1 regular and L2 regular.
-
L1 Regular:
The L1 regular is the L1 norm of the parameter vector added to the original loss function, and the coefficient of the regular term is used to balance the relationship between the original loss function and the regular term.
-
L2 Regular
The L2 rule adds the L2 norm of the parameter vector to the original loss function.
Adding a regular term to the loss function fits Oscar's shaver principle: among all the possible models that can be selected, a very simple model that can well interpret known data is the best.
-
-
Cross-validation
If the given sample data is sufficient, an easy way to choose a model is to randomly divide the dataset into three parts: the training set, the validation set, and the test set.
The training set is used to train the model, the validation set is used to select the model, and the test set is used to evaluate the method.
Choose the model with the least prediction error for the validation set when learning about models with different complexity.
Cross-validation has the following methods:
1. Simple cross-validation
Randomly divide the given data into two parts, one part as training set and one part as test set, and then use the training set to train the models under various conditions to get different models, evaluate the test error of each model on the test set, and select the model with the least test error.2. S-fold cross-validation
First randomly divide the given data into S disjoint subsets of the same size, then use the data training model of S-1 subset, use the remaining subset test model, repeat the process for possible S choices, and finally select the model with the smallest average test error in S evaluations.3. Keep a cross-validation
A special case of S-fold cross validation is S=N (N is the number of training set samples) -
Early stopping
The process of training a model is a process of learning and updating the parameters of the model. This process of learning parameters often uses some iterative methods, such as Gradient descent learning algorithm.
Early stopping is a method of truncating the number of iterations (epochs) to prevent over-fitting by stopping the iteration before the model converges iteratively over the training dataset.
This is done by calculating the accuracy of the validation set at the end of each word iteration (an iteration set is a round-trip through all the training data) and stopping training when accuracy no longer improves.
During training, record the best validation accuracy so far, and when the best accuracy is not achieved in 10 consecutive iterations (or more), you can assume that it is no longer improved.You can stop the iteration at this point.This strategy is also called "No-improvement-in-n". n is the number of iterations, which can be chosen according to the actual situation, such as 10, 20, 30...
-
Data Set Extension
A popular phrase in the field of data mining is, "Sometimes more data is better than a good model."
Training data is independently and identically distributed with future data
There are generally the following methods:
- Collect more data from data sources
- Copy the original data and add random noise
- Resampling
- Estimate data distribution parameters based on the current dataset, use this distribution to generate more data, and so on
-
Dropout
Dropout method prevents ANN from overfitting by modifying the number of neurons in the hidden layer of ANN.
-
-
Unfit
Unfit often occurs when the learning ability of the model is weak and the data complexity is high. At this time, the model is unable to learn the general rule in the dataset due to its insufficient learning ability, which leads to the weak generalization ability.
-
Reasons for underfitting
- Model complexity is too low
- Too few features
-
Common Solutions
- Add new features, consider incorporating feature combinations, higher order features, to increase the hypothesis space
- Adding polynomial features, which are common in machine learning algorithms, such as adding quadratic or cubic terms to a linear model to make it more generalizable
- Reduce the regularization parameters, the purpose of regularization is to prevent overfitting, but the model appears to be under-fitting, the regularization parameters need to be reduced
- Use non-linear models, such as kernel SVM, decision tree, in-depth learning, and so on
- Adjusting the capacity of a model, which, in general, refers to its ability to fit various functions
- Models with low capacity may have difficulty fitting the training set; multiple weak learners will be Bagging using integrated learning methods such as Bagging
-
Reference Article
Overfit
Reasons and solutions for underfitting and overfitting
[3] Predict the age of abalones.Download abalone dataset online (see WeChat Group, abalone dataset.csv), build a linear regression model, point out the problem of simple linear regression model for prediction, and think about how to solve it?
import pandas as pd import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression # Read abalone dataset (dataset download link:https://aistudio.baidu.com/aistudio/datasetdetail/361) data = pd.read_csv('Abalone Age.csv') # Clean up data new_data = data.iloc[:, 1:] # Get the dataset and view it print('head:', new_data.head(), '\nShape:', new_data.shape) print(new_data.describe()) # Missing Value Test print(new_data[new_data.isnull() == True].count()) new_data.boxplot() plt.show() print(data.corr()) print(new_data.corr()) X_train, X_test, Y_train, Y_test = train_test_split(new_data.iloc[:, :8], new_data.Age, train_size=.80) print("Raw data characteristics:", new_data.iloc[:, :8].shape, ",Training data characteristics:", X_train.shape, ",Test data characteristics:", X_test.shape) print("Raw Data Label:", new_data.Age.shape, ",Training Data Label:", Y_train.shape, ",Test Data Label:", Y_test.shape) model = LinearRegression() model.fit(X_train, Y_train) a = model.intercept_ b = model.coef_ print("Best Fit Line:intercept", a, ",Regression coefficient:", b) # Model score score = model.score(X_test, Y_test) print(score) Y_pred = model.predict(X_test) print(Y_pred) plt.plot(range(len(Y_pred)), Y_pred, 'b', label="predict") plt.show() X_train, X_test, Y_train, Y_test = train_test_split(new_data.iloc[:, :8], new_data.Age, train_size=.80) plt.figure() plt.plot(range(len(Y_pred)), Y_pred, 'b', label="predict") plt.plot(range(len(X_test)), Y_test, 'r', label="test") plt.legend(loc="upper right") plt.xlabel("the number of age") plt.ylabel('value of age') plt.show()
-
problem
(1) Linear regression may overfit with a small amount of data
(2) It is difficult to model polynomial regression with non-linear data or correlation between data features.
(3) It is difficult to express highly complex data well.
-
Solve
- The ridge and Lasso regression can solve the over-fitting problem to some extent.
[4] Python implements a new coronavirus transmission model and prediction (optional)
Data cleaning[1] Is data cleaning an important step in data mining model building? Generally, what are the cleaning methods?
-
Data cleaning is an important step in data mining model building.Data cleaning is the last procedure for discovering and correcting identifiable errors in data files, including checking data consistency, handling invalid values, and Missing Values Wait.
-
Data cleaning methods:
Data cleaning is generally specific to specific applications, so it is difficult to summarize unified methods and steps, but according to the different data, corresponding data cleaning methods can be given.
1. Solutions for incomplete data (i.e., missing values)
In most cases, missing values must be filled in manually (that is, cleaned up manually).Of course, some missing values can be deduced from this or other data sources, which can be replaced by mean, maximum, minimum, or more complex probability estimates for the purpose of cleaning up.
2. Detection and Resolution of Error Value
use statistical analysis The method identifies possible error or outlier values, such as Deviation analysis , identify values that do not conform to distribution or regression equations, or check data values with a simple rule base (common sense rules, business-specific rules, and so on), or use constraints between different attributes, external data to detect and clean up data.
3. Detection and Elimination of Duplicate Records
Records with identical attribute values in the database are considered duplicate records. Records with identical attribute values are checked for equality by determining whether they are equal. Equal records are merged into one record (that is, merge/clear).Merge/clear is the basic method of weight reduction.
4. Detection and resolution of inconsistencies (within and between data sources)
Data integrated from multiple data sources may have semantic conflicts, integrity constraints can be defined to detect inconsistencies, and connections can be discovered by analyzing the data to make the data consistent.
[2] Cleaning data was performed by deleting and filling the data.
import pandas as pd import numpy as np from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer data = pd.read_csv('Iris.csv') data1 = data.dropna() # Delete rows with missing data data2 = data.dropna(axis=1)# Delete columns with missing data print(data.fillna(method='pad'))# Number of rows in the same column filled print(data.fillna(0) print(data.fillna(data.mean()))# Mean Fill in Same Column print(data.fillna(data.median()))# Median Fill in Same Column print(data.fillna(data.mode()))# Majority fill for the same column
[3] Regression method was used to fill in the data of [2] questions.
import pandas as pd import numpy as np from sklearn.experimental import enable_iterative_imputer from sklearn.impute import IterativeImputer data = pd.read_csv('Iris.csv') new_data = data.iloc[:, 0:4] model = IterativeImputer(max_iter=10, random_state=0) model.fit(new_data) print(model.transform(new_data))Job List (5/11)
What is Pandas Series and what is the DataFrame in Pandas?How do I convert numpy data into DataFrame format?How do I convert Series data to DataFrame format?How do I convert a DataFrame to a NumPy array?How do I sort the DataFrame?What is data aggregation?(Note: Each question, with examples)
What is Pandas?
Pandas is a powerful toolset for analyzing structured data; it is based on Numpy, which provides high-performance matrix operations; used for data mining and analysis; and provides data cleaning capabilities.
- Import
import pandas as pd
- Pandas Sort Method
- Sort_by indexIndex()
- Sort_by actual valueValues () sort
- Default ascending order, descending when parameter ascending=False
[One-dimensional data: Pandas Series]
It is a one-dimensional array-like object consisting of a set of data (various NumPy data types) and a set of associated data labels (i.e. indexes).Simple Series objects can also be generated from a single set of data.
>>> import pandas as pd >>> import numpy as np >>> b = pd.Series(np.arange(5)) >>> b 0 0 1 1 2 2 3 3 4 4 dtype: int32 >>> s = pd.Series(np.arange(5),index=['a','b','c','d','e']) >>> print(s) a 0 b 1 c 2 d 3 e 4 dtype: int32
- Creating Series objects from a dictionary
>>> dict={'apple':45,'orange':23,'peach':12} >>> s =Series(dict) >>> s = pd.Series(dict) >>> s apple 45 orange 23 peach 12 dtype: int64
- Value
>>> a = pd.Series([1,2,3]) >>> a.values array([1, 2, 3], dtype=int64) >>> print(a.values) [1 2 3]
- Value by index
>>> a[2] 3
- Sort by index (ascending)
>>> s = pd.Series([56,34,45,67,12,67,84],index = list('gbcadfe')) >>> s.sort_index() a 67 b 34 c 45 d 12 e 84 f 67 g 56 dtype: int64
- Sort by value (ascending)
>>> s.sort_values() d 12 b 34 c 45 g 56 a 67 f 67 e 84 dtype: int64
[Two-dimensional data: DataFrame]
The DataFrame is a tabular data structure in Pandas that contains an ordered set of columns, each of which can be of different value types (numeric, string, Boolean, and so on). The DataFrame has both row and column indexes and can be thought of as a dictionary made up of Series.
- Establish
import pandas as pd a = [[1,2,3], [4,5,6], [7,8,9]] df = pd.DataFrame(a,index=list('123'),columns = ['A', 'B', 'C']) print(df)
Run Results
A B C 1 1 2 3 2 4 5 6 3 7 8 9
- Convert numpy data to DataFrame format
>>> import pandas as pd >>> import numpy as np >>> c = np.arange(20,32).reshape(3,4) >>> c array([[20, 21, 22, 23], [24, 25, 26, 27], [28, 29, 30, 31]]) >>> c = pd.DataFrame(c) >>> c 0 1 2 3 0 20 21 22 23 1 24 25 26 27 2 28 29 30 31 >>> c = pd.DataFrame(c,index=list('abc'),columns=list('ABCD')) >>> c A B C D a NaN NaN NaN NaN b NaN NaN NaN NaN c NaN NaN NaN NaN >>> d = np.arange(20,32).reshape(3,4) >>> c = pd.DataFrame(d,index=list('abc'),columns=list('ABCD')) >>> c A B C D a 20 21 22 23 b 24 25 26 27 c 28 29 30 31
- Convert DataFrame to NumPy
>>> a = [[4,6,1],[3,5,2],[9,5,7]] >>> df = pd.DataFrame(a,index=list('cbd'),columns=list('ahg')) >>> df.values array([[4, 6, 1], [3, 5, 2], [9, 5, 7]], dtype=int64)
- Convert Series data to DataFrame format
>>> import pandas as pd >>> import numpy as np >>> s = pd.Series(np.arange(5),index = list('abcde')) >>> s a 0 b 1 c 2 d 3 e 4 dtype: int32 >>> s = pd.DataFrame(s) >>> s 0 a 0 b 1 c 2 d 3 e 4
- Array Sorts DataFrame
- Sort by index
>>> a = [[4,6,1],[3,5,2],[9,5,7]] >>> df = pd.DataFrame(a,index=list('cbd'),columns=list('ahg')) >>> df.sort_index() # Arrange by row index a h g b 3 5 2 c 4 6 1 d 9 5 7 >>> df.sort_index(ascending=False) # Sort descending by row index a h g d 9 5 7 c 4 6 1 b 3 5 2 >>> df.sort_index(axis=1) # Arrange by column index a g h c 4 1 6 b 3 2 5 d 9 7 5 >>> df.sort_index(axis=1,ascending=False) # Arrange by column index h g a c 6 1 4 b 5 2 3 d 5 7 9
- Sort by actual value
Multiple columns can be specified by specifying column values that need to be sorted with the by parameter
>>> df.sort_values(by='h') # Sort in ascending order based on the value of column h a h g b 3 5 2 d 9 5 7 c 4 6 1 >>> df.sort_values(by='h',ascending=False)# Sort in descending order based on the value of column h a h g c 4 6 1 b 3 5 2 d 9 5 7 >>> df.sort_values(by=['h','g'],ascending=False)# Prioritize the top columns a h g c 4 6 1 d 9 5 7 b 3 5 2
[2] UtilizationIris.csvDataset, build KNN model, forecast Sepal.Length\Sepal.Width\Petal.Length\Petal.Width Which category of iris does it belong to (6.3, 3.1, 4.8, 1.4)?Write the KNN source code.
KNN(K-NearestNeighbor) algorithm ideas:
Make D a training dataset, and when test set D appears, compare d with all the samples in D to calculate the similarity (or distance) between them.When the first k most similar samples are selected from D, the category of D is determined by the category that appears most among the k nearest neighbors.
from numpy import * import pandas as pd data = pd.read_csv('Iris.csv') dataSet = data.iloc[:, 0:4] labels = data['Species'] print(labels) # Number of rows numSamples = dataSet.shape[0] print(numSamples) # Test data Iris-versicolor new_t = array([6.3, 3.1, 4.8, 1.4]) # Finding Euclidean distance diff = tile(new_t, (numSamples, 1))-dataSet squreDiff = diff**2 squreDist = sum(squreDiff, axis=1) distance = squreDist ** 0.5 print(distance) # Sort from smallest to largest sortedDistIndices = argsort(distance) print(sortedDistIndices) classCount = {} K = 4 for i in range(K): voteLabel = labels[sortedDistIndices[i]] print(voteLabel) classCount[voteLabel] = classCount.get(voteLabel, 0) + 1 print(classCount) maxCount = 0 for k, v in classCount.items(): if v > maxCount: maxCount = v maxIndex = k print("Your input is:", new_t, "and classified to class: ", maxIndex)
- Forecast Sepal.Length\Sepal.Width\Petal.Length\Petal.Width They belong to Iris setosa when they are (6.3, 3.1, 4.8, 1.4) respectively.
[3] Manhattan Distance, Chebyshev Distance, Minkovsky Distance, Standardized Euclidean Distance, and Mahalanobis Distance for X = [1,2,3] and Y = [0,1,2] were calculated.The calculation formula is given and calculated according to the formula.Use Python to achieve these distances.
[Manhattan Distance]
- Formula:
d(x,y)=∑k=1n∣xk−yk∣ d(x,y)=\sum_^{|x_k-y_k|} d(x,y)=k=1∑n∣xk−yk∣
-
python code
def ManhattanDist(A,B): return sum([abs(a-b) for (a,b) in zip(A,B)])
[Chebyshev distance]
- formula
d(x,y)=max(∣xk−yk∣) d(x,y)=max(|x_k-y_k|) d(x,y)=max(∣xk−yk∣)
Where k = 1,2,3,...,n
-
python code
# Chebyshev Distance def cheDist(A, B): return max([abs(a - b) for (a, b) in zip(A, B)])
[Minkowski distance]
- formula
d(x,y)=(∑k=1n∣xk−yk∣p)1/p d(x,y)=(\sum_^|x_k-y_k|^p)^ d(x,y)=(k=1∑n∣xk−yk∣p)1/p
- p = 1:Manhattan Distance
- p = 2:Euclidean distance
-
python code
# Minkowski distance def minkowskiDist(A, B, p): return sum([abs(a - b)**p for (a, b) in zip(A,B)])**(1/p)
[Euclidean Distance]
-
Formula:
d(x,y)=∑k=1n(xk−yk)2 d(x,y)=\sqrt{\sum_^(x_k-y_k)^2} d(x,y)=k=1∑n(xk−yk)2 -
python code
import math def eculiDist(A,B): return math.sqrt(sum([(a - b)**2i for (a.b) in zip(A,B)]))
[Standardized Euclidean Distance]
-
formula
d(x,y) =k=1n (x k_y K sk) 2. SK is the variance of components d(x,y)=\sqrt{\sum_^(\frac)^2}~~, ~s_k is the variance of each component d(x,y)=k=1_n (sk x K y k 2), SK is the variance of each componentdef standardEucliDist(A,B): X=np.vstack([A,B]) sk=np.var(X,axis=0,ddof=1) #print(sk) return np.sqrt(sum([((x - y) ** 2 /sk) for (x,y) in zip(A,B)]))
[Mahalanobis distance]
- formula
If_is a unit matrix, the Mahalanobis distance degenerates to a Euclidean distance.
If Sigma is a diagonal matrix, it is called a normalized Euclidean distance.
def mashi_distance(x,y): X=np.vstack([x,y]) print(X) XT=X.T print(XT) S=np.cov(X) #Covariance matrix between two dimensions SI = np.linalg.inv(S) #Inverse matrix of covariance matrix #Mahalanobis distance calculates the distance between two samples, where there are four samples, two groups of combinations, and six distances. n=XT.shape[0] d1=[] for i in range(0,n): for j in range(i+1,n): delta=XT[i]-XT[j] d=np.sqrt(np.dot(np.dot(delta,SI),delta.T)) print(d) d1.append(d)Job List (5/13)
[1] Four students A, B, C, D were selected and their quiz results were divided into two categories: excellent and pass by Kmeans algorithm.@Note: The KMeans function of the sklearn third-party library cannot be called directly to write code based on the classification process taught in the class.Write an experiment report.
Student Name Quiz 1 Quiz 2 A 1 1 B 2 1 C 4 3 D 5 4KMeans Experimental Report (K=2)
Experimental purpose
-
Four students A, B, C, D were selected and their quiz results were divided into "Excellent" and "Pass" categories using Kmeans algorithm.
-
Limitation: The KMeans function of the sklearn third-party library cannot be called directly.
Experimental steps
1. Data preparation-
Store data as a dictionary object
data = {'A': [1, 1], 'B': [2, 1], 'C': [4, 3], 'D': [5, 4]}
-
To facilitate value, further transform the data into a Series object
import pandas as pd # Convert to Series for easy value data = pd.Series(data) # print(data.values[1]) # print(data.index[2])
-
The KMeans algorithm involves calculating the distance between two points. We have written a function in advance: Enter the coordinates of two points and return the Euclidean distance between them.
-
def eucliDist(A, B): return math.sqrt(sum([(a - b) ** 2 for (a, b) in zip(A, B)]))
-
Function k_means(c,data) implements the KMeans algorithm:
a. Enter centroid list c, to cluster Series object data
b. Calculate the distance from each point in the data to two centroids and get a matrix, such as
[[0.0, 1.0, 3.605551275463989, 5.0], [1.0, 0.0, 2.8284271247461903, 4.242640687119285]]c. Comparing the values of the same column in a matrix, assign the corresponding students to the classes whose centroids are shorter distances, and store labels as lists, such as
['pass', 'excellent', 'excellent', 'excellent']
d. Recalculate the coordinates of the centroid, the coordinates of the new centroid = the average value of the coordinates assigned to the same class of points
e. Repeat b~d until the centroid coordinates do not change
f. Return to label list
- The complete function is as follows
def k_means(c,data): # A. Enter a list of centroids`c`, Series objects to be clustered`data` # b. Calculate the distance from each point in the data to two centroids and get a matrix, such as metrix = [[eucliDist(a, b) for a in data.values] for b in c] print(metrix) # c. Comparing the values of the same column of the matrix, assign the corresponding students to the classes of the shorter centroids, and store the labels as lists classifier = ['pass' if a < b else 'excellent' for(a,b) in zip(metrix[0],metrix[1]) ] print(classifier) # d. Recalculate the coordinates of the centroid, the coordinates of the new centroid = the average value of the coordinates assigned to the same class of points n1 = 0 c1,c2 = [0, 0], [0, 0] num = len(data) for i in range(0, num): if classifier[i] == 'pass': c1 = [a + b for (a,b) in zip(c1,data.values[i])] n1 = n1 + 1 elif classifier[i] == 'excellent': c2 = [a + b for (a,b) in zip(c2,data.values[i])] c1 = [a /n1 for a in c1] c2 = [a/(num - n1) for a in c2] # e. Repeat b~d until the centroid coordinates do not change if c != [c1, c2]: c = [c1,c2] print("center:" + str(c)) k_means(c, data) return classifier
-
Because we want to classify the data into two categories, we will select K=2 points as the initial centroids, which are (1, 1), (2, 1) respectively.
- Selecting different initial centroids will result in different clustering results
# Select K=2 points as initial centroid c = [[1,1], [2,1]]
-
Call function
label = k_means(c, data)
-
Organize results: Output in Series format
print(pd.Series((label), index = data.index))
experimental result
- When the initial particle takes [1,1], [2,1], the result is
[2] According to the following report cards, the five students'results are classified into class A, class B and class C, which are implemented by Kmeans algorithm.@Note: The KMeans function of the sklearn third-party library cannot be called directly to write code based on the classification process taught in the class.Write an experiment report.
Student Name Quiz 1 Quiz 2 Quiz 3 Final exam Project Defense achievement Zhang San 12 15 13 28 24 ? Li Si 7 11 10 19 21 ? King Five 12 14 11 27 23 ? Zhao Six 6 7 4 13 20 ? Liu Qi 13 14 13 27 25 ?KMeans Experimental Report (K=3)
Experimental purpose
-
According to the following report cards, the results of five students are classified as A, B and C.
-
Limitation: Implemented using the Kmeans algorithm, but not directly calling the KMeans function of the sklearn third-party library.
Experimental steps
1. Data preparation-
Store the data as a csv file in the following format
Student Name, Quiz 1, Quiz 2, Quiz 3, Final Results, Project Defense Zhang San, 12,15,13,28,24 Li Si, 7,11,10,19,21 King Five, 12,14,11,27,23 Zhao Liu, 6,7,4,13,20 Liu Qi, 13,14,13,27,25 -
Read data from csv file and select available data (exclude name column)
data = pd.read_csv('grade.csv') new_data = data.iloc[:, 1:].values
-
The KMeans algorithm involves calculating the distance between two points. We have written a function in advance: Enter the coordinates of two points and return the Euclidean distance between them.
def eucliDist(A, B): return math.sqrt(sum([(a - b) ** 2 for (a, b) in zip(A, B)]))
-
Function k_means(c,data,max,label) implements the KMeans algorithm:
a. Input: centroid list c, data to be clustered, max iterations, label list
b. Calculate Euclidean distances from each point in the data to three centroids to get a matrix metrix
metrix = [[eucliDist(a, b) for a in data] for b in c]
c. Comparing the values of the same column in the matrix metrix, classify the corresponding students into the classes of the shorter centroids, and store the labels as lists.
classifier = [] for (d, e, f) in zip(metrix[0], metrix[1], metrix[2]): m = min(d, e, f) if d == m: classifier.append(label[0]) elif e == m: classifier.append(label[1]) else: classifier.append(label[2])
d. Recalculate the coordinates of the centroid, the coordinates of the new centroid = the average value of the coordinates assigned to the same class of points
n1, n2 = 0, 0 c1 = [0, 0, 0, 0, 0] c2 = c1 c3 = c1 for i in range(0, num): if classifier[i] == label[0]: c1 = [a + b for (a, b) in zip(c1, data[i])] n1 = n1 + 1 elif classifier[i] == label[1]: c2 = [a + b for (a, b) in zip(c2, data[i])] n2 = n2 + 1 else: c3 = [a + b for (a, b) in zip(c3, data[i])] c1 = [a / n1 for a in c1] c2 = [a / n2 for a in c2] c3 = [a / (num - n1 - n2) for a in c3]
e. Repeat b~d until the centroid coordinates do not change or the maximum number of iterations is reached
f. Return to label list
- The complete function is as follows
def k_means(c, data, max,label): # a. Enter centroid list c, data to be clustered`data`, max iterations max = max - 1 num = len(data) # b. Calculate the distance from each point in the data to k centroids and get a matrix, such as metrix = [[eucliDist(a, b) for a in data] for b in c] print(metrix) # c. Comparing the values of the same column of the matrix, assign the corresponding students to the classes of the shorter centroids, and store the labels as lists classifier = [] for (d, e, f) in zip(metrix[0], metrix[1], metrix[2]): m = min(d, e, f) if d == m: classifier.append(label[0]) elif e == m: classifier.append(label[1]) else: classifier.append(label[2]) print(classifier) # d. Recalculate the coordinates of the centroid, the coordinates of the new centroid = the average value of the coordinates assigned to the same class of points n1, n2 = 0, 0 c1 = [0, 0, 0, 0, 0] c2 = c1 c3 = c1 for i in range(0, num): if classifier[i] == label[0]: c1 = [a + b for (a, b) in zip(c1, data[i])] n1 = n1 + 1 elif classifier[i] == label[1]: c2 = [a + b for (a, b) in zip(c2, data[i])] n2 = n2 + 1 else: c3 = [a + b for (a, b) in zip(c3, data[i])] c1 = [a / n1 for a in c1] c2 = [a / n2 for a in c2] c3 = [a / (num - n1 - n2) for a in c3] print(max) print([c1,c2,c3]) # e. Repeat b~d until the centroid coordinates do not change or the maximum number of iterations is reached if c != [c1, c2, c3] and max > 0: c = [c1, c2, c3] print(c) k_means(c, data, max, label) return classifier
-
Set initial centroid, label list, maximum number of iterations
# Select K points as initial centroid c = [[12, 15, 13, 28, 24], [ 7, 11, 10, 19, 21],[12, 14, 11, 27, 23]] label = ['A', 'B', 'C'] max = 20
-
Call the function and sort out the results
grade = k_means(c, new_data, max, label) grade = pd.Series(grade, index=data['Student Name']) print(grade)
experimental result
- When the initial centroid is [12, 15, 13, 28, 24], [7, 11, 10, 19, 21], [12, 14, 11, 27, 23], the iteration converges twice, and the results are as follows:
[3] Using Sklearn's standard KNN and KMeans methods, the dataset is "Win.csv(See WeChat Group), the KNN algorithm is used to label the wine test set, and then the prediction accuracy of the KNN algorithm is obtained by comparing the predicted and known label values.With the Kmeans algorithm, for labeled "Win.csv"Classify and set the K value and initial center point value yourself.
Standard KNN Classes for Sklearn
from sklearn.model_selection import train_test_split from sklearn.preprocessing import StandardScaler from sklearn.neighbors import KNeighborsClassifier import pandas as pd data = pd.read_csv('wine.csv') x_train,x_test,y_train,y_test = train_test_split(data.iloc[:, 0:13], data.iloc[:, 13]) # Standardization transfer = StandardScaler() x_train = transfer.fit_transform(x_train) x_test = transfer.fit_transform(x_test) estimator = KNeighborsClassifier() estimator.fit(x_train, y_train) predict = estimator.predict(x_test) score = estimator.score(x_test,y_test) print(score)
- Prediction accuracy: 0.93333333333333
Sklearn's standard K-means class
KMeans class main parameters
- The main parameters of the KMeans class are:
(Generally, if the dataset is convex, it is possible to disregard this value. If the dataset is not convex, it may be difficult to converge, at which point you can specify the maximum number of iterations to allow the algorithm to exit the loop in time.) n_init The number of times the algorithm was run with different initialization centroids.
(Since K-Means is a locally optimal iteration algorithm whose results are affected by the initial values, it is necessary to run several times to select a better clustering result. The default is 10, which generally does not need to be changed.If your k-value is large, you can increase it appropriately.) init How to select initial values
(You can choose'random'completely randomly, optimized'k-means++' or specify your own initialized K centroids.It is generally recommended to use the default'k-means++') algorithm There are three choices: auto, full or elkan."Full" is the traditional K-Means algorithm and "elkan" is the elkan K-Means algorithm.The default "auto" determines how to choose "full" and "elkan" based on whether the data values are sparse or not.Usually the data is dense, which is "elkan", otherwise it is "full".Generally speaking, it is recommended to use the default "auto" directly
- Set k = 3, initial centroid is the first three sets of data
from sklearn.datasets import load_iris import matplotlib.pyplot as plt from sklearn.cluster import KMeans import pandas as pd data = pd.read_csv('wine.csv') x = data.values[:, 0:13] # plt.scatter(x[:,0],x[:,:1],c='red',marker = 'o') estimator = KMeans(n_clusters = 3, init=x[0:3],n_init=1) estimator.fit(x) # Label Label_pred = estimator.labels_ x0 = x[Label_pred == 0] x1 = x[Label_pred == 1] x2 = x[Label_pred == 2] plt.scatter(x0[:,3],x0[:,12],c='blue',marker = 'o') plt.scatter(x1[:,3],x1[:,12],c='green',marker = 'o') plt.scatter(x2[:,3],x2[:,12],c='red',marker = 'o') plt.show()
[4] Using KMeans algorithm to solve the problem of "Iris.csvThe unlabeled data of the dataset is divided into three categories, and the classification results are visualized using three-dimensional graphics.
Visualization of K-means--3D Scatter Chart
-
Code
import matplotlib.pyplot as plt from sklearn.cluster import KMeans from mpl_toolkits.mplot3d import Axes3D import pandas as pd iris = pd.read_csv('iris.csv') x = iris.iloc[:, 1:5] # plt.scatter(x[:,0],x[:,:1],c='red',marker = 'o') estimator = KMeans(n_clusters = 3) estimator.fit(x) # Label Label_pred = estimator.labels_ x0 = x[Label_pred == 0] x1 = x[Label_pred == 1] x2 = x[Label_pred == 2] # 3-D Scatter Chart fig = plt.figure() ax = fig.gca(projection='3d') ax.scatter(x0.iloc[:,0],x0.iloc[:,2],x0.iloc[:, 3],c='blue',marker = 'o') ax.scatter(x1.iloc[:,0],x1.iloc[:,2],x1.iloc[:, 3],c='green',marker = 'o') ax.scatter(x2.iloc[:,0],x2.iloc[:,2],x2.iloc[:, 3],c='red',marker = 'o') ax.set_xlabel(iris.columns[1]+"(cm)") ax.set_ylabel(iris.columns[3]+"(cm)") ax.set_zlabel(iris.columns[4]+"(cm)") plt.show()
-
Get a three-dimensional scatterplot
[5] Using KMeans algorithm to solve the problem of "iris.csv"The unlabeled data of the dataset is divided into three categories, and two eigenvalues are chosen to show the classification results.
K-means Visualization - Multiple Subgraphs
import matplotlib.pyplot as plt from sklearn.cluster import KMeans from mpl_toolkits.mplot3d import Axes3D import pandas as pd import numpy as np iris = pd.read_csv('iris.csv') x = iris.iloc[:, 1:5] # plt.scatter(x[:,0],x[:,:1],c='red',marker = 'o') estimator = KMeans(n_clusters = 3) estimator.fit(x) # Label Label_pred = estimator.labels_ x0 = x[Label_pred == 0] x1 = x[Label_pred == 1] x2 = x[Label_pred == 2] #Partition Subgraph fig,axes=plt.subplots(4,4, sharex='col',figsize=(10, 10)) #plt.ylim(0,10) for i in range(0,4): for j in range(0, 4): if i != j: axes[i, j].scatter(x0.iloc[:, j],x0.iloc[:, i],c='blue', marker='.') axes[i, j].scatter(x1.iloc[:, j], x1.iloc[:, i], c='red', marker='.') axes[i, j].scatter(x2.iloc[:, j],x2.iloc[:, i],c='green', marker='.') axes[i,i].hist(x.iloc[:,i], bins=20, facecolor="blue", edgecolor="black", alpha=0.7) axes[i, 0].set_ylabel(iris.columns[i+1]+"(cm)") axes[3, i].set_xlabel(iris.columns[i+1] + "(cm)") #plt.tight_layout() plt.show()
[6] What do you think are the drawbacks of the KMeans and KNN algorithms?In view of these shortcomings, what improvements can be made by consulting the literature?
Defects and Improvements of KMeans Algorithm
(1) The K-Means clustering algorithm requires the user to specify the number of clusters, k-value in advance. In many cases, when clustering a dataset, the user does not know at first how many classes the dataset should be divided into, and it is difficult to estimate the k-value.
An improved method, k-Means adaptive optimization method, is mentioned in Li Fang's master's paper of Anhui University. By giving a suitable value to k at the beginning, a clustering center can be obtained through a single K-means algorithm.For the obtained cluster centers, the nearest classes are merged according to the distance of the K clusters, so the number of cluster centers decreases. When it is used for the next cluster, the corresponding number of clusters decreases, resulting in an appropriate number of clusters.A criterion E can be used to determine the number of clusters to get a suitable location to stop without merging the cluster centers.Repeat the above cycle until the evaluation function converges, and finally get a better clustering result.
*(2) ** Sensitive to initial cluster centers * Selecting different cluster centers results in different clustering results and different accuracy rates. Random selection of initial cluster centers results in instability of the algorithm and may lead to local optimum conditions.
Improvement method - K-Means++ algorithm: Assuming that n initial cluster centers (0<n<k) have been selected, when n+1 cluster centers are selected, points farther away from the current n cluster centers will have a higher probability of being selected as n+1 cluster centers.
(3) Sensitive to noise and isolated point data, the K-Means algorithm considers the centroid of the cluster as the cluster center to be added to the next round of calculation, so a small amount of such data can have a great impact on the average value, resulting in unstable or even erroneous results.
The LOF algorithm for outlier detection, an improved method, can reduce the influence of outliers and outliers on clustering effect by removing outliers and then clustering.
(4) Once the cluster has complex geometric shapes, kmeans cannot cluster the data well.
The main reason is to select a distance measurement method.Because K-Means mainly uses the Euclidean distance function to measure the similarity between data objects, and uses the sum of squares of errors as the criterion function, only spherical clusters with uniform distribution of data objects are usually found.
Improvement: Density-based DESAN clustering algorithm can be used if irregular data is to be processed.
-
Reference
Study on Kmeans Clustering Algorithm, Evaluation Method and Improvement Direction, Zhou Yinping and Li Weiguo (102206 North China Electric Power University, Beijing)
https://blog.csdn.net/u013129109/article/details/80063111
https://blog.csdn.net/u010536377/article/details/50884416
Defects of KNN algorithm and improvement methods
(1) More accurate distance function is needed instead of Euclidean distance
- Improvements:
- Eliminate irrelevant attributes: entanglement, greedy search, genetic search
- Attribute weighted distance function
- Frequency-based distance function, called the measure of dissimilarity
- Distance function for nominal attributes: Value Difference Measure (VDM)
(2) Search for an optimal neighbor size instead of k
- Improvements:
- cross validation
- DKNN is the dynamic determination of K values
(3) Find more accurate estimates of class probability instead of simple voting methods.
-
Improvements:
- Probability-based local class model, i.e. combining NB algorithm
[1] The dataset is shown in the following figure. Based on our understanding of decision trees, we design a decision tree and enter Test data, consistent with expected results?@Note that direct calls to the decision tree method provided by Sklearn are not allowed.
*Age* *Salary* *STU* *Credit* *Buy Computer* <30 H No OK No <30 H No Good No 30-40 H No OK Yes >40 M No OK Yes >40 L Yes OK Yes >40 L Yes Good No 30-40 L Yes Good Yes <30 M No OK No <30 L Yes OK Yes >40 M Yes OK Yes <30 M Yes Good Yes 30-40 M No Good Yes 30-40 H Yes OK Yes >40 M No Good No-
Decision Tree Algorithmic Ideas
- Use the information gain to select the best attribute to split the dataset.
- Make this attribute a decision node and divide the dataset into smaller subsets.
- Start building the tree by repeating this process recursively for each child node until one of the conditions matches:
- All tuples belong to the same attribute value.
- There are no more remaining attributes.
- There are no more instances.
-
Collate data into csv format
Age,Salary,STU,Credit,BuyComputer <30,H,No,OK,No <30,H,No,Good,No 30-40,H,No,OK,Yes >40,M,No,OK,Yes >40,L,Yes,OK,Yes >40,L,Yes,Good,No 30-40,L,Yes,Good,Yes <30,M,No,OK,No <30,L,Yes,OK,Yes >40,M,Yes,OK,Yes <30,M,Yes,Good,Yes 30-40,M,No,Good,Yes 30-40,H,Yes,OK,Yes >40,M,No,Good,No
-
Encoding implementation
from math import log import pandas as pd # Computing information entropy def Ent(dataset): n = len(dataset) label_counts = {} for item in dataset: label_current = item[-1] if label_current not in label_counts.keys(): label_counts[label_current] = 0 label_counts[label_current] += 1 ent = 0.0 for key in label_counts: prob = label_counts[key]/n ent -= prob * log(prob,2) return ent #Calculating Information Entropy of Branches by Weight def sum_weight(grouped,total_len): weight = len(grouped)/total_len #print(grouped.iloc[:,-1]) return weight * Ent(grouped.iloc[:,-1]) #Calculate Information Gain from Formula def Gain(column, data): lenth = len(data) ent_sum = data.groupby(column).apply(lambda x:sum_weight(x,lenth)).sum() ent_D = Ent(data.iloc[:,-1]) return ent_D - ent_sum # Calculate the feature for maximum information gain, input data is a dataframe, return is a string def get_max_gain(data): max_gain = 0 cols = data.columns[:-1] for col in cols: gain = Gain(col,data) if gain > max_gain: max_gain = gain max_label = col return max_label #Gets the most categories in the data as node classifications, enters a series, returns an index value as a string def get_most_label(label_list): return label_list.value_counts().idxmax() # Create a decision tree, pass in a dataframe, last column label def TreeGenerate(data): feature = data.columns[:-1] label_list = data.iloc[:, -1] #If the sample belongs to the same category C, mark this node as a class C leaf node if len(pd.unique(label_list)) == 1: return label_list.values[0] #If the attribute set A to be partitioned is empty or if the sample has the same value on attribute A, the node is considered a leaf node and marked as the classification with the largest number of samples elif len(feature)==0 or len(data.loc[:,feature].drop_duplicates())==1: return get_most_label(label_list) #Selecting optimal partition attributes from A best_attr = get_max_gain(data) tree = {best_attr: {}} #For each attribute value of the optimal partitioning attribute, a branch is generated for attr,gb_data in data.groupby(by=best_attr): print(gb_data) if len(gb_data) == 0: tree[best_attr][attr] = get_most_label(label_list) else: #Remove partitioned attributes from data new_data = gb_data.drop(best_attr,axis=1) #Recursively Construct Decision Tree tree[best_attr][attr] = TreeGenerate(new_data) return tree #Classifying using recursive functions def tree_predict(tree, data): feature = list(tree.keys())[0] label = data[feature] next_tree = tree[feature][label] if type(next_tree) == str: return next_tree else: return tree_predict(next_tree, data) data = pd.read_csv('computer.csv') #Get a trained decision tree mytree = TreeGenerate(data) print(mytree) test_data = {'Age':'30-40','Salary':'H','STU':'No','Credit':'OK'} predict = tree_predict(mytree,test_data) print(predict)
-
experimental result
-
test result
Yes
-
[2] The dataset is shown in the following figure. The class information entropy of this decision tree is calculated, and the class information entropy based on each eigenvalue is calculated.The information gain for each eigenvalue is calculated according to the formula, and a decision tree (drawn with a signature pen) is drawn.
[3] According to the following datasets, code the key links of decision tree implementation, such as information entropy, conditional entropy, information gain, and write an experimental report.(The last column is Categories: whether to provide loans)
DataSet = [[0, 0, 0, 0,'no'], #Dataset
[0, 0, 0, 1, 'no'],
[0, 1, 0, 1, 'yes'],
[0, 1, 1, 0, 'yes'],
[0, 0, 0, 0, 'no'],
[1, 0, 0, 0, 'no'],
[1, 0, 0, 1, 'no'],
[1, 1, 1, 1, 'yes'],
[1, 0, 1, 2, 'yes'],
[1, 0, 1, 2, 'yes'],
[2, 0, 1, 2, 'yes'],
[2, 0, 1, 1, 'yes'],
[2, 1, 0, 1, 'yes'],
[2, 1, 0, 2, 'yes'],
[2, 0, 0, 0, 'no']]
labels = ['age','job','Own house','credit status']
Experimental Report
-
Third-party libraries used in experiments
- math: Logarithmic
- pandas: Organize and divide data
from math import log import pandas as pd
-
Computing information entropy
- formula
Ihfo(D)=−∑i=1mpilog2pi Ihfo(D)=-\sum^_p_ilog_2p_i Ihfo(D)=−i=1∑mpilog2pi
def Ent(dataset): n = len(dataset) label_counts = {} for item in dataset: label_current = item if label_current not in label_counts.keys(): label_counts[label_current] = 0 label_counts[label_current] += 1 ent = 0.0 for key in label_counts: prob = label_counts[key]/n ent -= prob * log(prob,2) return ent
- formula
-
Calculating Conditional Entropy
-
Formula:
InfoA(D)=∑j=1v[(∣Dj∣∣D∣)×Info(Dj)] Info_A(D)=\sum^_[(\frac{|D_j|}{|D|})\times] InfoA(D)=j=1∑v[(∣D∣∣Dj∣)×Info(Dj)]#Calculating Information Entropy of Branches by Weight def sum_weight(grouped,total_len): weight = len(grouped)/total_len #print(grouped.iloc[:,-1]) return weight * Ent(grouped.iloc[:,-1])
-
Utilization in Calculating Conditional EntropyDataFrame.groupby, apply groups
# Calculating Conditional Entropy ent_sum = data.groupby(column).apply(lambda x:sum_weight(x,lenth)).sum()
-
-
Calculate Information Gain
-
Formula:
Grain(D)=Info(D)−InfoA(D) Grain(D) = Info(D)-Info_A(D) Grain(D)=Info(D)−InfoA(D)#Calculate Information Gain from Formula def Gain(column, data): lenth = len(data) ent_sum = data.groupby(column).apply(lambda x:sum_weight(x,lenth)).sum() ent_D = Ent(data.iloc[:,-1]) print(column,ent_D-ent_sum) return ent_D - ent_sum
-
-
Recursively Generate Decision Tree
- Use the information gain to select the best attribute to split the dataset.
- Make this attribute a decision node and divide the dataset into smaller subsets.
- Start building the tree by repeating this process recursively for each child node until one of the conditions matches:
- All tuples belong to the same attribute value.
- There are no more remaining attributes.
- No more instances
# Calculate the feature for maximum information gain, input data is a dataframe, return is a string def get_max_gain(data): max_gain = 0 cols = data.columns[:-1] for col in cols: gain = Gain(col,data) if gain > max_gain: max_gain = gain max_label = col return max_label #Gets the most categories in the data as node classifications, enters a series, returns an index value as a string def get_most_label(label_list): return label_list.value_counts().idxmax() # Create a decision tree, pass in a dataframe, last column label def TreeGenerate(data): feature = data.columns[:-1] label_list = data.iloc[:, -1] #If the sample belongs to the same category C, mark this node as a class C leaf node if len(pd.unique(label_list)) == 1: return label_list.values[0] #If the attribute set A to be partitioned is empty or if the sample has the same value on attribute A, the node is considered a leaf node and marked as the classification with the largest number of samples elif len(feature)==0 or len(data.loc[:,feature].drop_duplicates())==1: return get_most_label(label_list) #Selecting optimal partition attributes from A best_attr = get_max_gain(data) tree = {best_attr: {}} #For each attribute value of the optimal partitioning attribute, a branch is generated for attr,gb_data in data.groupby(by=best_attr): #print(gb_data) if len(gb_data) == 0: tree[best_attr][attr] = get_most_label(label_list) else: #Remove partitioned attributes from data new_data = gb_data.drop(best_attr,axis=1) #Recursively Construct Decision Tree tree[best_attr][attr] = TreeGenerate(new_data) return tree
-
Input data to generate decision tree
dataSet=[[0,0,0,0,'no'],[0,0,0,1,'no'],[0,1,0,1,'yes'],[0,1,1,0,'yes'],[0,0,0,0,'no'], [1,0,0,0,'no'],[1,0,0,1,'no'],[1,1,1,1,'yes'],[1,0,1,2,'yes'],[1,0,1,2,'yes'], [2,0,1,2,'yes'],[2,0,1,1,'yes'],[2,1,0,1,'yes'],[2,1,0,2,'yes'],[2,0,0,0,'no']] labels = ['Age', 'Working', 'Have your own house', 'Credit situation','Whether to provide loans'] data = pd.DataFrame(dataSet,columns=labels) tree = TreeGenerate(data) print(tree)
-
Results are output as a dictionary
{'Have your own house': {0: {'Working': {0: 'no', 1: 'yes'}}, 1: 'yes'}}
-
[1] Visualize the decision tree implemented in question [3].(optional)
import matplotlib.pyplot as plt #For normal display of matplotlib in Chinese, the specified font is SimHei plt.rcParams['font.sans-serif']=['SimHei'] plt.rcParams['font.family']='sans-serif' # Get the number of leaf nodes of a tree def get_num_leafs(decision_tree): num_leafs = 0 first_str = next(iter(decision_tree)) second_dict = decision_tree[first_str] for k in second_dict.keys(): if isinstance(second_dict[k], dict): num_leafs += get_num_leafs(second_dict[k]) else: num_leafs += 1 return num_leafs # Get the depth of the tree def get_tree_depth(decision_tree): max_depth = 0 first_str = next(iter(decision_tree)) second_dict = decision_tree[first_str] for k in second_dict.keys(): if isinstance(second_dict[k], dict): this_depth = 1 + get_tree_depth(second_dict[k]) else: this_depth = 1 if this_depth > max_depth: max_depth = this_depth return max_depth # Draw Nodes def plot_node(node_txt, center_pt, parent_pt, node_type): arrow_args = dict(arrowstyle='<-') create_plot.ax1.annotate(node_txt, xy=parent_pt, xycoords='axes fraction', xytext=center_pt, textcoords='axes fraction', va="center", ha="center", bbox=node_type,arrowprops=arrow_args) # Label Partition Properties def plot_mid_text(cntr_pt, parent_pt, txt_str): x_mid = (parent_pt[0] - cntr_pt[0]) / 2.0 + cntr_pt[0] y_mid = (parent_pt[1] - cntr_pt[1]) / 2.0 + cntr_pt[1] create_plot.ax1.text(x_mid, y_mid, txt_str, va="center", ha="center", color='red') # Draw Decision Tree def plot_tree(decision_tree, parent_pt, node_txt): d_node = dict(boxstyle="sawtooth", fc="0.8") leaf_node = dict(boxstyle="round4", fc='0.8') num_leafs = get_num_leafs(decision_tree) first_str = next(iter(decision_tree)) cntr_pt = (plot_tree.xoff + (1.0 +float(num_leafs))/2.0/plot_tree.totalW, plot_tree.yoff) plot_mid_text(cntr_pt, parent_pt, node_txt) plot_node(first_str, cntr_pt, parent_pt, d_node) second_dict = decision_tree[first_str] plot_tree.yoff = plot_tree.yoff - 1.0/plot_tree.totalD for k in second_dict.keys(): if isinstance(second_dict[k], dict): plot_tree(second_dict[k], cntr_pt, k) else: plot_tree.xoff = plot_tree.xoff + 1.0/plot_tree.totalW plot_node(second_dict[k], (plot_tree.xoff, plot_tree.yoff), cntr_pt, leaf_node) plot_mid_text((plot_tree.xoff, plot_tree.yoff), cntr_pt, k) plot_tree.yoff = plot_tree.yoff + 1.0/plot_tree.totalD def create_plot(dtree): fig = plt.figure(1, facecolor='white') fig.clf() axprops = dict(xticks=[], yticks=[]) create_plot.ax1 = plt.subplot(111, frameon=False, **axprops) plot_tree.totalW = float(get_num_leafs(dtree)) plot_tree.totalD = float(get_tree_depth(dtree)) plot_tree.xoff = -0.5/plot_tree.totalW plot_tree.yoff = 1.0 plot_tree(dtree, (0.5, 1.0), '') plt.show() tree = {'Have your own house': {0: {'Working': {0: 'no', 1: 'yes'}}, 1: 'yes'}} create_plot(tree)
- Reference material:https://zhuanlan.zhihu.com/p/43819989
[1] in the following transaction datasets
TID Itemset 1 2 3 4 5-
Itemset has 2/5 = 40% support
-
If the minimum support is set to 3, the frequent itemsets in the dataset are
L1
L2
Frequent Itemset Support 3 3 3 3- Rule has 2/5=40% support and 2/3=66.7% confidence.
[2] Read the "Association Rule Example.py" published by WeChat Group and according to the transaction sheet (T1, T2, T3, T3, T4, T5, T6, T7, T8, T9), the goods list for each transaction is {{,,,,,,,,, , {, ,}, write code to get Association rules.
A strong rule is a rule that is frequent and its confidence is higher than Minimum confidence Φ
-
Modeling
- Model initialization and parameter settings (minimum support = 0.1, minimum confidence = 0.5)
def __init__(self,minSupport=0.1,minConfidence=0.5): ''' minSuport:Minimum Support minConfidence:Minimum Confidence dataset:data set count:Deposit frequent itemsets as well as support associationRules:satisfy minConfidence Association Rules num:Number of elements threshold = num*minSupport: from num and minSupport Calculated threshold ''' self.minSupport = minSupport self.minConfidence = minConfidence self.dataset = None self.count = None self.associationRules = None self.num = 0 self.threshold = 0
-
Generate Frequent Itemsets
Ideas:
-
First traverse the dataset and count the number of elements num to get the candidate set C1;
-
Thresholds are obtained from num*minSupport;
-
Remove itemsets with less support than threhold and get L1, a frequent itemset with size=1.
-
Combine the frequent itemsets from the previous step to get a new candidate set.
-
Remove itemsets with less support than threhold to get new frequent itemsets.
-
Loop 4-5 until all frequent itemsets are found.
Modify "Association Rule Example.py" published by WeChat Group:
1. In the original example, the number of elements was incorrectly calculated and corrected in the operation. 2. Modify "Association Rule Example.py" published by WeChat Group to avoid collecting data items from itemsets that will not be split-
Original example
tmp = set(list(element[i])) tmp.update(list(element[j]))
['I1'] and ['I2'] merge into ['I','1','2']
-
Modify to
tmp = set([element[i],element[j]])
['I1'] and ['I2'] merge into ['I1','I2']
class Association_rules: #Calculate frequent itemset def countItem(self,upDict,elength): currentDict = {} element = list(upDict.keys()) for i in range(len(element)-1): for j in range(i+1,len(element)): #print(element[i]) tmp = set([element[i],element[j]]) #tmp.update(list(element[j])) #print(tmp) if len(tmp) > elength: continue if tmp in list(set(item) for item in currentDict.keys()): continue for item in self.dataset: if tmp.issubset(set(item)): if tmp in list(set(item) for item in currentDict.keys()): currentDict[tuple(tmp)] += 1 else: currentDict[tuple(tmp)] = 1 for item in list(currentDict.keys()): if currentDict[item] < self.threshold: del currentDict[item] if len(list(currentDict.keys())) < 1: return None else: return currentDict #Generate frequent itemsets def fit(self,dataset): self.dataset = dataset count = [] count.append({}) for item in self.dataset: for i in range(len(item)): if item[i] in list(count[0].keys()): count[0][item[i]] += 1 else: count[0][item[i]] = 1 self.num += 1 self.threshold = self.num * self.minSupport print(self.num, self.threshold) for item in list(count[0].keys()): if count[0][item] < self.threshold: del count[0][item] i = 0 while(True): if len(count[i]) < 2: break else: tmp = self.countItem(count[i],i+2) if tmp == None: break else: count.append(tmp) i += 1 self.count = count #Print and return frequent itemsets def frequentItemsets(self): #print('threshold:',self.threshold) for item in self.count: print(item) print() return self.count
-
-
Calculate Confidence
Formula:
Confidence(X=>Y)=∣X,Y∣∣X−Items∣ Confidence (X => Y) = \frac{||} {|X-Items|}\quad Confidence(X=>Y)=∣X−Items∣∣X,Y∣class Association_rules: #Calculate the confidence level.set = (X),set2 = (X^Y) def countConfidence(self,set1,set2): len1 = len(set1) len2 = len(set2) #Remove element position interference.For example, set2 = ('a','b'), andSelf.countAs ('b','a') if not tuple(set2) in self.count[len2-1].keys(): set2[0],set[1] = set2[1],set2[0] #A negligence in writing code where the storage format in count is str when there is only one element and tuple when there are more than one element if len1 == 1: return self.count[len2-1][tuple(set2)] / self.count[len1-1][set1[0]] else: if not tuple(set1) in self.count[len1-1].keys(): set1[0],set1[1] = set1[1],set1[0] return self.count[len2-1][tuple(set2)] / self.count[len1-1][tuple(set1)]
-
Binary Find Subset
Algorithmic ideas:
For example, find a subset of 4 elements 3210, then use binary 0, 1 to indicate whether each bit is selected or not.
Decimal Binary
0. 0000 represents an empty set
1) 0001 represents
2) 0010 represents
3) 0011 stands for
4) 0100 represents
...
15) 1110 represents
16) 1111 stands for#Binary method for finding all subsets of each itemset def subsets(self,itemset): N = len(itemset) subsets = [] for i in range(1,2**N-1): tmp = [] for j in range(N): if (i >> j) % 2 == 1: tmp.append(itemset[j]) subsets.append(tmp) return subsets
-
Generate Association Rules
- Determine if the confidence level of each rule in a frequent itemset is greater than the minimum confidence level
def associationRule(self): associationRules = [] for i in range(1,len(self.count)): for itemset in list(self.count[i].keys()): #Store association rules for each itemset in a dictionary tmp = {} #print(itemset) subset = self.subsets(itemset) #print(subset) for i in range(len(subset)-1): for j in range(i+1,len(subset)): #Judging that subset[i] and subset[j] make up a complete itemset without identical elements if len(subset[i]) + len(subset[j]) == len(itemset) and len(set(subset[i]) & set(subset[j])) == 0: confidence = self.countConfidence(subset[i],itemset) #print(subset[i],' > ',subset[j],' ',confidence) if confidence > self.minConfidence: #Generate corresponding key-value pairs tmpstr = str(subset[i]) + ' > ' + str(subset[j]) tmp[tmpstr] = confidence #Generate another rule by inverting subset[i] and subset[j] confidence = self.countConfidence(subset[j],itemset) #print(subset[j],' > ',subset[i],' ',confidence) if confidence > self.minConfidence: tmpstr = str(subset[j]) + ' > ' + str(subset[i]) tmp[tmpstr] = confidence if tmp.keys(): associationRules.append(tmp) for item in associationRules: print(item) return associationRules
-
Call the model and get the result
if __name__ == '__main__': num = 10 #dataset = set_data(num) dataset = [['I1','I2','I5'],['I2','I4'],['I2','I3'], ['I1','I2','I4'],['I1','I3'],['I2','I3'],['I1','I3'],['I1','I2','I3','I5'],['I1','I2','I3']] for item in dataset: print(item) ar = Association_rules() ar.fit(dataset) freItemsets = ar.frequentItemsets() associationRules = ar.associationRule()
Get Association Rules
Association Rules Support Confidence --> 4 0.67 --> 4 0.57 --> 4 0.67 --> 4 0.67 --> 4 0.67 --> 4 0.57
For example, find a subset of 4 elements 3210, then use binary 0, 1 to indicate whether each bit is selected or not.
Decimal Binary
0. 0000 represents an empty set
1) 0001 represents
2) 0010 represents
3) 0011 stands for
4) 0100 represents
...
15) 1110 represents
16) 1111 stands for
```python #Binary method for finding all subsets of each itemset def subsets(self,itemset): N = len(itemset) subsets = [] for i in range(1,2**N-1): tmp = [] for j in range(N): if (i >> j) % 2 == 1: tmp.append(itemset[j]) subsets.append(tmp) return subsets ```
-
Generate Association Rules
- Determine if the confidence level of each rule in a frequent itemset is greater than the minimum confidence level
def associationRule(self): associationRules = [] for i in range(1,len(self.count)): for itemset in list(self.count[i].keys()): #Store association rules for each itemset in a dictionary tmp = {} #print(itemset) subset = self.subsets(itemset) #print(subset) for i in range(len(subset)-1): for j in range(i+1,len(subset)): #Judging that subset[i] and subset[j] make up a complete itemset without identical elements if len(subset[i]) + len(subset[j]) == len(itemset) and len(set(subset[i]) & set(subset[j])) == 0: confidence = self.countConfidence(subset[i],itemset) #print(subset[i],' > ',subset[j],' ',confidence) if confidence > self.minConfidence: #Generate corresponding key-value pairs tmpstr = str(subset[i]) + ' > ' + str(subset[j]) tmp[tmpstr] = confidence #Generate another rule by inverting subset[i] and subset[j] confidence = self.countConfidence(subset[j],itemset) #print(subset[j],' > ',subset[i],' ',confidence) if confidence > self.minConfidence: tmpstr = str(subset[j]) + ' > ' + str(subset[i]) tmp[tmpstr] = confidence if tmp.keys(): associationRules.append(tmp) for item in associationRules: print(item) return associationRules
-
Call the model and get the result
if __name__ == '__main__': num = 10 #dataset = set_data(num) dataset = [['I1','I2','I5'],['I2','I4'],['I2','I3'], ['I1','I2','I4'],['I1','I3'],['I2','I3'],['I1','I3'],['I1','I2','I3','I5'],['I1','I2','I3']] for item in dataset: print(item) ar = Association_rules() ar.fit(dataset) freItemsets = ar.frequentItemsets() associationRules = ar.associationRule()
Get Association Rules
Association Rules Support Confidence --> 4 0.67 --> 4 0.57 --> 4 0.67 --> 4 0.67 --> 4 0.67 --> 4 0.57