See video: Python Tutorial 4 days to start Python data mining quickly
1 Introduction to Matplotlib
1.1 what is Matplotlib
- Dedicated to developing 2D charts (including 3D charts)
- Very simple to use
- Data visualization in a gradual and interactive way
The name of matplotlib comes from:
- Matrix matrix (2D data 2D chart)
- plot - draw
- lib - library Library Library
matlab matrix lab:
- mat - matrix
- lab - Laboratory
1.2 function of Matplotlib
Visualization is the key assistant tool in the whole data mining, which can clearly understand the data and adjust our analysis methods.
- It can visualize the data and present it more intuitively
- Make data more objective and persuasive
Dynamic visualization Library of web page: js Library - D3 ecrats
1.3 first Matplotlib diagram
import matplotlib.pyplot as plt %matplotlib inline plt.figure() # Create canvas plt.plot([1, 0, 9], [4, 5, 6]) # Drawing plt.show() # Figure display
1.4 three layer structure of Matplotlib
1.4.1 container layerThe container layer is mainly composed of Canvas, Figure and Axes.
- Canvas is the lowest system layer, which acts as a Sketchpad in the process of drawing, that is, a tool to place the canvas.
- Figure is the first layer above Canvas, and also the first layer of application layer that needs users to operate. It acts as the role of Canvas in the process of drawing.
- Axes is the second layer of application layer, which is equivalent to the role of coordinate system / drawing area on canvas during drawing.
explain:
- Figure: refers to the whole figure (you can use plt.figure() set the size and resolution of the canvas)
- Drawing area for Axes data, plt.subplot()
- Axis: an axis in a coordinate system that contains size limits, scales, and scale labels
Features are:
- A figure (canvas) can contain multiple axes (coordinate system / drawing area), but an axe can only belong to one figure
- An axes (coordinate system / drawing area) can contain multiple axes (coordinate axes), including two 2d coordinate systems and three 3d coordinate systems
The auxiliary display layer is the content of Axes (drawing area) except the image drawn according to the data, mainly including Axes appearance (facecolor), spines, axis, axis label, axis scale (tick), axis scale label (tick) Label, grid, legend, title, etc.
The setting of this layer can make the image display more intuitive and easy for users to understand, but it will not have a substantial impact on the image.
1.4.3 image layerImage layer refers to the image drawn by plot, scatter, bar, histogram, pie and other functions in Axes according to the data.
Summary:
- Canvas (Sketchpad) is located at the bottom layer, which is usually not accessible to users
- Figure based on Canvas
- Axes (drawing area) is based on Figure
- axis, legend and other auxiliary display layers and image layers are all built on Axes
2. Plot and basic drawing function
2.1 drawing and saving of line chart
In order to better understand all basic mapping functions, we integrate all basic API usage through mapping of weather and temperature changes
2.1.1 matplotlib.pyplot modularmatplotlib.pyplot It includes a series of drawing functions similar to matlab. Its function acts on the current coordinate system (axes) of the current figure.
import matplotlib.pyplot as plt
2.1.2 line drawing and display
Show the weather in Shanghai for a week. For example, the weather temperature from Monday to Sunday is as follows
# 1. Create canvas # plt.figure() plt.figure(figsize = (20, 8), dpi = 80) # 2. Drawing an image plt.plot([1, 2, 3, 4, 5, 6, 7], [17, 17, 18, 15, 11, 11, 13]) # Save image plt.savefig("test.png") # 3. Display image plt.show()
be careful: plt.show figure resources will be released. If you save an image after it is displayed, you can only save an empty image, so plt.savefig(“ xx.png ”)To be in plt.show Before ().
2.1.3 setting canvas properties and image saving plt.figure(figsize = ), dpi = ) figsize: specify the length, width and canvas size of the graph dpi: the sharpness of image, the sharpness of dot per inch image Return fig object plt.savefig(path) Path: image storage path2.2 improve the original line chart 1 (auxiliary display layer)
Case: display temperature change demand: draw a line chart of temperature change every minute from 11:00 to 12:00 in a city, and the temperature range is 15 ℃ ∼ \ sim ∼ 18 ℃
2.2.1 prepare data and draw initial line chartimport random import matplotlib.pyplot as plt # 1. Prepare data x y x = range(60) y_shanghai = [random.uniform(15, 18) for i in x] # 2. Create canvas plt.figure(figsize=(20, 8), dpi=80) # 3. Drawing an image plt.plot(x, y_shanghai) # 4. Display diagram plt.show()2.2.2 add custom x,y scale
plt.xticks(x, [labels], **kwargs) x: Position of scale value to display [labels]: Display label used to set each interval **kwargs: Used to set appearance properties such as label font tilt and color plt.yticks(y, [labels], **kwargs) y: Position of scale value to display [labels]: Display label used to set each interval **kwargs: Used to set appearance properties such as label font tilt and color
Example:
# Modify x, y scale # Prepare scale description for x x_label = ["11 spot{}branch".format(i) for i in x] plt.xticks(x[::5], x_label[::5]) # Scale description of preparation y plt.yticks(range(0, 40, 5)) # plt.yticks(range(40)[::5])
If the Chinese problem has not been solved, it will be displayed like this
Add two lines of code:
plt.rcParams['font.sans-serif'] = ['SimHei'] # Used to display Chinese labels normally plt.rcParams['axes.unicode_minus'] = False # Used to display negative sign normally2.2.4 add grid display
To see more clearly the values corresponding to the graph:
plt.grid(linestyle="--", alpha=0.5)2.2.5 add description information
Add x-axis and y-axis description information and Title:
plt.xlabel("Time change") plt.ylabel("temperature variation ") plt.title("Temperature change of a city from 11:00 to 12:00 every minute")
2.3 improve the original line chart 2 (image layer)
Demand: add another city's temperature change
The temperature changes of the day in Beijing were collected, ranging from 1 ℃ to 3 ℃.
2.3.1 multiple plotHow to add another different figure in the same coordinate system is very simple. You only need to plot again, but you need to distinguish lines, as shown below
# 1. Prepare data x y x = range(60) y_shanghai = [random.uniform(15, 18) for i in x] y_beijing = [random.uniform(1, 3) for i in x] # Increase temperature data in Beijing # 2. Create canvas plt.figure(figsize=(20, 8), dpi=80) # 3. Drawing an image plt.plot(x, y_shanghai, color="r", linestyle="-.", label="Shanghai") plt.plot(x, y_beijing, color="b", label="Beijing") # You can draw multiple polylines with multiple plot s # Show Legend plt.legend() # Description of the upper right corner of the image # Modify x, y scale # Prepare scale description for x x_label = ["11 spot{}branch".format(i) for i in x] plt.xticks(x[::5], x_label[::5]) plt.yticks(range(0, 40, 5)) # Add grid display plt.grid(linestyle="--", alpha=0.5) # Add description plt.xlabel("Time change") plt.ylabel("temperature variation ") plt.title("Temperature changes in Shanghai and Beijing from 11:00 to 12:00 every minute") # 4. Display diagram plt.show()
Two new places are used, one is for different display effect of broken line, the other is to add legend.
2.3.2 setting graphic style Color character Style character r red -Solid line g green -- dashed line b blue -. dash w White : dotted line c cyan ’’Leave blank, blank m magenta y yellow k black 2.3.3 display legendNote: if only plt.plot Setting the label in () can't display the legend finally. You need to display the legend through plt. legend().
plt.legend() # Default best plt.legend(loc="best") plt.legend(loc=0) loc: Show location of legendLocation String Location Code 'best' (default) 0 'upper right' 1 'upper left' 2 'lower left' 3 'lower right' 4 'right' 5 'center left' 6 'center right' 7 'lower center' 8 'upper center' 9 'center' 10
2.4 multiple coordinate system display- plt.subplots (object oriented drawing method)
Demand: display weather maps of Shanghai and Beijing in different coordinate systems of the same map
It can be implemented through the subplots function (the old version has subplot, which is inconvenient to use). The subplots function is recommended
- matplotlib.pyplot.subplots(nrows=1, ncols=1, **fig_kw) create a graph with multiple axes
More methods on axes sub coordinate system: Reference https://matplotlib.org/api/axes_api.html#matplotlib.axes.Axes
figure, axes = plt.subplot(nrows=1, ncols=2, **fig_kw) # 1 row and 2 columns axes[0].set_Method name(): First picture axes[1]: Second picture
Note: the plt. Function name () is equivalent to the procedure oriented drawing method, axes.set_ The method name () is equivalent to the object-oriented drawing method.
# 1. Prepare data x y x = range(60) y_shanghai = [random.uniform(15, 18) for i in x] y_beijing = [random.uniform(1, 3) for i in x] # 2. Create canvas # plt.figure(figsize=(20, 8), dpi=80) figure, axes = plt.subplots(nrows=1, ncols=2, figsize=(20, 8), dpi=80) # 3. Drawing an image axes[0].plot(x, y_shanghai, color="r", linestyle="-.", label="Shanghai") axes[1].plot(x, y_beijing, color="b", label="Beijing") # Show Legend axes[0].legend() axes[1].legend() # Modify x, y scale # Prepare scale description for x x_label = ["11 spot{}branch".format(i) for i in x] # axes[0].set_xticks(x[::5],x_label[::5]) replaces the following two lines axes[0].set_xticks(x[::5]) axes[0].set_xticklabels(x_label) # Specific time display of x-axis scale axes[0].set_yticks(range(0, 40, 5)) axes[1].set_xticks(x[::5]) axes[1].set_xticklabels(x_label) axes[1].set_yticks(range(0, 40, 5)) # Add grid display axes[0].grid(linestyle="--", alpha=0.5) axes[1].grid(linestyle="--", alpha=0.5) # Add description axes[0].set_xlabel("Time change") axes[0].set_ylabel("temperature variation ") axes[0].set_title("The temperature change of every minute from 11:00 to 12:00 in Shanghai") axes[1].set_xlabel("Time change") axes[1].set_ylabel("temperature variation ") axes[1].set_title("The temperature change of every minute from 11:00 to 12:00 in Beijing") # 4. Display diagram plt.show()
2.5 application scenario of line chart
An index changes with time:
- Show the number of active users of the company's products (different regions) every day
- Number of app s downloaded per day
- Show the change of the number of user clicks over time after the new product functions go online
- Expansion: drawing various mathematical function images
be careful: plt.plot() in addition to drawing line graphs, it can also be used to draw various mathematical function graphs
Drawing mathematical function image
import numpy as np # 1. Prepare x, y data # Sine function data # x = np.linspace(-10, 10, 1000) # Generate (- 10,10) equal spacing numbers y = np.sin(x) # Image data of quadratic function x = np.linspace(-1, 1, 1000) y = 2 * x * x # 2. Create canvas plt.figure(figsize=(20, 8), dpi=80) # 3. Drawing an image plt.plot(x, y) # Add grid display plt.grid(linestyle="--", alpha=0.5) # 4. Display image plt.show()
3 types and significance of common figures
3.1 line chart
- Line chart: a chart showing the increase or decrease of statistical quantity by the rise or fall of a line
Features: it can display the trend of data change and reflect the change of things. (change)
3.2 scatter diagram
- Scatter diagram: use two sets of data to form multiple coordinate points, inspect the distribution of coordinate points, and judge whether there is some association between two variables or summarize the distribution mode of coordinate points
Features: judge whether there is quantitative correlation trend between variables, and display outliers (distribution law)
3.3 histogram
- Histogram: data arranged in columns or rows of a worksheet can be drawn into a histogram.
Features: drawing continuous and discrete data, you can see the size of each data at a glance, and compare the differences between the data. (Statistics / comparison)
3.4 histogram
- Histogram: the distribution of data represented by a series of vertical stripes or line segments with different heights. Generally, the horizontal axis is used to represent the data range, and the vertical axis is used to represent the distribution.
characteristic:
- Draw continuous data to show the distribution of one or more groups of data (Statistics)
- The histogram can also be used to observe and estimate which data are relatively concentrated and where the abnormal or isolated data are distributed
3.5 pie chart
- Pie chart: used to represent the proportion of different classifications, and compare various classifications by radian size.
Characteristics: percentage of classified data (percentage)
4 scatter
Demand: exploring the relationship between housing area and housing price
# 1. Prepare data x = [225.98, 247.07, 253.14, 457.85, 241.58, 301.01, 20.67, 288.64, 163.56, 120.06, 207.83, 342.75, 147.9 , 53.06, 224.72, 29.51, 21.61, 483.21, 245.25, 399.25, 343.35] # Housing area data y = [196.63, 203.88, 210.75, 372.74, 202.41, 247.61, 24.9 , 239.34, 140.32, 104.15, 176.84, 288.23, 128.79, 49.64, 191.74, 33.1 , 30.74, 400.02, 205.35, 330.64, 283.45] # Housing price data # 2. Create canvas plt.figure(figsize=(20, 8), dpi=80) # 3. Drawing an image plt.scatter(x, y) # 4. Display image plt.show()
5 bar
matplotlib.pyplot.bar(x, y, width, align='center', **kwargs)
Parameters: x: sequence of scalars,Center point of horizontal axis of histogram y: Ordinate width: scalar or array-like, optional(Width of histogram) align: {'center','edge'},optional, default: 'center' Alignment of the bars to the x coordinates 'center': Center the base on the x positions 'edge': Align the left edges of the bars with the x positions(Position alignment of each histogram) **kwargs: color: Choose the color of the histogram Returns: '.BarContainer' Container with all the bars and optionally errorbars
5.1 demand 1 - compare box office revenue of each film
# 1. Prepare data movie_names = ['Raytheon 3: dusk of gods','Justice League: Injustice for All','Murder of Orient Express','Journey to dream seeking circle','Global Storm', 'Demon subduing biography','chase','Seventy seven days','Secret War','Berserker','other'] tickets = [73853,57767,22354,15969,14839,8725,8716,8318,7916,6764,52222] # 2. Create canvas plt.figure(figsize=(20, 8), dpi=80) # Used to display Chinese labels normally plt.rcParams['font.sans-serif'] = ['SimHei' # 3. Draw histogram x_ticks = range(len(movie_names)) plt.bar(x_ticks, tickets, color=['b','r','g','y','c','m','y','k','c','g','b']) # Modify x scale plt.xticks(x_ticks, movie_names) # Add title plt.title("Box office revenue comparison") # Add grid display plt.grid(linestyle="--", alpha=0.5) # 4. Display image plt.show()
5.2 demand 2 - how can I be more persuasive than the box office?
Compare box office for the same days
Sometimes, to be fair, we need to compare the box office of the first day and the first week of different films
# 1. Prepare data movie_name = ['Raytheon 3: dusk of gods','Justice League: Injustice for All','Journey to dream seeking circle'] first_day = [10587.6,10062.5,1275.7] first_weekend=[36224.9,34479.6,11830] x = range(len(movie_name)) # 2. Create canvas plt.figure(figsize=(20, 8), dpi=80) # 3. Draw histogram plt.bar(x, first_day, width=0.2, label="First day box office") # plt.bar([0.2, 1.2, 2.2], first_weekend, width=0.2, label = "first week box office") plt.bar([i+0.2 for i in x], first_weekend, width=0.2, label="First week box office") # Show Legend plt.legend() # Modify scale plt.xticks([0.1, 1.1, 2.1], movie_name) # 4. Display image plt.show()
6 histogram
6.1 histogram introduction
Histogram, which is similar in shape to histogram, has a completely different meaning from histogram. Histogram involves the concept of statistics. Firstly, data should be grouped, and then the number of data elements in each group should be counted. In the coordinate system, the horizontal axis marks the endpoint of each group, the vertical axis represents the frequency, and the height of each rectangle represents the corresponding frequency. Such a statistical chart is called the frequency distribution histogram.
Example:
- The frequency distribution histogram of the height of 36 students in class 1, grade 3 of a school is shown in the figure below
(1) Which group has the most students in height?
(2) How many students are over 160.5cm tall?
Related concepts:
- Group number: in statistics, we divide data into groups according to different ranges. The number of groups divided is called group number
- Group spacing: the difference between two endpoints of each group
6.2 comparison between histogram and histogram
- Histogram: rectangle length → \ to → frequency or quantity of each group, width (representing category) → \ to → fixed, which is conducive to small dataset analysis.
- Histogram: describes the frequency distribution of a group of data. The length of the rectangle is → \ to → the frequency or quantity of each group, and the width is → \ to → the group distance of each group. Therefore, its height and width are meaningful, which is conducive to displaying the statistical results of a large number of data sets.
- Histograms help to understand the distribution of data, such as the mode, the approximate location of the median, whether there are gaps or outliers in the data.
1. Histogram shows the distribution of data, and histogram compares the size of data → \ to → the most fundamental difference.
-
Histogram shows the distribution of a group of data in the divided interval, but it can not see the specific size of a single data in an interval.
-
In the column chart, you can see the size of each data and compare it.
2. The x-axis of histogram is quantitative data, and the x-axis of histogram is classified data.
- In the histogram, the variables on the x-axis are continuous intervals, which are usually expressed as numbers, such as "0-10g, 10-20g..." representing Apple weight , representing the time length of "0-10min, 10-20min
- In the histogram, the variables on the x-axis are classified data, such as different country names and different game types.
- Each column on the histogram is immovable, and the interval on the x-axis is continuous and fixed.
- Each column on the histogram can be sorted at will. In some cases, it needs to be arranged according to the name of the classification data, and in some cases, it needs to be arranged according to the size of the value.
3. Histogram column has no interval and histogram column has interval
- Because the intervals in the histogram are continuous. The interval of histogram is discrete.
4. The column width of histogram can be different, and the column width of histogram must be the same
- The width of a column in a histogram must be the same because it has no numerical meaning.
- In the histogram, the width of the column represents the length of the interval. According to the different interval, the width of the column can be different, but in theory it should be a multiple of the unit length.
For example, the U.S. Census Bureau surveyed 12.4 billion people's commuting time. Because the number of people who commuted in 45-150 minutes was too small, the interval was changed to 45-60 minutes, 60-90 minutes, 90-150 minutes, and the other intervals were all 5.
- It can be seen that the data of Y axis is "number of people / group distance". In this case, the sum of the area of each column is equal to the total number of people investigated, and the area of the column is meaningful.
- When the Y-axis of the figure above represents "interval number / total number / group distance", this histogram is the "frequency distribution histogram" of our junior high school learning, and the frequency refers to "interval number / total number". In such a histogram, the sum of the areas of all columns is equal to 1.
6.3 histogram drawing
Demand: film duration distribution
Now there are 250 movie durations. I want to count the distribution of these movie durations, such as the number of movies with durations ranging from 100 minutes to 120 minutes, and the frequency of their occurrence. How do you present these data?
6.3.1 histogram drawing apimatplotlib.pyplot.hist(x, bins=None, normed=None, **kwargs)
Parameters: x:(n,) array or sequence of(n,)arrays,data bins: integer or sequence or 'auto',optional(Number of groups) normed: Display frequency or not, default to frequency6.3.2 drawing
- Set group spacing
- Set the number of groups (usually for the case of less data, it is divided into 5-12 groups, with more data, change the graphic display mode)
- Generally, there is a corresponding formula for the number of groups: number of groups (bins) = range / group distance = (max min) / group distance (rounding / /)
# 1. Prepare data time = [131, 98, 125, 131, 124, 139, 131, 117, 128, 108, 135, 138, 131, 102, 107, 114, 119, 128, 121, 142, 127, 130, 124, 101, 110, 116, 117, 110, 128, 128, 115, 99, 136, 126, 134, 95, 138, 117, 111,78, 132, 124, 113, 150, 110, 117, 86, 95, 144, 105, 126, 130,126, 130, 126, 116, 123, 106, 112, 138, 123, 86, 101, 99, 136,123, 117, 119, 105, 137, 123, 128, 125, 104, 109, 134, 125, 127,105, 120, 107, 129, 116, 108, 132, 103, 136, 118, 102, 120, 114,105, 115, 132, 145, 119, 121, 112, 139, 125, 138, 109, 132, 134,156, 106, 117, 127, 144, 139, 139, 119, 140, 83, 110, 102,123,107, 143, 115, 136, 118, 139, 123, 112, 118, 125, 109, 119, 133,112, 114, 122, 109, 106, 123, 116, 131, 127, 115, 118, 112, 135,115, 146, 137, 116, 103, 144, 83, 123, 111, 110, 111, 100, 154,136, 100, 118, 119, 133, 134, 106, 129, 126, 110, 111, 109, 141,120, 117, 106, 149, 122, 122, 110, 118, 127, 121, 114, 125, 126,114, 140, 103, 130, 141, 117, 106, 114, 121, 114, 133, 137, 92,121, 112, 146, 97, 137, 105, 98, 117, 112, 81, 97, 139, 113,134, 106, 144, 110, 137, 137, 111, 104, 117, 100, 111, 101, 110,105, 129, 137, 112, 120, 113, 133, 112, 83, 94, 146, 133, 101,131, 116, 111, 84, 137, 115, 122, 106, 144, 109, 123, 116, 111,111, 133, 150] # 2. Create canvas plt.figure(figsize=(20, 8), dpi=80) # 3. Draw histogram distance = 2 group_num = int((max(time) - min(time)) / distance) # Rounding plt.hist(time, bins=group_num, density=True) # Modify x-axis scale plt.xticks(range(min(time), max(time) + 2, distance)) # Add grid plt.grid(linestyle="--", alpha=0.5) # 4. Display image plt.show()6.3.3 histogram points for attention
- Pay attention to group spacing
Group spacing can affect the data distribution presented by histogram, so it needs to change group spacing many times when drawing histogram.
- Note the variables represented by the Y-axis
The variables on the Y axis can be frequency (how many times the data appears), frequency (frequency / total times), frequency / group distance. Different variables will make the data distribution described by the histogram have different meanings.
7 pie
7.1 pie chart introduction
Pie chart is used to represent the proportion of different classifications, and compare various classifications by radian size.
Pie chart is divided into several blocks according to the proportion of classification. The whole pie represents the total amount of data. Each block (ARC) represents the proportion of the classification to the total. The sum of all blocks (ARC) is equal to 100%.
7.2 pie drawing
Pie api introduction: pay attention to the number of displayed percentages
pit.pie(x, labels= , autopct= , colors)
- x: Quantity, auto percentage
- labels: name of each part
- autopct proportion display specified% 1.2f%%
- %1.2f%%: display percentage,% - floating-point number, 1.2f-occupy one position, keep one decimal place,% - escape character,% - percentage sign output
- colors: each part of the color
Demand: display the arrangement proportion of different films
# 1. Prepare data movie_name = ['Raytheon 3: dusk of gods','Justice League: Injustice for All','Murder of Orient Express','Journey to dream seeking circle','Global Storm','Demon subduing biography','chase','Seventy seven days','Secret War','Berserker','other'] place_count = [60605,54546,45819,28243,13270,9945,7679,6799,6101,4621,20105] # 2. Create canvas plt.figure(figsize=(20, 8), dpi=80) # 3. Draw pie chart plt.pie(place_count, labels=movie_name, colors=['b','r','g','y','c','m','y','k','c','g','y'], autopct="%1.2f%%") # Show Legend plt.legend() # The displayed pie chart remains round plt.axis('equal') # 4. Display image plt.show()
7.3 add axis
In order to keep the displayed pie chart round, axis needs to be added to ensure the same length and width plt.axis('equal '), otherwise the output pie chart is oval.
8 summary
Video: four days of Python tutorial of black horse start Python data mining quickly https://www.bilibili.com/video/BV1xt411v7z9?from=search&seid=1374736475069929050