Array Stitching
When stitching vertically: each column represents the same meaning, otherwise the head is not in the mouth.
If each column has a different meaning, then the columns of the number of one group should be swapped so that they are the same as the other.
Array row-column exchange
Exercise: What should I do now if I want to study and analyze data from two countries in the previous case, while preserving country information (the country source of each data)?
import numpy as np us_data = "./youtube_video_data/US_video_data_numbers.csv" uk_data = "./youtube_video_data/GB_video_data_numbers.csv" #Load country data us_data = np.loadtxt(us_data,delimiter=",",dtype=int) uk_data = np.loadtxt(uk_data,delimiter=",",dtype=int) # Add country information #Construct an array of all zeros zeros_data = np.zeros((us_data.shape[0],1)).astype(int) #Construct an array of all 1 ones_data = np.ones((uk_data.shape[0],1)).astype(int) #Add a column of arrays of all 0,1 us_data = np.hstack((us_data,zeros_data)) uk_data = np.hstack((uk_data,ones_data)) # Stitching two sets of data final_data = np.vstack((us_data,uk_data)) print(final_data)
Run result:
[[4394029 320053 5931 46245 0] [7860119 185853 26679 0 0] [5845909 576597 39774 170708 0] ... [ 109222 4840 35 212 1] [ 626223 22962 532 1559 1] [ 99228 1699 23 135 1]]
numpy More Methods
-
Get the position of the maximum and minimum
np.argmax(t,axis=0) #0 axis
np.argmin(t,axis=1) #1 axis
-
Create an array of zeros:
np.zeros((3,4))
-
Create an array of all 1:
np.ones((3,4))
-
Create a square array (square array) with a diagonal of 1:
np.eye(3)
numpy generates random numbers
.random.seed(s) usage:
np.random.seed(0) np.random.rand(10) Out[357]: array([0.5488135 , 0.71518937, 0.60276338, 0.54488318, 0.4236548 , 0.64589411, 0.43758721, 0.891773 , 0.96366276, 0.38344152]) np.random.rand(10) Out[358]: array([0.79172504, 0.52889492, 0.56804456, 0.92559664, 0.07103606, 0.0871293 , 0.0202184 , 0.83261985, 0.77815675, 0.87001215])
The second pass of np.random.rand(10) is no longer under the np.random.seed(0) you set, so the second pass of the random array is just the sample value randomly selected under the default random.
Just type np.random.seed (0) again:
np.random.seed(0) np.random.rand(4,3) Out[362]: array([[0.5488135 , 0.71518937, 0.60276338], [0.54488318, 0.4236548 , 0.64589411], [0.43758721, 0.891773 , 0.96366276], [0.38344152, 0.79172504, 0.52889492]]) np.random.seed(0) np.random.rand(4,3) Out[364]: array([[0.5488135 , 0.71518937, 0.60276338], [0.54488318, 0.4236548 , 0.64589411], [0.43758721, 0.891773 , 0.96366276], [0.38344152, 0.79172504, 0.52889492]])pandas Learning
numpy can help us process numeric data, but that's not enough. Many times, our data has strings, time series, and so on, in addition to numerical values.
For example, we get data stored in the database by crawling.
For example: In the previous youtube example, besides numeric values, country information, video classification (tag) information, Title information, and so on.
So numpy can help us work with numbers, but pandas can help us work with other types of data in addition to numbers (based on numpy).
Common data types for pandas
- Series 1-D, Tagged Array
- DataFrame 2-D, Series container
pandas Series Creation
t1 = pd.Series([1,2,31,12,3,4]) print(t1) print(type(t1)) #Index can be set by index t2 = pd.Series([1,2,31,12,3,4],index=list('abcdef')) print(t2)
Run result:
0 1 1 2 2 31 3 12 4 3 5 4 dtype: int64 <class 'pandas.core.series.Series'> a 1 b 2 c 31 d 12 e 3 f 4 dtype: int64
You can also create them from a dictionary:
Index is the key to a dictionary
temp_dict = {"name":"Heiko","age":24,"tel":10086} t3 = pd.Series(temp_dict) print(t3)
Run result:
name Heiko age 24 tel 10086 dtype: object
Modify dtype:
t2f = t2.astype(float) print(t2f)
Run result:
a 1.0 b 2.0 c 31.0 d 12.0 e 3.0 f 4.0 dtype: float64
Series slices and indexes for pandas
- Slices: simply pass in the start end or step;
- Index: Pass in a serial number or index directly at one time, and a list of serial numbers or indexes at multiple times.
Index and Value of Series of pandas
a = pd.Series(range(5)) print(a.where(a>0))#Keep greater than 0 print(a.mask(a>0))#Delete greater than 0 print(a.where(a>0,10))#Greater than 0 becomes 10
Run result:
0 NaN 1 1.0 2 2.0 3 3.0 4 4.0 dtype: float64 0 0.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64 0 10 1 1 2 2 3 3 4 4 dtype: int64
pandas Read External Data
Now suppose we have a set of statistics about the names of the dogs, what should we do to observe this set of data?
Our set of data exists in csv and we use pd. read_directly csv is sufficient
It's a little different from what we thought, we thought it would be a Series type, but it's a DataFrame, so let's get to that data type
However, there is one more problem:
How do we use data in databases such as mysql or mongodb?
pd.read_sql(sql_sentence,connection)
So what about mongodb?
DataFrame for pandas
DataFrame objects have both row and column indexes
Row index, indicating different rows, horizontal index, called index, 0 axis, axis=0
Column index, table name different column, vertical index, called columns, 1 axis, axis=1
- DataFrame is equivalent to a container for Series
- DataFrame can pass in a dictionary as data
import pandas as pd d1 = {"name":["Heiko","Lee"],"age":[24,30],"tel":[10086,10010]} t1 = pd.DataFrame(d1) print(t1) print(type(t1)) d2 = [{"name":"Heiko","age":24,"tel":10086},{"name":"Lee","tel":10086},{"name":"Lyle","age":27}] t2 = pd.DataFrame(d2) print(d2) print(t2) print(type(t2))
Run result:
name age tel 0 Heiko 24 10086 1 Lee 30 10010 <class 'pandas.core.frame.DataFrame'> [{'name': 'Heiko', 'age': 24, 'tel': 10086}, {'name': 'Lee', 'tel': 10086}, {'name': 'Lyle', 'age': 27}] name age tel 0 Heiko 24.0 10086.0 1 Lee NaN 10086.0 2 Lyle 27.0 NaN
DataFrame Basic Properties and Situation Query
Exercise: The top names used most often in the dog name statistics read by statistics.
import pandas as pd df = pd.read_csv("./dogNames2.csv") #Sorting method in dataFrame #Follow Count_AnimalName for sorting df = df.sort_values(by="Count_AnimalName",ascending=False) print(df.head())#Top 5 Lines of Show
Run result:
Row_Labels Count_AnimalName 1156 BELLA 1195 9140 MAX 1153 2660 CHARLIE 856 3251 COCO 852 12368 ROCKY 823
Row or column of pandas
Notes on row or column selection in pandas
- Square brackets write arrays to indicate taking rows and manipulating rows
- Write string, represent column index, manipulate column
import pandas as pd df = pd.read_csv("./dogNames2.csv") print(df[:10]) print(df["Row_Labels"]) print(type(df["Row_Labels"])) #Only one column is taken, so it's a Series type
Run result:
Row_Labels Count_AnimalName 1156 BELLA 1195 9140 MAX 1153 2660 CHARLIE 856 3251 COCO 852 12368 ROCKY 823 Row_Labels Count_AnimalName 1156 BELLA 1195 9140 MAX 1153 2660 CHARLIE 856 3251 COCO 852 12368 ROCKY 823 8417 LOLA 795 8552 LUCKY 723 8560 LUCY 710 2032 BUDDY 677 3641 DAISY 649 1156 BELLA 9140 MAX 2660 CHARLIE 3251 COCO 12368 ROCKY ... 6884 J-LO 6888 JOANN 6890 JOAO 6891 JOAQUIN 16219 39743 Name: Row_Labels, Length: 16220, dtype: object <class 'pandas.core.series.Series'>
loc and iloc of pandas
There are more pandas-optimized options:
- df.loc indexes row data by label
- df.iloc retrieves row data from location
Boolean Index of pandas
Returning to the previous question of dog names, what would we do if we wanted to find the names of all the dogs that were used more than 800 times?
Returning to the previous question of dog names, what would we do if we wanted to find the names of all dogs that were used more than 700 times and had a name string longer than 4?
In a dataFrame, multiple conditions need to be connected by logical operators.
String method for pandas
Processing missing data
There are usually two situations where data is missing:
One is empty, None, etc., in pandas is NaN (like np.nan); Another is that we intentionally made it zero.
Determine if the data is NaN:
- pd.isnull(df)
- pd.notnull(df)
- Processing 1: Delete dropna (axis=0, how='any', inplace=False) from the row where NaN is located
- how='any': delete the row or column with nan;
- how='all': delete rows or columns that are all nan s;
- inplace: modify in place, omitting the step of assigning yourself.
- Processing Mode 2: Fill in the data,
- t.fillna(t.mean()): Fill average;
- t.fiallna(t.median()): Fill in the median;
- t.fillna(0): Fill 0.
Data processed as 0:
t[t==0]=np.nan
Of course, not every data with 0 needs to be processed, such as calculating the average value. nan does not participate in the calculation, but 0 will.
Statistical methods commonly used in pandas
Suppose we now have a set of the most popular movie data from 2006 to 2016. We want to know the average score and the number of directors in these movie data. How should we get it?
import pandas as pd file_path = "IMDB-Movie-Data.csv" df = pd.read_csv(file_path) print(df.head(1)) #Get average score print(df["Rating"].mean()) #Number of directors # print(len(set(df["Director"].tolist()))) print(len(df["Director"].unique())) #Number of Actors Acquired temp_actors_list = df["Actors"].str.split(", ").tolist() # tolist() into a list actors_list = [i for j in temp_actors_list for i in j]#Expand the list of elements as lists into a large list actors_num = len(set(actors_list)) print(actors_num)
unique(): For a one-dimensional array or list, the unique function removes duplicate elements and returns a new tuple or list with no element duplicates, from large to small.
The **set() **:set() function creates an unordered set of non-repeating elements that can be used for relationship testing, deleting duplicate data, and calculating intersections, differences, unions, and so on.
Run result:
Rank Title ... Revenue (Millions) Metascore 0 1 Guardians of the Galaxy ... 333.13 76.0 [1 rows x 12 columns] 6.723200000000003 644 2015
Exercise: For this set of movie data, how should we present the data if we want to rate the distribution of runtime?
Runtime (Minutes):
import pandas as pd from matplotlib import pyplot as plt file_path = "./IMDB-Movie-Data.csv" df = pd.read_csv(file_path) #print(df.head(1)) # print(df.info()) #rating,runtime distribution #Select Graph, Histogram #Preparing data runtime_data = df["Runtime (Minutes)"].values print(runtime_data) max_runtime = runtime_data.max() min_runtime = runtime_data.min() #Calculate Number of Groups print(max_runtime-min_runtime) num_bin = (max_runtime-min_runtime)//5 #Setting the size of a graphic plt.figure(figsize=(20,8),dpi=80) plt.hist(runtime_data,num_bin) plt.xticks(range(min_runtime,max_runtime+5,5)) plt.grid() plt.show()
Run result:
125
Rating:
import numpy as np from matplotlib import pyplot as plt runtime_data = np.array([8.1, 7.0, 7.3, 7.2, 6.2, 6.1, 8.3, 6.4, 7.1, 7.0, 7.5, 7.8, 7.9, 7.7, 6.4, 6.6, 8.2, 6.7, 8.1, 8.0, 6.7, 7.9, 6.7, 6.5, 5.3, 6.8, 8.3, 4.7, 6.2, 5.9, 6.3, 7.5, 7.1, 8.0, 5.6, 7.9, 8.6, 7.6, 6.9, 7.1, 6.3, 7.5, 2.7, 7.2, 6.3, 6.7, 7.3, 5.6, 7.1, 3.7, 8.1, 5.8, 5.6, 7.2, 9.0, 7.3, 7.2, 7.4, 7.0, 7.5, 6.7, 6.8, 6.5, 4.1, 8.5, 7.7, 7.4, 8.1, 7.5, 7.2, 5.9, 7.1, 7.5, 6.8, 8.1, 7.1, 8.1, 8.3, 7.3, 5.3, 8.8, 7.9, 8.2, 8.1, 7.2, 7.0, 6.4, 7.8, 7.8, 7.4, 8.1, 7.0, 8.1, 7.1, 7.4, 7.4, 8.6, 5.8, 6.3, 8.5, 7.0, 7.0, 8.0, 7.9, 7.3, 7.7, 5.4, 6.3, 5.8, 7.7, 6.3, 8.1, 6.1, 7.7, 8.1, 5.8, 6.2, 8.8, 7.2, 7.4, 6.7, 6.7, 6.0, 7.4, 8.5, 7.5, 5.7, 6.6, 6.4, 8.0, 7.3, 6.0, 6.4, 8.5, 7.1, 7.3, 8.1, 7.3, 8.1, 7.1, 8.0, 6.2, 7.8, 8.2, 8.4, 8.1, 7.4, 7.6, 7.6, 6.2, 6.4, 7.2, 5.8, 7.6, 8.1, 4.7, 7.0, 7.4, 7.5, 7.9, 6.0, 7.0, 8.0, 6.1, 8.0, 5.2, 6.5, 7.3, 7.3, 6.8, 7.9, 7.9, 5.2, 8.0, 7.5, 6.5, 7.6, 7.0, 7.4, 7.3, 6.7, 6.8, 7.0, 5.9, 8.0, 6.0, 6.3, 6.6, 7.8, 6.3, 7.2, 5.6, 8.1, 5.8, 8.2, 6.9, 6.3, 8.1, 8.1, 6.3, 7.9, 6.5, 7.3, 7.9, 5.7, 7.8, 7.5, 7.5, 6.8, 6.7, 6.1, 5.3, 7.1, 5.8, 7.0, 5.5, 7.8, 5.7, 6.1, 7.7, 6.7, 7.1, 6.9, 7.8, 7.0, 7.0, 7.1, 6.4, 7.0, 4.8, 8.2, 5.2, 7.8, 7.4, 6.1, 8.0, 6.8, 3.9, 8.1, 5.9, 7.6, 8.2, 5.8, 6.5, 5.9, 7.6, 7.9, 7.4, 7.1, 8.6, 4.9, 7.3, 7.9, 6.7, 7.5, 7.8, 5.8, 7.6, 6.4, 7.1, 7.8, 8.0, 6.2, 7.0, 6.0, 4.9, 6.0, 7.5, 6.7, 3.7, 7.8, 7.9, 7.2, 8.0, 6.8, 7.0, 7.1, 7.7, 7.0, 7.2, 7.3, 7.6, 7.1, 7.0, 6.0, 6.1, 5.8, 5.3, 5.8, 6.1, 7.5, 7.2, 5.7, 7.7, 7.1, 6.6, 5.7, 6.8, 7.1, 8.1, 7.2, 7.5, 7.0, 5.5, 6.4, 6.7, 6.2, 5.5, 6.0, 6.1, 7.7, 7.8, 6.8, 7.4, 7.5, 7.0, 5.2, 5.3, 6.2, 7.3, 6.5, 6.4, 7.3, 6.7, 7.7, 6.0, 6.0, 7.4, 7.0, 5.4, 6.9, 7.3, 8.0, 7.4, 8.1, 6.1, 7.8, 5.9, 7.8, 6.5, 6.6, 7.4, 6.4, 6.8, 6.2, 5.8, 7.7, 7.3, 5.1, 7.7, 7.3, 6.6, 7.1, 6.7, 6.3, 5.5, 7.4, 7.7, 6.6, 7.8, 6.9, 5.7, 7.8, 7.7, 6.3, 8.0, 5.5, 6.9, 7.0, 5.7, 6.0, 6.8, 6.3, 6.7, 6.9, 5.7, 6.9, 7.6, 7.1, 6.1, 7.6, 7.4, 6.6, 7.6, 7.8, 7.1, 5.6, 6.7, 6.7, 6.6, 6.3, 5.8, 7.2, 5.0, 5.4, 7.2, 6.8, 5.5, 6.0, 6.1, 6.4, 3.9, 7.1, 7.7, 6.7, 6.7, 7.4, 7.8, 6.6, 6.1, 7.8, 6.5, 7.3, 7.2, 5.6, 5.4, 6.9, 7.8, 7.7, 7.2, 6.8, 5.7, 5.8, 6.2, 5.9, 7.8, 6.5, 8.1, 5.2, 6.0, 8.4, 4.7, 7.0, 7.4, 6.4, 7.1, 7.1, 7.6, 6.6, 5.6, 6.3, 7.5, 7.7, 7.4, 6.0, 6.6, 7.1, 7.9, 7.8, 5.9, 7.0, 7.0, 6.8, 6.5, 6.1, 8.3, 6.7, 6.0, 6.4, 7.3, 7.6, 6.0, 6.6, 7.5, 6.3, 7.5, 6.4, 6.9, 8.0, 6.7, 7.8, 6.4, 5.8, 7.5, 7.7, 7.4, 8.5, 5.7, 8.3, 6.7, 7.2, 6.5, 6.3, 7.7, 6.3, 7.8, 6.7, 6.7, 6.6, 8.0, 6.5, 6.9, 7.0, 5.3, 6.3, 7.2, 6.8, 7.1, 7.4, 8.3, 6.3, 7.2, 6.5, 7.3, 7.9, 5.7, 6.5, 7.7, 4.3, 7.8, 7.8, 7.2, 5.0, 7.1, 5.7, 7.1, 6.0, 6.9, 7.9, 6.2, 7.2, 5.3, 4.7, 6.6, 7.0, 3.9, 6.6, 5.4, 6.4, 6.7, 6.9, 5.4, 7.0, 6.4, 7.2, 6.5, 7.0, 5.7, 7.3, 6.1, 7.2, 7.4, 6.3, 7.1, 5.7, 6.7, 6.8, 6.5, 6.8, 7.9, 5.8, 7.1, 4.3, 6.3, 7.1, 4.6, 7.1, 6.3, 6.9, 6.6, 6.5, 6.5, 6.8, 7.8, 6.1, 5.8, 6.3, 7.5, 6.1, 6.5, 6.0, 7.1, 7.1, 7.8, 6.8, 5.8, 6.8, 6.8, 7.6, 6.3, 4.9, 4.2, 5.1, 5.7, 7.6, 5.2, 7.2, 6.0, 7.3, 7.2, 7.8, 6.2, 7.1, 6.4, 6.1, 7.2, 6.6, 6.2, 7.9, 7.3, 6.7, 6.4, 6.4, 7.2, 5.1, 7.4, 7.2, 6.9, 8.1, 7.0, 6.2, 7.6, 6.7, 7.5, 6.6, 6.3, 4.0, 6.9, 6.3, 7.3, 7.3, 6.4, 6.6, 5.6, 6.0, 6.3, 6.7, 6.0, 6.1, 6.2, 6.7, 6.6, 7.0, 4.9, 8.4, 7.0, 7.5, 7.3, 5.6, 6.7, 8.0, 8.1, 4.8, 7.5, 5.5, 8.2, 6.6, 3.2, 5.3, 5.6, 7.4, 6.4, 6.8, 6.7, 6.4, 7.0, 7.9, 5.9, 7.7, 6.7, 7.0, 6.9, 7.7, 6.6, 7.1, 6.6, 5.7, 6.3, 6.5, 8.0, 6.1, 6.5, 7.6, 5.6, 5.9, 7.2, 6.7, 7.2, 6.5, 7.2, 6.7, 7.5, 6.5, 5.9, 7.7, 8.0, 7.6, 6.1, 8.3, 7.1, 5.4, 7.8, 6.5, 5.5, 7.9, 8.1, 6.1, 7.3, 7.2, 5.5, 6.5, 7.0, 7.1, 6.6, 6.5, 5.8, 7.1, 6.5, 7.4, 6.2, 6.0, 7.6, 7.3, 8.2, 5.8, 6.5, 6.6, 6.2, 5.8, 6.4, 6.7, 7.1, 6.0, 5.1, 6.2, 6.2, 6.6, 7.6, 6.8, 6.7, 6.3, 7.0, 6.9, 6.6, 7.7, 7.5, 5.6, 7.1, 5.7, 5.2, 5.4, 6.6, 8.2, 7.6, 6.2, 6.1, 4.6, 5.7, 6.1, 5.9, 7.2, 6.5, 7.9, 6.3, 5.0, 7.3, 5.2, 6.6, 5.2, 7.8, 7.5, 7.3, 7.3, 6.6, 5.7, 8.2, 6.7, 6.2, 6.3, 5.7, 6.6, 4.5, 8.1, 5.6, 7.3, 6.2, 5.1, 4.7, 4.8, 7.2, 6.9, 6.5, 7.3, 6.5, 6.9, 7.8, 6.8, 4.6, 6.7, 6.4, 6.0, 6.3, 6.6, 7.8, 6.6, 6.2, 7.3, 7.4, 6.5, 7.0, 4.3, 7.2, 6.2, 6.2, 6.8, 6.0, 6.6, 7.1, 6.8, 5.2, 6.7, 6.2, 7.0, 6.3, 7.8, 7.6, 5.4, 7.6, 5.4, 4.6, 6.9, 6.8, 5.8, 7.0, 5.8, 5.3, 4.6, 5.3, 7.6, 1.9, 7.2, 6.4, 7.4, 5.7, 6.4, 6.3, 7.5, 5.5, 4.2, 7.8, 6.3, 6.4, 7.1, 7.1, 6.8, 7.3, 6.7, 7.8, 6.3, 7.5, 6.8, 7.4, 6.8, 7.1, 7.6, 5.9, 6.6, 7.5, 6.4, 7.8, 7.2, 8.4, 6.2, 7.1, 6.3, 6.5, 6.9, 6.9, 6.6, 6.9, 7.7, 2.7, 5.4, 7.0, 6.6, 7.0, 6.9, 7.3, 5.8, 5.8, 6.9, 7.5, 6.3, 6.9, 6.1, 7.5, 6.8, 6.5, 5.5, 7.7, 3.5, 6.2, 7.1, 5.5, 7.1, 7.1, 7.1, 7.9, 6.5, 5.5, 6.5, 5.6, 6.8, 7.9, 6.2, 6.2, 6.7, 6.9, 6.5, 6.6, 6.4, 4.7, 7.2, 7.2, 6.7, 7.5, 6.6, 6.7, 7.5, 6.1, 6.4, 6.3, 6.4, 6.8, 6.1, 4.9, 7.3, 5.9, 6.1, 7.1, 5.9, 6.8, 5.4, 6.3, 6.2, 6.6, 4.4, 6.8, 7.3, 7.4, 6.1, 4.9, 5.8, 6.1, 6.4, 6.9, 7.2, 5.6, 4.9, 6.1, 7.8, 7.3, 4.3, 7.2, 6.4, 6.2, 5.2, 7.7, 6.2, 7.8, 7.0, 5.9, 6.7, 6.3, 6.9, 7.0, 6.7, 7.3, 3.5, 6.5, 4.8, 6.9, 5.9, 6.2, 7.4, 6.0, 6.2, 5.0, 7.0, 7.6, 7.0, 5.3, 7.4, 6.5, 6.8, 5.6, 5.9, 6.3, 7.1, 7.5, 6.6, 8.5, 6.3, 5.9, 6.7, 6.2, 5.5, 6.2, 5.6, 5.3]) max_runtime = runtime_data.max() min_runtime = runtime_data.min() print(min_runtime,max_runtime) #By setting unequal-width group spacing, the hist method takes a left-closed right-open go [1.9,3.5] num_bin_list = [1.9,3.5] i=3.5 while i<=max_runtime: i += 0.5 num_bin_list.append(i) print(num_bin_list) #Setting the size of a graphic plt.figure(figsize=(20,8),dpi=80) plt.hist(runtime_data,num_bin_list) #xticks enables previous group spacing to correspond plt.xticks(num_bin_list) plt.show()
Run result:
1.9 9.0 [1.9, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5]