Data Science Library (HM) DAY 4

Array Stitching

When stitching vertically: each column represents the same meaning, otherwise the head is not in the mouth.

If each column has a different meaning, then the columns of the number of one group should be swapped so that they are the same as the other.

Array row-column exchange

Exercise: What should I do now if I want to study and analyze data from two countries in the previous case, while preserving country information (the country source of each data)?

import numpy as np us_data = "./youtube_video_data/US_video_data_numbers.csv" uk_data = "./youtube_video_data/GB_video_data_numbers.csv" #Load country data us_data = np.loadtxt(us_data,delimiter=",",dtype=int) uk_data = np.loadtxt(uk_data,delimiter=",",dtype=int) # Add country information #Construct an array of all zeros zeros_data = np.zeros((us_data.shape[0],1)).astype(int) #Construct an array of all 1 ones_data = np.ones((uk_data.shape[0],1)).astype(int) #Add a column of arrays of all 0,1 us_data = np.hstack((us_data,zeros_data)) uk_data = np.hstack((uk_data,ones_data)) # Stitching two sets of data final_data = np.vstack((us_data,uk_data)) print(final_data)

Run result:

[[4394029 320053 5931 46245 0] [7860119 185853 26679 0 0] [5845909 576597 39774 170708 0] ... [ 109222 4840 35 212 1] [ 626223 22962 532 1559 1] [ 99228 1699 23 135 1]]

numpy More Methods

Get the position of the maximum and minimum

np.argmax(t,axis=0) #0 axis

np.argmin(t,axis=1) #1 axis
Create an array of zeros:

np.zeros((3,4))
Create an array of all 1:

np.ones((3,4))
Create a square array (square array) with a diagonal of 1:

np.eye(3)

numpy generates random numbers

.random.seed(s) usage:

np.random.seed(0) np.random.rand(10) Out[357]: array([0.5488135 , 0.71518937, 0.60276338, 0.54488318, 0.4236548 , 0.64589411, 0.43758721, 0.891773 , 0.96366276, 0.38344152]) np.random.rand(10) Out[358]: array([0.79172504, 0.52889492, 0.56804456, 0.92559664, 0.07103606, 0.0871293 , 0.0202184 , 0.83261985, 0.77815675, 0.87001215])

The second pass of np.random.rand(10) is no longer under the np.random.seed(0) you set, so the second pass of the random array is just the sample value randomly selected under the default random.
Just type np.random.seed (0) again:

np.random.seed(0) np.random.rand(4,3) Out[362]: array([[0.5488135 , 0.71518937, 0.60276338], [0.54488318, 0.4236548 , 0.64589411], [0.43758721, 0.891773 , 0.96366276], [0.38344152, 0.79172504, 0.52889492]]) np.random.seed(0) np.random.rand(4,3) Out[364]: array([[0.5488135 , 0.71518937, 0.60276338], [0.54488318, 0.4236548 , 0.64589411], [0.43758721, 0.891773 , 0.96366276], [0.38344152, 0.79172504, 0.52889492]])

pandas Learning

numpy can help us process numeric data, but that's not enough. Many times, our data has strings, time series, and so on, in addition to numerical values.

For example, we get data stored in the database by crawling.

For example: In the previous youtube example, besides numeric values, country information, video classification (tag) information, Title information, and so on.

So numpy can help us work with numbers, but pandas can help us work with other types of data in addition to numbers (based on numpy).

Common data types for pandas

Series 1-D, Tagged Array
DataFrame 2-D, Series container

pandas Series Creation

t1 = pd.Series([1,2,31,12,3,4]) print(t1) print(type(t1)) #Index can be set by index t2 = pd.Series([1,2,31,12,3,4],index=list('abcdef')) print(t2)

Run result:

0 1 1 2 2 31 3 12 4 3 5 4 dtype: int64 <class 'pandas.core.series.Series'> a 1 b 2 c 31 d 12 e 3 f 4 dtype: int64

You can also create them from a dictionary:
Index is the key to a dictionary

temp_dict = {"name":"Heiko","age":24,"tel":10086} t3 = pd.Series(temp_dict) print(t3)

Run result:

name Heiko age 24 tel 10086 dtype: object

Modify dtype:

t2f = t2.astype(float) print(t2f)

Run result:

a 1.0 b 2.0 c 31.0 d 12.0 e 3.0 f 4.0 dtype: float64

Series slices and indexes for pandas

Slices: simply pass in the start end or step;
Index: Pass in a serial number or index directly at one time, and a list of serial numbers or indexes at multiple times.

Index and Value of Series of pandas

a = pd.Series(range(5)) print(a.where(a>0))#Keep greater than 0 print(a.mask(a>0))#Delete greater than 0 print(a.where(a>0,10))#Greater than 0 becomes 10

Run result:

0 NaN 1 1.0 2 2.0 3 3.0 4 4.0 dtype: float64 0 0.0 1 NaN 2 NaN 3 NaN 4 NaN dtype: float64 0 10 1 1 2 2 3 3 4 4 dtype: int64

pandas Read External Data

Now suppose we have a set of statistics about the names of the dogs, what should we do to observe this set of data?

Our set of data exists in csv and we use pd. read_directly csv is sufficient

It's a little different from what we thought, we thought it would be a Series type, but it's a DataFrame, so let's get to that data type

However, there is one more problem:
How do we use data in databases such as mysql or mongodb?

pd.read_sql(sql_sentence,connection)

So what about mongodb?

DataFrame for pandas

DataFrame objects have both row and column indexes
Row index, indicating different rows, horizontal index, called index, 0 axis, axis=0
Column index, table name different column, vertical index, called columns, 1 axis, axis=1

DataFrame is equivalent to a container for Series
DataFrame can pass in a dictionary as data

import pandas as pd d1 = {"name":["Heiko","Lee"],"age":[24,30],"tel":[10086,10010]} t1 = pd.DataFrame(d1) print(t1) print(type(t1)) d2 = [{"name":"Heiko","age":24,"tel":10086},{"name":"Lee","tel":10086},{"name":"Lyle","age":27}] t2 = pd.DataFrame(d2) print(d2) print(t2) print(type(t2))

Run result:

name age tel 0 Heiko 24 10086 1 Lee 30 10010 <class 'pandas.core.frame.DataFrame'> [{'name': 'Heiko', 'age': 24, 'tel': 10086}, {'name': 'Lee', 'tel': 10086}, {'name': 'Lyle', 'age': 27}] name age tel 0 Heiko 24.0 10086.0 1 Lee NaN 10086.0 2 Lyle 27.0 NaN

DataFrame Basic Properties and Situation Query

Exercise: The top names used most often in the dog name statistics read by statistics.

import pandas as pd df = pd.read_csv("./dogNames2.csv") #Sorting method in dataFrame #Follow Count_AnimalName for sorting df = df.sort_values(by="Count_AnimalName",ascending=False) print(df.head())#Top 5 Lines of Show

Run result:

Row_Labels Count_AnimalName 1156 BELLA 1195 9140 MAX 1153 2660 CHARLIE 856 3251 COCO 852 12368 ROCKY 823

Row or column of pandas

Notes on row or column selection in pandas

Square brackets write arrays to indicate taking rows and manipulating rows
Write string, represent column index, manipulate column

import pandas as pd df = pd.read_csv("./dogNames2.csv") print(df[:10]) print(df["Row_Labels"]) print(type(df["Row_Labels"])) #Only one column is taken, so it's a Series type

Run result:

Row_Labels Count_AnimalName 1156 BELLA 1195 9140 MAX 1153 2660 CHARLIE 856 3251 COCO 852 12368 ROCKY 823 Row_Labels Count_AnimalName 1156 BELLA 1195 9140 MAX 1153 2660 CHARLIE 856 3251 COCO 852 12368 ROCKY 823 8417 LOLA 795 8552 LUCKY 723 8560 LUCY 710 2032 BUDDY 677 3641 DAISY 649 1156 BELLA 9140 MAX 2660 CHARLIE 3251 COCO 12368 ROCKY ... 6884 J-LO 6888 JOANN 6890 JOAO 6891 JOAQUIN 16219 39743 Name: Row_Labels, Length: 16220, dtype: object <class 'pandas.core.series.Series'>

loc and iloc of pandas

There are more pandas-optimized options:

df.loc indexes row data by label
df.iloc retrieves row data from location

Boolean Index of pandas

Returning to the previous question of dog names, what would we do if we wanted to find the names of all the dogs that were used more than 800 times?

Returning to the previous question of dog names, what would we do if we wanted to find the names of all dogs that were used more than 700 times and had a name string longer than 4?

In a dataFrame, multiple conditions need to be connected by logical operators.

String method for pandas

Processing missing data

There are usually two situations where data is missing:
One is empty, None, etc., in pandas is NaN (like np.nan); Another is that we intentionally made it zero.

Determine if the data is NaN:

pd.isnull(df)
pd.notnull(df)

Processing 1: Delete dropna (axis=0, how='any', inplace=False) from the row where NaN is located

how='any': delete the row or column with nan;
how='all': delete rows or columns that are all nan s;
inplace: modify in place, omitting the step of assigning yourself.

Processing Mode 2: Fill in the data,

t.fillna(t.mean()): Fill average;
t.fiallna(t.median()): Fill in the median;
t.fillna(0): Fill 0.

Data processed as 0:

t[t==0]=np.nan

Of course, not every data with 0 needs to be processed, such as calculating the average value. nan does not participate in the calculation, but 0 will.

Statistical methods commonly used in pandas

Suppose we now have a set of the most popular movie data from 2006 to 2016. We want to know the average score and the number of directors in these movie data. How should we get it?

import pandas as pd file_path = "IMDB-Movie-Data.csv" df = pd.read_csv(file_path) print(df.head(1)) #Get average score print(df["Rating"].mean()) #Number of directors # print(len(set(df["Director"].tolist()))) print(len(df["Director"].unique())) #Number of Actors Acquired temp_actors_list = df["Actors"].str.split(", ").tolist() # tolist() into a list actors_list = [i for j in temp_actors_list for i in j]#Expand the list of elements as lists into a large list actors_num = len(set(actors_list)) print(actors_num)

unique(): For a one-dimensional array or list, the unique function removes duplicate elements and returns a new tuple or list with no element duplicates, from large to small.

The **set() **:set() function creates an unordered set of non-repeating elements that can be used for relationship testing, deleting duplicate data, and calculating intersections, differences, unions, and so on.

Run result:

Rank Title ... Revenue (Millions) Metascore 0 1 Guardians of the Galaxy ... 333.13 76.0 [1 rows x 12 columns] 6.723200000000003 644 2015

Exercise: For this set of movie data, how should we present the data if we want to rate the distribution of runtime?

Runtime (Minutes):

import pandas as pd from matplotlib import pyplot as plt file_path = "./IMDB-Movie-Data.csv" df = pd.read_csv(file_path) #print(df.head(1)) # print(df.info()) #rating,runtime distribution #Select Graph, Histogram #Preparing data runtime_data = df["Runtime (Minutes)"].values print(runtime_data) max_runtime = runtime_data.max() min_runtime = runtime_data.min() #Calculate Number of Groups print(max_runtime-min_runtime) num_bin = (max_runtime-min_runtime)//5 #Setting the size of a graphic plt.figure(figsize=(20,8),dpi=80) plt.hist(runtime_data,num_bin) plt.xticks(range(min_runtime,max_runtime+5,5)) plt.grid() plt.show()

Run result:

125

Rating:

import numpy as np from matplotlib import pyplot as plt runtime_data = np.array([8.1, 7.0, 7.3, 7.2, 6.2, 6.1, 8.3, 6.4, 7.1, 7.0, 7.5, 7.8, 7.9, 7.7, 6.4, 6.6, 8.2, 6.7, 8.1, 8.0, 6.7, 7.9, 6.7, 6.5, 5.3, 6.8, 8.3, 4.7, 6.2, 5.9, 6.3, 7.5, 7.1, 8.0, 5.6, 7.9, 8.6, 7.6, 6.9, 7.1, 6.3, 7.5, 2.7, 7.2, 6.3, 6.7, 7.3, 5.6, 7.1, 3.7, 8.1, 5.8, 5.6, 7.2, 9.0, 7.3, 7.2, 7.4, 7.0, 7.5, 6.7, 6.8, 6.5, 4.1, 8.5, 7.7, 7.4, 8.1, 7.5, 7.2, 5.9, 7.1, 7.5, 6.8, 8.1, 7.1, 8.1, 8.3, 7.3, 5.3, 8.8, 7.9, 8.2, 8.1, 7.2, 7.0, 6.4, 7.8, 7.8, 7.4, 8.1, 7.0, 8.1, 7.1, 7.4, 7.4, 8.6, 5.8, 6.3, 8.5, 7.0, 7.0, 8.0, 7.9, 7.3, 7.7, 5.4, 6.3, 5.8, 7.7, 6.3, 8.1, 6.1, 7.7, 8.1, 5.8, 6.2, 8.8, 7.2, 7.4, 6.7, 6.7, 6.0, 7.4, 8.5, 7.5, 5.7, 6.6, 6.4, 8.0, 7.3, 6.0, 6.4, 8.5, 7.1, 7.3, 8.1, 7.3, 8.1, 7.1, 8.0, 6.2, 7.8, 8.2, 8.4, 8.1, 7.4, 7.6, 7.6, 6.2, 6.4, 7.2, 5.8, 7.6, 8.1, 4.7, 7.0, 7.4, 7.5, 7.9, 6.0, 7.0, 8.0, 6.1, 8.0, 5.2, 6.5, 7.3, 7.3, 6.8, 7.9, 7.9, 5.2, 8.0, 7.5, 6.5, 7.6, 7.0, 7.4, 7.3, 6.7, 6.8, 7.0, 5.9, 8.0, 6.0, 6.3, 6.6, 7.8, 6.3, 7.2, 5.6, 8.1, 5.8, 8.2, 6.9, 6.3, 8.1, 8.1, 6.3, 7.9, 6.5, 7.3, 7.9, 5.7, 7.8, 7.5, 7.5, 6.8, 6.7, 6.1, 5.3, 7.1, 5.8, 7.0, 5.5, 7.8, 5.7, 6.1, 7.7, 6.7, 7.1, 6.9, 7.8, 7.0, 7.0, 7.1, 6.4, 7.0, 4.8, 8.2, 5.2, 7.8, 7.4, 6.1, 8.0, 6.8, 3.9, 8.1, 5.9, 7.6, 8.2, 5.8, 6.5, 5.9, 7.6, 7.9, 7.4, 7.1, 8.6, 4.9, 7.3, 7.9, 6.7, 7.5, 7.8, 5.8, 7.6, 6.4, 7.1, 7.8, 8.0, 6.2, 7.0, 6.0, 4.9, 6.0, 7.5, 6.7, 3.7, 7.8, 7.9, 7.2, 8.0, 6.8, 7.0, 7.1, 7.7, 7.0, 7.2, 7.3, 7.6, 7.1, 7.0, 6.0, 6.1, 5.8, 5.3, 5.8, 6.1, 7.5, 7.2, 5.7, 7.7, 7.1, 6.6, 5.7, 6.8, 7.1, 8.1, 7.2, 7.5, 7.0, 5.5, 6.4, 6.7, 6.2, 5.5, 6.0, 6.1, 7.7, 7.8, 6.8, 7.4, 7.5, 7.0, 5.2, 5.3, 6.2, 7.3, 6.5, 6.4, 7.3, 6.7, 7.7, 6.0, 6.0, 7.4, 7.0, 5.4, 6.9, 7.3, 8.0, 7.4, 8.1, 6.1, 7.8, 5.9, 7.8, 6.5, 6.6, 7.4, 6.4, 6.8, 6.2, 5.8, 7.7, 7.3, 5.1, 7.7, 7.3, 6.6, 7.1, 6.7, 6.3, 5.5, 7.4, 7.7, 6.6, 7.8, 6.9, 5.7, 7.8, 7.7, 6.3, 8.0, 5.5, 6.9, 7.0, 5.7, 6.0, 6.8, 6.3, 6.7, 6.9, 5.7, 6.9, 7.6, 7.1, 6.1, 7.6, 7.4, 6.6, 7.6, 7.8, 7.1, 5.6, 6.7, 6.7, 6.6, 6.3, 5.8, 7.2, 5.0, 5.4, 7.2, 6.8, 5.5, 6.0, 6.1, 6.4, 3.9, 7.1, 7.7, 6.7, 6.7, 7.4, 7.8, 6.6, 6.1, 7.8, 6.5, 7.3, 7.2, 5.6, 5.4, 6.9, 7.8, 7.7, 7.2, 6.8, 5.7, 5.8, 6.2, 5.9, 7.8, 6.5, 8.1, 5.2, 6.0, 8.4, 4.7, 7.0, 7.4, 6.4, 7.1, 7.1, 7.6, 6.6, 5.6, 6.3, 7.5, 7.7, 7.4, 6.0, 6.6, 7.1, 7.9, 7.8, 5.9, 7.0, 7.0, 6.8, 6.5, 6.1, 8.3, 6.7, 6.0, 6.4, 7.3, 7.6, 6.0, 6.6, 7.5, 6.3, 7.5, 6.4, 6.9, 8.0, 6.7, 7.8, 6.4, 5.8, 7.5, 7.7, 7.4, 8.5, 5.7, 8.3, 6.7, 7.2, 6.5, 6.3, 7.7, 6.3, 7.8, 6.7, 6.7, 6.6, 8.0, 6.5, 6.9, 7.0, 5.3, 6.3, 7.2, 6.8, 7.1, 7.4, 8.3, 6.3, 7.2, 6.5, 7.3, 7.9, 5.7, 6.5, 7.7, 4.3, 7.8, 7.8, 7.2, 5.0, 7.1, 5.7, 7.1, 6.0, 6.9, 7.9, 6.2, 7.2, 5.3, 4.7, 6.6, 7.0, 3.9, 6.6, 5.4, 6.4, 6.7, 6.9, 5.4, 7.0, 6.4, 7.2, 6.5, 7.0, 5.7, 7.3, 6.1, 7.2, 7.4, 6.3, 7.1, 5.7, 6.7, 6.8, 6.5, 6.8, 7.9, 5.8, 7.1, 4.3, 6.3, 7.1, 4.6, 7.1, 6.3, 6.9, 6.6, 6.5, 6.5, 6.8, 7.8, 6.1, 5.8, 6.3, 7.5, 6.1, 6.5, 6.0, 7.1, 7.1, 7.8, 6.8, 5.8, 6.8, 6.8, 7.6, 6.3, 4.9, 4.2, 5.1, 5.7, 7.6, 5.2, 7.2, 6.0, 7.3, 7.2, 7.8, 6.2, 7.1, 6.4, 6.1, 7.2, 6.6, 6.2, 7.9, 7.3, 6.7, 6.4, 6.4, 7.2, 5.1, 7.4, 7.2, 6.9, 8.1, 7.0, 6.2, 7.6, 6.7, 7.5, 6.6, 6.3, 4.0, 6.9, 6.3, 7.3, 7.3, 6.4, 6.6, 5.6, 6.0, 6.3, 6.7, 6.0, 6.1, 6.2, 6.7, 6.6, 7.0, 4.9, 8.4, 7.0, 7.5, 7.3, 5.6, 6.7, 8.0, 8.1, 4.8, 7.5, 5.5, 8.2, 6.6, 3.2, 5.3, 5.6, 7.4, 6.4, 6.8, 6.7, 6.4, 7.0, 7.9, 5.9, 7.7, 6.7, 7.0, 6.9, 7.7, 6.6, 7.1, 6.6, 5.7, 6.3, 6.5, 8.0, 6.1, 6.5, 7.6, 5.6, 5.9, 7.2, 6.7, 7.2, 6.5, 7.2, 6.7, 7.5, 6.5, 5.9, 7.7, 8.0, 7.6, 6.1, 8.3, 7.1, 5.4, 7.8, 6.5, 5.5, 7.9, 8.1, 6.1, 7.3, 7.2, 5.5, 6.5, 7.0, 7.1, 6.6, 6.5, 5.8, 7.1, 6.5, 7.4, 6.2, 6.0, 7.6, 7.3, 8.2, 5.8, 6.5, 6.6, 6.2, 5.8, 6.4, 6.7, 7.1, 6.0, 5.1, 6.2, 6.2, 6.6, 7.6, 6.8, 6.7, 6.3, 7.0, 6.9, 6.6, 7.7, 7.5, 5.6, 7.1, 5.7, 5.2, 5.4, 6.6, 8.2, 7.6, 6.2, 6.1, 4.6, 5.7, 6.1, 5.9, 7.2, 6.5, 7.9, 6.3, 5.0, 7.3, 5.2, 6.6, 5.2, 7.8, 7.5, 7.3, 7.3, 6.6, 5.7, 8.2, 6.7, 6.2, 6.3, 5.7, 6.6, 4.5, 8.1, 5.6, 7.3, 6.2, 5.1, 4.7, 4.8, 7.2, 6.9, 6.5, 7.3, 6.5, 6.9, 7.8, 6.8, 4.6, 6.7, 6.4, 6.0, 6.3, 6.6, 7.8, 6.6, 6.2, 7.3, 7.4, 6.5, 7.0, 4.3, 7.2, 6.2, 6.2, 6.8, 6.0, 6.6, 7.1, 6.8, 5.2, 6.7, 6.2, 7.0, 6.3, 7.8, 7.6, 5.4, 7.6, 5.4, 4.6, 6.9, 6.8, 5.8, 7.0, 5.8, 5.3, 4.6, 5.3, 7.6, 1.9, 7.2, 6.4, 7.4, 5.7, 6.4, 6.3, 7.5, 5.5, 4.2, 7.8, 6.3, 6.4, 7.1, 7.1, 6.8, 7.3, 6.7, 7.8, 6.3, 7.5, 6.8, 7.4, 6.8, 7.1, 7.6, 5.9, 6.6, 7.5, 6.4, 7.8, 7.2, 8.4, 6.2, 7.1, 6.3, 6.5, 6.9, 6.9, 6.6, 6.9, 7.7, 2.7, 5.4, 7.0, 6.6, 7.0, 6.9, 7.3, 5.8, 5.8, 6.9, 7.5, 6.3, 6.9, 6.1, 7.5, 6.8, 6.5, 5.5, 7.7, 3.5, 6.2, 7.1, 5.5, 7.1, 7.1, 7.1, 7.9, 6.5, 5.5, 6.5, 5.6, 6.8, 7.9, 6.2, 6.2, 6.7, 6.9, 6.5, 6.6, 6.4, 4.7, 7.2, 7.2, 6.7, 7.5, 6.6, 6.7, 7.5, 6.1, 6.4, 6.3, 6.4, 6.8, 6.1, 4.9, 7.3, 5.9, 6.1, 7.1, 5.9, 6.8, 5.4, 6.3, 6.2, 6.6, 4.4, 6.8, 7.3, 7.4, 6.1, 4.9, 5.8, 6.1, 6.4, 6.9, 7.2, 5.6, 4.9, 6.1, 7.8, 7.3, 4.3, 7.2, 6.4, 6.2, 5.2, 7.7, 6.2, 7.8, 7.0, 5.9, 6.7, 6.3, 6.9, 7.0, 6.7, 7.3, 3.5, 6.5, 4.8, 6.9, 5.9, 6.2, 7.4, 6.0, 6.2, 5.0, 7.0, 7.6, 7.0, 5.3, 7.4, 6.5, 6.8, 5.6, 5.9, 6.3, 7.1, 7.5, 6.6, 8.5, 6.3, 5.9, 6.7, 6.2, 5.5, 6.2, 5.6, 5.3]) max_runtime = runtime_data.max() min_runtime = runtime_data.min() print(min_runtime,max_runtime) #By setting unequal-width group spacing, the hist method takes a left-closed right-open go [1.9,3.5] num_bin_list = [1.9,3.5] i=3.5 while i<=max_runtime: i += 0.5 num_bin_list.append(i) print(num_bin_list) #Setting the size of a graphic plt.figure(figsize=(20,8),dpi=80) plt.hist(runtime_data,num_bin_list) #xticks enables previous group spacing to correspond plt.xticks(num_bin_list) plt.show()

Run result:

1.9 9.0 [1.9, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5]

Data Science Library (HM) DAY 4

Array Stitching

Array row-column exchange

numpy More Methods

numpy generates random numbers

Common data types for pandas

pandas Series Creation

Series slices and indexes for pandas

Index and Value of Series of pandas

pandas Read External Data

DataFrame for pandas

DataFrame Basic Properties and Situation Query

Row or column of pandas

loc and iloc of pandas

Boolean Index of pandas

String method for pandas

Processing missing data

Statistical methods commonly used in pandas

21 October 2021, 09:26 | Views: 5530

Add new comment

0 comments