Data Science Library (HM) DAY 4

Array Stitching

When stitching vertically: each column represents the same meaning, otherwise the head is not in the mouth.

If each column has a different meaning, then the columns of the number of one group should be swapped so that they are the same as the other.

Array row-column exchange

Exercise: What should I do now if I want to study and analyze data from two countries in the previous case, while preserving country information (the country source of each data)?

import numpy as np

us_data = "./youtube_video_data/US_video_data_numbers.csv"
uk_data = "./youtube_video_data/GB_video_data_numbers.csv"

#Load country data
us_data = np.loadtxt(us_data,delimiter=",",dtype=int)
uk_data = np.loadtxt(uk_data,delimiter=",",dtype=int)

# Add country information
#Construct an array of all zeros
zeros_data = np.zeros((us_data.shape[0],1)).astype(int)
#Construct an array of all 1
ones_data = np.ones((uk_data.shape[0],1)).astype(int)

#Add a column of arrays of all 0,1
us_data = np.hstack((us_data,zeros_data))
uk_data = np.hstack((uk_data,ones_data))


# Stitching two sets of data
final_data = np.vstack((us_data,uk_data))
print(final_data)

Run result:

[[4394029  320053    5931   46245       0]
 [7860119  185853   26679       0       0]
 [5845909  576597   39774  170708       0]
 ...
 [ 109222    4840      35     212       1]
 [ 626223   22962     532    1559       1]
 [  99228    1699      23     135       1]]

numpy More Methods

  • Get the position of the maximum and minimum

    np.argmax(t,axis=0) #0 axis

    np.argmin(t,axis=1) #1 axis

  • Create an array of zeros:

    np.zeros((3,4))

  • Create an array of all 1:

    np.ones((3,4))

  • Create a square array (square array) with a diagonal of 1:

    np.eye(3)

numpy generates random numbers

.random.seed(s) usage:

np.random.seed(0)
np.random.rand(10)
Out[357]: 
array([0.5488135 , 0.71518937, 0.60276338, 0.54488318, 0.4236548 ,
       0.64589411, 0.43758721, 0.891773  , 0.96366276, 0.38344152])
np.random.rand(10)
Out[358]: 
array([0.79172504, 0.52889492, 0.56804456, 0.92559664, 0.07103606,
       0.0871293 , 0.0202184 , 0.83261985, 0.77815675, 0.87001215])

The second pass of np.random.rand(10) is no longer under the np.random.seed(0) you set, so the second pass of the random array is just the sample value randomly selected under the default random.
Just type np.random.seed (0) again:

np.random.seed(0)
np.random.rand(4,3)
Out[362]: 
array([[0.5488135 , 0.71518937, 0.60276338],
       [0.54488318, 0.4236548 , 0.64589411],
       [0.43758721, 0.891773  , 0.96366276],
       [0.38344152, 0.79172504, 0.52889492]])
np.random.seed(0)
np.random.rand(4,3)
Out[364]: 
array([[0.5488135 , 0.71518937, 0.60276338],
       [0.54488318, 0.4236548 , 0.64589411],
       [0.43758721, 0.891773  , 0.96366276],
       [0.38344152, 0.79172504, 0.52889492]])

pandas Learning

numpy can help us process numeric data, but that's not enough. Many times, our data has strings, time series, and so on, in addition to numerical values.

For example, we get data stored in the database by crawling.

For example: In the previous youtube example, besides numeric values, country information, video classification (tag) information, Title information, and so on.

So numpy can help us work with numbers, but pandas can help us work with other types of data in addition to numbers (based on numpy).

Common data types for pandas

  1. Series 1-D, Tagged Array
  2. DataFrame 2-D, Series container

pandas Series Creation

t1 = pd.Series([1,2,31,12,3,4])
print(t1)
print(type(t1))

#Index can be set by index
t2 = pd.Series([1,2,31,12,3,4],index=list('abcdef'))
print(t2)

Run result:

0     1
1     2
2    31
3    12
4     3
5     4
dtype: int64
<class 'pandas.core.series.Series'>
a     1
b     2
c    31
d    12
e     3
f     4
dtype: int64

You can also create them from a dictionary:
Index is the key to a dictionary

temp_dict = {"name":"Heiko","age":24,"tel":10086}
t3 = pd.Series(temp_dict)
print(t3)

Run result:

name    Heiko
age        24
tel     10086
dtype: object

Modify dtype:

t2f = t2.astype(float)
print(t2f)

Run result:

a     1.0
b     2.0
c    31.0
d    12.0
e     3.0
f     4.0
dtype: float64

Series slices and indexes for pandas

  • Slices: simply pass in the start end or step;
  • Index: Pass in a serial number or index directly at one time, and a list of serial numbers or indexes at multiple times.

Index and Value of Series of pandas

a = pd.Series(range(5))
print(a.where(a>0))#Keep greater than 0
print(a.mask(a>0))#Delete greater than 0
print(a.where(a>0,10))#Greater than 0 becomes 10

Run result:

0    NaN
1    1.0
2    2.0
3    3.0
4    4.0
dtype: float64
0    0.0
1    NaN
2    NaN
3    NaN
4    NaN
dtype: float64
0    10
1     1
2     2
3     3
4     4
dtype: int64

pandas Read External Data

Now suppose we have a set of statistics about the names of the dogs, what should we do to observe this set of data?

Our set of data exists in csv and we use pd. read_directly csv is sufficient

It's a little different from what we thought, we thought it would be a Series type, but it's a DataFrame, so let's get to that data type

However, there is one more problem:
How do we use data in databases such as mysql or mongodb?

pd.read_sql(sql_sentence,connection)

So what about mongodb?

DataFrame for pandas

DataFrame objects have both row and column indexes
Row index, indicating different rows, horizontal index, called index, 0 axis, axis=0
Column index, table name different column, vertical index, called columns, 1 axis, axis=1

  • DataFrame is equivalent to a container for Series
  • DataFrame can pass in a dictionary as data
import pandas as pd

d1 = {"name":["Heiko","Lee"],"age":[24,30],"tel":[10086,10010]}
t1 = pd.DataFrame(d1)
print(t1)
print(type(t1))

d2 = [{"name":"Heiko","age":24,"tel":10086},{"name":"Lee","tel":10086},{"name":"Lyle","age":27}]
t2 = pd.DataFrame(d2)
print(d2)
print(t2)
print(type(t2))

Run result:

    name  age    tel
0  Heiko   24  10086
1    Lee   30  10010
<class 'pandas.core.frame.DataFrame'>
[{'name': 'Heiko', 'age': 24, 'tel': 10086}, {'name': 'Lee', 'tel': 10086}, {'name': 'Lyle', 'age': 27}]
    name   age      tel
0  Heiko  24.0  10086.0
1    Lee   NaN  10086.0
2   Lyle  27.0      NaN

DataFrame Basic Properties and Situation Query

Exercise: The top names used most often in the dog name statistics read by statistics.

import pandas as pd

df = pd.read_csv("./dogNames2.csv")

#Sorting method in dataFrame
#Follow Count_AnimalName for sorting
df = df.sort_values(by="Count_AnimalName",ascending=False)
print(df.head())#Top 5 Lines of Show

Run result:

      Row_Labels  Count_AnimalName
1156       BELLA              1195
9140         MAX              1153
2660     CHARLIE               856
3251        COCO               852
12368      ROCKY               823

Row or column of pandas

Notes on row or column selection in pandas

  1. Square brackets write arrays to indicate taking rows and manipulating rows
  2. Write string, represent column index, manipulate column
import pandas as pd

df = pd.read_csv("./dogNames2.csv")

print(df[:10])
print(df["Row_Labels"])
print(type(df["Row_Labels"])) #Only one column is taken, so it's a Series type

Run result:

      Row_Labels  Count_AnimalName
1156       BELLA              1195
9140         MAX              1153
2660     CHARLIE               856
3251        COCO               852
12368      ROCKY               823
      Row_Labels  Count_AnimalName
1156       BELLA              1195
9140         MAX              1153
2660     CHARLIE               856
3251        COCO               852
12368      ROCKY               823
8417        LOLA               795
8552       LUCKY               723
8560        LUCY               710
2032       BUDDY               677
3641       DAISY               649
1156       BELLA
9140         MAX
2660     CHARLIE
3251        COCO
12368      ROCKY
          ...   
6884        J-LO
6888       JOANN
6890        JOAO
6891     JOAQUIN
16219      39743
Name: Row_Labels, Length: 16220, dtype: object
<class 'pandas.core.series.Series'>

loc and iloc of pandas

There are more pandas-optimized options:

  1. df.loc indexes row data by label
  2. df.iloc retrieves row data from location

Boolean Index of pandas

Returning to the previous question of dog names, what would we do if we wanted to find the names of all the dogs that were used more than 800 times?

Returning to the previous question of dog names, what would we do if we wanted to find the names of all dogs that were used more than 700 times and had a name string longer than 4?

In a dataFrame, multiple conditions need to be connected by logical operators.

String method for pandas

Processing missing data

There are usually two situations where data is missing:
One is empty, None, etc., in pandas is NaN (like np.nan); Another is that we intentionally made it zero.

Determine if the data is NaN:

  • pd.isnull(df)
  • pd.notnull(df)
  1. Processing 1: Delete dropna (axis=0, how='any', inplace=False) from the row where NaN is located
  • how='any': delete the row or column with nan;
  • how='all': delete rows or columns that are all nan s;
  • inplace: modify in place, omitting the step of assigning yourself.
  1. Processing Mode 2: Fill in the data,
  • t.fillna(t.mean()): Fill average;
  • t.fiallna(t.median()): Fill in the median;
  • t.fillna(0): Fill 0.

Data processed as 0:

t[t==0]=np.nan

Of course, not every data with 0 needs to be processed, such as calculating the average value. nan does not participate in the calculation, but 0 will.

Statistical methods commonly used in pandas

Suppose we now have a set of the most popular movie data from 2006 to 2016. We want to know the average score and the number of directors in these movie data. How should we get it?

import pandas as pd

file_path = "IMDB-Movie-Data.csv"
df = pd.read_csv(file_path)

print(df.head(1))

#Get average score
print(df["Rating"].mean())

#Number of directors
# print(len(set(df["Director"].tolist())))
print(len(df["Director"].unique()))

#Number of Actors Acquired
temp_actors_list = df["Actors"].str.split(", ").tolist() # tolist() into a list
actors_list = [i for j in temp_actors_list for i in j]#Expand the list of elements as lists into a large list
actors_num = len(set(actors_list))
print(actors_num)

unique(): For a one-dimensional array or list, the unique function removes duplicate elements and returns a new tuple or list with no element duplicates, from large to small.

The **set() **:set() function creates an unordered set of non-repeating elements that can be used for relationship testing, deleting duplicate data, and calculating intersections, differences, unions, and so on.

Run result:

 Rank                    Title  ... Revenue (Millions) Metascore
0     1  Guardians of the Galaxy  ...             333.13      76.0

[1 rows x 12 columns]
6.723200000000003
644
2015

Exercise: For this set of movie data, how should we present the data if we want to rate the distribution of runtime?

Runtime (Minutes):

import pandas as pd
from matplotlib import pyplot as plt
file_path = "./IMDB-Movie-Data.csv"

df = pd.read_csv(file_path)
#print(df.head(1))
# print(df.info())

#rating,runtime distribution
#Select Graph, Histogram
#Preparing data
runtime_data = df["Runtime (Minutes)"].values
print(runtime_data)
max_runtime = runtime_data.max()
min_runtime = runtime_data.min()

#Calculate Number of Groups
print(max_runtime-min_runtime)
num_bin = (max_runtime-min_runtime)//5

#Setting the size of a graphic
plt.figure(figsize=(20,8),dpi=80)
plt.hist(runtime_data,num_bin)

plt.xticks(range(min_runtime,max_runtime+5,5))

plt.grid()
plt.show()

Run result:

125

Rating:

import numpy as np
from matplotlib import pyplot as plt

runtime_data = np.array([8.1, 7.0, 7.3, 7.2, 6.2, 6.1, 8.3, 6.4, 7.1, 7.0, 7.5, 7.8, 7.9, 7.7, 6.4, 6.6, 8.2, 6.7, 8.1, 8.0, 6.7, 7.9, 6.7, 6.5, 5.3, 6.8, 8.3, 4.7, 6.2, 5.9, 6.3, 7.5, 7.1, 8.0, 5.6, 7.9, 8.6, 7.6, 6.9, 7.1, 6.3, 7.5, 2.7, 7.2, 6.3, 6.7, 7.3, 5.6, 7.1, 3.7, 8.1, 5.8, 5.6, 7.2, 9.0, 7.3, 7.2, 7.4, 7.0, 7.5, 6.7, 6.8, 6.5, 4.1, 8.5, 7.7, 7.4, 8.1, 7.5, 7.2, 5.9, 7.1, 7.5, 6.8, 8.1, 7.1, 8.1, 8.3, 7.3, 5.3, 8.8, 7.9, 8.2, 8.1, 7.2, 7.0, 6.4, 7.8, 7.8, 7.4, 8.1, 7.0, 8.1, 7.1, 7.4, 7.4, 8.6, 5.8, 6.3, 8.5, 7.0, 7.0, 8.0, 7.9, 7.3, 7.7, 5.4, 6.3, 5.8, 7.7, 6.3, 8.1, 6.1, 7.7, 8.1, 5.8, 6.2, 8.8, 7.2, 7.4, 6.7, 6.7, 6.0, 7.4, 8.5, 7.5, 5.7, 6.6, 6.4, 8.0, 7.3, 6.0, 6.4, 8.5, 7.1, 7.3, 8.1, 7.3, 8.1, 7.1, 8.0, 6.2, 7.8, 8.2, 8.4, 8.1, 7.4, 7.6, 7.6, 6.2, 6.4, 7.2, 5.8, 7.6, 8.1, 4.7, 7.0, 7.4, 7.5, 7.9, 6.0, 7.0, 8.0, 6.1, 8.0, 5.2, 6.5, 7.3, 7.3, 6.8, 7.9, 7.9, 5.2, 8.0, 7.5, 6.5, 7.6, 7.0, 7.4, 7.3, 6.7, 6.8, 7.0, 5.9, 8.0, 6.0, 6.3, 6.6, 7.8, 6.3, 7.2, 5.6, 8.1, 5.8, 8.2, 6.9, 6.3, 8.1, 8.1, 6.3, 7.9, 6.5, 7.3, 7.9, 5.7, 7.8, 7.5, 7.5, 6.8, 6.7, 6.1, 5.3, 7.1, 5.8, 7.0, 5.5, 7.8, 5.7, 6.1, 7.7, 6.7, 7.1, 6.9, 7.8, 7.0, 7.0, 7.1, 6.4, 7.0, 4.8, 8.2, 5.2, 7.8, 7.4, 6.1, 8.0, 6.8, 3.9, 8.1, 5.9, 7.6, 8.2, 5.8, 6.5, 5.9, 7.6, 7.9, 7.4, 7.1, 8.6, 4.9, 7.3, 7.9, 6.7, 7.5, 7.8, 5.8, 7.6, 6.4, 7.1, 7.8, 8.0, 6.2, 7.0, 6.0, 4.9, 6.0, 7.5, 6.7, 3.7, 7.8, 7.9, 7.2, 8.0, 6.8, 7.0, 7.1, 7.7, 7.0, 7.2, 7.3, 7.6, 7.1, 7.0, 6.0, 6.1, 5.8, 5.3, 5.8, 6.1, 7.5, 7.2, 5.7, 7.7, 7.1, 6.6, 5.7, 6.8, 7.1, 8.1, 7.2, 7.5, 7.0, 5.5, 6.4, 6.7, 6.2, 5.5, 6.0, 6.1, 7.7, 7.8, 6.8, 7.4, 7.5, 7.0, 5.2, 5.3, 6.2, 7.3, 6.5, 6.4, 7.3, 6.7, 7.7, 6.0, 6.0, 7.4, 7.0, 5.4, 6.9, 7.3, 8.0, 7.4, 8.1, 6.1, 7.8, 5.9, 7.8, 6.5, 6.6, 7.4, 6.4, 6.8, 6.2, 5.8, 7.7, 7.3, 5.1, 7.7, 7.3, 6.6, 7.1, 6.7, 6.3, 5.5, 7.4, 7.7, 6.6, 7.8, 6.9, 5.7, 7.8, 7.7, 6.3, 8.0, 5.5, 6.9, 7.0, 5.7, 6.0, 6.8, 6.3, 6.7, 6.9, 5.7, 6.9, 7.6, 7.1, 6.1, 7.6, 7.4, 6.6, 7.6, 7.8, 7.1, 5.6, 6.7, 6.7, 6.6, 6.3, 5.8, 7.2, 5.0, 5.4, 7.2, 6.8, 5.5, 6.0, 6.1, 6.4, 3.9, 7.1, 7.7, 6.7, 6.7, 7.4, 7.8, 6.6, 6.1, 7.8, 6.5, 7.3, 7.2, 5.6, 5.4, 6.9, 7.8, 7.7, 7.2, 6.8, 5.7, 5.8, 6.2, 5.9, 7.8, 6.5, 8.1, 5.2, 6.0, 8.4, 4.7, 7.0, 7.4, 6.4, 7.1, 7.1, 7.6, 6.6, 5.6, 6.3, 7.5, 7.7, 7.4, 6.0, 6.6, 7.1, 7.9, 7.8, 5.9, 7.0, 7.0, 6.8, 6.5, 6.1, 8.3, 6.7, 6.0, 6.4, 7.3, 7.6, 6.0, 6.6, 7.5, 6.3, 7.5, 6.4, 6.9, 8.0, 6.7, 7.8, 6.4, 5.8, 7.5, 7.7, 7.4, 8.5, 5.7, 8.3, 6.7, 7.2, 6.5, 6.3, 7.7, 6.3, 7.8, 6.7, 6.7, 6.6, 8.0, 6.5, 6.9, 7.0, 5.3, 6.3, 7.2, 6.8, 7.1, 7.4, 8.3, 6.3, 7.2, 6.5, 7.3, 7.9, 5.7, 6.5, 7.7, 4.3, 7.8, 7.8, 7.2, 5.0, 7.1, 5.7, 7.1, 6.0, 6.9, 7.9, 6.2, 7.2, 5.3, 4.7, 6.6, 7.0, 3.9, 6.6, 5.4, 6.4, 6.7, 6.9, 5.4, 7.0, 6.4, 7.2, 6.5, 7.0, 5.7, 7.3, 6.1, 7.2, 7.4, 6.3, 7.1, 5.7, 6.7, 6.8, 6.5, 6.8, 7.9, 5.8, 7.1, 4.3, 6.3, 7.1, 4.6, 7.1, 6.3, 6.9, 6.6, 6.5, 6.5, 6.8, 7.8, 6.1, 5.8, 6.3, 7.5, 6.1, 6.5, 6.0, 7.1, 7.1, 7.8, 6.8, 5.8, 6.8, 6.8, 7.6, 6.3, 4.9, 4.2, 5.1, 5.7, 7.6, 5.2, 7.2, 6.0, 7.3, 7.2, 7.8, 6.2, 7.1, 6.4, 6.1, 7.2, 6.6, 6.2, 7.9, 7.3, 6.7, 6.4, 6.4, 7.2, 5.1, 7.4, 7.2, 6.9, 8.1, 7.0, 6.2, 7.6, 6.7, 7.5, 6.6, 6.3, 4.0, 6.9, 6.3, 7.3, 7.3, 6.4, 6.6, 5.6, 6.0, 6.3, 6.7, 6.0, 6.1, 6.2, 6.7, 6.6, 7.0, 4.9, 8.4, 7.0, 7.5, 7.3, 5.6, 6.7, 8.0, 8.1, 4.8, 7.5, 5.5, 8.2, 6.6, 3.2, 5.3, 5.6, 7.4, 6.4, 6.8, 6.7, 6.4, 7.0, 7.9, 5.9, 7.7, 6.7, 7.0, 6.9, 7.7, 6.6, 7.1, 6.6, 5.7, 6.3, 6.5, 8.0, 6.1, 6.5, 7.6, 5.6, 5.9, 7.2, 6.7, 7.2, 6.5, 7.2, 6.7, 7.5, 6.5, 5.9, 7.7, 8.0, 7.6, 6.1, 8.3, 7.1, 5.4, 7.8, 6.5, 5.5, 7.9, 8.1, 6.1, 7.3, 7.2, 5.5, 6.5, 7.0, 7.1, 6.6, 6.5, 5.8, 7.1, 6.5, 7.4, 6.2, 6.0, 7.6, 7.3, 8.2, 5.8, 6.5, 6.6, 6.2, 5.8, 6.4, 6.7, 7.1, 6.0, 5.1, 6.2, 6.2, 6.6, 7.6, 6.8, 6.7, 6.3, 7.0, 6.9, 6.6, 7.7, 7.5, 5.6, 7.1, 5.7, 5.2, 5.4, 6.6, 8.2, 7.6, 6.2, 6.1, 4.6, 5.7, 6.1, 5.9, 7.2, 6.5, 7.9, 6.3, 5.0, 7.3, 5.2, 6.6, 5.2, 7.8, 7.5, 7.3, 7.3, 6.6, 5.7, 8.2, 6.7, 6.2, 6.3, 5.7, 6.6, 4.5, 8.1, 5.6, 7.3, 6.2, 5.1, 4.7, 4.8, 7.2, 6.9, 6.5, 7.3, 6.5, 6.9, 7.8, 6.8, 4.6, 6.7, 6.4, 6.0, 6.3, 6.6, 7.8, 6.6, 6.2, 7.3, 7.4, 6.5, 7.0, 4.3, 7.2, 6.2, 6.2, 6.8, 6.0, 6.6, 7.1, 6.8, 5.2, 6.7, 6.2, 7.0, 6.3, 7.8, 7.6, 5.4, 7.6, 5.4, 4.6, 6.9, 6.8, 5.8, 7.0, 5.8, 5.3, 4.6, 5.3, 7.6, 1.9, 7.2, 6.4, 7.4, 5.7, 6.4, 6.3, 7.5, 5.5, 4.2, 7.8, 6.3, 6.4, 7.1, 7.1, 6.8, 7.3, 6.7, 7.8, 6.3, 7.5, 6.8, 7.4, 6.8, 7.1, 7.6, 5.9, 6.6, 7.5, 6.4, 7.8, 7.2, 8.4, 6.2, 7.1, 6.3, 6.5, 6.9, 6.9, 6.6, 6.9, 7.7, 2.7, 5.4, 7.0, 6.6, 7.0, 6.9, 7.3, 5.8, 5.8, 6.9, 7.5, 6.3, 6.9, 6.1, 7.5, 6.8, 6.5, 5.5, 7.7, 3.5, 6.2, 7.1, 5.5, 7.1, 7.1, 7.1, 7.9, 6.5, 5.5, 6.5, 5.6, 6.8, 7.9, 6.2, 6.2, 6.7, 6.9, 6.5, 6.6, 6.4, 4.7, 7.2, 7.2, 6.7, 7.5, 6.6, 6.7, 7.5, 6.1, 6.4, 6.3, 6.4, 6.8, 6.1, 4.9, 7.3, 5.9, 6.1, 7.1, 5.9, 6.8, 5.4, 6.3, 6.2, 6.6, 4.4, 6.8, 7.3, 7.4, 6.1, 4.9, 5.8, 6.1, 6.4, 6.9, 7.2, 5.6, 4.9, 6.1, 7.8, 7.3, 4.3, 7.2, 6.4, 6.2, 5.2, 7.7, 6.2, 7.8, 7.0, 5.9, 6.7, 6.3, 6.9, 7.0, 6.7, 7.3, 3.5, 6.5, 4.8, 6.9, 5.9, 6.2, 7.4, 6.0, 6.2, 5.0, 7.0, 7.6, 7.0, 5.3, 7.4, 6.5, 6.8, 5.6, 5.9, 6.3, 7.1, 7.5, 6.6, 8.5, 6.3, 5.9, 6.7, 6.2, 5.5, 6.2, 5.6, 5.3])
max_runtime = runtime_data.max()
min_runtime = runtime_data.min()
print(min_runtime,max_runtime)

#By setting unequal-width group spacing, the hist method takes a left-closed right-open go [1.9,3.5]
num_bin_list = [1.9,3.5]
i=3.5
while i<=max_runtime:
    i += 0.5
    num_bin_list.append(i)
print(num_bin_list)

#Setting the size of a graphic
plt.figure(figsize=(20,8),dpi=80)
plt.hist(runtime_data,num_bin_list)

#xticks enables previous group spacing to correspond
plt.xticks(num_bin_list)

plt.show()

Run result:

1.9 9.0
[1.9, 3.5, 4.0, 4.5, 5.0, 5.5, 6.0, 6.5, 7.0, 7.5, 8.0, 8.5, 9.0, 9.5]

Tags: Python Back-end

Posted on Thu, 21 Oct 2021 09:26:02 -0400 by jayarsee