pandas notes of data analysis

Pandas

A content that represents a table type

  • Lesson 4: jupyter21:22
  • Session 5: pandas 24:31
  • Session 6: series 38:19
  • Lesson 7: data framework 25:50
# Load pandas Library
import pandas as pd
import numpy as np
s = pd.Series([2,4,6,8,10])
s
0     2
1     4
2     6
3     8
4    10
dtype: int64
d = pd.DataFrame([
 [2,4,6,8,10],
 [7,3,4,7,15],
])

d
0 1 2 3 4
0 2 4 6 8 10
1 7 3 4 7 15
d[0]
0    2
1    7
Name: 0, dtype: int64

We should pay attention to that what we get directly with brackets is the column, because for example, we need to get the age attribute of a table, and usually get the data of the age column, so if we want to get a piece of data, we need to use brackets again

How to get a row

d.loc[0]
0     2
1     4
2     6
3     8
4    10
Name: 0, dtype: int64

This is a series for us
In fact, this data frame is composed of multiple series
So we can write that

d2 = pd.DataFrame([
 pd.Series([2,4,6,8,10]),
 pd.Series([7,3,4,7,15]),
])
d2
0 1 2 3 4
0 2 4 6 8 10
1 7 3 4 7 15
class1 = pd.Series({'hong': 50, 'huang': 90, 'qing': 60})

# Modify dictionary index
class1_values = {'hong': 50, 'huang': 90, 'qing': 60}
class1_index = ['hong', 'lv', 'lan']
# The key in this place is set according to the index parameter, and then the key in the previous dictionary will not be used
class1 = pd.Series(class1_values, index=class1_index)
class1
hong    50.0
lv       NaN
lan      NaN
dtype: float64
class1

# Value data, whether the output type is array or ndarray array
class1.values

# Index, output index type (unique index type of Pandas), essentially ndarray
class1.index

class1.index[2]
class1.index.values


array(['hong', 'lv', 'lan'], dtype=object)
class1_index
class1.hong



50.0
class1[[1,2,0]]


lv       NaN
lan      NaN
hong    50.0
dtype: float64
class1[0:1]
hong    50.0
dtype: float64
# Direct memory judgment
class1 > 6
# This Nan value is False in your judgment
hong     True
lv      False
lan     False
dtype: bool
# You can write like this
# This writing method is very similar to that of database
class1[class1>6]
hong    50.0
dtype: float64
# Add one to everything directly
class1+1

hong    51.0
lv       NaN
lan      NaN
dtype: float64
  • This kind of overall plus one, he is very efficient
  • If it's our list, if you want to achieve this effect, you have to loop this list
    Get a data from the list, add + 1 to the new list
  • In this case, we take out three pieces of data at the same time (just like concurrency), and then perform + 1 operation at the same time
    Then put it in a new one at the same time
  • We can use that magic command to calculate time to help verify it
%%timeit
# Modify dictionary index
class2_values = [1024,3,5,7,9,10,13,115,127,149,221]
# The key in this place is set according to the index parameter, and then the key in the previous dictionary will not be used
class2 = pd.Series(class2_values)
class2+1
198 µs ± 9.37 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
class2+1
100 µs ± 3.56 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit
for i in range(100000):
    i+=1
4.12 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit
a = pd.Series(range(100000))
a+1

562 µs ± 72 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

I guess it's because the data volume is not large enough and the advantages of this database can't be shown, so we have to try more
Sometimes we need to use GPU to calculate. If we use CPU, it will consume CPU very much. Because GPU is better at this small amount of calculation, it is equivalent to a group of primary school students. In addition, subtraction, multiplication and division, primary school students are better than CPU mathematicians

# It can not only add and subtract, but also multiply and divide, take the rest and divide the bottom plate
print(class2 // 2) 
11.0
11.0
class2 = pd.Series([1024,3,5,7,9,10,13,115,127,149,221])
# Average
print(class2.mean())
print(np.mean(class2))
class2

153.0
153.0





0     1024
1        3
2        5
3        7
4        9
5       10
6       13
7      115
8      127
9      149
10     221
dtype: int64
class3 = pd.Series([1024,13,5,7,9,10,1,115,127,149,221])
# Median
# Through function calls in the library
print(np.median(class3))
# Self property call writing
print(class3.median())
# Median if there are two data, that's the average of the two data

13.0
13.0
# variance
class2.var()
89190.6
# standard deviation
class2.std()

298.6479532827908
print(class2)
print("-"*50)
print(class2+1)
print("-"*50)
# All in and out of the container
# This container includes keys and values similar to dictionaries, all of which are counted, all of which are counted
print(10 in class2)
print("-"*50)
print(5 in class2 + 1)
# On the inaccuracy of floating-point calculation

0     1024
1        3
2        5
3        7
4        9
5       10
6       13
7      115
8      127
9      149
10     221
dtype: int64
--------------------------------------------------
0     1025
1        4
2        6
3        8
4       10
5       11
6       14
7      116
8      128
9      150
10     222
dtype: int64
--------------------------------------------------
True
--------------------------------------------------
True
# Then ask us if we can take out the values
print(4 in class2) 
print(4 in class2.values)
True
False
# values value modification
class2['ming'] = 0
class2['hua'] = 0
class2['hong'] = 0

class2[['hua','hong']] = 55
class2[['hua','hong']] = [35, 55]
class2['hua','hong'] = [1, 2]  # One floor is OK
class2

0       1024
1          3
2          5
3          7
4          9
5         10
6         13
7        115
8        127
9        149
10       221
ming       0
hua        1
hong       2
dtype: int64
# Deep copy
class4 = class2.copy()
class4 = class4+1
print(class2)
class4
0       1024
1          3
2          5
3          7
4          9
5         10
6         13
7        115
8        127
9        149
10       221
ming       0
hua        1
hong       2
dtype: int64





0       1025
1          4
2          6
3          8
4         10
5         11
6         14
7        116
8        128
9        150
10       222
ming       1
hua        2
hong       3
dtype: int64
#The index can also be modified separately
class2.index = [22,23,24,28,24,29,1,2,3,4,8,5,9,21]
class2
22    1024
23       3
24       5
28       7
24       9
29      10
1       13
2      115
3      127
4      149
8      221
5        0
9        1
21       2
dtype: int64
# This csv path can't have Chinese, otherwise the acquisition fails
df = pd.read_csv("./source/test.csv")
df

ro c1 c2 c3 c4 c5 c6 c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18
0 a 0 5 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
1 b 1 6 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11
2 c 2 7 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12
3 d 3 8 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13
4 e 4 9 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14

The data in csv is separated by commas, and comes from:
python:pandas -- read ﹣ CSV method

198 original articles published, 43 praised, 10000 visitors+
Private letter follow

Tags: Session Database Attribute REST

Posted on Wed, 15 Jan 2020 01:33:57 -0500 by mtb211