Data analysis day01

Data analysis day01

1.numpy module

What is data analysis

  • It is to extract the information hidden behind some seemingly disordered data and summarize the internal law of the research object
  • Data analysis is to analyze a large amount of collected data with appropriate methods to help people make judgments and take appropriate actions
    • What is the purchase volume of goods
    • Delivery volume of headquarter to agents in different regions
    • ...

Why learning data analysis

  • Job requirements
  • Is the foundation of Python Data Science
  • Is the foundation of machine learning course

Data analysis implementation process

  • Raise questions
  • Preparation data
  • Analysis data
  • Draw a conclusion
  • Visualization of achievements

Data analysis three swordsmen

  • numpy
  • pandas
  • matplotlib

numpy module: one-dimensional or multi-dimensional array (list of lower versions)

  • NumPy(Numerical Python) is the basic library for scientific computing in Python. It focuses on numerical calculation, which is also the basis of most Python scientific calculation libraries. It is mostly used for numerical operations performed on large and multidimensional arrays.

Creation of numpy

  • Create with np.array()

  • Create with plt

  • Using the routes function of np to create

  • Using array() to create a one-dimensional array

In [3]:

import numpy as np
arr = np.array([1,2,3,4,5,6])
arr

Out[3]:

array([1, 2, 3, 4, 5, 6])
  • Using array() to create a multidimensional array

In [4]:

np.array([[1,2,3,4],[5,6,7,8],[9,9,9,9]])

Out[4]:

array([[1, 2, 3, 4],
       [5, 6, 7, 8],
       [9, 9, 9, 9]])
  • What's the difference between arrays and lists?
    • The data type of array elements stored in data must be uniform
    • Data types have priority:
      • str>float>int

In [6]:

arr = np.array([1,2.2,3,4,5,6])
arr

Out[6]:

array([1. , 2.2, 3. , 4. , 5. , 6. ])
  • Load an external image reading into the numpy array, and then try to change the value of the array elements to see the impact on the original image

In [10]:

import matplotlib.pyplot as plt
img_arr = plt.imread('./1.jpg')
plt.imshow(img_arr)

Out[10]:

In [11]:

plt.imshow(img_arr-100)

Out[11]:

<matplotlib.image.AxesImage at 0x165794d4e48>
  • zeros()
  • ones()
  • linespace()
  • arange()
  • random series

In [12]:

np.zeros((3,4))

Out[12]:

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [13]:

np.linspace(0,100,num=20)

Out[13]:

array([  0.        ,   5.26315789,  10.52631579,  15.78947368,
        21.05263158,  26.31578947,  31.57894737,  36.84210526,
        42.10526316,  47.36842105,  52.63157895,  57.89473684,
        63.15789474,  68.42105263,  73.68421053,  78.94736842,
        84.21052632,  89.47368421,  94.73684211, 100.        ])

In [14]:

np.arange(0,100,step=3)

Out[14]:

array([ 0,  3,  6,  9, 12, 15, 18, 21, 24, 27, 30, 33, 36, 39, 42, 45, 48,
       51, 54, 57, 60, 63, 66, 69, 72, 75, 78, 81, 84, 87, 90, 93, 96, 99])

In [16]:

np.random.randint(0,100,size=(5,6))

Out[16]:

array([[71, 76, 47, 11,  7,  6],
       [47, 89, 70, 44, 41, 96],
       [58, 42, 36, 53, 49, 55],
       [13, 32, 64, 58, 15,  7],
       [78, 56, 40, 71, 45, 63]])

In [18]:

np.random.random((3,4))

Out[18]:

array([[0.24913375, 0.91988476, 0.36386714, 0.58404557],
       [0.15544885, 0.73892461, 0.82189615, 0.80368295],
       [0.07230386, 0.45535116, 0.75370029, 0.03377829]])
  • Randomness:
    • Random factor: X (time)

In [23]:

#Fixed randomness
np.random.seed(10)
np.random.randint(0,100,size=(5,6))

Out[23]:

array([[ 9, 15, 64, 28, 89, 93],
       [29,  8, 73,  0, 40, 36],
       [16, 11, 54, 88, 62, 33],
       [72, 78, 49, 51, 54, 77],
       [69, 13, 25, 13, 92, 86]])

Common properties of numpy

  • shape
  • ndim
  • size
  • dtype

In [30]:

img_arr.shape
img_arr.ndim
img_arr.size
img_arr.dtype
type(img_arr)

Out[30]:

numpy.ndarray

In [32]:

arr = np.array([1,2,3],dtype='uint8')

Data type of numpy

  • array(dtype =?): data type can be set
  • arr.dtype = '?': data type can be modified [image.png]

In [38]:

arr = np.array([1,2,3])

In [39]:

arr.dtype = 'int32'

Index and slice operation of numpy

  • Index operations are the same as lists

In [40]:

arr = np.random.randint(0,100,size=(6,8))
arr

Out[40]:

array([[30, 30, 89, 12, 65, 31, 57, 36],
       [27, 18, 93, 77, 22, 23, 94, 11],
       [28, 74, 88,  9, 15, 18, 80, 71],
       [88, 11, 17, 46,  7, 75, 28, 33],
       [84, 96, 88, 44,  5,  4, 71, 88],
       [88, 50, 54, 34, 15, 77, 88, 15]])

In [41]:

arr[1]

Out[41]:

array([27, 18, 93, 77, 22, 23, 94, 11])
  • Slicing operation
    • Cut out the first two columns of data
    • Cut out the first two lines of data
    • Cut out the data of the first two columns of the first two rows
    • Array data flip
    • Exercise: flip a picture up, down, left, right
    • Exercise: crop the picture to the specified area

In [44]:

arr.shape

Out[44]:

(6, 8)

In [43]:

#Cut out the first two lines
arr[0:2]

Out[43]:

array([[30, 30, 89, 12, 65, 31, 57, 36],
       [27, 18, 93, 77, 22, 23, 94, 11]])

In [45]:

#Cut out the first two columns arr[hang,lie]
arr[:,0:2]

Out[45]:

array([[30, 30],
       [27, 18],
       [28, 74],
       [88, 11],
       [84, 96],
       [88, 50]])

In [46]:

#Cut out the data of the first two columns of the first two rows
arr[0:2,0:2]

Out[46]:

array([[30, 30],
       [27, 18]])

In [49]:

#Array data flip
plt.imshow(img_arr)

Out[49]:

<matplotlib.image.AxesImage at 0x16578fb6a20>

In [50]:

img_arr.shape #The first two dimensions represent pixels, and the last dimension represents color

Out[50]:

(426, 640, 3)

In [51]:

#Flip the picture up and down
plt.imshow(img_arr[::-1,:,:])

Out[51]:

<matplotlib.image.AxesImage at 0x16578b53080>

In [52]:

plt.imshow(img_arr[:,::-1,:])

Out[52]:

<matplotlib.image.AxesImage at 0x16579085f28>

In [53]:

plt.imshow(img_arr[::-1,::-1,::-1])

Out[53]:

<matplotlib.image.AxesImage at 0x16578bfc588>

In [54]:

#Tailoring
plt.imshow(img_arr)

Out[54]:

<matplotlib.image.AxesImage at 0x165791662b0>

In [55]:

plt.imshow(img_arr[50:200,50:300,:])

Out[55]:

<matplotlib.image.AxesImage at 0x16578f80c88>
  • Slice summary:
    • Cut line: arr[index1:index3]
    • Cut column: arr [row slice, column slice]
    • Flip: arr[::-1]

reshape

  • The number of array elements before and after deformation is the same

In [57]:

arr = np.array([1,2,3,4,5,6])
arr

Out[57]:

array([1, 2, 3, 4, 5, 6])

In [60]:

#Changing one dimension array into two dimensions
arr.reshape((2,3))

Out[60]:

array([[1, 2, 3],
       [4, 5, 6]])

In [61]:

arr.reshape((-1,2))

Out[61]:

array([[1, 2],
       [3, 4],
       [5, 6]])

Concatenate operation: concatenate

  • It is a horizontal or vertical splicing of numpy array
  • axis understanding
    • 0: column
    • 1: row

In [64]:

arr1 = np.array([[1,2,3],[4,5,6]])
arr1

Out[64]:

array([[1, 2, 3],
       [4, 5, 6]])

In [68]:

np.concatenate((arr1,arr1),axis=1)

Out[68]:

array([[1, 2, 3, 1, 2, 3],
       [4, 5, 6, 4, 5, 6]])

In [69]:

arr2 = np.array([[1,2,3,3],[4,5,6,6]])
arr2

Out[69]:

array([[1, 2, 3, 3],
       [4, 5, 6, 6]])
  • Matching cascade
    • The shape of the concatenated arrays is the same
  • Mismatched cascading
    • The shape of the concatenated arrays is different (the dimensions must be the same)
      • Rows of multiple arrays are cascaded in the same way
      • The number of columns of multiple arrays is cascaded in the same way

In [71]:

#On the cascading of arr1 and arr2
np.concatenate((arr1,arr2),axis=1)

Out[71]:

array([[1, 2, 3, 1, 2, 3, 3],
       [4, 5, 6, 4, 5, 6, 6]])

Common aggregation operations

  • sum,max,min,mean

In [74]:

arr = np.random.randint(0,10,size=(4,5))
arr

Out[74]:

array([[6, 6, 5, 6, 0],
       [0, 6, 9, 1, 8],
       [9, 1, 2, 8, 9],
       [9, 5, 0, 2, 7]])

In [77]:

arr.sum(axis=1)

Out[77]:

array([23, 24, 29, 23])

Common mathematical functions

  • NumPy provides standard trigonometric functions: sin(), cos(), tan()
  • The numpy.around (a, decisions) function returns the rounding value of the specified number.
    • Parameter Description:
      • a: array
      • decimals: number of decimal places to round. The default value is 0. If negative, the integer is rounded to the left of the decimal point

In [78]:

np.sin(arr)

Out[78]:

array([[-0.2794155 , -0.2794155 , -0.95892427, -0.2794155 ,  0.        ],
       [ 0.        , -0.2794155 ,  0.41211849,  0.84147098,  0.98935825],
       [ 0.41211849,  0.84147098,  0.90929743,  0.98935825,  0.41211849],
       [ 0.41211849, -0.95892427,  0.        ,  0.90929743,  0.6569866 ]])

In [81]:

arr = np.random.random(size=(3,4))
arr

Out[81]:

array([[0.07961309, 0.30545992, 0.33071931, 0.7738303 ],
       [0.03995921, 0.42949218, 0.31492687, 0.63649114],
       [0.34634715, 0.04309736, 0.87991517, 0.76324059]])

In [83]:

np.around(arr,decimals=2)

Out[83]:

array([[0.08, 0.31, 0.33, 0.77],
       [0.04, 0.43, 0.31, 0.64],
       [0.35, 0.04, 0.88, 0.76]])

Common statistical functions

  • numpy.amin() and numpy.amax() are used to calculate the minimum and maximum values of elements in an array along a specified axis.
  • numpy.ptp(): calculates the difference between the maximum value and the minimum value of an element in an array (maximum minimum value).
  • The numpy. Media() function is used to calculate the median (median) of elements in array a
  • Standard deviation std(): standard deviation is a measure of the dispersion of the average value of a set of data.
    • Formula: std = sqrt(mean((x - x.mean())**2))
    • If the array is [1, 2, 3, 4], its average value is 2.5. Therefore, the square of the difference is [2.25,0.25,0.25,2.25], and the square of its average value is eradicated by 4, i.e. sqrt(5/4), and the result is 1.1180339887498949.
  • Variance var(): the variance (sample variance) in the statistics is the average of the square value of the difference between each sample value and the average of all sample values, that is, mean((x - x.mean())** 2). In other words, the standard deviation is the square root of the variance.

In [85]:

arr = np.random.randint(0,20,size=(5,3))
arr

Out[85]:

array([[12, 18, 17],
       [17, 16,  0],
       [ 5,  9,  0],
       [ 6,  0,  2],
       [ 3,  3, 18]])

In [86]:

np.amin(arr,axis=0)

Out[86]:

array([3, 0, 0])

In [87]:

np.ptp(arr,axis=0)

Out[87]:

array([14, 18, 18])

In [88]:

np.median(arr,axis=0)

Out[88]:

array([6., 9., 2.])

In [93]:

np.std(arr,axis=0)

Out[93]:

array([5.16139516, 7.02566723, 8.28492607])

In [94]:

np.var(arr,axis=0)

Out[94]:

array([26.64, 49.36, 68.64])

Matrix correlation

  • NumPy contains a matrix library, numpy.matlib. The functions in this module return a matrix, not a ndarray object. A matrix is a rectangular array of row and column elements.

  • The matlib.empty() function returns a new matrix. The syntax format is: numpy.matlib.empty(shape, dtype), filled with random data

    • Parameter introduction:
      • Shape: an integer or integer tuple defining the shape of a new matrix
      • Dtype: optional, data type

In [98]:

import numpy.matlib as matlib
matlib.empty(shape=(4,5))

Out[98]:

matrix([[-0.2794155 , -0.2794155 , -0.95892427, -0.2794155 ,  0.        ],
        [ 0.        , -0.2794155 ,  0.41211849,  0.84147098,  0.98935825],
        [ 0.41211849,  0.84147098,  0.90929743,  0.98935825,  0.41211849],
        [ 0.41211849, -0.95892427,  0.        ,  0.90929743,  0.6569866 ]])
  • numpy.matlib.zeros(), numpy.matlib.ones() returns a matrix filled with 0 or 1

In [ ]:

  • The function numpy.matlib.eye() returns a matrix with a diagonal element of 1 and other positions of zero.
    • numpy.matlib.eye(n, M,k, dtype)
      • n: Returns the number of rows in the matrix
      • M: Returns the number of columns of the matrix, the default is n
      • k: Index of diagonal
      • dtype: data type

In [101]:

matlib.eye(5,5,1)

Out[101]:

matrix([[0., 1., 0., 0., 0.],
        [0., 0., 1., 0., 0.],
        [0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 1.],
        [0., 0., 0., 0., 0.]])
  • The numpy.matlib.identity() function returns a unit matrix of a given size. The unit matrix is a square matrix. The elements on the diagonal (called the main diagonal) from the upper left corner to the lower right corner are all 1, and all of them are 0.

In [99]:

matlib.identity(6)

Out[99]:

matrix([[1., 0., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 0.],
        [0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 0., 1.]])
  • Transpose matrix
    • .T

In [103]:

arr = matlib.identity(6)
arr

Out[103]:

matrix([[1., 0., 0., 0., 0., 0.],
        [0., 1., 0., 0., 0., 0.],
        [0., 0., 1., 0., 0., 0.],
        [0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 1., 0.],
        [0., 0., 0., 0., 0., 1.]])

In [106]:

a = np.array([[1,2,3],[4,5,6]])
a

Out[106]:

array([[1, 2, 3],
       [4, 5, 6]])

In [107]:

a.T

Out[107]:

array([[1, 4],
       [2, 5],
       [3, 6]])
  • matrix multiplication
    • numpy.dot(a, b, out=None)

      • A: ndarray array
      • B: array of ndarray
    • Each number (2 and 1) in the first row of the first matrix is multiplied by the number (1 and 1) in the first column of the second matrix, and then the product is added (2 x 1 + 1 x 1) to get the value 3 in the upper left corner of the result matrix. That is to say, the value at the intersection of row m and column n of the result matrix is equal to the sum of the products of row m of the first matrix and column n of the second matrix.

    • Linear algebra is based on matrix derivation:

      • https://www.cnblogs.com/alantu2018/p/8528299.html

In [109]:

arr_1 = np.array([[1,2,3],[4,5,6]])  #2 rows and 3 columns
arr_2 = np.array([[1,2,3],[4,5,6]]) 
arr_2 = arr_2.T

In [110]:

arr_1

Out[110]:

array([[1, 2, 3],
       [4, 5, 6]])

In [111]:

arr_2

Out[111]:

array([[1, 4],
       [2, 5],
       [3, 6]])

In [112]:

np.dot(arr_1,arr_2)

Out[112]:

array([[14, 32],
       [32, 77]])
  • Key points:
    • Array creation
    • Index and slice of array
    • Concatenation and deformation of data
    • Aggregation (sum,max,mean) and statistical function (std()) of numpy
    • Multiplication principle of matrix

2. Basic operation of pandas

Why to learn pandas

  • numpy has been able to help us with data processing, so what's the purpose of learning pandas?
    • numpy can help us deal with numerical data. Of course, there are many other types of data (string, time series) besides numerical data in data analysis. Then pandas can help us deal with other data except numerical data very well!

What is panda?

  • First, let's recognize two common classes in pandas
    • Series
    • DataFrame

In [8]:

import pandas as pd
from pandas import Series,DataFrame
import numpy as np

Series

  • Series is an object similar to one-dimensional array, which consists of the following two parts:

    • values: a set of data (ndarray type)
    • Index: related data index label
  • Series creation

    • Created by list or numpy array
    • Created by dictionary

In [3]:

s = Series(data=[1,2,3,4,5])
s

Out[3]:

0    1
1    2
2    3
3    4
4    5
dtype: int64
  • Index of Series
    • Secret index: default
    • Explicit indexing: enhancing data readability
      • Parameter specification of index

In [4]:

s1 = Series(data=[1,2,3],index=['a','b','c'])
s1

Out[4]:

a    1
b    2
c    3
dtype: int64

In [7]:

dic = {
    'Mathematics':100,
    'Synthesis':188
}
s3 = Series(data=dic)
s3

Out[7]:

Mathematics 100
 Li Zong 188
dtype: int64

In [10]:

s4 = Series(data=np.random.randint(0,100,size=(3,)))
s4

Out[10]:

0     9
1    91
2    24
dtype: int32
  • Index and slice of Series

In [12]:

s1

Out[12]:

a    1
b    2
c    3
dtype: int64

In [16]:

s1['a']
s1[0]
s1.a

Out[16]:

1

In [19]:

s1[0:2]
s1['a':'c']

Out[19]:

a    1
b    2
c    3
dtype: int64
  • Common properties of Series
    • shape
    • size
    • index
    • values

In [24]:

s1.shape
s1.size
s1.index
s1.values

Out[24]:

array([1, 2, 3], dtype=int64)
  • Common methods of Series
    • head(),tail()
    • unique()
    • isnull(),notnull()
    • add() sub() mul() div()

In [26]:

s1.head(2)#Show only the first two numbers
s1.tail(2)

Out[26]:

b    2
c    3
dtype: int64
  • Arithmetic operation of Series

In [27]:

s1 = Series(data=[1,2,3,4],index=['a','b','c','d'])
s2 = Series(data=[1,2,3,4],index=['a','b','e','d'])
s1

Out[27]:

a    1
b    2
c    3
d    4
dtype: int64

In [28]:

s2

Out[28]:

a    1
b    2
e    3
d    4
dtype: int64
  • Series algorithm:
    • Count the element values with consistent index, otherwise fill in the blank

In [29]:

s = s1+s2
s

Out[29]:

a    2.0
b    4.0
c    NaN
d    8.0
e    NaN
dtype: float64
  • Series based null (missing) filtering
    • isnull,notnull: judge whether some elements are null

In [30]:

s.isnull()

Out[30]:

a    False
b    False
c     True
d    False
e     True
dtype: bool

In [33]:

#Using occult and display indexes
s[[0,1,2]]
s[['a','c']]

Out[33]:

a    2.0
c    NaN
dtype: float64

In [35]:

s

Out[35]:

a    2.0
b    4.0
c    NaN
d    8.0
e    NaN
dtype: float64

In [36]:

#Use Boolean as index
s[[True,True,False,True,False]]

Out[36]:

a    2.0
b    4.0
d    8.0
dtype: float64

In [37]:

s.notnull()

Out[37]:

a     True
b     True
c    False
d     True
e    False
dtype: bool

In [38]:

s[s.notnull()]

Out[38]:

a    2.0
b    4.0
d    8.0
dtype: float64

DataFrame

  • DataFrame is a data structure of table type. DataFrame consists of several columns of data arranged in a certain order. The original intention of the design is to expand the use scenarios of Series from one-dimensional to multi-dimensional. DataFrame has both row and column indexes.

    • Row index: index
    • Column index: columns
    • Value: values
  • Creation of DataFrame

    • ndarray create
    • Dictionary creation

In [39]:

DataFrame(data=np.random.randint(0,100,size=(4,6)))

Out[39]:

0 1 2 3 4 5
0 93 24 73 95 46 36
1 17 98 7 13 79 34
2 82 51 52 21 4 50
3 77 23 91 31 6 12

In [42]:

dic = {
    'name':['Zhang San','Li Si','Bachelor'],
    'salary':[10000,20000,15555]
}
df = DataFrame(data=dic,index=['a','b','c'])
df

Out[42]:

name salary
a Zhang San 10000
b Li Si 20000
c Bachelor 15555
  • Properties of DataFrame
    • values,columns,index,shape

In [46]:

df.values
df.columns
df.index
df.shape

Out[46]:

(3, 2)

============================================

Exercise 4:

Create a DataFrame named df based on the following exam result table:

    this one and that one  
Language 150 0
 Mathematics 150 0
 English 150 0
 Li Zong 300 0

============================================

Type Markdown and LaTeX: α2α2

  • DataFrame index operation
    • Index rows
    • Queue indexing
    • Index elements

In [49]:

df

Out[49]:

name salary
a Zhang San 10000
b Li Si 20000
c Bachelor 15555

In [51]:

#Take out the first column
df['name']

Out[51]:

a     Zhang San
b     Li Si
c    Bachelor
Name: name, dtype: object

In [52]:

#Remove multiple columns
df[['name','salary']]

Out[52]:

name salary
a Zhang San 10000
b Li Si 20000
c Bachelor 15555

In [56]:

#Take out a row
df.loc['a']

Out[56]:

name         Zhang San
salary    10000
Name: a, dtype: object

In [54]:

#Take multiple rows
df.loc[['a','c']]

Out[54]:

name salary
a Zhang San 10000
c Bachelor 15555

In [58]:

df.iloc[[1,2]]

Out[58]:

name salary
b Li Si 20000
c Bachelor 15555
  • loc ['display index ']
  • iloc [secret index]

In [59]:

df

Out[59]:

name salary
a Zhang San 10000
b Li Si 20000
c Bachelor 15555

In [61]:

#Take a single element (Li Si's salary)
df.iloc[1,1]
df.loc['b','salary']

Out[61]:

20000

In [62]:

#Take multiple elements
df.loc[['a','c'],'salary']

Out[62]:

a    10000
c    15555
Name: salary, dtype: int64
  • Slicing operation of DataFrame
    • Slice rows
    • Slice columns

In [63]:

#Cut out the first two lines
df[0:2]

Out[63]:

name salary
a Zhang San 10000
b Li Si 20000

In [65]:

#Cut out the first two columns
df.iloc[:,0:2]

Out[65]:

name salary
a Zhang San 10000
b Li Si 20000
c Bachelor 15555
  • Summary of indexes and slices
    • Indexes:
      • df[col]: single column
      • df[[col1,col2]]: take multiple columns
      • df.loc[row]: take a single row
      • df.loc[[row1,row2]]: take multiple lines
      • df.loc[row,col]: take element
    • Section
      • Cut line: df[row1:row3]
      • Cut column: df.loc[:,col1:col3]

In [ ]:

  • Operation of DataFrame: same as that of Series

Type Markdown and LaTeX: α2α2

============================================

Practice:

  1. Suppose DDD is the mid-term exam result and ddd2 is the final exam result. Please create ddd2 freely and add it with DDD to find the average value of mid-term and final exam.
  2. Suppose that Zhang's mid-term exam mathematics was found cheating, to be recorded as 0 points, how to achieve?
  3. How can Li Si get 100 points for all subjects in the mid-term exam for reporting Zhang San's cheating?
  4. Later, the teacher found that there was a mistake in one question. In order to calm the students' emotions, each student was given 10 points in each subject. How to achieve this?

============================================

In [ ]:

  • Conversion of time data type

    • pd.to_datetime(col)
  • Set a column as a row index

    • df.set_index()
  • Stock:

    • Use tushare package to get the historical market data of a stock.
      • tushre financial data interface package: provides various financial historical transaction data
      • Download tushare: pip install tushare
    • Export all dates when the stock closed more than 3% higher than the opening.
    • Export the date when all shares of the stock open more than 2% lower than the closing of the previous day.
    • If I buy one stock on the first trading day of each month from January 1, 2010, and sell all the stocks on the last trading day of each year, how is my income up to now?

In [70]:

import tushare as ts
df = ts.get_k_data('600519',start='2000-01-01')

In [72]:

#Write to file
df.to_csv('./maotai.csv')

In [74]:

#Read local data to df
df = pd.read_csv('./maotai.csv')
df.head(5)

Out[74]:

Unnamed: 0 date open close high low volume code
0 0 2001-08-27 5.392 5.554 5.902 5.132 406318.00 600519
1 1 2001-08-28 5.467 5.759 5.781 5.407 129647.79 600519
2 2 2001-08-29 5.777 5.684 5.781 5.640 53252.75 600519
3 3 2001-08-30 5.668 5.796 5.860 5.624 48013.06 600519
4 4 2001-08-31 5.804 5.782 5.877 5.749 23231.48 600519

In [79]:

#Delete the useless column (axis=0 row, 1 column) in the function of. drop series
df.drop(labels='Unnamed: 0',axis=1,inplace=True) #inplace=True delete data from original data
  • df.info():
    • Return some original information in df
      • Rows of data
      • Data type of each column element
      • Detect if there is missing data in the column

In [83]:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4385 entries, 0 to 4384
Data columns (total 7 columns):
date      4385 non-null object
open      4385 non-null float64
close     4385 non-null float64
high      4385 non-null float64
low       4385 non-null float64
volume    4385 non-null float64
code      4385 non-null int64
dtypes: float64(5), int64(1), object(1)
memory usage: 239.9+ KB

In [88]:

##Convert data types in the date column to time series types
df['date'] = pd.to_datetime(df['date'])

In [90]:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4385 entries, 0 to 4384
Data columns (total 7 columns):
date      4385 non-null datetime64[ns]
open      4385 non-null float64
close     4385 non-null float64
high      4385 non-null float64
low       4385 non-null float64
volume    4385 non-null float64
code      4385 non-null int64
dtypes: datetime64[ns](1), float64(5), int64(1)
memory usage: 239.9 KB

In [94]:

#Use the date column as the row index of the source data
df.set_index('date',inplace=True)
 0 | 2001-08-27 | 5.392 | 5.554 | 5.902 | 5.132 | 406318.00 | 600519 |

| 1 | 1 | 2001-08-28 | 5.467 | 5.759 | 5.781 | 5.407 | 129647.79 | 600519 |
| 2 | 2 | 2001-08-29 | 5.777 | 5.684 | 5.781 | 5.640 | 53252.75 | 600519 |
| 3 | 3 | 2001-08-30 | 5.668 | 5.796 | 5.860 | 5.624 | 48013.06 | 600519 |
| 4 | 4 | 2001-08-31 | 5.804 | 5.782 | 5.877 | 5.749 | 23231.48 | 600519 |

In [79]:

#Delete the useless column (axis=0 row, 1 column) in the function of. drop series
df.drop(labels='Unnamed: 0',axis=1,inplace=True) #inplace=True delete data from original data
  • df.info():
    • Return some original information in df
      • Rows of data
      • Data type of each column element
      • Detect if there is missing data in the column

In [83]:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4385 entries, 0 to 4384
Data columns (total 7 columns):
date      4385 non-null object
open      4385 non-null float64
close     4385 non-null float64
high      4385 non-null float64
low       4385 non-null float64
volume    4385 non-null float64
code      4385 non-null int64
dtypes: float64(5), int64(1), object(1)
memory usage: 239.9+ KB

In [88]:

##Convert data types in the date column to time series types
df['date'] = pd.to_datetime(df['date'])

In [90]:

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4385 entries, 0 to 4384
Data columns (total 7 columns):
date      4385 non-null datetime64[ns]
open      4385 non-null float64
close     4385 non-null float64
high      4385 non-null float64
low       4385 non-null float64
volume    4385 non-null float64
code      4385 non-null int64
dtypes: datetime64[ns](1), float64(5), int64(1)
memory usage: 239.9 KB

In [94]:

#Use the date column as the row index of the source data
df.set_index('date',inplace=True)
Published 89 original articles, won praise 1, visited 605
Private letter follow

Tags: Python pip

Posted on Mon, 13 Jan 2020 02:42:18 -0500 by mady