2-5_Pandas_String_Operations pandas string operation

pandas string operation
Obviously, in addition to the numeric type, there are many character types of data that we work with, and this part of the data is obviously very important, so in this section we will mention string processing for pandas.

# * coding:utf-8_*_
# Author: XiangLin
# Creation time: 11/02/2020 20:15
# File: 2-5_Pandas_String_Operations.py
# IDE      :PyCharm
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import warnings
warnings.filterwarnings('ignore')
plt.style.use("bmh")
plt.rc('font', family='SimHei', size=25) #Show Chinese
pd.set_option('display.max_columns',1000)
pd.set_option('display.width', 1000)
pd.set_option('display.max_colwidth',1000)

As you can see earlier, when pandas are working with numeric types, they are all just like fish in water. To tell you secretly, pandas processing strings are also quite vigorous.
Let's read a copy of the weather data.

weather_2012 = pd.read_csv('weather_2012.csv',parse_dates=True,index_col='Date/Time')
print(weather_2012.head())
print(weather_2012.index)
Output:
                     Temp (C)  Dew Point Temp (C)  Rel Hum (%)  Wind Spd (km/h)  Visibility (km)  Stn Press (kPa)               Weather
Date/Time                                                                                                                              
2012-01-01 00:00:00      -1.8                -3.9           86                4              8.0           101.24                   Fog
2012-01-01 01:00:00      -1.8                -3.7           87                4              8.0           101.24                   Fog
2012-01-01 02:00:00      -1.8                -3.4           89                7              4.0           101.26  Freezing Drizzle,Fog
2012-01-01 03:00:00      -1.5                -3.2           88                6              4.0           101.27  Freezing Drizzle,Fog
2012-01-01 04:00:00      -1.5                -3.3           88                7              4.8           101.23                   Fog
DatetimeIndex(['2012-01-01 00:00:00', '2012-01-01 01:00:00', '2012-01-01 02:00:00', '2012-01-01 03:00:00', '2012-01-01 04:00:00', '2012-01-01 05:00:00', '2012-01-01 06:00:00', '2012-01-01 07:00:00', '2012-01-01 08:00:00', '2012-01-01 09:00:00',
               ...
               '2012-12-31 14:00:00', '2012-12-31 15:00:00', '2012-12-31 16:00:00', '2012-12-31 17:00:00', '2012-12-31 18:00:00', '2012-12-31 19:00:00', '2012-12-31 20:00:00', '2012-12-31 21:00:00', '2012-12-31 22:00:00', '2012-12-31 23:00:00'], dtype='datetime64[ns]', name='Date/Time', length=8784, freq=None)              

5.1 String Operation

You can see from the data above that there is a column called'Weather'.Let's assume it's snowing that includes Snow.

The str type of pandas provides a series of convenient functions, such as contains here

weather_description = weather_2012['Weather']
is_snowing = weather_description.str.contains('Snow')
# Look at the dataframe that we contains returned as a result of a Boolean decision.
# Returns a dataframe of bool type content
print(is_snowing.astype(int)[:5])
is_snowing.astype(int).plot(figsize = (20,6))
plt.show()

w = weather_2012.loc[weather_2012['Weather'].str.contains('Snow'),'Weather'].head()
print(w)
Output:
Date/Time
2012-01-01 00:00:00    0
2012-01-01 01:00:00    0
2012-01-01 02:00:00    0
2012-01-01 03:00:00    0
2012-01-01 04:00:00    0
Name: Weather, dtype: int32
Date/Time
2012-01-02 17:00:00    Snow Showers
2012-01-02 20:00:00    Snow Showers
2012-01-02 21:00:00    Snow Showers
2012-01-02 23:00:00    Snow Showers
2012-01-03 00:00:00    Snow Showers
Name: Weather, dtype: object

6.2 Average Temperature

If we want to know the median monthly temperature value, there is a useful function to call Ha, called resample()

weather_med = weather_2012['Temp (C)'].resample('M',how=np.median)
print(weather_med)
weather_2012['Temp (C)'].resample('M',how=np.median).plot(figsize = (20,10),kind = 'bar')
plt.show()
Output:
Date/Time
2012-01-31    -7.05
2012-02-29    -4.10
2012-03-31     2.60
2012-04-30     6.30
2012-05-31    16.05
2012-06-30    19.60
2012-07-31    22.90
2012-08-31    22.20
2012-09-30    16.10
2012-10-31    11.30
2012-11-30     1.05
2012-12-31    -2.85
Freq: M, Name: Temp (C), dtype: float64


As expected, July and August are the hottest months

You know, Boolean True and False are actually inconvenient to calculate. Of course, they are 0 and 1, so it's good for us to convert to float to do that?

print(is_snowing.astype(float)[:5])

Date/Time
2012-01-01 00:00:00    0.0
2012-01-01 01:00:00    0.0
2012-01-01 02:00:00    0.0
2012-01-01 03:00:00    0.0
2012-01-01 04:00:00    0.0
Name: Weather, dtype: float64

Then we use resample wisely to find out how much snow it snows each month (why do we feel bored doing something, but do we know how much snow it snows in any month...)

me_snow = is_snowing.astype(float).resample('M',how = np.mean)
print(me_snow)
is_snowing.astype(float).resample('M',how = np.mean).plot(figsize = (20,10),kind = 'bar')
plt.show()
Output:
Date/Time
2012-01-31    0.240591
2012-02-29    0.162356
2012-03-31    0.087366
2012-04-30    0.015278
2012-05-31    0.000000
2012-06-30    0.000000
2012-07-31    0.000000
2012-08-31    0.000000
2012-09-30    0.000000
2012-10-31    0.000000
2012-11-30    0.038889
2012-12-31    0.251344
Freq: M, Name: Weather, dtype: float64


So, as you can see, December is the snowiest month in Canada.Then you can also see some other clues, for example, you may find that it snows suddenly in November and then extends over a long period of time. Although the probability of snowing decreases gradually, it may not stop until April or May.

5.3 Draw temperature and snow

We put temperature and snow probability together to make up two columns of the dataframe and draw a graph

temperature = weather_2012['Temp (C)'].resample('M',how=np.median)
is_snowing = weather_2012['Weather'].str.contains('Snow')
snowiness = is_snowing.astype(float).resample('M',how = np.mean)
# Name the column
temperature.name = "Temperature"
snowiness.name = "Snowiness"
print(temperature)
print(snowiness)
output
Date/Time
2012-01-31    -7.05
2012-02-29    -4.10
2012-03-31     2.60
2012-04-30     6.30
2012-05-31    16.05
2012-06-30    19.60
2012-07-31    22.90
2012-08-31    22.20
2012-09-30    16.10
2012-10-31    11.30
2012-11-30     1.05
2012-12-31    -2.85
Freq: M, Name: Temperature, dtype: float64
Date/Time
2012-01-31    0.240591
2012-02-29    0.162356
2012-03-31    0.087366
2012-04-30    0.015278
2012-05-31    0.000000
2012-06-30    0.000000
2012-07-31    0.000000
2012-08-31    0.000000
2012-09-30    0.000000
2012-10-31    0.000000
2012-11-30    0.038889
2012-12-31    0.251344
Freq: M, Name: Snowiness, dtype: float64

We concat the string stitching
Use concat to join the two columns together to form a new dataframe

stats = pd.concat([temperature,snowiness],axis=1)
print(stats)
stats.plot(figsize = (20,10),kind = "bar")
plt.show()
        Temperature  Snowiness
Date/Time                         
2012-01-31        -7.05   0.240591
2012-02-29        -4.10   0.162356
2012-03-31         2.60   0.087366
2012-04-30         6.30   0.015278
2012-05-31        16.05   0.000000
2012-06-30        19.60   0.000000
2012-07-31        22.90   0.000000
2012-08-31        22.20   0.000000
2012-09-30        16.10   0.000000
2012-10-31        11.30   0.000000
2012-11-30         1.05   0.038889
2012-12-31        -2.85   0.251344


You find what the devil is!!!Red snow probability!!!
Yes, you have different ranges in these two dimensions, so draw them separately.

stats.plot(kind='Bar', subplots=True, figsize=(15, 10))
plt.show()


Data link: Link: https://pan.baidu.com/s/1caOMOZO0y5xOQD1mSjyLiA
Extraction Code: rpxn
Online Data Mining Algorithms from July
Xianglin
February 12, 2020, Chengkou, Chongqing
Learn well, go up every day, and get results

Fourteen original articles were published, 5 were praised, and 653 were visited
Private letter follow

Tags: Pycharm

Posted on Tue, 11 Feb 2020 22:54:36 -0500 by funkyres