Datawhale team learning - Pandas - task two

Happy to learn pandas, you can see joyful pandas tutorial

joyful-pandas

1. Questions

[question 1] what is the difference between str object method and df/Series object method?

​ str.replace For object type or string type, the default operation is regular expression. Currently, it is not supported on DataFrame

Replace is for any type of sequence or data box. If you want to replace it with regular expression, you need to set regex=True. This method can support multi column replacement through dictionary

[question 2] given a column of string type, how to determine whether the cell is numerical data?

You can match with regular expressions

str.contains(r'-?([0-9]\d*)(\.\d+)?$')

[question 3] what is the function of rsplit? Where does it apply?

Method to split a string by specifying a separator and return a list

[question 4] in sections 2 to 4 of this chapter, 5 types of string operations are introduced respectively. Please think about the scenarios in which they are applied respectively?

Split: when a string series has the same separator, split can be used to separate

Splicing: it is understood as the reverse operation of splitting. When you want to use a symbol as a separator to connect several strings

Replacement: it is very useful in data cleaning. You can use regular expressions to remove some unnecessary characters, numbers, etc

String matching: it has the same effect, but it doesn't need to be extracted, just to see if there is one.

String extraction: when a string series has certain rules, it can be used to extract a certain part, for example, year, month, day, year, which is to extract the pattern you need

2. Practice

[exercise 1] there is a dataset about strings, please solve the following problems:

(a) The string code stores the personnel information (ID column is added after the number) in the following format: "XX XX (name): × Chinese, gender XX, born on XX, mm, DD, yyyy"

df['full name'].str.cat(df['nationality'],sep=':').str.cat(df['Gender'],sep='Chinese, gender').str.cat(df['Year of birth'],sep=',Born in').str.cat(df['Birth month'],sep='year').str.cat(df['Date of birth'],sep='month')+'day'

(b) Change the birthday information of the person in (a) to Chinese (for example, October 23, 1974), and the rest of the return format remains unchanged.

Don't do it. Analyze the big guy's answer

L_year = list('zero one two three four five six seven eight nine')
L_one = [s.strip() for s in list('  two three four five six seven eight nine')]
L_two = [s.strip() for s in list(' one two three four five six seven eight nine')]

Year part:

df['Year of birth'].str.replace(r'\d',lambda x:L_year[int(x.group(0))])
Personnel No
 1942
 1985
 3 1946
 1999
 52010
        ... 
1996 1984
 1997 1943
 1998 2018
 1999 2005
 2000 1962
 Name: year of birth, Length: 2000, dtype: object

Month part:

The whole is

1. Convert to two digits first (because there is 10)

2. Convert all two digits to null, so there must be two blanks before character group one

3. First, use bool to determine whether the character group One (0-9) is satisfied in the string, str.replace Converts non STR types to NAN

4. The latter is obvious.

df_new = (df['full name']+':'+df['nationality']+'Chinese, gender'+df['Gender']+',Born in'
          +df['Year of birth'].str.replace(r'\d',lambda x:L_year[int(x.group(0))])+'year'
          +df['Birth month'].apply(lambda x:x if len(x)==2 else '0'+x)\
                      .str.replace(r'(?P<one>[\d])(?P<two>\d?)',lambda x:L_one[int(x.group('one'))]
                      +bool(int(x.group('one')))*'Ten'+L_two[int(x.group('two'))])+'month'
          +df['Date of birth'].apply(lambda x:x if len(x)==2 else '0'+x)\
                      .str.replace(r'(?P<one>[\d])(?P<two>\d?)',lambda x:L_one[int(x.group('one'))]
                      +bool(int(x.group('one')))*'Ten'+L_two[int(x.group('two'))])+'day')\
          .to_frame().rename(columns={0:'ID'})
df_new.head()

(c) Split the ID column result in (b) into 5 columns corresponding to the original list, and use equals to verify the consistency.

It's simpler here, so we won't analyze it

dic_year = {i[0]:i[1] for i in zip(list('zero one two three four five six seven eight nine'),list('0123456789'))}
dic_two = {i[0]:i[1] for i in zip(list('Eleven two three four five six seven eight nine'),list('0123456789'))}
dic_one = {'Ten':'1','twenty':'2','thirty':'3',None:''}
df_res = df_new['ID'].str.extract(r'(?P<full name>[a-zA-Z]+):(?P<nationality>[\d])Chinese, gender(?P<Gender>[\w]),Born in(?P<Year of birth>[\w]{4})year(?P<Birth month>[\w]+)month(?P<Date of birth>[\w]+)day')
df_res['Year of birth'] = df_res['Year of birth'].str.replace(r'(\w)+',lambda x:''.join([dic_year[x.group(0)[i]] for i in range(4)]))
df_res['Birth month'] = df_res['Birth month'].str.replace(r'(?P<one>\w?Ten)?(?P<two>[\w])',lambda x:dic_one[x.group('one')]+dic_two[x.group('two')]).str.replace(r'0','10')
df_res['Date of birth'] = df_res['Date of birth'].str.replace(r'(?P<one>\w?Ten)?(?P<two>[\w])',lambda x:dic_one[x.group('one')]+dic_two[x.group('two')]).str.replace(r'^0','10')
df_res.head()

Exercise two] novel coronavirus is included in the first half of the data set. The first column contains some news headlines for new coronavirus.

(a) Select all the lines about the news headlines of Beijing and Shanghai.

(b) Find the mean value of col2.

(c) Find the mean value of col3.

pd.read_csv('data/String_data_two.csv').head()
col1 col2 col3
0 The second confirmed patient in Ordos City was cured and discharged 19 363.6923
1 2 new cases in Yunnan, 124 in total -67 -152.281
2 14 infectious medical staff discharged from Wuhan Union Medical College Hospital -86 325.6221
3 9 new cases in Shandong, 307 in total -74 -204.9313
4 School opening date in Shanghai extended to March -95 4.05
df=pd.read_csv('data/String_data_two.csv')
df[df['col1'].str.contains('Beijing|Shanghai')]
# First use astype to convert to int, you will be prompted with an error, see where it is wrong, and then use the following code
df[df['col2'].str.contains('0-')]
col1 col2 col3
309 2 new cases were found in Tianjin, 34 cases were confirmed 0- -376.6524
df.loc[[309,396,485],'col2']=[0,9,7]
df['col2'].astype('int').mean()
-0.984
df.astype('string')
col1 col2 col3
0 The second confirmed patient in Ordos City was cured and discharged 19 363.6923
1 2 new cases in Yunnan, 124 in total -67 -152.281
2 14 infectious medical staff discharged from Wuhan Union Medical College Hospital -86 325.6221
3 9 new cases in Shandong, 307 in total -74 -204.9313
4 School opening date in Shanghai extended to March -95 4.05
... ... ... ...
495 34 new cases in Sichuan, 142 in total -55 55.8904
496 20 new cases in Guangxi, 78 in total -99 133.6509
497 17 new cases in Hebei, 65 in total -54 69.6604
498 1737 newly diagnosed cases were found in 31 provinces, regions and cities in China, with a total of 7711 cases 70 -336.9622
499 5 new cases in Shanghai, 101 in total -95 157.951

500 rows × 3 columns

df.columns=df.columns.str.strip()
df['col3'][~(df['col3'].str.replace(r'-?\d+\.?\d+','True')=='True')]
28      355`.3567
37             -5
73              1
122    9056.\2253
332    3534.6554{
370             7
Name: col3, dtype: object
df.loc[[28,122,332],'col3'] = [355.3567,9056.2253, 3534.6554]
df['col3'].astype(float).mean()
24.707484999999988

Tags: Lambda REST

Posted on Fri, 26 Jun 2020 04:04:18 -0400 by mATOK