Data mining -- common data preprocessing

0. Read and write data

(1) Read txt file

#Read txt and splice data import pandas as pd data=pd.read_table('1.txt', delimiter=',',dtype={'id':'int32','index':'int8'}) all=data.merge(data2,on='id',how='left').merge(data3,on=['id','index'],how='left')

(2) Read and save list file

Note: if the list is stored in txt, it needs to be converted to str, which is also str when reading. In order to prevent the list structure from being broken, it can be converted into np.array storage

#Deposit import numpy as np a=np.array(test_courier_ids) np.save('test_courier_ids.npy',a) #read a=np.load('test_courier_ids.npy') test_courier_ids=a.tolist()

(3) Save dataframe to csv

#Save dataframe as a csv file data.to_csv(path_or_buf="data.csv", index=False)

1. Sort

DataFrame.sort_values(by='##',axis=0,ascending=True, inplace=False, na_position='last')

Parameter description
by specifies the column name (axis=0 or 'index') or index value (axis=1 or 'columns')
Axis if axis=0 or 'index', it will be sorted according to the data size in the specified column; if axis=1 or 'columns', it will be sorted according to the data size in the specified index. The default axis=0
Ascending: whether to arrange the columns in ascending order. The default value is True, that is, ascending order
inplace whether to replace the original data with the sorted data set. The default value is False, that is, no replacement
na_position {'first', 'last'}, set the display position of the missing value

2. A new column or row for dataframe append

import pandas as pd data = pd.DataFrame() a = {"x":1,"y":2}#a line data = data.append(a,ignore_index=True) print(data) a = [[1,2,3],[4,5,6]]#Two columns data = data.append(a,ignore_index=True) a = [[7,8,9],[10,11,12]] data = data.append(a,ignore_index=True) print(data)

3. Combination of data splicing: merge and join

Note: join can only be used for the same number of rows (merge according to row index)

reference resources: https://blog.csdn.net/winnertakeall/article/details/86662669

4. Sorting within groups

df.groupby('B', group_keys=False).apply(lambda x: x.sort_values('C', ascending=False)) [Out] A B C 3 4 a 201003 0 2 a 200801 1 3 b 200902 2 5 b 200704

5. Initializing a new column

All values in this column are assigned:

day1_all_train_data['date']='0201'

Using lamada to judge the condition assignment: lamada (expression if condition else(...) )

sourcedf['region']=sourcedf['exam_district'].apply(lambda x:"whole country" if x==1 else ("Beijing" if x==3 else("Shanghai" if x==24 else "Other regions")) )

6. Too many columns in Panda dataframe, output display ellipsis

Solution: set the maximum number of columns to display

pd.set_option('display.max_columns',20)

Similarly, you can set how many lines to display:

pd.set_option('display.max_rows',10)

7. Code a column

Hard coding: 1,2,3,4,5

label = preprocessing.LabelEncoder() data_raw[x] = label.fit_transform(data_raw[x])

#The feature value is continuously encoded from 0 (or 1), for example, color is hard coded. There are three values of color, which are encoded as 1,2,3 colorMap = df['color'] = df['color'].map(colorMap)

onehot: 001 010 100

#Expand all values under a field horizontally. For each data, its value on the corresponding expanded value is 1 data1 = pd.get_dummies(df[["color"]]) #onehot for multiple feature s: df[[fea1,fea2..]] #For the data after onehot, if the original data needs to be merged, just join the data of onehot directly res = df.join(data1)

8. Implement Unix timestamp and common time conversion

http://tool.chinaz.com/Tools/unixtime.aspx

9. pandas: replace the value of one column with the condition of another

#When the condition > 0.99, the value of old is replaced by the value of new df['old'] = df['old'].mask(df['condition'] > 0.99, df['new'])

https://blog.csdn.net/mym_74/article/details/102887459

10. Cast

df[' Min Humidity']=df[' Min Humidity'].astype('float64') df=df.astype({'Max Humidity':'float64','Max Dew PointF':'float64'})

11. Latitude and longitude geohash

https://www.jianshu.com/p/2fd0cf12e5ba

https://blog.csdn.net/wangyaninglm/article/details/78936475#31_python3__geohash_522

12. Determine whether the key exists in the dictionary

d = {'name':{},'age':{},'sex':{}} #d.keys() lists all the key s in the dictionary print name in d.keys() print name not in d.keys() #Results return True, False