Data mining -- common data preprocessing

0. Read and write data

(1) Read txt file

#Read txt and splice data
import pandas as pd
data=pd.read_table('1.txt', delimiter=',',dtype={'id':'int32','index':'int8'})

(2) Read and save list file

Note: if the list is stored in txt, it needs to be converted to str, which is also str when reading. In order to prevent the list structure from being broken, it can be converted into np.array storage

import numpy as np


(3) Save dataframe to csv

#Save dataframe as a csv file
data.to_csv(path_or_buf="data.csv", index=False) 

1. Sort

    DataFrame.sort_values(by='##',axis=0,ascending=True, inplace=False, na_position='last')

Parameter description
by specifies the column name (axis=0 or 'index') or index value (axis=1 or 'columns')
Axis if axis=0 or 'index', it will be sorted according to the data size in the specified column; if axis=1 or 'columns', it will be sorted according to the data size in the specified index. The default axis=0
Ascending: whether to arrange the columns in ascending order. The default value is True, that is, ascending order
inplace whether to replace the original data with the sorted data set. The default value is False, that is, no replacement
na_position {'first', 'last'}, set the display position of the missing value

2. A new column or row for dataframe append

import pandas as pd
    data = pd.DataFrame()
    a = {"x":1,"y":2}#a line
    data = data.append(a,ignore_index=True)

    a = [[1,2,3],[4,5,6]]#Two columns
    data = data.append(a,ignore_index=True)
    a = [[7,8,9],[10,11,12]]
    data = data.append(a,ignore_index=True)


3. Combination of data splicing: merge and join

Note: join can only be used for the same number of rows (merge according to row index)

reference resources:

4. Sorting within groups

df.groupby('B', group_keys=False).apply(lambda x: x.sort_values('C', ascending=False))
[Out]    A    B       C
    3    4    a    201003
    0    2    a    200801
    1    3    b    200902
    2    5    b    200704


5. Initializing a new column

All values in this column are assigned:


Using lamada to judge the condition assignment: lamada (expression if condition else(...) )

sourcedf['region']=sourcedf['exam_district'].apply(lambda x:"whole country" if x==1 else ("Beijing" if x==3 else("Shanghai" if x==24 else "Other regions"))  )

6. Too many columns in Panda dataframe, output display ellipsis

Solution: set the maximum number of columns to display


Similarly, you can set how many lines to display:


7. Code a column

Hard coding: 1,2,3,4,5

 label = preprocessing.LabelEncoder()
    data_raw[x] = label.fit_transform(data_raw[x])
#The feature value is continuously encoded from 0 (or 1), for example, color is hard coded. There are three values of color, which are encoded as 1,2,3

colorMap = {elem:index+1 for index,elem in enumerate(set(df["color"]))}
df['color'] = df['color'].map(colorMap)

onehot:  001 010 100

#Expand all values under a field horizontally. For each data, its value on the corresponding expanded value is 1
data1 = pd.get_dummies(df[["color"]])

#onehot for multiple feature s:

#For the data after onehot, if the original data needs to be merged, just join the data of onehot directly
res  = df.join(data1)

8. Implement Unix timestamp and common time conversion

9. pandas: replace the value of one column with the condition of another

#When the condition > 0.99, the value of old is replaced by the value of new
df['old'] = df['old'].mask(df['condition'] > 0.99, df['new'])

10. Cast

df[' Min Humidity']=df[' Min Humidity'].astype('float64')
df=df.astype({'Max Humidity':'float64','Max Dew PointF':'float64'})

11. Latitude and longitude geohash

12. Determine whether the key exists in the dictionary

d = {'name':{},'age':{},'sex':{}}
#d.keys() lists all the key s in the dictionary
print name in d.keys()
print name not in d.keys()
#Results return True, False


Tags: Lambda Unix

Posted on Fri, 12 Jun 2020 02:41:25 -0400 by catalinus