python pandas data cleaning: sample() function

DataFrame.sample

The DataFrame.sample method is mainly used for simple random sampling of DataFrame.

PS: simple random sampling here means that it cannot be used for systematic sampling or stratified sampling.

DataFrame.sample this method can randomly extract rows or columns from DataFrame. The parameters received by this method are as follows:

DataFrame.sample(n=None
, frac=None
, replace=False
, weights=None
, random_state=None
, axis=None)

The sample method has only six parameters. Before introducing the use of these six parameters in detail, let's create a DataFrame data:

df = pd.DataFrame({'num_legs': [2, 4, 8, 0],
                   'num_wings': [2, 0, 0, 0],
                   'num_specimen_seen': [10, 2, 1, 8]},
                  index=['falcon', 'dog', 'spider', 'fish'])
print(df)

Output:

        num_legs  num_wings  num_specimen_seen
falcon         2          2                 10
dog            4          0                  2
spider         8          0                  1
fish           0          0                  8

Parameter interpretation:

--n    Set sampling quantity

The first parameter n in the sample method is an int type parameter. This parameter is used to specify the number of samples (rows) or columns to be randomly extracted. The default is to randomly extract row data. This parameter cannot be used with the frac parameter, and if the frac parameter is not specified, the default value of the n parameter is 1.

print(df.sample())#Randomly extract 1 line of output without specifying

print(df.sample(2))#2 lines of random output

output:
        num_legs  num_wings  num_specimen_seen
falcon         2          2                 10


        num_legs  num_wings  num_specimen_seen
fish           0          0                  8
spider         8          0                  1

--frac     Set sampling scale

frac parameter receives a float type data and specifies the proportion of randomly selected rows or columns. This parameter cannot be used together with n parameter. For example, if you want to randomly extract 80% of the row data, you can use the following code: (it is found to be out of order)

print(df.sample(frac=0.8))
print(df.sample(frac=0.9))
#The final number of outputs is the number of samples * frac, which adopts the principle of rounding, so one sample outputs 3 and one outputs 4

Output:
        num_legs  num_wings  num_specimen_seen
fish           0          0                  8
dog            4          0                  2
spider         8          0                  1

        num_legs  num_wings  num_specimen_seen
spider         8          0                  1
dog            4          0                  2
falcon         2          2                 10
fish           0          0                  8

--replace     Set whether to put it back

replace receives a bool type data. False indicates that no put back sampling is performed, and True indicates that put back sampling is performed. The default value is false, that is, no put back sampling is performed.

df.sample(3,replace=True)

output
        num_legs  num_wings  num_specimen_seen
falcon         2          2                 10
falcon         2          2                 10
fish           0          0                  8

When frac receives a number greater than 1, it means that the sample size returned by the sample method is greater than the sample size of the original data. This can be achieved only with put back sampling. Therefore, the replace parameter must be specified as True at this time:

df.sample(frac=1.5,replace=True)

output
        num_legs  num_wings  num_specimen_seen
spider         8          0                  1
dog            4          0                  2
dog            4          0                  2
spider         8          0                  1
dog            4          0                  2
dog            4          0                  2

--weights     Set sample weight

This parameter is used to specify the sampling weight. The greater the weight, the greater the probability that the row or column of data will be extracted. The default value of this parameter is None, which means that equal probability sampling is performed at this time, that is, the probability of each row or column being drawn is equal.

Two types of data can be passed to the weights parameter, one is str type and the other is Series type.

strĀ 

The requirement is a column name in the DataFrame (i.e. performing row sampling)

pandas will take the value of str column as the sampling weight of this row of data for sampling.

If the sum of data in a column is not equal to 1, the column data will be normalized to 1. If there is a missing value in the column, the sampling weight of this row of data is regarded as 0, that is, this row of data is not extracted. In addition, infinite values are not allowed in this column.

For example, add num to the sample data_ Wings column as sampling weight, num_ The four values of wings column are 2, 0, 0 and 0 respectively. The sum of the four data is 2, not 1. It will be normalized to 1 first, that is, the weight is 1, 0, 0 and 0 in the final sampling. This means that when extracting data, the last three rows of data can never be extracted because the sampling weight is 0.

df.sample(4,weights='num_wings',replace=True)

output
        num_legs  num_wings  num_specimen_seen
falcon         2          2                 10
falcon         2          2                 10
falcon         2          2                 10
falcon         2          2                 10

Series

The length can be different from the length of rows or columns in the data. Take row sampling as an example. Before sampling, pandas will align the indexes first, which is equivalent to making a left connection between DataFrame and Series. The DataFrame does not match the index. The corresponding row sampling weight is 0. Let's illustrate with examples:

First, we created a Series data with three elements. The indexes are falcon, dog and cat respectively. We noticed that one of the index tags cat did not find a match in the data row index tag above. Next, we assign s as the sampling weight to weights:

s = pd.Series([0.2,0.5,0.3],index=['falcon','dog','cat'])
df.sample(4,weights=s,replace=True)
output
        num_legs  num_wings  num_specimen_seen
falcon         2          2                 10
dog            4          0                  2
spider         8          0                  1
fish           0          0                  8

It can be understood that first, pandas left joins df and s, df is the left table and S is the right table. The following results will be returned:

        num_legs  num_wings  num_specimen_seen  weights
falcon         2          2                 10     0.2
dog            4          0                  2     0.5
spider         8          0                  1     nan
fish           0          0                  8     nan

It can be seen from s that the data in the first and second rows of df match the sampling weight, which are 0.2 and 0.5 respectively, while the data in the third and fourth rows do not match the sampling weight, so they are missing values. Next, the data of weights is used as the weight for sampling. Since the weights of the last two columns are missing values, that is, the sampling weight is 0, these two rows will be excluded during sampling. Effective behavior:

        num_legs  num_wings  num_specimen_seen  weights
falcon         2          2                 10     0.2
dog            4          0                  2     0.5

Since the sampling weights of the two rows of data are 0.2 and 0.5 respectively, and 0.7 cannot be compared with 1, they will be standardized to and 1 before sampling, and the final sampling weights of the two rows are 0.285 and 0.715 respectively, that is, the first row has a 28.5% probability of being drawn, while the second row has a 71.5% probability of being drawn, so the above results appear.

--random_state     Set random number seed

random_ The state parameter can reproduce the sampling results. For example, you can use this parameter if you sample on a data set today and you want to get the same sampling results as today when you sample on the same data tomorrow. This parameter receives an int type.

For the first sampling, take a sample at random:

df.sample(random_state=1)

input
      num_legs  num_wings  num_specimen_seen
fish         0          0                  8

For the second sampling, if you want to get the same result, you can specify the same random_state

df.sample(random_state=1)

Output:
      num_legs  num_wings  num_specimen_seen
fish         0          0                  8

Do not specify random_state, the returned values may be different

df.sample()

output
        num_legs  num_wings  num_specimen_seen
spider         8          0                  1

--axis

The sample method can sample rows or columns. The parameter that controls this behavior is axis. When axis is specified as 0 or 'index', the rows are sampled. When axis is specified as 1 or 'col', the columns are sampled. Row sampling is performed by default.

#Row sampling
df.sample(axis=0)

Output:
        num_legs  num_wings  num_specimen_seen
spider         8          0                  1

#Row sampling
df.sample(axis='index')

Output:
      num_legs  num_wings  num_specimen_seen
fish         0          0                  8

#Column sampling
df.sample(axis=1)

Output:
        num_legs
falcon         2
dog            4
spider         8
fish           0
#Column sampling
df.sample(axis='columns')

Output:
        num_specimen_seen
falcon                 10
dog                     2
spider                  1
fish                    8

come from: Pandas data cleaning series: detailed explanation of DataFrame.sample method - Zhihu

What does the frac parameter in the sample method do_ CDA Q & a community

Tags: R Language boosting

Posted on Fri, 26 Nov 2021 05:40:08 -0500 by penguinmasta