pandas.cut Use summary

https://www.cnblogs.com/sench/p/10128216.html

pandas.cut Used to divide a set of data into discrete intervals. For example, there is a group of age data that can be used pandas.cut Divide the age data into different age groups and label them.

prototype

pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise') #0.23.4

Parameter meaning

x: The array like data to be segmented must be 1-dimensional (cannot use DataFrame);
Bins: bins is the cut interval (or "bucket", "box" or "bin"), which has three forms: an int type scalar, a scalar sequence (array), or pandas.IntervalIndex .

  • A scalar of type int
    When bins is a scalar of type int, x is bisected into bins shares. The range of x is expanded by 0.1% on each side to include the maximum and minimum values of x.
  • Scalable Series
    The scalar sequence defines the interval edge of each bin after being split, and x does not extend at this time.
  • pandas.IntervalIndex
    Define the exact interval to use.

Right: bool type parameter. It is True by default, indicating whether to include the right part of the interval. For example, if bins=[1,2,3], right=True, then the interval is (1,2], (2,3]; right=False, then the interval is (1,2), (2,3).
Labels: label the bins after segmentation. For example, after the age x is divided into the bins of the age group, you can label the age group such as youth and middle age. The length of labels must be the same as that of the divided interval. For example, if there are two intervals (1,2], (2,3) after dividing, the length of labels must be 2. If labels=False is specified, the data in X is returned in the bin (starting from 0).
retbins: a bool type parameter, indicating whether to return the split bins. It is useful when the bins is an int type scalar. In this way, the divided interval can be obtained. The default value is False.
precision: keep the number of decimal places in the interval. The default value is 3
include_ Low: parameter of bool type, indicating whether the left side of the interval is open or closed. The default value is false, that is, the left side of the interval (closed) is not included.
duplicates: whether duplicate intervals are allowed. There are two options: raise: not allowed, drop: allowed.

Return value

out: one pandas.Categorical , a value of type series or ndarray, which represents the bin (interval) of each value in x after partition. If labels are specified, the corresponding label will be returned.
bins: the separated interval. It is returned when retbins is specified as True.

example

Let's take the age group as an example.

import numpy as np
import pandas as pd

ages = np.array([1,5,10,40,36,12,58,62,77,89,100,18,20,25,30,32]) #Age data

Divide ages into 5 intervals

ages = np.array([1,5,10,40,36,12,58,62,77,89,100,18,20,25,30,32]) 
pd.cut(ages, 5)

Output:

[(0.901, 20.8], (0.901, 20.8], (0.901, 20.8], (20.8, 40.6], (20.8, 40.6], ..., (0.901, 20.8], (0.901, 20.8], (20.8, 40.6], (20.8, 40.6], (20.8, 40.6]]
Length: 16
Categories (5, interval[float64]): [(0.901, 20.8] < (20.8, 40.6] < (40.6, 60.4] < (60.4, 80.2] < (80.2, 100.0]]

It can be seen that ages is divided into five intervals, and both sides of the interval are extended to contain the maximum and minimum values.

Divide ages into 5 intervals and specify labels

ages = np.array([1,5,10,40,36,12,58,62,77,89,100,18,20,25,30,32]) #Age data
pd.cut(ages, 5, labels=[u"baby",u"youth",u"middle age",u"Prime of life",u"old age"])

Output:

[baby, baby, baby, youth, youth, ..., baby, baby, youth, youth, youth]
Length: 16
Categories (5, object): [baby < youth < middle age < Prime of life < old age]

Divide ages by specifying interval

ages = np.array([1,5,10,40,36,12,58,62,77,89,100,18,20,25,30,32]) #Age data
pd.cut(ages, [0,5,20,30,50,100], labels=[u"baby",u"youth",u"middle age",u"Prime of life",u"old age"])

Output:

[baby, baby, youth, Prime of life, Prime of life, ..., youth, youth, middle age, middle age, Prime of life]
Length: 16
Categories (5, object): [baby < youth < middle age < Prime of life < old age]

Instead of dividing ages equally, we divide ages into five intervals (0, 5], (5, 20], (20, 30], (30, 50], (50100)

Return to split bins

Make retbins=True

ages = np.array([1,5,10,40,36,12,58,62,77,89,100,18,20,25,30,32]) #Age data
pd.cut(ages, [0,5,20,30,50,100], labels=[u"baby",u"youth",u"middle age",u"Prime of life",u"old age"],retbins=True)

Output:

([baby, baby, youth, Prime of life, Prime of life, ..., youth, youth, middle age, middle age, Prime of life]
 Length: 16
 Categories (5, object): [baby < youth < middle age < Prime of life < old age],
 array([  0,   5,  20,  30,  50, 100]))

Only return which bin the data in x is in

Just make labels=False

ages = np.array([1,5,10,40,36,12,58,62,77,89,100,18,20,25,30,32]) #Age data
pd.cut(ages, [0,5,20,30,50,100], labels=False)

Output:

array([0, 0, 1, 3, 3, 1, 4, 4, 4, 4, 4, 1, 1, 2, 2, 3], dtype=int64)

The first 0 indicates that 1 is in bin 0

Tags: Python

Posted on Thu, 28 May 2020 03:25:08 -0400 by gottes_tod