sklearn -- Transformer and estimator

sklearn -- Transformer and estimator

Among the sklearn I contacted before, there are SimpleImputer, OrdinalEncoder and OneHotEncoder. What they have in common is that they are all used for data feature preprocessing.

In the use of them, it is inevitable to encounter: fit_transform,transform. At first, I didn't fully understand it, but I just remembered this feature. Finally, I learned the concepts of Transformer and estimator in sklearn and thought it necessary to integrate them.

Converter Transformer

Before the model fit, the data set needs to be preprocessed, including the processing of null values and the preprocessing of object variables.

It can be seen that the data set needs to be transformed, so the Transformer needs to be used at this time. In essence, SimpleImputer, OrdinalEncoder and OneHotEncoder belong to the Transformer.

For example, the original data X_ After train passes through a SimpleImputer, a new dataset x with null interpolation is obtained_ train_ new.

Fit and fit_transform and transform

Take SimpleImputer as an example. If mean interpolation is required, you first need to know the mean value of the column corresponding to null value before interpolation.

Therefore, fit first calculates the relevant value of the column corresponding to the null value (missing value) of the matrix, such as the mean, median and the number with the highest frequency.

transform transforms the input data set according to the value calculated by fit.

Therefore, before processing a new data set, you need to fit once to get the relevant values and convert them in transform. But in general, we will use fit directly_ transform .

fit_ Within transform, first fit the data set and call transform on it.

>>>import numpy as np
>>>from sklearn.impute import SimpleImputer
>>>imp = SimpleImputer(missing_values=np.nan, strategy='mean')
>>>imp.fit([[1, 8], [np.nan, 4], [7, 6]])
SimpleImputer()
>>>X
array([[nan,  2.],
 [ 6., nan],
 [ 7., nan]])
>>>print(imp.transform(X))
[[4. 2.]
 [6. 6.]
 [7. 6.]]
>>> print(imp.fit_transform(X))
[[6.5 2. ]
 [6.  2. ]
 [7.  2. ]]

Line 10 is to fit according to the data entered in advance, and then transform x according to the mean value saved by fit. The average value of the first column of the input array is (1 + 7) / 2 = 4, and the average value of the second column is (8 + 4 + 6) / 3 = 6, that is, the divisor is the number of non null values. After obtaining the mean value, insert the null part of X directly, regardless of the data of X.

Line 14, fit x directly_ Transform is to fit x first and then transform X. The principle is the same as above.

It is worth noting that

When converting the dataset, training set and verifier, you only need to call fit on the training set. For the verifier, you can directly call transform(). Because for the consistency of the dataset, the conversion of the verification set is best consistent with the training set. (for example, if the training set is encoded with OrdinalEncoder, and if the verifier also calls fit_transform, it is likely that the coding of the same object value in the training set verification set is different, for example, the color red of the ball in the training set is coded as 3, but it is coded as 6 in the verification set)

Pull down the source code (you can not look at this part, just look at the above conclusion)

After turning to the source code, SimpleImputer, OrdinalEncoder and OneHotEncoder all have base classes_ Baseencoder (TransformerMixin, baseestimator). Although it also has a base class TransformerMixin, I don't understand the code. I can only see return self.fit(X, **fit_params).transform(X). First fit, and then directly call transform.

So simply put, fit_transform means that fit is called first and then transform is called again.

class TransformerMixin:
    """Mixin class for all transformers in scikit-learn."""

    def fit_transform(self, X, y=None, **fit_params):
        """
        Fit to data, then transform it.
        Fit the data and convert it
        Fits transformer to `X` and `y` with optional parameters `fit_params`
        and returns a transformed version of `X`.
        Parameters
        ----------
        X : array-like of shape (n_samples, n_features)
            Input samples.
        y :  array-like of shape (n_samples,) or (n_samples, n_outputs), \
                default=None
            Target values (None for unsupervised transformations).
        **fit_params : dict
            Additional fit parameters.
        Returns
        -------
        X_new : ndarray array of shape (n_samples, n_features_new)
            Transformed array.
        """
        # non-optimized default implementation; override when a better
        # method is possible for a given clustering algorithm
        if y is None:
            # fit method of arity 1 (unsupervised transformation)
            return self.fit(X, **fit_params).transform(X)
        else:
            # fit method of arity 2 (supervised transformation)
            return self.fit(X, y, **fit_params).transform(X)

Take OneHotEncoder as an example. If you look directly at its, fit and transform, you will find that they all call the base class_ BaseEncoder_ fit() and transform().

    def _fit(self, X, handle_unknown="error", force_all_finite=True):
        self._check_n_features(X, reset=True)
        self._check_feature_names(X, reset=True)
        X_list, n_samples, n_features = self._check_X(
            X, force_all_finite=force_all_finite
        )
        self.n_features_in_ = n_features

        if self.categories != "auto":
            if len(self.categories) != n_features:
                raise ValueError(
                    "Shape mismatch: if categories is an array,"
                    " it has to be of shape (n_features,)."
                )

        self.categories_ = []

        for i in range(n_features):
            Xi = X_list[i]
            if self.categories == "auto":
                cats = _unique(Xi)
            else:
                cats = np.array(self.categories[i], dtype=Xi.dtype)
                if Xi.dtype.kind not in "OUS":
                    sorted_cats = np.sort(cats)
                    error_msg = (
                        "Unsorted categories are not supported for numerical categories"
                    )
                    # if there are nans, nan should be the last element
                    stop_idx = -1 if np.isnan(sorted_cats[-1]) else None
                    if np.any(sorted_cats[:stop_idx] != cats[:stop_idx]) or (
                        np.isnan(sorted_cats[-1]) and not np.isnan(sorted_cats[-1])
                    ):
                        raise ValueError(error_msg)

                if handle_unknown == "error":
                    diff = _check_unknown(Xi, cats)
                    if diff:
                        msg = (
                            "Found unknown categories {0} in column {1}"
                            " during fit".format(diff, i)
                        )
                        raise ValueError(msg)
            self.categories_.append(cats)

    def _transform(
        self, X, handle_unknown="error", force_all_finite=True, warn_on_unknown=False
    ):
        self._check_feature_names(X, reset=False)
        self._check_n_features(X, reset=False)
        X_list, n_samples, n_features = self._check_X(
            X, force_all_finite=force_all_finite
        )

        X_int = np.zeros((n_samples, n_features), dtype=int)
        X_mask = np.ones((n_samples, n_features), dtype=bool)

        columns_with_unknown = []
        for i in range(n_features):
            Xi = X_list[i]
            diff, valid_mask = _check_unknown(Xi, self.categories_[i], return_mask=True)

            if not np.all(valid_mask):
                if handle_unknown == "error":
                    msg = (
                        "Found unknown categories {0} in column {1}"
                        " during transform".format(diff, i)
                    )
                    raise ValueError(msg)
                else:
                    if warn_on_unknown:
                        columns_with_unknown.append(i)
                    # Set the problematic rows to an acceptable value and
                    # continue `The rows are marked `X_mask` and will be
                    # removed later.
                    X_mask[:, i] = valid_mask
                    # cast Xi into the largest string type necessary
                    # to handle different lengths of numpy strings
                    if (
                        self.categories_[i].dtype.kind in ("U", "S")
                        and self.categories_[i].itemsize > Xi.itemsize
                    ):
                        Xi = Xi.astype(self.categories_[i].dtype)
                    elif self.categories_[i].dtype.kind == "O" and Xi.dtype.kind == "U":
                        # categories are objects and Xi are numpy strings.
                        # Cast Xi to an object dtype to prevent truncation
                        # when setting invalid values.
                        Xi = Xi.astype("O")
                    else:
                        Xi = Xi.copy()

                    Xi[~valid_mask] = self.categories_[i][0]
            # We use check_unknown=False, since _check_unknown was
            # already called above.
            X_int[:, i] = _encode(Xi, uniques=self.categories_[i], check_unknown=False)
        if columns_with_unknown:
            warnings.warn(
                "Found unknown categories in columns "
                f"{columns_with_unknown} during transform. These "
                "unknown categories will be encoded as all zeros",
                UserWarning,
            )

I read the following roughly. To be honest, I didn't see it very clearly, but it's easy to see_ Fit and_ Transform is used many times in the main loop part, self.categories_ This variable, and this variable comes from SEF_ fit() . That is, transform will directly use the preset values stored in fit.

Predictor Estimator

That is, various algorithm models in sklearn. Both classifier and regression belong to this category.

Common basic, such as decision tree and random forest.

# Decision tree
from sklearn.tree import DecisionTreeRegressor		// For regression problems
from sklearn.tree import DecisionTreeClassifier		// Used to classify problems
# Random winning rate
from sklearn.ensemble import RandomForestRegressor
# XGBoost
from xgboost import XGBRegressor

These predictors are similar in use. Take decision tree as an example.

  • Instantiate an esitimator first
  • Then fit him. Note that the fit here is the fit of the esitimator.
  • Then call predict to predict.
from sklearn.tree import DecisionTreeRegressor		// For regression problems
from sklearn.tree import DecisionTreeClassifier		// Used to classify problems

melbourne_model = DecisionTreeRegressor(random_state = 1)

#Fit model
melbourne_model.fit(X,y)

melbourne_model.predict(X.head())   // Model prediction

The comparison and detailed explanation of each model are not repeated here.

Remember, the foundation of ML is always the foundation of mathematics.

Tags: sklearn Transformer

Posted on Thu, 28 Oct 2021 20:53:02 -0400 by bulldorc