Machine learning - interview selection

  1. What is machine learning?

    Over fitting: it means that the effect of the model on the training set is very good, and the prediction effect on the test set is very poor. Generally, the deviation is low and the square error is high

  2. How to avoid over fitting problem?

    1. Resampling bootstrap

    2. L1,l2 regularization

    3. Pruning operation of decision tree

    4. Cross validation

3. What is the under fitting of machine learning?

    Under fitting: refers to low model complexity or too small data set,The fitting degree of model data is not high,Therefore, the effect of the model on the training set is not good.Generally, the deviation is high and the variance is low
  1. How to avoid under fitting problem?

    1. Increase the number of samples

    2. Increase the number of sample features

    3. Feature dimension extension

5. What is cross validation? What is the role of cross validation? What are the main methods of cross validation?

    Cross validation: the original data set(dataset)It is divided into two parts.One part is the training set, which is used to train the model,The other part tests the effect of the model as a test set.

    Cross validation:  1)Cross validation is used to evaluate the prediction effect of the model on the new data set,It can also reduce the over fitting of the model to a certain extent

                             2)We can also get as much effective information as possible from the limited data.

    Main methods of cross validation:

                    ①Set aside method.The original data set is simply divided into training sets,Validation set,The test set consists of three parts.

                    ②k Fold cross validation.(Generally, 50% cross validation or 10% cross validation is adopted)

                    ③Leave one method.(Only one sample is left as the test set of data,The rest are used as training sets)---Applies only to fewer datasets

                    ④ Bootstrap method.(Sample deviation will be introduced)Up sampling and down sampling

6. What is the difference between supervised algorithm and unsupervised algorithm?

  Labeled algorithms are called supervised algorithms,Those without labels are called unsupervised algorithms.

7 What are the common distance measurement formulas?

1)Minkowski(Minkowski distance )

--- P=1:Manhattan distance(City distance)

---P=2:Euclidean distance 

---P=∞infinite:Chebyshev Distance 

2) Cosine similarity 

3) KL distance(Relative entropy)

4) Jackard similarity coefficient(aggregate)

5) Pearson correlation coefficientρ(distance=  1-ρ)

8. What are the ways and differences of text data processing?

1) Word bag method(BOW/TF)\Word set method(SOW)

---The grammar and word order of the text are not considered,Only consider the number of times the word exists(BOW/TF)Or does it exist(SOW)


---Considering the word frequency of the text,The inverse document frequency of the file is also considered(The basic idea is:The importance of words is directly proportional to the number of times words appear in the document,It is inversely proportional to the number of words in the corpus)

3) HashTF-IDF(Word frequency is not calculated,Calculate words for Hash Posterior Hash Number of values)

4) Dumb coding(OneHotEncoder)

5) Word2Vec (By analyzing all words in the document->>Get the degree of association between words->>Then the word vector matrix is formed)

9. What is the least square method?

    Minimize the sum of squares of the errors between the predicted value and the real value

10. What are the common optimization algorithms and their advantages and disadvantages;

1)Gradient descent method(Gradient Descent)

---BGD--->Each iteration needs to traverse all the sample data,Long consumption time,But we must get the optimal solution.

---SGD--->Fast iteration speed,The local optimal solution is obtained(The global optimal solution is obtained when the function is convex)

---MBGD--->Small batch gradient descent method

2)Newton method and quasi Newton method

Newton method--->Newton's method is second-order convergence,The iterative algorithm with fast convergence speed needs to solve the problem of the objective function at each step Hessian The calculation of the inverse matrix of the matrix is complex.

Quasi Newton method--->Improving Newton's method requires solving complex problems every time Hessian The defect of the inverse matrix of matrix, which uses positive definite matrix to approximate Hessian The inverse of matrix, which simplifies the complexity of operation.

3)conjugate gradient method

Conjugate gradient method is a method between steepest descent method and Newton method. It only needs the first derivative information, but it overcomes the disadvantage of slow convergence of steepest descent method and avoids the storage and calculation of Newton method Hesse Disadvantages of matrix Union and inversion. Its advantage is that it needs less storage and does not need any external parameters.

4) Heuristic method

There are many kinds of heuristic optimization methods, including classical simulated annealing method, genetic algorithm, ant colony algorithm and particle swarm optimization algorithm

5)Lagrange multiplier method
  1. Please describe Lagrange multiplier method and KKT condition;

    Lagrange multiplier method is a method to transform the objective function with constraints into the objective function without constraints.For inequality constraints,Requirement inequality f(x)<=0.
    KKT condition:requirementα*f(x)=0, f(x)<=0; α>=0 And the derivative of each parameter is required to be 0

12. How to avoid data imbalance?

    1)bootstrap(Resampling)--->Up sampling and down sampling

    2)Data synthesis-->Generate more samples from existing samples

    3)Data weighting

    4)As a problem of classification or anomaly detection

    Positive and negative samples are very few-->Method of data synthesis

    There are enough negative samples, very few positive samples, and the proportion is very different-->I. classification method

    There are enough positive and negative samples, and the proportion is not particularly different-->Sampling or weighting method.

13. What aspects generally cause the error of the algorithm?

    1)Deviation caused by the complexity that the model cannot represent the basic data(bias).---Under fitting

    2)Variance caused by model overfitting training set data(variance).---Over fitting

14. How to deal with samples with missing data characteristics?

    According to the actual situation of missing samples,We generally use:

    1)mean value,median,The maximum and minimum values are used to fill in the data

    2)Complete the data according to the empirical value

    3)The missing value is obtained by correlation calculation

    4)Enough samples,Samples with missing values can be deleted directly

15. What are the advantages of converting continuous data partitions to discrete data?

    1)Easy to model fast iteration

    2)Strong robustness

    3)Introduce nonlinearity,Improve the expression ability of the model

    4)The risk of model over fitting is reduced

16. What is the role of dumb coding?

    Dumb coding is a method to convert the characteristics of string type into numeric type.

17. What are the common algorithms of decision tree? What are the differences between these algorithms?

    Commonly used are:ID3,C4.5,CART Three algorithms


    1) Different purity quantitative indexes.ID3-->information gain ,C4.5-->Information gain rate,CART-->Gini coefficient

    2) Different data processing capabilities. ID3-->discrete data ,C4.5,CART--->Continuous data discretization,Pruning operation

    3)Different types of trees.ID3 and C4.5->Multitree, CART->It can only be a binary tree.

18. Data split principle or process of decision tree;

    1)Treat all samples as one node

    2)Quantitative index according to purity.Calculate the of each feature'purity',According to the least'pure'Data division based on the characteristics of

    3)Repeat the above steps,Knowing that each leaf node is sufficient'pure'Or the stop condition is reached

19. How to avoid over fitting and under fitting in decision tree algorithm;

    Over fitting:1,Select the training set that can reflect the business logic to generate the decision tree;2,Pruning operation(Pre pruning and post pruning); 3,Limit the height of the tree( K (cross validation)

    Under fitting:Increase the depth of the tree,RF

20. What are the differences and relations between regression decision tree algorithm and classification decision tree algorithm?

    Regression decision tree solves the problem of regression.Classification decision tree solves the problem of classification.They are the same in the construction of the decision tree (optimal division). The regression decision tree is to calculate the mean or weighted average value of the leaf node where the prediction point is located to obtain the regression result.Classification results are obtained by majority voting or weighted majority voting,The regression decision tree is MAE,MSE Classification decision tree is information entropy, Gini coefficient and error rate

21. What are the functions of data standardization, data normalization and interval scaling methods and what are the differences between them

    Data standardization is to transform the data set into data subject to standard normal distribution.(It is often used for highly discrete data)

    Data normalization is to convert the vector of each sample into a unit vector(Row operation of matrix)

    Interval scaling method is used for data sets with large distribution,Scale to by proportional scaling feature_range=(0,1).

22. What is the goal of feature selection?

    Select features with high dispersion and strong correlation with the target

23. What are the common feature selection methods?

    ①Filtration method

    Filtering method, scoring each feature according to divergence or correlation, setting the threshold or the number of thresholds to be selected, so as to select features; Common methods include variance selection method, correlation coefficient method, chi square test, mutual information method and so on

    ②Packaging method

    The packaging method selects or excludes several features each time according to the objective function (usually the prediction effect score); The common method is recursive feature elimination.

    ③Embedding method

    In the embedding method, some machine learning algorithms and models are used for training to obtain the weight coefficients of each feature, and the features are selected from large to small according to the coefficients; The commonly used method is the feature selection method based on penalty term.

24. What is the role of dimensionality reduction?

    ①Dimensionality reduction can alleviate the problem of dimensionality disaster

    ②Dimensionality reduction can compress data while minimizing information loss

    ③Dimension data is easier to understand through visualization

25. What are the common dimensionality reduction methods?

    PCA(Principal component analysis)-->Unsupervised dimensionality reduction(No category information)-->Select the direction projection with large variance,The greater the variance, the greater the amount of information contained,Less information loss.It can be used for feature extraction and feature selection.

    LDA(Linear discriminant analysis)-->Supervised dimensionality reduction(There is category information)-->The variance within the selected projection category is small,The variance between categories is large.
  1. What is the principle of PCA dimensionality reduction?

    PCA It is an unsupervised dimensionality reduction algorithm, which projects the sample data from high dimension to low dimension through linear transformation,And the projected data has the largest variance
  2. What is the dimension reduction principle of LDA linear discriminant analysis?

    LDA It is a supervised dimensionality reduction algorithm,It projects high-dimensional data onto low-dimensional data,After projection, the intra class variance is the smallest and the inter class variance is the largest (after projection, the projection points of each category of data are as close as possible, and the distance between the category centers of different categories of data is as large as possible)

28. What is the difference between model parameters and hyperparameters in machine learning? What are the ways in which hyperparameters are given?

    Model parameters are configuration variables within the model (the values of model parameters can be estimated with data);

    Model hyperparameters are external configurations of the model, and parameter values must be set manually. (take values through grid cross validation),It can also be selected by experience.)

29. What are the commonly used machine learning tools?

  1. What is the role of SVD matrix singular value decomposition in machine learning?

    LSA(Latent semantic analysis),PCA Dimensionality reduction

31. What is the objective function of logistic algorithm?

    Sigmoid function,maximum likelihood estimation (MLE) , likelihood function  L(θ)

32. Please describe the principle of SVM

    SVM Is to find a hyperplane,Make the point closest to this hyperplane(Support vector)Farthest from hyperplane.
  1. Why interval maximization is used in SVM?

    Interval maximization can increase the confidence of samples,The classifier is more robust.
    When the training data are linearly separable, there are infinite separation hyperplanes, which can correctly separate the two kinds of data.
    The perceptron uses the misclassification minimization strategy to obtain the separation hyperplane, but there are infinite solutions at this time.

    On the other hand, the classification result generated by the separated hyperplane is the most robust and has the strongest generalization ability for unknown instances.

    Then we should elaborate on geometric interval, function interval, and W and b when minimizing 1 / 2 | w | ^ 2 from function interval - > that is, the origin of linear separable support vector machine learning algorithm - maximum interval method.

34. Please deduce the linear separable SVM algorithm

   Watermelon Book formula derivation...
  1. Please describe the principle and application scenario of soft interval linear separable SVM algorithm

    For some sample points that do not satisfy the function interval greater than 1,We add a relaxation factor on the basis of linear separabilityξPenalty term coefficient C.
    C The bigger,The greater the penalty for misclassification(The more error is not allowed);When C Infinite is a linear separable problem.

36. Please describe how SVM solves the problem of nonlinear separability

   By introducing kernel function, the high-dimensional inner product is transformed into low-dimensional inner product

37. What are the functions of kernel functions? What are the common kernel functions?

    Mapping from low dimension to high dimension-----kernel function(rbf,linear,poly)----Calculate in low dimension,Get the effect of high-dimensional classification.

    The common kernel functions are: linear kernel function, polynomial kernel function, Gaussian kernel function and Laplace kernel function
  1. How does SVM deal with multi classification problems?

    ovo(One on one) or ovr(One to many method)
  2. What is the difference and relationship between Logistic regression and SVM algorithm?

    Same point:
    1)Logistic Regression sum SVM They are all classification algorithms.
    2)If you don't consider using kernel functions, LR and SVM They are linear classification models, that is, their classification decision surface is linear.
    3)Are supervised learning algorithms
    4) LR and SVM Are discriminant models.
    1)Different loss functions
    LR Based on the probability theory, the value of the parameter is estimated by the maximum likelihood estimation method, and then the classification probability is calculated. The one with higher probability is taken as the classification result, SVM Based on the maximization of geometric interval, the maximum geometric interval surface is taken as the optimal classification surface.
    2)SVM Only local points near the classification surface, i.e. support vectors, are considered, LR Then all points are considered, and the points far away from the classification surface also play a role in the result, although the role is small.
    3)When solving nonlinear classification problems, SVM Using kernel function, and LR Kernel functions are usually not used.
    4)SVM Without scaling invariance, LR It has scalability invariance.
    5)Logistic Regression is through sigmoid Function to classify,Categories are divided into{0,1},SVM Yes sign Function to classify,Categories are divided into{+1,-1}.
  3. What are the differences and connections between Logistic regression and Linear regression?

    1) Linear regression requires variables to obey normal distribution, while logistic regression does not require variable distribution.
    2) Linear regression requires dependent variables to be continuous numerical variables, while logistic regression requires dependent variables to be classified variables.
    3) Linear regression requires a linear relationship between independent variables and dependent variables, while logistic regression does not require a linear relationship between independent variables and dependent variables
    4) logistic regression is to analyze the relationship between the probability of taking a value of the dependent variable and the independent variable, while linear regression is to directly analyze the relationship between the dependent variable and the independent variable

  4. How to evaluate the effect of the algorithm model?

    Classification algorithm evaluation:Accuracy(Accuracy),Accuracy rate(Precision),recall (Recall),F1 value,ROC curve(AUC value)
    Regression algorithm evaluation: MAE,MSE,Regression scoring function that can explain variance(explained_variance_score)

42. What are the evaluation indicators of the classification algorithm?

    Accuracy(Accuracy),Accuracy rate(Precision),recall (Recall),F1 value,ROC curve(AUC value)

43. What is the role of the confusion matrix? (model effect evaluation)

  The confusion matrix can clearly reflect the parts where the real value and the predicted value coincide with each other and do not coincide with each other, so as to realize the comparison between the predicted value and the real value under the same characteristics. If it is true at the same time, it will be placed in the corresponding matrix position. If it is not true, it will be placed in the mismatched matrix position, and the matching and mismatching items between the real value and the predicted value will be placed in the matrix. We call this matrix Is the confusion matrix.
  1. Recall rate, accuracy rate, accuracy rate, function and difference of F1 index?

    recall (Recall)-->Real example of correct prediction/All real cases
    Accuracy rate(Precision)-->Real example of correct prediction/Number of all samples predicted as real cases
    Accuracy(Accuracy)-->Predict the correct number of positive and negative samples/Total number of samples
    F1 value--> 2/(1/Precision +1/Recall)-->Harmonic average of accuracy and recall
  2. Role of ROC and AUC?

    TPR(TruePositive Rate): TP/(TP+FN),Actually Recall
    FPR:FP/(FP+TN),Error reception rate,False positive rate, that is, the probability that negative samples are recognized as positive samples.
    ROC The curve is FPR Is the horizontal axis,TPR Is the vertical axis,A curve drawn.AUC by ROC Area under curve.
    ROC Curve is a common method to evaluate the effect of classification model.AUC The larger the value of,Indicates that the better the model.

46. What are the evaluation indicators of regression algorithm?

    MAE,MSE, R2
  1. What are the functions and differences of MSE, MAE and R2 indicators?

    MSE--->Mean square deviation.
    MAE--->Mean absolute error.
    R2--->The denominator is understood as the discreteness of the original data, and the numerator is the error between the predicted data and the original data. Dividing the two can eliminate the influence of the discreteness of the original data.
  2. What is the difference between loss function and objective function?

    loss function (Cost function)-->The smaller the loss function, the better the model fitting.
    objective function -->The function to be maximized or minimized is called the objective function.A regular term must be an objective function.
  3. In machine learning, model selection refers to how to make model selection?

    Model selection is the selection of algorithm model and model parameters.(That is, what algorithm to choose)
    How to select a model--->Cross validation,Select the model with good effect.

50. In machine learning, how to debug the learning algorithm model when the model effect is not good?

    1)Get more data

    2)Increase and decrease of characteristics

    3)Adjustment of model hyperparameters

    4) Model fusion (lifting algorithm)

51. Please describe the principle of random forest (RF)

    1)By modifying the dataset Bootstrap(Resampling),obtain m Samples.

    2)Right here m Samples were selected randomly,Retraining m Tree decision tree model.

    3)According to this m Decision tree,Classification by voting or weighted voting,Regression using mean or weighted average.

52. Please describe the similarities and differences between RF and GBDT

    RF and GBDT Same point:

    1)Decision tree is used as the base model.

    2)Are a kind of integration algorithm

    3)Are made up of multiple trees, The final result is determined by multiple trees

    RF and GBDT difference:

   1,The trees that make up the random forest can be classified trees or regression trees; and GBDT It consists only of regression trees

   2,The trees that make up the random forest can be generated in parallel; and GBDT Can only be generated serially(Strong fusion)

   3,For the final output, the random forest adopts majority voting, etc.; and GBDT All results are accumulated or weighted

   4,Random forests are not sensitive to outliers, GBDT Very sensitive to outliers

   5,The random forest treats the training set equally, GBDT It is the integration of weak classifiers based on weight

   6,Random forest improves performance by reducing model variance, GBDT It is to improve performance by reducing model deviation
  1. What is the role of XGBoost


54. In XGBoost, what optimization methods are used when building the tree


55. What is the principle / function of integrated learning?

    principle:'learn widely from others ' strong points'

    effect:According to certain combination strategies,Multiple weak learners form a strong learner.
  1. What is the role of RF and GBDT in feature selection?

    When making feature selection,We can use RF,GBDT Firstly, the model is trained,obtain feature_importance Parameter value,The feature selection can be carried out with a given threshold

57. Introduction to cross validation method


Set aside method:

Leave one method:

k-fold cross validation:

58. The role of L1 and L2 regularization?

59. How to prune the decision tree?

60. Principles of Newton method and quasi Newton method?

Copyright notice: This is the original article of CSDN blogger "vegetable sprouts after frost", which follows the CC 4.0 BY-SA copyright agreement. Please attach the original source link and this notice for reprint.
Original link:

Tags: Algorithm Machine Learning

Posted on Sat, 20 Nov 2021 00:00:34 -0500 by matafy