Machine Learning Model Training Scheme in Mass Data Scenarios

It is very difficult to train the machine learning model by single point in the process of actual processing and solving the engineering problem of machine learning.These scenarios include online recommendations, CTR estimates, Lookalike marketing, and so on. When there are hundreds of millions of data, tens of thousands of dimensional features, and these applications involve more than 10G or even TB levels of data, how do you train models based on massive data?

Incremental Learning and Feature Selection

incremental learning

Incremental learning, even using small batch data (sometimes referred to as online learning), is the core of this learning style because it allows you to have only a small amount of data in memory for any period of time, and then output the final result through Voting, as shown in the reference code below.

#Number of slices
size = math.ceil(len(train) / cnt)
for i in range(cnt):
    start = size * i
    end = (i + 1) * size if (i + 1) * size < len(train) else len(train)
    #batch data
    slice = train[start:end]

feature selection

Feature selection removes feature importance 0 from sparse features, ranks lower feature importance values, and persists the feature in npz format for subsequent loading. The core code is as follows

se = pd.Series(clf.feature_importances_)
se = se[se>0]
##Sort feature importance
col =list(se.sort_values(ascending=False).index)
##Non-zero printed features and number
print('Coding features with non-zero signature importance are',len(se),'individual')
n = clf.best_iteration_
baseloss = clf.best_score_['valid']['auc']
#Finding the Optimal Number of Features by Filtering Features
clf = LGBMClassifier(boosting_type='gbdt',
                     num_leaves=31, max_depth=-1, 
                     learning_rate=0.1, n_estimators=n, 
                     subsample_for_bin=200000, objective=None,
                     class_weight=None, min_split_gain=0.0, 
                     min_child_samples=20, subsample=1.0, subsample_freq=1,
                     reg_alpha=0.0, reg_lambda=0.0, random_state=None,
                     n_jobs=-1, silent=True)

print('Start feature selection calculations...')
all_num = int(len(se)/100)*100
print('Share',all_num,'Features to be calculated')
loss = []
break_num = 0
for i in range(100,all_num,100):
    if loss[-1]>baseloss:
        best_num = i
        baseloss = loss[-1]
    print('Front',i,'The score for each feature is',loss[-1],'And Full Score',baseloss)
    if break_num==2:
print('The best number of features selected is',best_num,'At last the training speed can be greatly improved')

As mentioned above, we describe the Incremental Learning + Feature Selection method, which is applicable to scenarios where data is calculated on a single machine with a data volume of around 10G. This method is also applicable to developers who can also do model validation by Ali PAI with computing resource Maxcompute. This method has the following advantages


  • The model has the strongest generalization ability and the best performance.
  • Easy to deploy and model iteration;
  • Supports more complex tree models, which can explain strongly.


  • For single airport scenarios only, data over 100G is difficult to train.


Spark is designed for general data processing, not for machine learning tasks, so it is not really a machine learning framework.To run machine learning tasks on a Spark, you can use MLlib for Spark; however, this scheme usually has the following limitations:

  • Complex models, such as training for integrated tree models, are not supported.
  • Most of the scenarios are baseline, and support for parameter selection is limited, requiring the developer to reconstruct himself (such as the kmeans algorithm, where the distance between two vectors used inside the spark is a Euclidean distance).If you want to adjust to cosine or Mahalanobis distance, you need to reconstruct it;
  • Support for grid tuning is not ideal;

In view of this, Microsoft has developed MMLSpark, which provides Apache Spark with some in-depth learning and data science tools to unify machine learning components CNTK, LightGBM and Spark. So far:

  • We can run the tasks of the integrated tree model based on Spark.
  • Can adjust complex parameters such as leaf node settings;
  • Supports search for hyperparameters

The sample code is as follows:

# Instantiate a LightGBM Regressor with similar but different parameters and single-machine versions, and the documentation can be found at the following links:
lgbm = LightGBMRegressor(numIterations=120, objective='binary',
        learningRate=0.007, baggingSeed=50,
        boostingType="goss", lambdaL1=0.4, lambdaL2=0.4,
        baggingFraction=0.87, minSumHessianInLeaf=0.003,
        maxDepth=9, featureFraction=0.66, numLeaves=47,

MMLSpark has the following advantages over incremental learning


  • Distributed training;
  • Support PB level data training modeling;


  • The cost of environmental construction and maintenance is too complex;

Deep learning framework such as Tensorflow

The dataset is equally distributed to each node of the system, where each node has a copy of the neural network and its local weights.Each node processes a different subset of the dataset and updates its local weight set.These local weights are shared across the cluster and a new global set of weights is computed through a cumulative algorithm.These global weights are then assigned to all nodes on which the next batch of data is processed

As shown in the figure above, parameter server method (PS) is used for distributed machine learning training in TensorFlow.Implements parallel processing of datasets and global sharing of parameters

The advantages of using the in-depth learning framework are as follows


  • Tensorflow, Mxnet and Pytorch naturally support distributed training with simple and flexible configuration.


  • Lack of EXPLANABILITY in machine learning scenarios
  • Parallel Data Computing Brings a Threshold to Development

Apart from the above methods, if you use Maxcompute+Dataworks of Ali Cloud, you can try to do a lot of data modeling by machine learning PAI. PAI shields the development complexity of distributed environment, does not require the configuration and maintenance of environment, it should also be a "cost-effective" choice.

In the latest version, PAI supports small dataset model operations, allowing developers to run through processes before large-scale data calculations can be performed after validation.

Tags: Python Spark Apache network

Posted on Mon, 11 May 2020 23:50:09 -0400 by les48