Auto scheduling mechanism in TVM (Ansor) learning notes

background

TVM inherits the idea of separating Algorithm from schedule in Halide. The user uses the DSL such as TE (Tensor expression) to define the calculation (Algorithm), then the compiler optimizes the corresponding schedule, and finally generates the code of the target platform according to the schedule. To automatically generate high-performance operator implementation according to a given calculation, the core is to find a good schedule. However, this schedule is not only related to calculation and input, but also related to the hardware platform. To find the optimal schedule, the search space is very large. It is essentially a NP hard combinatorial optimization problem. In order to solve this problem, TVM introduces the auto tuning mechanism, which greatly improves the performance of the compiled operator kernel. The earliest generation of auto tuning mechanism is called AutoTVM. The corresponding papers are <Learning to Optimize Tensor Programs> . It requires the user to write the template of the schedule, leaving some parameters for tuning. However, writing the schedule template still requires some domain knowledge and time cost. Therefore, auto tuning 2.0, namely Ansor (AutoScheduler, AutoTVM 2.0), is introduced into TVM. Its corresponding paper is <Ansor : Generating High-Performance Tensor Programs for Deep Learning> . Compared with the previous generation of AutoTVM, one of the main advantages of Ansor is that there is no need to write the schedule template, so that the generation of the whole schedule can be more automated. Official website blog <Introducing TVM Auto-scheduler (a.k.a. Ansor)> Table 1 shows its characteristics and differences from the previous generation. See Figure 1 of the official blog for the whole search process. Paper supporting PPT in <Ansor : Generating High-Performance Tensor Programs for Deep Learning> . This paper is mainly some trivial notes for learning the implementation of the mechanism. Let's take a look at its implementation process in combination with the code.

Workflow

There are some examples of Ansor on the official website (see <Use AutoScheduler for Template-Free Scheduling> ). For example, for Nvidia GPU platform:

Define calculation

First, you need to define the calculation. For a single operator, you can use register first_ The workload() function (defined in the workload_registry.py file) registers the calculation task. As in tune_ conv2d_ layer_ In the cuda.py file:

@auto_scheduler.register_workload
def conv2d_layer(N, H, W, CO, CI, KH, KW, stride, padding):
    data = te.placeholder((N, CI, H, W), name="data")
    kernel = te.placeholder((CO, CI, KH, KW), name="kernel")
    bias = te.placeholder((1, CO, 1, 1), name="bias")
    conv = topi.nn.conv2d_nchw(data, kernel, stride, padding, dilation=1, out_dtype="float32")
    out = topi.nn.relu(conv + bias)
    return [data, kernel, bias, out]

The main function of this decorator is to store the defined function in WORKLOAD_FUNC_REGISTRY is in this mapping. Then create the corresponding search task:

N, H, W, CO, CI, KH, KW, strides, padding = 1, 7, 7, 512, 512, 3, 3, (1, 1), (1, 1)
task = auto_scheduler.SearchTask(
    func=conv2d_layer, args=(N, H, W, CO, CI, KH, KW, strides, padding), target=target
)

The SearchTask() function is defined in search_ In the task.py file, the object with the same name corresponding to it in the C + + layer is defined in search_task. * file. When building this object, the ComputeDAG object will be built according to the workload first. ComputeDAG is an IR designed specifically for Ansor. It is transformed from the calculation declaration described by TE (which can be a single operator or a subgraph). The Python and C + + layer objects of ComputeDAG are defined in compute_dag. * file. Taking matmul and conv2d as examples, their ComputeDAG is:

# matmul
A = PLACEHOLDER [512, 512]
B = PLACEHOLDER [512, 512]
C(i, j) += (A[i, k]*B[k, j])

# conv2d
data = PLACEHOLDER [1, 512, 7, 7]
pad_temp(i0, i1, i2, i3) = tir.if_then_else(((((i2 >= 1) && (i2 < 8)) && (i3 >= 1)) && (i3 < 8)), data[i0, i1, (i2 - 1), (i3 - 1)], 0f)
kernel = PLACEHOLDER [512, 512, 3, 3]
compute(nn, ff, yy, xx) += (pad_temp[nn, rc, (yy + ry), (xx + rx)]*kernel[ff, rc, ry, rx])
bias = PLACEHOLDER [1, 512, 1, 1]
T_add(ax0, ax1, ax2, ax3) = (compute[ax0, ax1, ax2, ax3] + bias[ax0, ax1, 0, 0])
compute(i0, i1, i2, i3) = max(T_add[i0, i1, i2, i3], 0f)

Unlike searching in predetermined parameters in AutoTVM, Ansor constructs the search space by modifying the loop structure. In order to change the loop structure flexibly, it implements a lightweight loop structure IR based on the original IR, which is dedicated to schedule search. It contains State and Action. The former represents the status of schedule search, that is, the cycle structure defined by schedule; The latter contains one or more schedule primitives. As for why it is different from the existing IR, but to expand a set of Sketch IR, there are discussions in the community. It is mainly for the rapid incremental modification of loop structure and the traceability of transform history. ComputeDAG implements some analysis processes, such as total floating-point operation, consumer / producer relationship and whether the stage should be tile d.

If the object to be tuned is the entire network, you need to call extract_ The tasks () function extracts tasks. It extracts the tuning task from a network represented by a Relay.

# Extract tasks from the network
print("Extract tasks...")
mod, params, input_shape, output_shape = get_network(network, batch_size, layout, dtype=dtype)
tasks, task_weights = auto_scheduler.extract_tasks(mod["main"], params, target)

for idx, task in enumerate(tasks):
    print("========== Task %d  (workload key: %s) ==========" % (idx, task.workload_key))
    print(task.compute_dag)

For example, for ResNet50 network model, its output is (part):

========== Task 0  (workload key: ["dac19035dd5fe9424ee8617421b9c817", 1, 28, 28, 128, 4, 4, 128, 128, 1, 28, 28, 128, 1, 28, 28, 128]) ==========
placeholder = PLACEHOLDER [1, 28, 28, 128]
data_pad(i0, i1, i2, i3) = tir.if_then_else(((((i1 >= 1) && (i1 < 29)) && (i2 >= 1)) && (i2 < 29)), placeholder[i0, (i1 - 1), (i2 - 1), i3], 0f)
input_tile(eps, nu, p, ci) = data_pad[floordiv(p, 196), ((floormod(floordiv(p, 14), 14)*2) + eps), ((floormod(p, 14)*2) + nu), ci]
B(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 4) == 3)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 4) == 2)),  ..(OMITTED).. ormod(i, 4) == 0) && (floormod(j, 4) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 4) == 0)), 1f, 0f))))))))))))))))
data_pack(eps, nu, p, ci) += ((input_tile[r_a, r_b, p, ci]*B[r_a, eps])*B[r_b, nu])
placeholder = PLACEHOLDER [4, 4, 128, 128]
bgemm(eps, nu, p, co) += (data_pack[eps, nu, p, ci]*placeholder[eps, nu, co, ci])
A(i, j) = select(((floormod(i, 4) == 3) && (floormod(j, 2) == 1)), 1f, select(((floormod(i, 4) == 3) && (floormod(j, 2) == 0)),  ..(OMITTED).. ct(((floormod(i, 4) == 0) && (floormod(j, 2) == 1)), 0f, select(((floormod(i, 4) == 0) && (floormod(j, 2) == 0)), 1f, 0f))))))))
inverse(vh, vw, p, co) += ((bgemm[r_a, r_b, p, co]*A[r_a, vh])*A[r_b, vw])
conv2d_winograd(n, h, w, co) = inverse[floormod(h, 2), floormod(w, 2), ((((n*14)*14) + (floordiv(h, 2)*14)) + floordiv(w, 2)), co]
placeholder = PLACEHOLDER [1, 28, 28, 128]
T_add(ax0, ax1, ax2, ax3) = (conv2d_winograd[ax0, ax1, ax2, ax3] + placeholder[ax0, ax1, ax2, ax3])

========== Task 1  (workload key: ["7006235cfc29b73be524cf390ed5a977", 1, 56, 56, 64, 1, 1, 64, 64, 1, 56, 56, 64]) ==========
placeholder = PLACEHOLDER [1, 56, 56, 64]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 64, 64]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, (yy + ry), (xx + rx), rc]*placeholder[ry, rx, rc, ff])

========== Task 2  (workload key: ["d7b65649a4dd54becea0a52aabbc5af5", 1, 1000, 1, 1000]) ==========
placeholder = PLACEHOLDER [1, 1000]
T_softmax_maxelem(i0) max= placeholder[i0, k]
T_softmax_exp(i0, i1) = tir.exp((placeholder[i0, i1] - T_softmax_maxelem[i0]))
T_softmax_expsum(i0) += T_softmax_exp[i0, k]
T_softmax_norm(i0, i1) = (T_softmax_exp[i0, i1]/T_softmax_expsum[i0])

========== Task 3  (workload key: ["f4380bb1dc62422a69ad4a1a9771f927", 1, 56, 56, 64, 1, 1, 64, 128, 1, 28, 28, 128]) ==========
placeholder = PLACEHOLDER [1, 56, 56, 64]
PaddedInput(i0, i1, i2, i3) = placeholder[i0, i1, i2, i3]
placeholder = PLACEHOLDER [1, 1, 64, 128]
Conv2dOutput(nn, yy, xx, ff) += (PaddedInput[nn, ((yy*2) + ry), ((xx*2) + rx), rc]*placeholder[ry, rx, rc, ff])

Extract_ The tasks () function is defined in the relay_integration.py file. It performs a call_all_topi_funcs() function, during which the tracengenvironment object will collect TOPI (TVM Operator Inventory) calls. After completion, the key and weight of the task will be recorded in WKl of the trackingenvironment object_ key_ to_ Weight member. For this task and the corresponding weight, the corresponding SearchTask object will be created.

For each CallNode, a ComputeDAG object is created and registered. The AccessAnalyzer will perform static analysis on the calculation. For example, for an operator, check whether it is simple access (element wise), whether it needs multi-level tiling, whether it is output, whether it is strictly inline, etc. This information is mainly used for later sketch generation. The specific implementation is in the AccessAnalyzer::AccessAnalyzer() function (in the compute_dag.cc file).

Tuning

Next, set some parameters of tuning, and then call the tune() function to start the tuning process. Corresponding code:

def run_tuning():
    print("Begin tuning...")
    measure_ctx = auto_scheduler.LocalRPCMeasureContext(repeat=1, min_repeat_ms=300, timeout=10)

    tuner = auto_scheduler.TaskScheduler(tasks, task_weights)
    tune_option = auto_scheduler.TuningOptions(
        num_measure_trials=200,  # change this to 20000 to achieve the best performance
        runner=measure_ctx.runner,
        measure_callbacks=[auto_scheduler.RecordToFile(log_file)],
    )

    tuner.tune(tune_option)

There are several key components in the whole tuning process. Task scheduler is responsible for task scheduling. As mentioned earlier, the whole network is divided into multiple subgraphs, and each subgraph corresponds to a task. The scheduler is responsible for scheduling these tasks to give priority to the tasks that are most likely to improve performance. For each task (subgraph), the sketch generation process generates a sketch. It can be understood as coarse-grained scheduling, in which the general cycle structure will be determined. Then the annotation process is to refine the schedule. It will randomly produce finer grained optimizations (involving parallel, unrolling, tiling size, etc.) to form a complete schedule. It mainly provides the initial population for the next search. Finally, the search is carried out by Evolutionary algorithm. In order to reduce the time-consuming performance test, the search process will also train the cost model to evaluate the performance of the generated program. Select high score candidates for performance test according to the evaluation results. Then, the measured performance data running on real hardware are fed to cost model training to improve its accuracy, so as to cycle.

Task Scheduler

For a neural network, it will be divided into many subgraphs after task extraction (for example, ResNet50 network has about 30 subgraphs after graph Division). How to allocate time effectively on these subgraphs to improve performance as much as possible is a problem. Official PPT p20 graphically illustrates its comparison with previous methods.

The method adopted in Ansor is to estimate the impact of each task on the objective function (representing the end-to-end latency), and then use the derivative to select the task that can make the objective function decline the fastest. It is assumed that n n n tasks, order t = [ t 1 , t 2 , . . . , t n ] t = [t_1, t_2, ..., t_n] t=[t1, t2,..., tn] is the Allocation vector, where t i t_i ti = second i i Time spent on i tasks g i ( t ) g_i(t) gi (t) is the task i i i corresponds to the delay of the subgraph, which is the allocation vector t t t. The objective function can be written as f = ∑ i = 1 n w i × g i ( t ) f = \sum_{i=1}^n w_i \times g_i(t) f=∑i=1n​wi​ × gi​(t). among w i w_i wi refers to the second i i i is the weight of the subgraph (the number of repetitions of the subgraph in Implementation). In this way, each task with small derivative can be selected. However, because it is discrete and cannot be derived directly, the calculation of derivative needs approximation (the following formula is from the appendix of the paper):
∂ f ∂ t i = ∂ f ∂ g i ∂ g i ∂ t i ≈ ∂ f ∂ g i ( α g i ( t i ) − g i ( t i − Δ t ) Δ t ) + ( 1 − α ) ( g i ( t i + Δ t ) − g i ( t i ) Δ t ) ≈ ∂ f ∂ g i ( α g i ( t i ) − g i ( t i − Δ t ) Δ t ) + ( 1 − α ) ( g i ( t i + 1 ) − g i ( t i ) ) \frac{\partial f}{\partial t_i} = \frac{\partial f}{\partial g_i} \frac{\partial g_i}{\partial t_i} \\ \approx \frac{\partial f}{\partial g_i} (\alpha \frac{g_i(t_i) - g_i(t_i - \Delta t)}{\Delta t}) + (1 - \alpha)( \frac{g_i(t_i + \Delta t) - g_i(t_i)}{\Delta t}) \\ \approx \frac{\partial f}{\partial g_i} (\alpha \frac{g_i(t_i) - g_i(t_i - \Delta t)}{\Delta t}) + (1 - \alpha)( g_i(t_i + 1) - g_i(t_i)) ∂ti​∂f​=∂gi​∂f​∂ti​∂gi​​≈∂gi​∂f​(αΔtgi​(ti​)−gi​(ti​−Δt)​)+(1−α)(Δtgi​(ti​+Δt)−gi​(ti​)​)≈∂gi​∂f​(αΔtgi​(ti​)−gi​(ti​−Δt)​)+(1−α)(gi​(ti​+1)−gi​(ti​))

among g i ( t i ) g_i(t_i) gi (ti) and g i ( t i − Δ t ) g_i(t_i - \Delta t) gi​(ti​− Δ t) It can be obtained from historical information, but g i ( t i + 1 ) g_i(t_i + 1) gi (ti + 1) doesn't work because it is future information, so it needs to be predicted. There are two methods:

  • Optimistic guess: Suppose and then pass t i t_i ti. After time g i g_i gi = 0. That is to say g i ( t i + 1 ) = g i ( t i ) − g i ( t i ) / t i g_i(t_i + 1) = g_i(t_i) - g_i(t_i) / t_i gi​(ti​+1)=gi​(ti​)−gi​(ti​)/ti​.
  • Similarity between tasks: if subgraphs are similar, their latency should also be similar. Therefore, the latency of similar subgraphs can be used as an approximation.

Considering these two factors, the following approximation can be obtained (the following formula is from the appendix of the paper):
g i ( t i + 1 ) ≈ min ⁡ ( g i ( t i ) − g i ( t i ) t i , β C i max ⁡ k ∈ N ( i ) V k ) g_i(t_i + 1) \approx \min (g_i(t_i) - \frac{g_i(t_i)}{t_i} , \beta \frac{C_i}{\max_{k \in N(i)} V_k}) gi​(ti​+1)≈min(gi​(ti​)−ti​gi​(ti​)​,βmaxk∈N(i)​Vk​Ci​​)

among C i C_i Ci # is the task i i Floating point operation of i, V k V_k Vk# is a task k k k floating point operands per second. In implementation g i ( t i ) g_i(t_i) gi (ti) uses tasks i i i current optimal latency, t i t_i ti , yes for tasks i i I the number of rounds of tuning (i.e. the number of times _tune_task() function is called).

The main implementation class of this mechanism is TaskScheduler (in the task_scheduler.py file). There are two strategies in the implementation: Round Robin and gradient. Round robin simply takes turns, and gradient is the derivative based method mentioned above. The main process of TaskScheduler::tune() function is as follows:

tune() # In task_scheduler.py
    serach_policies = make_search_policies(search_policy, ...)
        cost_model = XGBModel(...)
        search_policies = [SketchPolicy() for task in tasks]
    
    # do a round robin first to warm up
    for idx in range(len(self.tasks)):
        _tune_task(idx)
    
    # use the specific strategy to choose workload to tune
    while self.ct < tune_option.num_measure_trials and len(self.dead_tasks) < len(self.tasks):
        if strategy is round_robin:
            ...
        elif strategy is gradient:
            for each task:
                # compute gradient from chain rule : (delta f / delta g_i)
                ...
                # compute (g_i(t_i) - g(t_i - \Delta t)) / (\Delta t)
                ...
                # compute (g_i(t_i + \Delta t) - g(t_i)) / (\Delta t)
                ...
                # combine all grads
                ...
                gradients.append(grad)
            task_idx = np.argmin(gradient)
        _tune_task(task_idx)
            search_policies[task_idx].continue_search_one_round()

Here, start with a round of warm up, that is, a round of tuning for each task. Next, enter the main loop. At this time, select tasks based on gradient mentioned above for tuning. Each selected task will be tuned for one round. The number of trials in a round is num_measures_per_round, whose values are the default values of 64 and num_ measure_ The minimum value of trials / task size (so that all tasks are tuned at least once). The condition for the end of the cycle is that the specified number of measure trials is reached, or all tasks are "dead" ("death" refers to that the search space is fully explored or has not been improved for a period of time).

Program Sampler

The main goal of this step is to automatically build the search space and evenly sample it. The method adopted is a hierarchical search space of two layers (sketch + annotation). The first stage is called sketch generation, which produces a small number of sketches, and the second stage is called random annotation, which has a relatively large number. Official PPT p12 begins with its graphic introduction.

Sketch generates a graph for a child. Continue mentioned above_ search_ one_ The round() function is used to perform a round of search. The main process is as follows:

continue_search_one_round() // In search_policy.py
    SketchPolicy::ContinueSearchOneRound() // In sketch_policy.cc
        // Search one round to get promising states
        best_states = SearchOneRound()
            // 1. Generate sketches
            sketch_cache_ = GenerateSketches()
            // 2. Sample the init population
            init_population = SampleInitPopulation(sketch_cache_)
            // 3. Perform evolutionary search
            // Also insert already measured good states to the initial population
            init_population.push_back(...)
            // Sample some random states for eps-greedy
            RandomSampleStates(init_population, ...)
            return EvolutionarySearch(...)
        // Pick 'num_measure_per_iter' states to measure, also pick som e random states
        inputs = PickStatesWithEpsGreedy(best_states, random_states, num_measures)
        // Measure candiate states
        results = ProgramMeasurer::Measure(search_task, ..., inputs)
        // Update the cost model
        CostModel::Update()
            update_func()
                update() // In xgb_model.py

The SketchPolicyNode::GenerateSketches() function (implemented in the sketch_policy.cc file) is used to generate sketch. It starts with the initial State in ComputeDAG, and tries to apply derivation rules for each Stage in that State. If it can be applied, add it to the list after application and wait for the next derivation. This is an enumeration traversal process. The State in the implementation (implemented in the loop_state. * file) contains the current loop structure and a list of transformation steps to build it. Each State corresponds to a specific schedule in ComputeDAG. The enumeration process is shown as follows:

For derivation rules, see SketchPolicy::SketchPolicy() function, which initializes the derivation rule list (sketch_rules in SketchPolicyNode). Note that this rule list is platform dependent. The derivation rules are described in detail in Table 1 in the paper. Correspondingly, the corresponding part of the code is in sketch_ In the policy.cc file:

/********** Sketch generation rules **********/
static RuleSkipStage rule_skip_stage;
static RuleAlwaysInline rule_always_inline;
static RuleMultiLevelTiling rule_multi_level_tiling;
static RuleMultiLevelTilingWithFusion rule_multi_level_tiling_with_fusion;
static RuleAddCacheRead rule_add_cache_read_stage;
static RuleAddCacheWrite rule_add_cache_write_stage;
static RuleAddRfactor rule_add_rfactor;
static RuleCrossThreadReduction rule_cross_thread_reduction;
static RuleSimplifyComputeWithConstTensor rule_simplify_compute_with_const_tensor;
static RuleSpecialComputeLocationGPU rule_special_compute_location_gpu;

For each rule, two functions need to be implemented. One is the function MeetCondition(), which determines whether to apply the condition of the rule; The other is the function Apply() that applies the rule. In the judgment of whether to apply rules, the information in access analyzer when building ComputeDAG before is used. These rules include the following:

  1. Rule skippstate: ignoring is doing nothing.
  2. Rule alwaysinline: simple element wise operations, such as element wise add, relu, etc.
  3. Rule multilevel Tiling: computation intensive operators with internal reuse opportunities, such as matmul and conv2d.
  4. Rule multilevel tiling with fusion: users who meet the above conditions and data can be fused, such as conv2d + relu.
  5. RuleAddCacheRead: the data consumer is an operation that requires multiple layers of tiling.
  6. RuleAddCacheWrite: the operation requires multi-layer tiling, and data users cannot be fused.
  7. RuleAddRfactor: reduce axis split. Used when the space axis is short and the reduce axis is long.
  8. Rule crossthreadreduction: parallelize the above split reduce axis through multiple threads.
  9. Rule simplify compute with consttensor: Winograd algorithm mainly used for convolution.
  10. RuleSpecialComputeLocationGPU: similar to the above, it is mainly used for Winograd algorithm on GPU.

For multi-level tiling, this classical transformation mode has a fixed mode. The CPU has the structure of "SSRSRS" and the GPU has the structure of "SSSRRSRS". Where "S" stands for one tile level of space loops, "R" stands for one tile level of reduction loops. It is suitable for computationally intensive operators (such as matmul, conv2d, conv3d).

In order to apply evolutionary algorithm, at least one initial population is needed, and all individuals in the population need to be complete schedule s. In the previous step, although the loop structure is generated, some details have not been determined, such as tile size, GPU thread bind, loop unroll, vectorize, parallel, etc. Therefore, we need to add this information randomly to make it complete. This step is called annotation. For relevant examples, see Official PPT p15.

The SketchPolicyNode::SampleInitPopulation() function (implemented in the sketch_policy.cc file) implements the mechanism of generating annotation through random methods. It first randomly selects one from a given sketch, and then randomly makes some modifications, such as filling in tile size, parallelizing outer loop, vectorizing inner loop, and expanding inner loop to generate schedule. Stop when the number of generated schedules reaches the specified number (the default is 50). Init in code_ Rules contain rules for generating annotations. The corresponding statement in the code is as follows:

/********** Init population rules **********/
static InitFillTileSize init_fill_tile_size;
static InitChangeComputeLocation init_change_compute_location;
static InitParallel init_parallel;
static InitUnroll init_unroll;
static InitVectorization init_vectorization;
static InitThreadBind init_thread_bind;

Take InitFillTileSize as an example. Its Apply() function scans the transformation history, and then randomly fills in tile size for all splitsteps.

Performance Tuner

The number of samples generated by random sampling is small, which can not guarantee good performance. Therefore, it is necessary to use the evolutionary algorithm combined with cost model to tune the sampled program. After the initial population is generated, it can be searched by evolutionary algorithm. This part is mainly implemented in the SketchPolicyNode::EvolutionarySearch() function (in the sketch_policy.cc file). In the evolutionary algorithm, multiple rounds of repetition use the mutation mechanism to generate candidate sets and output some programs with the highest score determined by the cost model. These programs will be compiled and run on the target machine to get the real running speed data. This data is fed to the cost model for training. Therefore, the accuracy of cost model will gradually increase with the increase of data.

According to the general routine of evolutionary algorithm, there are two important steps:

  • Crossover
    New offspring are produced by fusing the genes of two or more parent s. See Official PPT p17. It seems that this has not been implemented in the code.

  • Mutation
    This step mainly applies the following rules:

    • Tile size: randomly select a tiled loop, divide the tile size in one tile level by the random factor, and then multiply this factor to another tile level.
    • Parallel / unroll / vector factor and granularity: randomly select a loop marked parallel, and change the parallel granularity by split ting it or merging adjacent loop level s. The maximum number of steps to expand the loop is randomly selected.
    • Computing location: randomly select a non multi-level tiled node and randomly change its computing location to another legal attach point.

The code corresponding to these rules is in sketch_ policy_ In rules. H:

/*! \brief The rule that mutates tile size by randomly dividing a tile size by a factor
    and multipling it to another tile size. */
DEFINE_MUTATE_POPULATION_RULE(MutateTileSize);

/*! \brief The rule that mutates the number of fused outer iterators annotated by parallel. */
DEFINE_MUTATE_POPULATION_RULE(MutateParallel);
  
/*! \brief The rule that randomly changes the computation location for some stages that do not
 * need tiling and are not strictly inlineable(e.g. data padding). */
DEFINE_MUTATE_POPULATION_RULE(MutateComputeLocation);
  
/*! \brief The rule that mutates the value of a randomly selected auto unroll pragma step. */
DEFINE_MUTATE_POPULATION_RULE(MutateAutoUnroll);

The schematic diagram of the whole evolutionary algorithm search process is as follows:

The whole iteration has 5 rounds by default. In each round, a new population will be generated according to the rules of mutation according to the population of the previous round (the initial population generated earlier will be used in the first round). For the last round of population, use cost model to predict their performance and get their scores. The higher the score, the better the performance may be. Therefore, the greater the score, the greater the probability that the individual will be selected for mutation. In addition, an upper bound heap is maintained throughout the process, which contains some individuals with the best prediction performance so far. After several rounds of iteration, there are several rounds of "winning" individuals in the heap, and the others are "eliminated". These individuals will then perform performance tests on real hardware. The obtained data are used to train the cost model. Therefore, the cost model will become more accurate and then be used for subsequent search.

As described above, in the process of candidate selection, evolutionary algorithm will use cost model to predict performance. Specifically, the cost model is used to predict the score of each innermost non loop statement. See for schematic diagram Official PPT p18. It extracts the characteristics of each innermost non loop statement (floating point, number of integer operations, vectorization, loop expansion, parallelization, GPU thread binding related characteristics, calculation density, buffer access characteristics, internal allocation characteristics and others). Specifically, it encodes these features into a fixed length feature vector. The model is trained based on the data of performance test. The default model is XGBModel, and the corresponding code is in xgb_modelpy. Because XGBoost does not support incremental training, it will retrain with all the data every time. Its update() function is used for training and predict() function is used for prediction. The feature extraction part is implemented in the feature.cc file, mainly in the GetPerStoreFeaturesFromStates() function.

Apply the Result

When the tuning process is completed, the search history is recorded in the log file. The next step is to apply the best configuration in the search history to generate a schedule.

# Apply the best schedule
sch, args = task.apply_best(log_file)

Then lower and build are called to generate target code. There's nothing special about this one. It's the same as before:

print(tvm.lower(sch, args, simple_mode=True))
func = tvm.build(sch, args, target)

epilogue

Compared with auto tuning mechanism in TVM, Ansor has several advantages:

  • Better usability: it implements a lot of expert experience in the form of rules in the code, so that developers do not need to write the schedule template, so as to realize automation and reduce the use threshold.
  • Better schedule: the hierarchical search design of sketch and annotation expands the search space, so it is possible to search for a schedule with better performance.
  • More efficient search: the search is more efficient through evolutionary algorithm (simulated annealing is used in AutoTVM) and cost model based on machine learning. In addition, it also optimizes subgraph tasks with greater performance gains through more intelligent task scheduling (not in this part of AutoTVM). Through these technologies, the search time of Ansor has been reduced exponentially compared with AutoTVM. However, due to the large space to be explored, further reducing the search time is still a direction worthy of research.

Tags: Machine Learning tvm

Posted on Sun, 21 Nov 2021 05:15:46 -0500 by yuraupt