[Original] Linux Process Scheduler-CFS Scheduler

background

Read the fucking source code!--By Luxun
A picture is worth a thousand words. --By Golgi

Explain:

Kernel version: 4.14
ARM64 Processor, Contex-A53, Dual Core
Use Tools: Source Insight 3.5, Visio

1. Overview

Completely Fair Scheduler, a fully fair scheduler, is used to schedule ordinary processes in Linux systems.
CFS uses a red-black tree algorithm to manage all scheduling entities sched_entity with an O (log(n) efficiency.CFS tracks the virtual run time vruntime of the scheduling entity sched_entity, equally treats the scheduling entity sched_entity in the running queue, and arranges the scheduling entity sched_entity with less execution time to the left of the red-black tree.
Scheduling entity sched_entity queues red and black trees through enqueue_entity() and dequeue_entity().

As a rule, first take a picture to visualize the principles:

During each sched_latency cycle, runtime can be calculated based on the weight values of each task.
Runtime runtime can be converted to virtual runtime.
Depending on the size of the virtual run time, the scheduling entity with less virtual run time is inserted into the CFS red-black tree and placed on the left side.
The next time a task is scheduled, select a scheduling entity with less virtual run time to run.

Before starting this article, it is recommended that you read (1) Linux Process Scheduler-Base.

Start exploring!

2. Data structure

2.1 Scheduling Class

The Linux kernel abstracts a scheduling class struct sched_class, which is a typical object-oriented design idea. It abstracts the common features into classes, which can be implemented according to specific scheduling algorithms when instantiating each dispatcher.This approach achieves high cohesion and low coupling while making it easy to extend the new scheduler.

In the core dispatch code kernel/sched/core.c, task->sched_class->xxx_func is used, where task represents the struct task_struck structure describing the task, which contains the dispatcher used by the task, and can find the corresponding function pointer to complete the call execution, somewhat similar to the polymorphic mechanism in C++.

2.2 rq/cfs_rq/task_struct/task_group/sched_entity

struct rq: Each CPU has a corresponding running queue;
Strct cfs_rq:CFS run queue, which contains a struct rb_root_cached red-black tree for linking dispatch entities struct sched_entity.The RQ run queue corresponds to a CFS run queue, and another CFS run queue is maintained for each CPU in the task_group structure.
struct task_struct: The descriptor of a task that contains all the information about a process and the struct sched_entity in the structure used to participate in the scheduling of a CFS;
struct task_group: Group scheduling (refer to the previous section), Linux supports grouping tasks to manage the allocation of CPU resources, in which each CPU in the system is assigned a struct sched_entity scheduling entity and a struct cfs_rq run queue, where struct sched_entity is used to participate in the scheduling of CFS;
struct sched_entity: Dispatch entity, which is also the object of CFS dispatch management;

Take a look at the organizational relationship between them:

The struct sched_entity structure field notes are as follows:

struct sched_entity { /* For load-balancing: */ struct load_weight load; //Load weight value of dispatching entity struct rb_node run_node; //Nodes in the red-black tree used to connect to the CFS run queue struct list_head group_node; //Nodes in the cfs_tasks chain used to connect to the CFS run queue unsigned int on_rq; //Used to indicate whether or not in a running queue u64 exec_start; //Start execution time of the current dispatching entity u64 sum_exec_runtime; //Total time dispatched entity execution u64 vruntime; //Virtual run time, which is used to queue in the CFS run queue u64 prev_sum_exec_runtime; //Total time the last dispatching entity was running u64 nr_migrations; //load balancing struct sched_statistics statistics; //statistical information #ifdef CONFIG_FAIR_GROUP_SCHED int depth; //Depth of the task group, where the root task group has a depth of 0, increasing step by step down struct sched_entity *parent; //Parent object pointing to dispatching entity /* rq on which this entity is (to be) queued: */ struct cfs_rq *cfs_rq; //CFS queue that points to the attribution of the dispatching entity, that is, the CFS queue that needs to be enqueued /* rq "owned" by this entity/group: */ struct cfs_rq *my_q; //Points to a CFS queue that belongs to the current dispatching entity and contains subtasks or subtasks #endif #ifdef CONFIG_SMP /* * Per entity load average tracking. * * Put into separate cache line so it does not * collide with read-mostly values above. */ struct sched_avg avg ____cacheline_aligned_in_smp; //Load calculation for dispatching entities (`PELT`) #endif };

The key field notes for the struct cfs_rq structure are as follows:

/* CFS-related fields in a runqueue */ struct cfs_rq { struct load_weight load; //Load weight value of CFS run queue unsigned int nr_running, h_nr_running; //nr_running: Number of dispatching entities running (participating in time slice calculations) u64 exec_clock; //Runtime u64 min_vruntime; //Minimum virtual run time, increase or decrease processing required when dispatching entities are queued #ifndef CONFIG_64BIT u64 min_vruntime_copy; #endif struct rb_root_cached tasks_timeline; //Red-black tree, used to store dispatch entities /* * 'curr' points to currently running entity on this cfs_rq. * It is set to NULL otherwise (i.e when none are currently running). */ struct sched_entity *curr, *next, *last, *skip; //Point to the current running dispatching entity, the next dispatching entity, the last dispatching entity in the CFS running queue, and the dispatching entity that skips running #ifdef CONFIG_SCHED_DEBUG unsigned int nr_spread_over; #endif #ifdef CONFIG_SMP /* * CFS load tracking */ struct sched_avg avg; //Calculate load correlation u64 runnable_load_sum; unsigned long runnable_load_avg; //Average runnable load based on PELT #ifdef CONFIG_FAIR_GROUP_SCHED unsigned long tg_load_avg_contrib; //Load contribution of task groups unsigned long propagate_avg; #endif atomic_long_t removed_load_avg, removed_util_avg; #ifndef CONFIG_64BIT u64 load_last_update_time_copy; #endif #ifdef CONFIG_FAIR_GROUP_SCHED /* * h_load = weight * f(tg) * * Where f(tg) is the recursive weight fraction assigned to * this group. */ unsigned long h_load; u64 last_h_load_update; struct sched_entity *h_load_next; #endif /* CONFIG_FAIR_GROUP_SCHED */ #endif /* CONFIG_SMP */ #ifdef CONFIG_FAIR_GROUP_SCHED struct rq *rq; /* cpu runqueue to which this cfs_rq is attached */ //Point to the CPU RQ run queue to which the CFS run queue belongs /* * leaf cfs_rqs are those that hold tasks (lowest schedulable entity in * a hierarchy). Non-leaf lrqs hold other higher schedulable entities * (like users, containers etc.) * * leaf_cfs_rq_list ties together list of leaf cfs_rq's in a cpu. This * list is used during load balance. */ int on_list; struct list_head leaf_cfs_rq_list; struct task_group *tg; /* group that "owns" this runqueue */ //Task group to which the CFS run queue belongs #ifdef CONFIG_CFS_BANDWIDTH int runtime_enabled; //Use CFS bandwidth control in the CFS run queue u64 runtime_expires; //Runtime due s64 runtime_remaining; //Remaining Runtime u64 throttled_clock, throttled_clock_task; //Limiting time correlation u64 throttled_clock_task_time; int throttled, throttle_count; //Throttled: throttle_count: number of times the CFS runs queue throttled struct list_head throttled_list; //Run Queue Limit Chain List Node, used to add to the cfttle_cfs_rq list in the cfs_bandwidth structure #endif /* CONFIG_CFS_BANDWIDTH */ #endif /* CONFIG_FAIR_GROUP_SCHED */ };

3. Process analysis

The entire process analysis is organized around key functions in the CFS dispatch class entity fair_sched_class.

Let's first see what functions fair_sched_class contains:

/* * All the scheduling class methods: */ const struct sched_class fair_sched_class = { .next = &idle_sched_class, .enqueue_task = enqueue_task_fair, .dequeue_task = dequeue_task_fair, .yield_task = yield_task_fair, .yield_to_task = yield_to_task_fair, .check_preempt_curr = check_preempt_wakeup, .pick_next_task = pick_next_task_fair, .put_prev_task = put_prev_task_fair, #ifdef CONFIG_SMP .select_task_rq = select_task_rq_fair, .migrate_task_rq = migrate_task_rq_fair, .rq_online = rq_online_fair, .rq_offline = rq_offline_fair, .task_dead = task_dead_fair, .set_cpus_allowed = set_cpus_allowed_common, #endif .set_curr_task = set_curr_task_fair, .task_tick = task_tick_fair, .task_fork = task_fork_fair, .prio_changed = prio_changed_fair, .switched_from = switched_from_fair, .switched_to = switched_to_fair, .get_rr_interval = get_rr_interval_fair, .update_curr = update_curr_fair, #ifdef CONFIG_FAIR_GROUP_SCHED .task_change_group = task_change_group_fair, #endif };

3.1 runtime and vruntime

The CFS scheduler does not have the concept of time slice anymore. Instead, it chooses to schedule tasks by sorting them according to actual and virtual run times.
So, what are the runtime and virtual runtime calculations?Take a look at process calls:

The default sysctl_sched_latency for the Linux kernel is 6ms, which is a configurable user state.The sched_period is used to ensure that all runnable tasks run at least once.
When there are more than eight runnable tasks, sched_period needs to be calculated by multiplying the number of tasks by the minimum dispatch granular value, which is 0.75ms by default.
The running time of each task is calculated by multiplying the sched_period value by the weight of the task in the entire CFS running queue.
Virtual run time = actual run time * NICE_0_LOAD / weight of the task;

Let's take an example of five Tasks where each Task has a different nice value (different priority), and the corresponding weight value provides a transformation array in the kernel:

const int sched_prio_to_weight[40] = { /* -20 */ 88761, 71755, 56483, 46273, 36291, /* -15 */ 29154, 23254, 18705, 14949, 11916, /* -10 */ 9548, 7620, 6100, 4904, 3906, /* -5 */ 3121, 2501, 1991, 1586, 1277, /* 0 */ 1024, 820, 655, 526, 423, /* 5 */ 335, 272, 215, 172, 137, /* 10 */ 110, 87, 70, 56, 45, /* 15 */ 36, 29, 23, 18, 15, };

Here's the picture:

3.2 CFS Schedule tick

The tick function in the CFS scheduler is task_tick_fair, which is called for each schedule tick in the system and also for hrtimer.
The process is as follows:

The main work includes:

Update runtime statistics, such as vruntime, runtime, load values, weight values, etc.
Check whether preemption is required, mainly by comparing whether runtime is exhausted and whether the difference between vruntime and runtime is greater than runtime.

Let's take a look at the updates to the update_curr function:

3.3 Tasks Queued

When a task enters a runnable state, the dispatching entity needs to be placed in the red-black tree to complete the enqueue operation.
When the task exits the runnable state, the dispatching entity needs to be removed from the red-black tree to complete the queue operation.
The CFS scheduler uses the enqueue_task_fair function to queue tasks to the CFS queue and the dequeue_task_fair function to queue tasks from the CFS queue.

In the operations of queuing and enlisting, the core logic can be divided into two parts: 1) updating the data at runtime, such as load, weight, the proportion of group scheduling, etc; 2) inserting sched_entity into or removing it from the red-black tree;
Since dequeue_task_fair is generally similar in logic, no further analysis is needed.
This process involves CPU load calculation, task_group scheduling, CFS Bandwidth bandwidth control and so on, which have been analyzed in the previous article and can be understood in combination.

3.3 Task Creation

The task_fork_fair function is called when the parent process creates a child process through a fork, and its incoming parameter is the task_struct of the child process.The main role of this function is to determine the vruntime of the subtasks, so it can also determine the location of the dispatching entity of the subtasks in the red-black tree RB.

The task_fork_fair s themselves are relatively simple, as shown in the following diagram:

3.4 Task Selection

Whenever a process task switches, that is, when the schedule function executes, the scheduler needs to select the next task to execute.
In the CFS scheduler, this is done through the pick_next_task_fair function, following the process:

When a process task switch is required, the incoming parameter of the pick_next_task_fair function contains the task that needs to be switched out, that is, pre_task;
When pre_task is not a normal process, that is, the scheduling class is not CFS, it does not use the scheduling entity of sched_entity to participate in the scheduling, so it executes the simple branch, notifying the system that the current task needs to be switched through the put_pre_task function, rather than through the put_prev_entity function;
When pre_task is a normal process, pick_next_entity is called to select the next task to execute. This selection process actually has two situations: do while() traverses once when the dispatching entity corresponds to the task, and when the dispatching entity corresponds to the task_group, the task group needs to be traversed to select the next task to execute.
put_prev_entity, used to switch the preparation before the task, update the runtime statistics, and do not dequeue operation, where the curr pointer of the CFS queue needs to be set to NULL;
set_next_entity, which sets the next dispatching entity to run and sets the curr pointer of the CFS queue;
If hrtimer is enabled, set the hrtimer's expiration time to the remaining running time of the dispatching entity;

For the time being, the CFS scheduler covers a lot of content. A fair.c file has nearly 10,000 lines of code. The analysis of related content is also scattered in the previous articles, so you can see what you are interested in.

After finishing work, I washed and slept.