The implementation principle of BPF TCP congestion algorithm


1. Preface

The flywheel of eBPF is still rotating rapidly. Since the Linux kernel version 5.6 supports the ability of eBPF program to modify TCP congestion algorithm, it can be realized by modifying the pointer of congestion function structure in the kernel in user state; In version 5.13, this function is further optimized, which increases the ability of this kind of program type to directly call part of the kernel code, which avoids the need to repeatedly implement the functions related to the TCP congestion algorithm used in the kernel in the eBPF program.

The realization of these two functions provides the evolution of Linux from macro kernel to intelligent micro kernel. Although at present, it only focuses on the control of TCP congestion algorithm, the realization of these two functions has a very good imagination space. This is because many functions in the Linux kernel are based on structure pointers. When we can complete the redirection of functions in the kernel structure in the eBPF program written by users, we can realize the flexible expansion of the kernel and the enhancement of functions. Combined with the direct calling ability of kernel functions, it is equivalent to providing ordinary users with the ability to customize the kernel. Although this is only a small step in eBPF, the follow-up may be called a big step in kernel ecology.

This paper first focuses on the struct provided by version 5.6 for the customization of TCP congestion algorithm_ The ability of OPS. The ability of this type of eBPF program to call Linux kernel functions will be described in detail in the next article.

2. eBPF enabled TCP congestion control algorithm

In order to support the ability to modify the TCP congestion control algorithm through the eBPF program, Martin KaFai Lau, an engineer from Facebook, submitted a report composed of 11 small patches on January 8, 2020 Submit . The implementation adds BPF to eBPF_ MAP_ TYPE_ STRUCT_ Ops new map structure type and BPF_ PROG_ TYPE_ STRUCT_ The program type of OPS only supports TCP congestion structure in the kernel at the current stage_ congestion_ Ops modification.

Figure 1 related structure and code fragment of the overall implementation

First, let's start with how to use the sample program (for the complete code implementation, see here )Here, we will omit the contents irrelevant to the function introduction:

void BPF_PROG(dctcp_init, struct sock *sk)
	const struct tcp_sock *tp = tcp_sk(sk);
	struct dctcp *ca = inet_csk_ca(sk);

	ca->prior_rcv_nxt = tp->rcv_nxt;
	ca->dctcp_alpha = min(dctcp_alpha_on_init, DCTCP_MAX_ALPHA);
	ca->loss_cwnd = 0;
	ca->ce_state = 0;

	dctcp_reset(tp, ca);

__u32 BPF_PROG(dctcp_ssthresh, struct sock *sk)
	struct dctcp *ca = inet_csk_ca(sk);
	struct tcp_sock *tp = tcp_sk(sk);

	ca->loss_cwnd = tp->snd_cwnd;
	return max(tp->snd_cwnd - ((tp->snd_cwnd * ca->dctcp_alpha) >> 11U), 2U);

// ....

struct tcp_congestion_ops dctcp_nouse = {
	.init		= (void *)dctcp_init,
	.set_state	= (void *)dctcp_state,
	.flags		= TCP_CONG_NEEDS_ECN,
	.name		= "bpf_dctcp_nouse",

struct tcp_congestion_ops dctcp = {  // The structure defined by the bpf program is not necessarily the same as that used in the kernel
  																	 // Can be a combination of necessary fields
	.init		= (void *)dctcp_init,
	.in_ack_event   = (void *)dctcp_update_alpha,
	.cwnd_event	= (void *)dctcp_cwnd_event,
	.ssthresh	= (void *)dctcp_ssthresh,
	.cong_avoid	= (void *)tcp_reno_cong_avoid,
	.undo_cwnd	= (void *)dctcp_cwnd_undo,
	.set_state	= (void *)dctcp_state,
	.flags		= TCP_CONG_NEEDS_ECN,
	.name		= "bpf_dctcp",

Two points are noted here:

  1. tcp_ congestion_ The OPS structure is not the corresponding structure in the kernel header file. It only contains the fields used by the TCP CC algorithm in the corresponding structure of the kernel. It is a subset of the structure with the same name corresponding to the kernel.
  2. Some structures (such as tcp_sock) will see preserve_ access_ The index attribute indicates that when the eBPF bytecode is loaded, the fields in this structure will be redirected to meet the offset of the structure field with the same name in the current kernel version.

Note the TCP defined in the BPF program_ congestion_ Ops structure (also known as BPF PRG BTF type), which can be completely consistent with the structure defined in the kernel (called btf_vmlinux btf type) or some necessary fields in the kernel structure. The order of structure definition may not be consistent with the structure in the kernel, but the name, type or function declaration must be consistent (such as parameters and return values). Therefore, you may need to change from BPF PRG BTF type to BTF type_ Vmlinux is a translation process of BTF type. BTF technology is mainly used in this conversion process. At present, it is mainly to find and match through member name, BTF type and size. If libbpf does not match, an error will be returned. The whole conversion process is similar to the reflection mechanism in the Go language type, which is mainly implemented in the function bpf_map__init_kern_struct_ops (see the principle chapter for details).

Add the section name to the eBPF program and declare it as. struct_ops, used to identify the struct to be implemented in the BPF implementation_ Ops structure, such as TCP currently implemented_ congestion_ Ops structure.

Multiple structs can be defined simultaneously under SEC(".struct_ops")_ Ops structure. Each struct_ Ops is defined as a global variable under SEC(".struct_ops"). libbpf creates a map for each variable. The name of the map is the name of the defining variable, BPF in this case_ dctcp_ Nouse and dctcp.

For complete user status code, see here For the generated scaffold related codes, see here , the core program code related to dctcp is as follows:

static void test_dctcp(void)
	struct bpf_dctcp *dctcp_skel;
	struct bpf_link *link;

  // Scaffold generated functions
	dctcp_skel = bpf_dctcp__open_and_load();
	if (CHECK(!dctcp_skel, "bpf_dctcp__open_and_load", "failed\n"))

  // bpf_map__attach_struct_ops adds a struct to register_ Ops map to kernel subsystem
  // Here is the struct TCP defined above_ congestion_ Ops dctcp variable
	link = bpf_map__attach_struct_ops(dctcp_skel->maps.dctcp); 
	if (CHECK(IS_ERR(link), "bpf_map__attach_struct_ops", "err:%ld\n",
		  PTR_ERR(link))) {


  # Destroy related data structures

The detailed process is explained as follows:

  • In BPF_ object__ In the open phase, libbpf will look for the SEC(".struct_ops") part and find the struct_ btf type implemented by ops. Note that btf type here refers to BPF_ A type in the btf of prog. O. "struct bpf_map" like other map types, through bpf_object__add_map(). Libbpf then collects (through SHT_REL) the locations of bpf progs (using the function defined by SEC("struct_ops/xyz") where func ptrs points. btf is not required in the open phase_ vmlinux.

  • In BPF_ object__ In the load phase, the fields in the map structure (depending on btf_vmlinux) pass through bpf_map__init_kern_struct_ops() initialization. During the loading phase, libbpf also sets prog - > type and prog - > attach_ btf_ ID and prog - > expected_ attach_ Type attribute. Therefore, the properties of a program do not depend on its section name.

    Currently, BPF_ prog btf-type ==> btf_ Vmlinux BTF type matching process is very simple: member name matching + BTF kind matching + size matching.
    If these matching criteria fail, libbpf will reject them. The current target support is "struct tcp_condensation_ops", in which most of its members are function pointers.
    bpf_ The member ordering of prog's BTF type can be different from that of BTF_ BTF type of vmlinux.

    Then, all obj - > maps are created as usual (in bpf_object_create_maps()). Once the map is created and the properties of prog are set, libbpf will continue to execute. Libbpf will continue to load all programs.

  • bpf_map__attach_struct_ops() is used to register a struct_ops map to the kernel subsystem.

For the complete PR code supporting TCP congestion control algorithm, see here .

3. Implementation of scaffold code

The sample process of generating scaffold is as follows: (for scaffold submission commit, see here , you can here Search for relevant keywords (view).

$ cd tools/bpf/runqslower && make V=1  # The whole process is as follows
$ .output/sbin/bpftool btf dump file /sys/kernel/btf/vmlinux format c > .output/vmlinux.h
clang -g -O2 -target bpf -I.output -I.output -I/home/vagrant/linux-5.8/tools/lib -I/home/vagrant/linux-5.8/tools/include/uapi		      \
	 -c runqslower.bpf.c -o .output/runqslower.bpf.o &&				      \

$ llvm-strip -g .output/runqslower.bpf.o
$ .output/sbin/bpftool gen skeleton .output/runqslower.bpf.o > .output/runqslower.skel.h
$ cc -g -Wall -I.output -I.output -I/home/vagrant/linux-5.8/tools/lib -I/home/vagrant/linux-5.8/tools/include/uapi -c runqslower.c -o .output/runqslower.o

$ cc -g -Wall .output/runqslower.o .output/libbpf.a -lelf -lz -o .output/runqslower

4. bpf struct_ Underlying implementation principle of OPS

In the above process, the user state code and the main implementation process in the kernel have been explained. If you are not interested in the underlying implementation principle of the kernel, you can skip this part.

4.1 ops structure in kernel (bpf_tcp_ca.c)

As shown in Figure 1, in order to realize this function, it is necessary to provide basic capability support in the kernel code. The operation object structure (ops structure) corresponding to the structure in the kernel is bpf_tcp_congestion_ops, defined in / net / IPv4 / BPF_ tcp_ In the ca.c file, see here :

/* Avoid sparse warning.  It is only used in bpf_struct_ops.c. */
extern struct bpf_struct_ops bpf_tcp_congestion_ops;

struct bpf_struct_ops bpf_tcp_congestion_ops = {
	.verifier_ops = &bpf_tcp_ca_verifier_ops,
	.reg = bpf_tcp_ca_reg,
	.unreg = bpf_tcp_ca_unreg,
	.check_member = bpf_tcp_ca_check_member,
	.init_member = bpf_tcp_ca_init_member,
	.init = bpf_tcp_ca_init,
	.name = "tcp_congestion_ops",

bpf_ tcp_ congestion_ The functions in the OPS structure are described as follows:

  • The init() function will be called first to make any required global settings;
  • init_member() verifies the exact value of any field in the structure. In particular, init_member() can verify non functional fields (for example, flag fields);
  • check_member() determines whether a specific member of the target structure is allowed to be implemented in BPF;
  • The reg() function actually registers the replacement structure after passing the check; In the case of congestion control, it will put TCP_ congestion_ The OPS structure (with appropriate BPF trampolines for function pointers) is installed where the network stack will use it;
  • Unregistered;
  • verifier_ops structure has some functions to verify whether each replacement function can be executed safely;

Among them, verfier_ The OPS structure is mainly used for the judgment of the verifier. The defined functions are as follows:

static const struct bpf_verifier_ops bpf_tcp_ca_verifier_ops = {
	.get_func_proto		= bpf_tcp_ca_get_func_proto,// The function prototype used by the verifier is used to verify whether it is allowed in the eBPF program 
  																							// BPF_ The auxiliary function in the CALL kernel and adjust the BPF_ after verification. imm32 domain in call instruction.
	.is_valid_access	= bpf_tcp_ca_is_valid_access,     // Is it a legal access
	.btf_struct_access	= bpf_tcp_ca_btf_struct_access, // It is used to judge whether the structure in btf can be accessed

Finally, in kernel / BPF / BPF_ struct_ ops_ Add a line to types. H:


4.2 definition and management of kernel OPS object structure (bpf_struct_ops.c)

stay bpf_struct_ops.c In the file, include the "bpf_struct_ops_types.h" file 4 times, and set BPF respectively_ STRUCT_ OPS_ The type macro realizes the definition of value structure in map and the management function of OPS object array defined by kernel, and also includes the definition of corresponding data structure BTF.

/* bpf_struct_ops_##_name (e.g. bpf_struct_ops_tcp_congestion_ops) is
 * the map's value exposed to the userspace and its btf-type-id is
 * stored at the map->btf_vmlinux_value_type_id.
#define BPF_STRUCT_OPS_TYPE(_name)				\
extern struct bpf_struct_ops bpf_##_name;			\
struct bpf_struct_ops_##_name {						\
	struct _name data ____cacheline_aligned_in_smp;		\
#include "bpf_struct_ops_types.h" / / ① used to generate bpf_struct_ops_tcp_congestion_ops structure

enum {
#include "bpf_struct_ops_types.h" / / ② generate an enum member

static struct bpf_struct_ops * const bpf_struct_ops[] = {
#define BPF_STRUCT_OPS_TYPE(_name)				\
	[BPF_STRUCT_OPS_TYPE_##_name] = &bpf_##_name,
#include "bpf_struct_ops_types.h" / / ③ generate a member in the array [bpf_struct_ops_type_tcp_condensation_ops] 
  																	 // = &bpf_tcp_congestion_ops

void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log)

	/* Ensure BTF type is emitted for "struct bpf_struct_ops_##_name" */
#define BPF_STRUCT_OPS_TYPE(_name) BTF_TYPE_EMIT(struct bpf_struct_ops_##_name);
#include "bpf_struct_ops_types.h"  // ④  BTF_TYPE_EMIT (struct bpf_struct_ops_tcp_condensation_ops BTF registration)
  // ...

Related structures after compilation and full deployment:

extern struct bpf_struct_ops bpf_tcp_congestion_ops;			
struct bpf_struct_ops_tcp_congestion_ops {		// ① 	 Stored as a value object of type map		
	refcount_t refcnt;				
	enum bpf_struct_ops_state state			
	struct tcp_congestion_ops data ____cacheline_aligned_in_smp;	// Tcp_convergence_ops object in kernel

enum {
	BPF_STRUCT_OPS_TYPE_tcp_congestion_ops  //  ② Serial number declaration

static struct bpf_struct_ops * const bpf_struct_ops[] = { // ③ As an array variable
  // Bpf_tcp_convergence_ops is the variable defined in / net/ipv4/bpf_tcp_ca.c file (including function pointers for various operations)
	[BPF_STRUCT_OPS_TYPE_tcp_congestion_ops] = &bpf_tcp_congestion_ops,

void bpf_struct_ops_init(struct btf *btf, struct bpf_verifier_log *log)
  // #define BTF_TYPE_EMIT(type) ((void)(type *)0)
  ((void)(struct  bpf_struct_ops_tcp_congestion_ops *)0); // ④ BTF type registration

  // ...

So far, the kernel has completed the generation and registration of ops structure types and the management of ops object array.

4.3 initialization of kernel structure value in map

This process involves initializing the variables defined in the BPF program to the kernl kernel variables. This process is implemented in the bpf_map_init_kern_struct_ops function in the libbpf library. The function prototype is:

/* Init the map's fields that depend on kern_btf */
static int bpf_map__init_kern_struct_ops(struct bpf_map *map,
					 const struct btf *btf,
					 const struct btf *kern_btf)

The main process of initializing map structure variables using bpf program structure is as follows:

  • The defined BPF_MAP_TYPE_STRUCT_OPS map object will be recognized during BPF program loading;
  • Obtain the tcp_condensation_ops type in the variable type defined by struct ops (such as struct tcp_condensation_ops dctcp), and use the obtained tname/type/type_id to set it to the st_ops object in the map structure;
  • Through the tname attribute set in the previous step, find the type_id and type of tcp_convergence_ops type in the kernel in the btf information table of the kernel, and also obtain the vtype_id and Vtype of value type bpf_struct_ops_tcp_convergence_ops in the map object;
  • So far, we have obtained the variables defined in the BPF program and bpf_prog BTF type tcp_condensation_ops, the type tcp_condensation_ops defined in the kernel and the bpf_struct_ops_tcp_condensation_ops of map value type;
  • The next thing is to initialize the bpf_prog btf type variable into the bpf_struct_ops_tcp_condensation_ops variable through specific btf information rules (name, call parameters, return type, etc.), and put the variables in the kernel into the st_ops - > kern_vdata structure (bpf_map_attach_struct_ops() The function updates the value of the map with st_ops - > kern_vdata, and the key of the map is fixed to the value of 0 (indicating the first position);
  • Then, set btf_vmlinux_value_type_id in the map structure to vtype_id for subsequent inspection and use. Map - > btf_vmlinux_value_type_id = kern_vtype_id;

5. Summary

On the surface, congestion control is an important new function of BPF, but from the bottom implementation, we can see that the implementation of this function is far more general than this function. We believe that there will be more abundant implementation in the near future. The implementation of defining kernel functions in software will bring us a different experience.

Specifically, this basic function can be used to make a BPF program replace any "operation structure" using function pointers in the kernel , and a large part of the kernel code is called through at least one such structure. If we can replace all or part of the security_hook_heads structure, we can modify the security policy in any way, such as the recommendations similar to KRSI. Replacing a file_operations structure can reconnect any part of the I/O subsystem of the kernel.

At present, no one has proposed to do these things, but this ability will certainly attract interested users. One day, almost all kernel functions can be hooked or replaced by BPF code in user space. In such a world, users will have great power to change the operation mode of their system, but we think "Linux kernel" Will become more amorphous because many functions may depend on which code is loaded from user space.

6. References

  1. Kernel operations structures in BPF
  2. Introduce BPF STRUCT_OPS
  3. Writing TCP congestion control algorithm with eBPF

Tags: Algorithm udp ebpf TCP/IP

Posted on Wed, 22 Sep 2021 09:33:06 -0400 by Kifebear