TSO process analysis

1. TSO(transimit segment offload) is for tcp. It refers to the ability of the protocol stack to offload the tcp segment operation to the hardware, which needs the support of the hardware. When the network card has TSO capability, the upper layer protocol stack can directly send a packet over MTU, and the action of packet splitting is left to the hardware to do, saving cpu resources. In addition to TSO, there is another GSO in the kernel. GSO does not distinguish protocol types. GSO is enabled by default. GSO is a delay segmentation technology implemented in software. Compared with TSO, GSO ultimately needs the protocol stack to complete segmentation processing by itself.

Even if the network card does not have the Tso capability, the transmission layer can still encapsulate a packet that exceeds the MTU. Before the packet is sent to the driver, check whether the network card has the Tso capability. If not, call the ip layer and transmission layer segmentation processing functions to complete the packet segmentation. In this way, the kernel delays the packet segmentation to the dev link layer, improving the packet processing efficiency . When GSO/TSO is supported, the data storage format of skb is as follows. After skb - > end, there is an skb_ In the share area, the non-linear area data of skb is stored here. The processing of GSO/TSO segmentation is to treat the skb data (including linear area and non-linear area) as GSO_ This paper takes the virtual network card as an example to introduce the whole process of TSO.

2. Driver initialization process

When the virtio driver is loaded, it will judge whether the virtual network card has TSO capability according to the negotiation results of features at the front and back of qemu/vhost. If yes, it will be judged in dev - > HW_ feature or upper netif_ F_ The TSO flag is then assigned to dev - > features.

static int virtnet_probe(struct virtio_device *vdev)
{
		/* Individual feature bits: what can host handle? */
		if (virtio_has_feature(vdev, VIRTIO_NET_F_HOST_TSO4))
			dev->hw_features |= NETIF_F_TSO;

                //gso defaults to True
		if (gso)
			dev->features |= dev->hw_features & (NETIF_F_ALL_TSO|NETIF_F_UFO);
}

3. When registering a virtual network card device, set the GSO capability.

virtnet_probe  ---> register_netdev  ---->register_netdevice

int register_netdevice(struct net_device *dev)
{
	dev->hw_features |= NETIF_F_SOFT_FEATURES;
        //Dev - > features is for the protocol stack
	dev->features |= NETIF_F_SOFT_FEATURES;
}

4. Initiate a connect connection or three handshake establishment at the sender (tcp_v4_syn_recv_sock), GSO will be turned on.

int tcp_v4_connect(struct sock *sk, struct sockaddr *uaddr, int addr_len)
{
	rt = ip_route_newports(fl4, rt, orig_sport, orig_dport,
			       inet->inet_sport, inet->inet_dport, sk);
	if (IS_ERR(rt)) {
		err = PTR_ERR(rt);
		rt = NULL;
		goto failure;
	}
	/* OK, now commit destination to socket.  */
        //Set GSO type to tcpv4
	sk->sk_gso_type = SKB_GSO_TCPV4;
	sk_setup_caps(sk, &rt->dst);
}

tcp_v4_connect the GSO of the sock_ Type is set to tcpv4 type, then called sk_. setup_ Caps according to net_gso_ok returns a value to determine whether GSO capability is supported. Normally, True is returned.

static inline bool net_gso_ok(netdev_features_t features, int gso_type)
{
        //This function can be considered to check whether there is tso capability. There are two main calling places:
        //1. It is called when the tcp layer is connect ed or the three handshakes are completed. In this call process, if GSO is enabled, the features will be set to TSO at the same time,
        //   GSO is on by default, so when the tcp layer calls this interface, it will return true
        //2, the dev layer sends skb to the pre drive call to determine whether TSO is needed. In this calling process, features is directly equal to dev->features.
        //   If the network card does not have the TSO capability, the features will not have the TSO flag, and this function will return false
    
	netdev_features_t feature = gso_type & SKB_GSO1_MASK;

	feature <<= NETIF_F_GSO_SHIFT;

	if (gso_type & SKB_GSO2_MASK) {
		netdev_features_t f = gso_type & SKB_GSO2_MASK;
		f <<= NETIF_F_GSO2_SHIFT;
		feature |= f;
	}

	/* check flags correspondence */
	BUILD_BUG_ON(SKB_GSO_TCPV4   != (NETIF_F_TSO >> NETIF_F_GSO_SHIFT));
	BUILD_BUG_ON(SKB_GSO_UDP     != (NETIF_F_UFO >> NETIF_F_GSO_SHIFT));
	BUILD_BUG_ON(SKB_GSO_DODGY   != (NETIF_F_GSO_ROBUST >> NETIF_F_GSO_SHIFT));
	BUILD_BUG_ON(SKB_GSO_TCP_ECN != (NETIF_F_TSO_ECN >> NETIF_F_GSO_SHIFT));
	BUILD_BUG_ON(SKB_GSO_TCPV6   != (NETIF_F_TSO6 >> NETIF_F_GSO_SHIFT));
	BUILD_BUG_ON(SKB_GSO_FCOE    != (NETIF_F_FSO >> NETIF_F_GSO_SHIFT));
	BUILD_BUG_ON(SKB_GSO_GRE     != (NETIF_F_GSO_GRE >> NETIF_F_GSO_SHIFT));
	BUILD_BUG_ON(SKB_GSO_IPIP    != (NETIF_F_GSO_IPIP >> NETIF_F_GSO_SHIFT));
	BUILD_BUG_ON(SKB_GSO_SIT     != (NETIF_F_GSO_SIT >> NETIF_F_GSO_SHIFT));
	BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL != (NETIF_F_GSO_UDP_TUNNEL >> NETIF_F_GSO_SHIFT));
	BUILD_BUG_ON(SKB_GSO_MPLS    != (NETIF_F_GSO_MPLS >> NETIF_F_GSO_SHIFT));

	/* GSO2 flags, see netdev_features.h */
	BUILD_BUG_ON(SKB_GSO_GRE_CSUM != (NETIF_F_GSO_GRE_CSUM >> NETIF_F_GSO2_SHIFT));
	BUILD_BUG_ON(SKB_GSO_UDP_TUNNEL_CSUM != (NETIF_F_GSO_UDP_TUNNEL_CSUM >> NETIF_F_GSO2_SHIFT));
	BUILD_BUG_ON(SKB_GSO_PARTIAL != (NETIF_F_GSO_PARTIAL >> NETIF_F_GSO2_SHIFT));
	BUILD_BUG_ON(SKB_GSO_SCTP    != (NETIF_F_GSO_SCTP >> NETIF_F_GSO2_SHIFT));
	BUILD_BUG_ON(SKB_GSO_TCP_FIXEDID != (NETIF_F_TSO_MANGLEID >> NETIF_F_GSO2_SHIFT));

	return (features & feature) == feature;
}

After gso is supported in protocol layer verification, the ability of dispersion, aggregation and csum verification will be enabled at the same time.

void sk_setup_caps(struct sock *sk, struct dst_entry *dst)
{
	sk_dst_set(sk, dst);
	sk->sk_route_caps = dst->dev->features;
	if (sk->sk_route_caps & NETIF_F_GSO)
		sk->sk_route_caps |= NETIF_F_GSO_SOFTWARE;
	sk->sk_route_caps &= ~sk->sk_route_nocaps;
	if (sk_can_gso(sk)) {
		//skb head needs extra space, close GSO
		if (dst->header_len) {
			sk->sk_route_caps &= ~NETIF_F_GSO_MASK;
		} else {
			//Enable the functions of dispersion, aggregation and csum of skb, because the network card needs to support the functions of dispersion, aggregation and the recalculation of csum while doing TSO
			sk->sk_route_caps |= NETIF_F_SG | NETIF_F_HW_CSUM;
			sk->sk_gso_max_size = dst->dev->gso_max_size;
			sk->sk_gso_max_segs = dst->dev->gso_max_segs;
		}
	}
}

5. The application program calls send to send the packet, and the send system calls TCP finally_ Sendmsg, in tcp_sendmsg decides whether to support GSO. If it supports, encapsulate the user data information into the linear or nonlinear area of skb. After encapsulating, the skb packet is a large package, and then calls tcp_. push_ One is sent to IP layer, of course, check function will be called before sending, and csum of TCP layer will be calculated according to the type of csum. When GSO and TSO are supported, only csum of pseudo header will be calculated by TCP layer.

int tcp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
		size_t size)
{
	struct iovec *iov;
	struct tcp_sock *tp = tcp_sk(sk);
	struct sk_buff *skb;
	int iovlen, flags, err, copied = 0;
	int mss_now = 0, size_goal, copied_syn = 0, offset = 0;
	bool sg;
	long timeo;

	lock_sock(sk);

	flags = msg->msg_flags;
	if (flags & MSG_FASTOPEN) {
		err = tcp_sendmsg_fastopen(sk, msg, &copied_syn, size);
		if (err == -EINPROGRESS && copied_syn > 0)
			goto out;
		else if (err)
			goto out_err;
		offset = copied_syn;
	}

	timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);

	/* Wait for a connection to finish. One exception is TCP Fast Open
	 * (passive side) where data is allowed to be sent before a connection
	 * is fully established.
	 */
	if (((1 << sk->sk_state) & ~(TCPF_ESTABLISHED | TCPF_CLOSE_WAIT)) &&
	    !tcp_passive_fastopen(sk)) {
		if ((err = sk_stream_wait_connect(sk, &timeo)) != 0)
			goto do_error;
	}

	if (unlikely(tp->repair)) {
		if (tp->repair_queue == TCP_RECV_QUEUE) {
			copied = tcp_send_rcvq(sk, msg, size);
			goto out_nopush;
		}

		err = -EINVAL;
		if (tp->repair_queue == TCP_NO_QUEUE)
			goto out_err;

		/* 'common' sending to sendq */
	}

	/* This should be in poll */
	clear_bit(SOCK_ASYNC_NOSPACE, &sk->sk_socket->flags);

	//Obtain mss. If GSO is supported, the integer multiple of the negotiated mss is obtained here
	mss_now = tcp_send_mss(sk, &size_goal, flags);

	/* Ok commence sending. */
	iovlen = msg->msg_iovlen;
	iov = msg->msg_iov;
	copied = 0;

	err = -EPIPE;
	if (sk->sk_err || (sk->sk_shutdown & SEND_SHUTDOWN))
		goto out_err;

        //Judge whether there is dispersing and aggregating ability
	sg = !!(sk->sk_route_caps & NETIF_F_SG);

	while (--iovlen >= 0) {
		size_t seglen = iov->iov_len;
		unsigned char __user *from = iov->iov_base;

		iov++;
		if (unlikely(offset > 0)) {  /* Skip bytes copied in SYN */
			if (offset >= seglen) {
				offset -= seglen;
				continue;
			}
			seglen -= offset;
			from += offset;
			offset = 0;
		}

		while (seglen > 0) {
			int copy = 0;
			int max = size_goal;

			//Get the last skb of the write queue
			skb = tcp_write_queue_tail(sk);
			if (tcp_send_head(sk)) {
				if (skb->ip_summed == CHECKSUM_NONE)
					max = mss_now;
				//copy indicates the maximum data length that skb can store
				copy = max - skb->len;
			}

			//If skb - > len > = max, the skb is full of data; a new skb needs to be allocated
			if (copy <= 0) {
new_segment:
				/* Allocate new segment. If the interface is SG,
				 * allocate skb fitting to single page.
				 */
				if (!sk_stream_memory_free(sk))
					goto wait_for_sndbuf;

				skb = sk_stream_alloc_skb(sk,
							  //Get the length of the skb header
							  select_size(sk, sg),
							  sk->sk_allocation);
				if (!skb)
					goto wait_for_memory;

				/*
				 * Check whether we can use HW checksum.
				 */
				//In sk_setup_caps, csum capability already set, set IP_ The sum mode is CHECKSUM_PARTIAL
				//It means that the protocol stack only does the check sum of ip header and pseudo header, and the palyload needs the help of hardware
				if (sk->sk_route_caps & NETIF_F_CSUM_MASK)
					skb->ip_summed = CHECKSUM_PARTIAL;

				//After dividing an skb, add it to SK - > sk_ write_ Queue in queue
				skb_entail(sk, skb);
				copy = size_goal;
				max = size_goal;

				/* All packets are restored as if they have
				 * already been sent. skb_mstamp isn't set to
				 * avoid wrong rtt estimation.
				 */
				if (tp->repair)
					TCP_SKB_CB(skb)->sacked |= TCPCB_REPAIRED;
			}

			/* Try to append data to the end of skb. */
			//The copied data cannot exceed the message length sent by the user at most
			if (copy > seglen)
				copy = seglen;

			/* Where to copy to? */
			//Judge whether there is space in the linear area
			if (skb_availroom(skb) > 0) {
				/* We have some space in skb head. Superb! */
				copy = min_t(int, copy, skb_availroom(skb));
				//Copy user data to linear area and update SKB - > len synchronously
				err = skb_add_data_nocache(sk, skb, from, copy);
				if (err)
					goto do_fault;
			} else {
				//There is no space in the linear area. Put the message information into skinfo
				bool merge = true;
				int i = skb_shinfo(skb)->nr_frags;
				struct page_frag *pfrag = sk_page_frag(sk);

				//sk_page_frag represents the current SKB_ The last frags of shinfo. Here, judge whether there is any page of the last frags
				//Space can store data (32 bytes minimum). If not, reassign a page and put it in pfrag - > page
				if (!sk_page_frag_refill(sk, pfrag))
					goto wait_for_memory;

				//Judge whether pfrag - > page is sk_ page_ The last page pointed by frag. If it is, it indicates the previous step
				//Sk_ page_ Page in frag has enough space to store data;
				//If not, it indicates that the page page has been reassigned in the previous step. Set the merge to false. Next, you need to reassign the newly assigned page page
				//Add to skb_shinfo
				if (!skb_can_coalesce(skb, i, pfrag->page,
						      pfrag->offset)) {
					if (i == MAX_SKB_FRAGS || !sg) {
						tcp_mark_push(tp, skb);
						goto new_segment;
					}
					merge = false;
				}

				//Take the minimum value of packet length and page remaining space to be copied
				copy = min_t(int, copy, pfrag->size - pfrag->offset);

				if (!sk_wmem_schedule(sk, copy))
					goto wait_for_memory;

				//Copy user data to pfrag - > page and update SKB - > len and SKB - > data synchronously_ Len, so SKB - > Data_ Len only represents the length of nonlinear region
				err = skb_copy_to_page_nocache(sk, from, skb,
							       pfrag->page,
							       pfrag->offset,
							       copy);
				if (err)
					goto do_error;

				/* Update the skb. */
				//If it is a merge operation, modify the size information of the last page
				if (merge) {
					skb_frag_size_add(&skb_shinfo(skb)->frags[i - 1], copy);
				} else {
					//New assigned page, add to SKB_ Shinfo (SKB) - > frags [i]
					//At the same time, SKB_ shinfo(skb)->nr_ Frags value increased by 1
					skb_fill_page_desc(skb, i, pfrag->page,
							   pfrag->offset, copy);
					//page reference count plus 1
					get_page(pfrag->page);
				}
				pfrag->offset += copy;
			}

			if (!copied)
				TCP_SKB_CB(skb)->tcp_flags &= ~TCPHDR_PSH;

			tp->write_seq += copy;
			TCP_SKB_CB(skb)->end_seq += copy;
			skb_shinfo(skb)->gso_segs = 0;

			from += copy;
			copied += copy;
			if ((seglen -= copy) == 0 && iovlen == 0)
				goto out;

			if (skb->len < max || (flags & MSG_OOB) || unlikely(tp->repair))
				continue;

			if (forced_push(tp)) {
				tcp_mark_push(tp, skb);
				__tcp_push_pending_frames(sk, mss_now, TCP_NAGLE_PUSH);
			} else if (skb == tcp_send_head(sk))
				tcp_push_one(sk, mss_now);
			continue;

wait_for_sndbuf:
			set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
wait_for_memory:
			if (copied)
				tcp_push(sk, flags & ~MSG_MORE, mss_now,
					 TCP_NAGLE_PUSH, size_goal);

			if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)
				goto do_error;

			mss_now = tcp_send_mss(sk, &size_goal, flags);
		}
	}

out:
	if (copied)
		tcp_push(sk, flags, mss_now, tp->nonagle, size_goal);
out_nopush:
	release_sock(sk);
	return copied + copied_syn;

do_fault:
	if (!skb->len) {
		tcp_unlink_write_queue(skb, sk);
		/* It is the one place in all of TCP, except connection
		 * reset, where we can be unlinking the send_head.
		 */
		tcp_check_send_head(sk, skb);
		sk_wmem_free_skb(sk, skb);
	}

do_error:
	if (copied + copied_syn)
		goto out;
out_err:
	err = sk_stream_error(sk, flags, err);
	release_sock(sk);
	return err;
}

6. In TCP_ write_ In Xmit process, through tcp_init_tso_segs set gso_size and number of segments gso_segs, where gso_size is the mss value. These two parameters are used to tell the hardware the number and length of packets to be split into when Tso is split.

static void tcp_set_skb_tso_segs(const struct sock *sk, struct sk_buff *skb,
				 unsigned int mss_now)
{
	struct skb_shared_info *shinfo = skb_shinfo(skb);

	/* Make sure we own this skb before messing gso_size/gso_segs */
	WARN_ON_ONCE(skb_cloned(skb));

	if (skb->len <= mss_now || !sk_can_gso(sk) ||
	    skb->ip_summed == CHECKSUM_NONE) {
		/* Avoid the costly divide in the normal
		 * non-TSO case.
		 */
		shinfo->gso_segs = 1;
		shinfo->gso_size = 0;
		shinfo->gso_type = 0;
	} else {
		//gso_segs is the total packet length divided by mss
		shinfo->gso_segs = DIV_ROUND_UP(skb->len, mss_now);
		//gso_size is the mss value. When the hardware is split, GSO will be used_ Size to split each packet
		shinfo->gso_size = mss_now;
		shinfo->gso_type = sk->sk_gso_type;
	}
}

7. Before the dev layer is sent to the driver, further verify whether the network card has TSO capability. If not, call back the segmentation function of tcp to complete the segmentation processing of skb. If supported, send it directly to the driver;

static struct sk_buff *validate_xmit_skb(struct sk_buff *skb, struct net_device *dev)
{
	netdev_features_t features;

	if (skb->next)
		return skb;

	features = netif_skb_features(skb);
	skb = validate_xmit_vlan(skb, features);
	if (unlikely(!skb))
		goto out_null;

	//The features here are dev - > features. If the network card does not have TSO, dev - > features will not have the TSO flag. This function returns true
	if (netif_needs_gso(skb, features)) {
		struct sk_buff *segs;

		segs = skb_gso_segment(skb, features);
		if (IS_ERR(segs)) {
			goto out_kfree_skb;
		} else if (segs) {
			consume_skb(skb);
			skb = segs;
		}
	} else {
		if (skb_needs_linearize(skb, features) &&
		    __skb_linearize(skb))
			goto out_kfree_skb;

		/* If packet is not checksummed and device does not
		 * support checksumming for this protocol, complete
		 * checksumming here.
		 */
		if (skb->ip_summed == CHECKSUM_PARTIAL) {
			if (skb->encapsulation)
				skb_set_inner_transport_header(skb,
							       skb_checksum_start_offset(skb));
			else
				skb_set_transport_header(skb,
							 skb_checksum_start_offset(skb));
			if (skb_csum_hwoffload_help(skb, features))
				goto out_kfree_skb;
		}
	}

	return skb;

out_kfree_skb:
	kfree_skb(skb);
out_null:
	return NULL;
}

8. If it is necessary to do gso segmentation, first enter the ip layer segmentation processing. In the ip layer segmentation processing function, the main work is to call the tcp layer segmentation processing function. After the tcp layer segmentation is completed, check the ip header of the segmented skb again;

static struct sk_buff *inet_gso_segment(struct sk_buff *skb,
					netdev_features_t features)
{
	bool udpfrag = false, fixedid = false, gso_partial, encap;
	struct sk_buff *segs = ERR_PTR(-EINVAL);
	const struct net_offload *ops;
	unsigned int offset = 0;
	struct iphdr *iph;
	int proto, tot_len;
	int nhoff;
	int ihl;
	int id;

	//Set the offset of ip header based on head
	skb_reset_network_header(skb);
	//The ip header is based on the offset of the mac header, even if it is the length of the mac header
	nhoff = skb_network_header(skb) - skb_mac_header(skb);
	if (unlikely(!pskb_may_pull(skb, sizeof(*iph))))
		goto out;

	iph = ip_hdr(skb);
	ihl = iph->ihl * 4;
	if (ihl < sizeof(*iph))
		goto out;

	id = ntohs(iph->id);
	proto = iph->protocol;

	/* Warning: after this point, iph might be no longer valid */
	if (unlikely(!pskb_may_pull(skb, ihl)))
		goto out;
	//Split ip head
	__skb_pull(skb, ihl);

	encap = SKB_GSO_CB(skb)->encap_level > 0;
	if (encap)
		features &= skb->dev->hw_enc_features;
	SKB_GSO_CB(skb)->encap_level += ihl;

	//Set the offset of tcp header based on head
	skb_reset_transport_header(skb);

	segs = ERR_PTR(-EPROTONOSUPPORT);

	if (!skb->encapsulation || encap) {
		udpfrag = !!(skb_shinfo(skb)->gso_type & SKB_GSO_UDP);
		fixedid = !!(skb_shinfo(skb)->gso_type & SKB_GSO_TCP_FIXEDID);

		/* fixed ID is invalid if DF bit is not set */
		if (fixedid && !(ip_hdr(skb)->frag_off & htons(IP_DF)))
			goto out;
	}

	ops = rcu_dereference(inet_offloads[proto]);
	if (likely(ops && ops->callbacks.gso_segment))
		segs = ops->callbacks.gso_segment(skb, features);

	if (IS_ERR_OR_NULL(segs))
		goto out;

	gso_partial = !!(skb_shinfo(segs)->gso_type & SKB_GSO_PARTIAL);

	skb = segs;
	do {
		//Reset ip header information for each segment skb
		iph = (struct iphdr *)(skb_mac_header(skb) + nhoff);
		if (udpfrag) {
			iph->frag_off = htons(offset >> 3);
			if (skb->next != NULL)
				iph->frag_off |= htons(IP_MF);
			offset += skb->len - nhoff - ihl;
			tot_len = skb->len - nhoff;
		} else if (skb_is_gso(skb)) {
			if (!fixedid) {
				iph->id = htons(id);
				id += skb_shinfo(skb)->gso_segs;
			}

			if (gso_partial)
				tot_len = skb_shinfo(skb)->gso_size +
					  SKB_GSO_CB(skb)->data_offset +
					  skb->head - (unsigned char *)iph;
			else
				tot_len = skb->len - nhoff;
		} else {
			if (!fixedid)
				iph->id = htons(id++);
			tot_len = skb->len - nhoff;
		}
		iph->tot_len = htons(tot_len);
		//Check the ip header of each segment skb
		ip_send_check(iph);
		if (encap)
			skb_reset_inner_headers(skb);
		skb->network_header = (u8 *)iph - skb->head;
	} while ((skb = skb->next));

out:
	return segs;
}

9. After entering the tcp layer segmentation function, tcp will be called_ gso_ Segment completes the segmentation of skb. After the segmentation is completed, check the tcp layer for each segment of skb, and reallocate the seq serial number for each segment of skb;

struct sk_buff *tcp_gso_segment(struct sk_buff *skb,
				netdev_features_t features)
{
	struct sk_buff *segs = ERR_PTR(-EINVAL);
	unsigned int sum_truesize = 0;
	struct tcphdr *th;
	unsigned int thlen;
	unsigned int seq;
	__be32 delta;
	unsigned int oldlen;
	unsigned int mss;
	struct sk_buff *gso_skb = skb;
	__sum16 newcheck;
	bool ooo_okay, copy_destructor;

	th = tcp_hdr(skb);
	thlen = th->doff * 4;
	if (thlen < sizeof(*th))
		goto out;

	if (!pskb_may_pull(skb, thlen))
		goto out;

	oldlen = (u16)~skb->len;
	__skb_pull(skb, thlen);

	mss = skb_shinfo(skb)->gso_size;
	if (unlikely(skb->len <= mss))
		goto out;

	if (skb_gso_ok(skb, features | NETIF_F_GSO_ROBUST)) {
		/* Packet is from an untrusted source, reset gso_segs. */

		skb_shinfo(skb)->gso_segs = DIV_ROUND_UP(skb->len, mss);

		segs = NULL;
		goto out;
	}

	copy_destructor = gso_skb->destructor == tcp_wfree;
	ooo_okay = gso_skb->ooo_okay;
	/* All segments but the first should have ooo_okay cleared */
	skb->ooo_okay = 0;

        //Real segmentation processing function
	segs = skb_segment(skb, features);
	if (IS_ERR(segs))
		goto out;

	/* Only first segment might have ooo_okay set */
	segs->ooo_okay = ooo_okay;

	/* GSO partial and frag_list segmentation only requires splitting
	 * the frame into an MSS multiple and possibly a remainder, both
	 * cases return a GSO skb. So update the mss now.
	 */
	if (skb_is_gso(segs))
		mss *= skb_shinfo(segs)->gso_segs;

	delta = htonl(oldlen + (thlen + mss));

	skb = segs;
	th = tcp_hdr(skb);
	seq = ntohl(th->seq);

	newcheck = ~csum_fold((__force __wsum)((__force u32)th->check +
					       (__force u32)delta));

	while (skb->next) {
		th->fin = th->psh = 0;
		th->check = newcheck;

		//Check the tcp layer for each segment of skb
		if (skb->ip_summed == CHECKSUM_PARTIAL)
			gso_reset_checksum(skb, ~th->check);
		else
			th->check = gso_make_checksum(skb, ~th->check);

		//Set the serial number of the skb. After splitting, except the last skb, all sizes are mss
		seq += mss;
		if (copy_destructor) {
			skb->destructor = gso_skb->destructor;
			skb->sk = gso_skb->sk;
			sum_truesize += skb->truesize;
		}
		skb = skb->next;
		th = tcp_hdr(skb);

		th->seq = htonl(seq);
		th->cwr = 0;
	}

	/* Following permits TCP Small Queues to work well with GSO :
	 * The callback to TCP stack will be called at the time last frag
	 * is freed at TX completion, and not right now when gso_skb
	 * is freed by GSO engine
	 */
	if (copy_destructor) {
		swap(gso_skb->sk, skb->sk);
		swap(gso_skb->destructor, skb->destructor);
		sum_truesize += skb->truesize;
		atomic_add(sum_truesize - gso_skb->truesize,
			   &skb->sk->sk_wmem_alloc);
	}

	delta = htonl(oldlen + (skb_tail_pointer(skb) -
				skb_transport_header(skb)) +
		      skb->data_len);
	th->check = ~csum_fold((__force __wsum)((__force u32)th->check +
				(__force u32)delta));
	if (skb->ip_summed == CHECKSUM_PARTIAL)
		gso_reset_checksum(skb, ~th->check);
	else
		th->check = gso_make_checksum(skb, ~th->check);
out:
	return segs;
}

You can see TCP_ gso_ In segment, the real way to do SKB segmentation is in skb_segment,skb_ In segment, the SKB of tso is segmented according to the length of mss. For the data of linear region, it is copied directly to the linear region of segmented SKB. For the data of nonlinear region, the frags pointer is directed to the frags of segmented SKB directly;

struct sk_buff *skb_segment(struct sk_buff *head_skb,
			    netdev_features_t features)
{
	struct sk_buff *segs = NULL;
	struct sk_buff *tail = NULL;
	//flag_list is used to store ip packets in ip_ Do_ It will be set in fragment
	struct sk_buff *list_skb = skb_shinfo(head_skb)->frag_list;
	//frags stores distributed and aggregated non-linear data packets
	skb_frag_t *frag = skb_shinfo(head_skb)->frags;
	unsigned int mss = skb_shinfo(head_skb)->gso_size;
	//doffset is the sum of ip header and mac header
	unsigned int doffset = head_skb->data - skb_mac_header(head_skb);
	struct sk_buff *frag_skb = head_skb;
	unsigned int offset = doffset;
	unsigned int tnl_hlen = skb_tnl_header_len(head_skb);
	unsigned int partial_segs = 0;
	unsigned int headroom;
	unsigned int len = head_skb->len;
	__be16 proto;
	bool csum, sg;
	int nfrags = skb_shinfo(head_skb)->nr_frags;
	int err = -ENOMEM;
	int i = 0;
	int pos;
	int dummy;

	//The first skb allocates ip and mac header space
	__skb_push(head_skb, doffset);
	proto = skb_network_protocol(head_skb, &dummy);
	if (unlikely(!proto))
		return ERR_PTR(-EINVAL);

	sg = !!(features & NETIF_F_SG);
	csum = !!can_checksum_protocol(features, proto);

	if (sg && csum && (mss != GSO_BY_FRAGS))  {
		if (!(features & NETIF_F_GSO_PARTIAL)) {
			struct sk_buff *iter;

			if (!list_skb ||
			    !net_gso_ok(features, skb_shinfo(head_skb)->gso_type))
				goto normal;

			/* Split the buffer at the frag_list pointer.
			 * This is based on the assumption that all
			 * buffers in the chain excluding the last
			 * containing the same amount of data.
			 */
			skb_walk_frags(head_skb, iter) {
				if (skb_headlen(iter))
					goto normal;

				len -= iter->len;
			}
		}

		/* GSO partial only requires that we trim off any excess that
		 * doesn't fit into an MSS sized block, so take care of that
		 * now.
		 */
		partial_segs = len / mss;
		if (partial_segs > 1)
			mss *= partial_segs;
		else
			partial_segs = 0;
	}

normal:
	headroom = skb_headroom(head_skb);
	//Get the length of linear area SKB - > len - SKB - > Data_ Len
	pos = skb_headlen(head_skb);

	do {
		struct sk_buff *nskb;
		skb_frag_t *nskb_frag;
		int hsize;
		int size;

		if (unlikely(mss == GSO_BY_FRAGS)) {
			len = list_skb->len;
		} else {
			//No new segment skb is added. offset accumulates the length of segment skb
			//len is the length of the next new segment skb, which should not exceed the mss value at most
			len = head_skb->len - offset;
			if (len > mss)
				len = mss;
		}

		//skb_ headlen = skb->len - skb->data_ Len is the length of linear region of SKB. When the first segmentation,
		//In theory, the linear region is copied, so the hsize should be less than 0 in the second time;
		//When hsize is less than 0, assign it to 0 directly. Next, the newly assigned segment skb will not apply for skb - > data
		//Linear space, but directly copy the non-linear data
		hsize = skb_headlen(head_skb) - offset;
		if (hsize < 0)
			hsize = 0;
		if (hsize > len || !sg)
			hsize = len;

		//Copy ip packet
		if (!hsize && i >= nfrags && skb_headlen(list_skb) &&
		    (skb_headlen(list_skb) == len || sg)) {
			BUG_ON(skb_headlen(list_skb) > len);

			i = 0;
			nfrags = skb_shinfo(list_skb)->nr_frags;
			frag = skb_shinfo(list_skb)->frags;
			frag_skb = list_skb;
			pos += skb_headlen(list_skb);

			while (pos < offset + len) {
				BUG_ON(i >= nfrags);

				size = skb_frag_size(frag);
				if (pos + size > offset + len)
					break;

				i++;
				pos += size;
				frag++;
			}

			nskb = skb_clone(list_skb, GFP_ATOMIC);
			list_skb = list_skb->next;

			if (unlikely(!nskb))
				goto err;

			if (unlikely(pskb_trim(nskb, len))) {
				kfree_skb(nskb);
				goto err;
			}

			hsize = skb_end_offset(nskb);
			if (skb_cow_head(nskb, doffset + headroom)) {
				kfree_skb(nskb);
				goto err;
			}

			nskb->truesize += skb_end_offset(nskb) - hsize;
			skb_release_head_state(nskb);
			__skb_push(nskb, doffset);
		} else {
			//Linear and nonlinear regions of copy skb
			nskb = __alloc_skb(hsize + doffset + headroom,
					   GFP_ATOMIC, skb_alloc_rx_flag(head_skb),
					   NUMA_NO_NODE);

			if (unlikely(!nskb))
				goto err;

			skb_reserve(nskb, headroom);
			__skb_put(nskb, doffset);
		}

		//When segs is empty, nskb is assigned to segs as the first skb, otherwise the new nskb will be put into next
		if (segs)
			tail->next = nskb;
		else
			segs = nskb;
		tail = nskb;

		//Copy the skb header information from the first skb
		__copy_skb_header(nskb, head_skb);

		skb_headers_offset_update(nskb, skb_headroom(nskb) - headroom);
		//Set mac head length
		skb_reset_mac_len(nskb);

		skb_copy_from_linear_data_offset(head_skb, -tnl_hlen,
						 nskb->data - tnl_hlen,
						 doffset + tnl_hlen);

		if (nskb->len == len + doffset)
			goto perform_csum_check;

		if (!sg) {
			if (!nskb->remcsum_offload)
				nskb->ip_summed = CHECKSUM_NONE;
			SKB_GSO_CB(nskb)->csum =
				skb_copy_and_csum_bits(head_skb, offset,
						       skb_put(nskb, len),
						       len, 0);
			SKB_GSO_CB(nskb)->csum_start =
				skb_headroom(nskb) + doffset;
			continue;
		}

		nskb_frag = skb_shinfo(nskb)->frags;

		//Copy the linear region of skb to nskb
		//When hsize=0, no linear area data needs to be copied
		skb_copy_from_linear_data_offset(head_skb, offset,
						 skb_put(nskb, hsize), hsize);

		skb_shinfo(nskb)->tx_flags = skb_shinfo(head_skb)->tx_flags &
			SKBTX_SHARED_FRAG;

		//The initial value of pos is the linear area length. offset+len indicates the total data length to be copied after the local segment skb is completed,
		//When pos<offset+len, it means that the linear area has been copied, but the skb data of each segment has not been copied,
		//In succession, you can only copy from the non-linear area. After each frags is copied, the pos length increases correspondingly
		//The frags length of until the skb data copy of this segment is completed
		while (pos < offset + len) {
			//If the non-linear area frags are also copied, they will be copied from the ip area
			if (i >= nfrags) {
				BUG_ON(skb_headlen(list_skb));

				i = 0;
				nfrags = skb_shinfo(list_skb)->nr_frags;
				frag = skb_shinfo(list_skb)->frags;
				frag_skb = list_skb;

				BUG_ON(!nfrags);

				list_skb = list_skb->next;
			}

			if (unlikely(skb_shinfo(nskb)->nr_frags >=
				     MAX_SKB_FRAGS)) {
				net_warn_ratelimited(
					"skb_segment: too many frags: %u %u\n",
					pos, mss);
				goto err;
			}

			if (unlikely(skb_orphan_frags(frag_skb, GFP_ATOMIC)))
				goto err;

			//When copying the non-linear area, it is not to copy the original non-linear area data to the linear area of the new segmented skb
			//Instead, point the frags pointer of the segmented skb directly to the frags of the original skb
			*nskb_frag = *frag;
			__skb_frag_ref(nskb_frag);
			size = skb_frag_size(nskb_frag);

			if (pos < offset) {
				nskb_frag->page_offset += offset - pos;
				skb_frag_size_sub(nskb_frag, offset - pos);
			}

			skb_shinfo(nskb)->nr_frags++;
			//Copy a frags and modify the pos length
			if (pos + size <= offset + len) {
				i++;
				frag++;
				pos += size;
			} else {
				skb_frag_size_sub(nskb_frag, pos + size - (offset + len));
				goto skip_fraglist;
			}

			nskb_frag++;
		}

skip_fraglist:
		nskb->data_len = len - hsize;
		nskb->len += nskb->data_len;
		nskb->truesize += nskb->data_len;

perform_csum_check:
		if (!csum) {
			if (skb_has_shared_frag(nskb)) {
				err = __skb_linearize(nskb);
				if (err)
					goto err;
			}
			if (!nskb->remcsum_offload)
				nskb->ip_summed = CHECKSUM_NONE;
			SKB_GSO_CB(nskb)->csum =
				skb_checksum(nskb, doffset,
					     nskb->len - doffset, 0);
			SKB_GSO_CB(nskb)->csum_start =
				skb_headroom(nskb) + doffset;
		}
	} while ((offset += len) < head_skb->len); //The length of copied data has not reached the length of the whole skb, and it will enter the next segmentation

	/* Some callers want to get the end of the list.
	 * Put it in segs->prev to avoid walking the list.
	 * (see validate_xmit_skb_list() for example)
	 */
	segs->prev = tail;

	if (partial_segs) {
		struct sk_buff *iter;
		int type = skb_shinfo(head_skb)->gso_type;
		unsigned short gso_size = skb_shinfo(head_skb)->gso_size;

		/* Update type to add partial and then remove dodgy if set */
		type |= (features & NETIF_F_GSO_PARTIAL) / NETIF_F_GSO_PARTIAL * SKB_GSO_PARTIAL;
		type &= ~SKB_GSO_DODGY;

		/* Update GSO info and prepare to start updating headers on
		 * our way back down the stack of protocols.
		 */
		for (iter = segs; iter; iter = iter->next) {
			skb_shinfo(iter)->gso_size = gso_size;
			skb_shinfo(iter)->gso_segs = partial_segs;
			skb_shinfo(iter)->gso_type = type;
			SKB_GSO_CB(iter)->data_offset = skb_headroom(iter) + doffset;
		}

		if (tail->len - doffset <= gso_size)
			skb_shinfo(tail)->gso_size = 0;
		else if (tail != segs)
			skb_shinfo(tail)->gso_segs = DIV_ROUND_UP(tail->len - doffset, gso_size);
	}

	/* Following permits correct backpressure, for protocols
	 * using skb_set_owner_w().
	 * Idea is to tranfert ownership from head_skb to last segment.
	 */
	if (head_skb->destructor == sock_wfree) {
		swap(tail->truesize, head_skb->truesize);
		swap(tail->destructor, head_skb->destructor);
		swap(tail->sk, head_skb->sk);
	}
	return segs;

err:
	kfree_skb_list(segs);
	return ERR_PTR(err);
}

10. After the segmentation processing, return the segmented skb list, and send the segmented skb list further (dev (qdisc) - > drive ----- > network card).

Tags: network Mac less socket

Posted on Sat, 20 Jun 2020 01:21:48 -0400 by Nicholas Reed