Stand alone container network

Based on linux kernel 5.4.54
Yesterday I shared the principle of veth:
veth principle -- the sending and receiving path of data packets when two containers communicate through veth
Generally, containers do not communicate directly through veth, but through docker0 bridge
Today, analyze the communication path of the container through veth and docker0 bridges

Single machine container network structure

When two containers are created on the host through docker, the network structure shown in the figure will be automatically generated

  • A docker0 bridge will be generated on the host
  • Container 1 and docker0 bridge are connected through veth, and container 2 is the same

Take a quick look at Namespace

Namespace of network device:

When the network device is registered, it will register through net_ device->nd_ Net (network device structure field) set Net Namespace.

Namespace of analysis structure diagram equipment:
  • veth0 belongs to Namespace1; veth1 belongs to Namespace2;
  • eth0, docker0, and two veh devices on docker0 belong to the Host Namespace

Namespace of packet:

The Namespace of the packet is determined by SKB_ buff->dev->nd_ Net (Namespace of the destination device of the packet)

Namespace of the process:

When a process is created through clone(), task is used_ Struct - > nsproxy (process structure field) sets the Namespace for the process, nsproxy - > net_ NS determines the Net Namespace of the process

/* nsproxy The structure contains various namespace isolation and Cgroup. I'll know more about it later */
struct nsproxy {
    atomic_t count;
    struct uts_namespace *uts_ns;
    struct ipc_namespace *ipc_ns;
    struct mnt_namespace *mnt_ns;
    struct pid_namespace *pid_ns_for_children;
    struct net       *net_ns;
    struct cgroup_namespace *cgroup_ns;
};

Socket the Namespace of the socket

When a process creates a Socket, it sets Sock - > sk_ Net is current - > nsproxy - > net_ NS, that is, pass the Net Namespace of the current process to the sock Socket.

Analyze two situations

1. Container 1 sends data packets to container 2 through the bridge

Transceiver path:

process(Container 1)
|--adopt socket The system call enters the kernel through Namespace1 Network protocol stack
kernel layer: establish skb Structure that copies data from user space to kernel space
TCP/UDP Packet
IP Packet,run Namespace1 Routing and netfilter
|--Out of the protocol stack into the network device
 Call the transmission packet function driven by the network device
|
veth_xmit: veth Driver registered transfer function
    |
    veth_forward_skb
        |
        __dev_forward_skb: eliminate skb All information in that could affect namespace isolation
        |          And the network device to which the packet is to be sent will be updated(skb->dev),from veth0 Change to docker-veth0
        |          Protocol stack to be run by packet(network namespace)from skb->dev of nd_net Field decision
        |
        XDP hookpoint 
        |
        netif_rx
            |
            netif_rx_internal: cpu Soft interrupt load balancing
                |
                enqueue_to_backlog: take skb Package join assignment cpu of input_pkt_queue Team tail
                                    queue Activate network soft interrupt when null,
                                    queue Not null, no need to activate soft interrupt,cpu Before emptying the queue
                                    The soft interrupt is automatically triggered
    each cpu They all have their own input_pkt_queue(Receive queue,Default size 1000,Modifiable),and process_queue(Processing queue),Soft interrupt processing function processing completed process_queue All in skb After package,Will input_pkt_queue Splice to process_queue
    input_pkt_queue and process_queue yes cpu For non NAPI Device ready queue,NAPI The device has its own queue

    All the way here, the packet path and veth Two in the document veth The transmission phase of communication is completely consistent, docker0 The bridge processes packets mainly in__netif_receive_skb_core in

cpu Process of processing network packets:

do_softirq()
|
net_rx_action: Network soft interrupt processing function
    |
    napi_poll
        |
        n->poll: Calling destination network device driver poll function
            |    veth Device not defined poll,Call default poll function-process_backlog
            |
            process_backlog: cpu Cycle from process_queue Remove from skb handle,Up to 300 skb,
                |            After the processing queue is cleared,Splicing input_pkt_queue reach process_queue Team tail
                |
                __netif_receive_skb
                    |
                    ...
                    |
                    __netif_receive_skb_core

Packet processing code analysis:

/*
* __netif_receive_skb_core code analysis
* The code has been deleted a lot, leaving the bridge processing and the part of data packets passing to the upper layer processing
* Many other parts, such as vlan, xdp, tcpdump and so on, have been deleted
*/
static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
                    struct packet_type **ppt_prev)
{
    struct packet_type *ptype, *pt_prev;
    rx_handler_func_t *rx_handler;
    struct sk_buff *skb = *pskb;
    struct net_device *orig_dev;
    bool deliver_exact = false;
    int ret = NET_RX_DROP;
    __be16 type;

    /* Record skb package destination equipment */
    orig_dev = skb->dev;

    /* Set the protocol header pointer of skb package */
    skb_reset_network_header(skb);
    if (!skb_transport_header_was_set(skb))
        skb_reset_transport_header(skb);
    skb_reset_mac_len(skb);

    pt_prev = NULL;

another_round:
...
    /**
    * skb The destination device of the package is docker-veth0, which serves as an interface of the bridge
    * docker-veth0 Rx is set when registering_ Handler is the packet receiving function of the bridge.br_handle_frame
    * The code in yellow is the code for calling bridge.br_handle_frame
    */
    rx_handler = rcu_dereference(skb->dev->rx_handler);
    if (rx_handler) {
        ...
        switch (rx_handler(&skb)) {
        case RX_HANDLER_CONSUMED: /* Processed, no further processing required */
            ret = NET_RX_SUCCESS;
            goto out;
        case RX_HANDLER_ANOTHER: /* Do it again */
            goto another_round;
        case RX_HANDLER_EXACT: /* Pass exactly to ptype - > dev = = SKB - > dev */
            deliver_exact = true;
        case RX_HANDLER_PASS:
            break;
        default:
            BUG();
        }
    }
...
    /* Get layer 3 protocol */
    type = skb->protocol;

    /* 
    * Call the protocol processing function of the specified protocol (such as ip_rcv function) to transfer the data packet to the upper protocol layer for processing
    * ip_rcv Function is the entry function of network protocol stack
    * Packets arrive here through netfilter, routing, and finally forwarded or sent to the upper protocol stack
    */
    deliver_ptype_list_skb(skb, &pt_prev, orig_dev, type,
                   &orig_dev->ptype_specific);
...
    if (pt_prev) {
        if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))
            goto drop;
        *ppt_prev = pt_prev;
    } else {
drop:
        if (!deliver_exact)
            atomic_long_inc(&skb->dev->rx_dropped);
        else
            atomic_long_inc(&skb->dev->rx_nohandler);
        kfree_skb(skb);
        ret = NET_RX_DROP;
    }

out:
    *pskb = skb;
    return ret;
}

Bridge processing code analysis:

/* br_handle_frame,Deleted */
rx_handler_result_t br_handle_frame(struct sk_buff **pskb)
{
    struct net_bridge_port *p;
    struct sk_buff *skb = *pskb;
    const unsigned char *dest = eth_hdr(skb)->h_dest;
...
forward:
    switch (p->state) {
    case BR_STATE_FORWARDING:
    case BR_STATE_LEARNING:
        /* Is the destination address a device link layer address */
        if (ether_addr_equal(p->br->dev->dev_addr, dest))
            skb->pkt_type = PACKET_HOST;

        return nf_hook_bridge_pre(skb, pskb);
    default:
drop:
        kfree_skb(skb);
    }
    return RX_HANDLER_CONSUMED;
}

nf_hook_bridge_pre
    |
    br_handle_frame_finish

/* br_handle_frame_finish,Deleted */
int br_handle_frame_finish(struct net *net, struct sock *sk, struct sk_buff *skb)
{
    struct net_bridge_port *p = br_port_get_rcu(skb->dev);
    enum br_pkt_type pkt_type = BR_PKT_UNICAST;
    struct net_bridge_fdb_entry *dst = NULL;
    struct net_bridge_mdb_entry *mdst;
    bool local_rcv, mcast_hit = false;
    struct net_bridge *br;
    u16 vid = 0;
...
    if (dst) {
        unsigned long now = jiffies;

        /* If the destination is the host */
        if (dst->is_local)
            /*
            * This function will eventually return to__ netif_receive_skb_core
            * Send skb to the three-layer protocol stack of Host Net Namespace for processing
            */
            return br_pass_frame_up(skb);

        if (now != dst->used)
            dst->used = now;
        /*
        * The destination is not the host. The host forwards the packet to the specified port
        * The code implementation is to call the packet receiving function driven by the destination port device
        * This time, the path is Veth calling docker-veth1_ xmit
        * Veth is analyzed above_ Xmit, the destination device of the packet will be modified
        * Change from docker veth1 to veth1, and then send it to the cpu queue for processing
        * cpu When processing packets, run the network protocol stack of veth1 (that is, Namespace2)
        * Finally, the container 2 process receives packets
        */
        br_forward(dst->dst, skb, local_rcv, false);
    }
...

out:
    return 0;
drop:
    kfree_skb(skb);
    goto out;
}

Summary path:

Container 1 process generates packets
|
Sent to veth0 through Namespace1 protocol stack
|
The veh0 driver changes the skb destination device to docker veh0 and sends skb to the cpu queue
|
The cpu processes packets, because docker-veth0 is a port of the bridge and calls the bridge packet receiving function
|
The skb destination device modified by the bridge is docker-veth1, and the docker-veth1 driver is called
|
The docker-veth1 driver changes the skb destination device to veth1 and sends the skb to the cpu queue
|
The cpu processes the packets and sends them to the network protocol stack of veth1 (namespace 2)
|
Container 2 process receiving

2. Container 1 sends data packets to the host through the bridge

The code has been analyzed and summarized directly

Summary path:

Container 1 process generates packets
|
Sent to veth0 through Namespace1 protocol stack
|
The veh0 driver changes the skb destination device to docker veh0 and sends skb to the cpu queue
|
The cpu processes packets, because docker-veth0 is a port of the bridge and calls the bridge packet receiving function
|
The bridge judges that the destination is the host and directly runs the host namespace protocol stack

Tags: C Linux Container

Posted on Fri, 03 Dec 2021 05:59:53 -0500 by mulysa