How to do well in containerization management? Tencent architect's article takes you through the Docker principle

This article has been included GitHub There are also real interview questions, interview strategies, efficient learning materials, etc

1, Implementation principle of container

In essence, container is a sandbox technology. It's like isolating an application in a box and making it work. Because of the box boundary, there is no interference between applications. And like a container, take it and run it everywhere. In fact, this is the ideal state of PaaS.

The core of the implementation container is to generate the boundary limiting the application runtime. As we know, the compiled executable code plus data is called a program. After running the program, it becomes a process, which is called an application. If we can add a boundary to the application when it starts, can't we achieve the expected sandbox?

In Linux, there are two main technologies to implement the container boundary, cgroups and Namespace. Cgroups are used to limit the resources of the running container, and Namespace will isolate the container and implement the boundary.

In this way, the container is just a restricted special process.

2, Container isolation: Namespace

Before introducing Namespace, let's look at an experiment:

# Using the official image of Python 3.6.8, an environment running django is established
# After entering the container, use the ps command to view the running process
root@8729260f784a:/src# ps -A
  PID TTY          TIME CMD
    1 ?        00:01:22 gunicorn
   22 ?        00:01:20 gunicorn
   23 ?        00:01:24 gunicorn
   25 ?        00:01:30 gunicorn
   27 ?        00:01:16 gunicorn
   41 pts/0    00:00:00 bash
   55 pts/0    00:00:00 ps

As you can see, the process with PID=1 in the container is the django application started by gunicorn. Students familiar with Linux know that the process with PID=1 is the first process when the system is started, also known as init process. Other processes are generated by its management. At this point, PID=1 is really a django process.

Next, exit the container and execute the ps command on the host

# The environment is Centos7
[root@localhost ~]# ps -ef | grep gunicorn
root      9623  8409  0 21:29 pts/0    00:00:00 grep --color=auto gunicorn
root     30828 30804  0 May28 ?        00:01:22 /usr/local/bin/python /usr/local/bin/gunicorn -c gunicorn_config.py ctg.wsgi
root     31171 30828  0 May28 ?        00:01:20 /usr/local/bin/python /usr/local/bin/gunicorn -c gunicorn_config.py ctg.wsgi
root     31172 30828  0 May28 ?        00:01:24 /usr/local/bin/python /usr/local/bin/gunicorn -c gunicorn_config.py ctg.wsgi
root     31174 30828  0 May28 ?        00:01:30 /usr/local/bin/python /usr/local/bin/gunicorn -c gunicorn_config.py ctg.wsgi
root     31176 30828  0 May28 ?        00:01:16 /usr/local/bin/python /usr/local/bin/gunicorn -c gunicorn_config.py ctg.wsgi

From the perspective of the host, it is found that the PID of django process has changed to 30828. It is not difficult to prove that some processing has been done in the container. The process that Mingming is 30828 is changed into the first process in the container, and no other process of the host can be seen in the container. This also shows that the environment inside the container is indeed isolated.

This kind of processing is actually the Namespace mechanism of Linux. For example, the above method to change PID to 1 is through PID Namespace. The way to create threads in Linux is clone, where clone is specified_ Newpid parameter, so that the newly created process will see a new process space. At this time, the new process becomes the process of PID=1.

int pid = clone(main_function, stack_size, CLONE_NEWPID | SIGCHLD, NULL); 

There are many parameters similar to PID Namespace in Linux, such as:

3, Container restrictions: Cgroups

Through the Namespace technology, we realize the isolation between the container and the host, and between the container and the host. But this is not enough. Imagine a scenario where two containers are running on the host. Although the containers are isolated from each other, from the perspective of the host, in fact, the two containers are two special processes, and there is a natural competitive relationship between the processes, which can eat up the resources of the system. Of course, we can't allow that.

Cgroups is a technology used to set resources for processes in the Linux kernel.

The full name of Linux Cgroups is Linux Control Group. Its main function is to limit the upper limit of resources used by process groups, including CPU, memory, disk and network bandwidth.

You can also set the priority of the process, audit, suspend and restore operations.

In previous versions, cgroup can be managed through libcgroup tools. After RedHat 7, it has been managed through systemctl instead.

We know that the function of systemd in Linux is to manage system resources. For the convenience of management, a concept called unit is derived. For example, a unit can have a broader definition, such as abstract services, network resources, devices, mounted file systems, etc. In order to make a better distinction, Linux divides unit into 12 types.

In Cgroup, slice, scope and service are mainly used.

For example, create a temporary cgroup, and then restrict the resources of the processes it starts:

 # Create a service called toptest and run it in slice named test
[root@localhost ~]# systemd-run --unit=toptest --slice=test top -b
Running as unit toptest.service.

Now the toptest service is running in the background

# View Cgroup information through SYSTEMd CGLS
[root@localhost ~]#  systemd-cgls
├─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 22
├─test.slice
│ └─toptest.service
│   └─6490 /usr/bin/top -b

# View the status of the service through systemctl status
[root@localhost ~]# systemctl status toptest
● toptest.service - /usr/bin/top -b
   Loaded: loaded (/run/systemd/system/toptest.service; static; vendor preset: disabled)
  Drop-In: /run/systemd/system/toptest.service.d
           └─50-Description.conf, 50-ExecStart.conf, 50-Slice.conf
   Active: active (running) since Tue 2020-06-02 14:01:01 CST; 3min 50s ago
 Main PID: 6490 (top)
   CGroup: /test.slice/toptest.service
           └─6490 /usr/bin/top -b

Now, the resources of the toptest service are limited.

# First, look at the information of Cgroup before restriction. 6490 is the process PID
[root@localhost ~]# cat /proc/6490/cgroup
11:pids:/test.slice
10:blkio:/test.slice
9:hugetlb:/
8:cpuset:/
7:memory:/test.slice
6:devices:/test.slice
5:net_prio,net_cls:/
4:perf_event:/
3:freezer:/
2:cpuacct,cpu:/test.slice
1:name=systemd:/test.slice/toptest.service

# Limit the CPU and memory it uses
systemctl set-property toptest.service CPUShares=600 MemoryLimit=500M

# Check the information of Cgroup again and find that some content has been added to cpu and memory.
[root@localhost ~]# cat /proc/6490/cgroup
11:pids:/test.slice
10:blkio:/test.slice
9:hugetlb:/
8:cpuset:/
7:memory:/test.slice/toptest.service
6:devices:/test.slice
5:net_prio,net_cls:/
4:perf_event:/
3:freezer:/
2:cpuacct,cpu:/test.slice/toptest.service
1:name=systemd:/test.slice/toptest.service

At this time, you can use / sys/fs/cgroup/memory/test.slice And / sys/fs/cgroup/cpu/test.slice Under the directory, there is an extra name toptest.service Directory of.

Cat in its directory toptest.service/cpu . shares can find that the CPU is limited to 600

Back to docker, in fact, docker and the above operations are basically the same. The specific resources that need to be restricted are specified in docker run:

$ docker run -it --cpu-period=100000 --cpu-quota=20000 ubuntu /bin/bash

For the specific limitations of docker, you can view them in the sys / FS / CGroup / CPU / dockekr / and other folders.

4, Container's file system: container image - rootfs

Now we know that the core of container technology is to limit the vision of the container through Namespace and the resources that the container can access through Cgroup. But there are some special points about Mount Namespace that need to be paid attention to.

The special feature of Mount Namespace is that in addition to the process's authentication of the file system mount point when modifying, it also needs to explicitly declare which directories need to be mounted. In Linux system, there is a command called chroot, which can change the root directory of the process to the specified location. Mount Namespace is based on chroot.

Within the container, you should see a fully independent file system that is unaffected by the host and other containers. This independent file system is called container image. It also has a more professional name called rootfs. rootfs contains the files, configurations and directories required by an operating system, but does not contain the system kernel. Because in Linux, files and kernel are stored separately, the operating system will only load the specified kernel when it is started. This means that all containers share the kernel of the operating system on the host.

In the era of PaaS, due to the difference between the cloud and the local environment, the process of application packaging has always been a more painful process. But with rootfs, this problem is well solved. Because in the image, not only the application but also the required dependencies are packaged together. This solves the problem that the application can run well no matter where it is.

Not only that, rootfs also solves the problem of reusability. Imagine this scenario. You packaged a centos image containing the java environment through rootfs. Someone needs to run an apache service in the container. Does he need to build the java environment from scratch? When docker solves this problem, it introduces a concept called layer. Every time it modifies rootfs, it only saves incremental content, not a new image of fork.

The idea of hierarchy also comes from Linux, a unified file system. Its main function is to mount directories in different locations to the same directory. Corresponding to Docker, different environments use different federated file systems. For example, the latest version of CentOS 7 uses overlay2, while Ubuntu 16.04 and Docker CE 18.05 use AuFS

You can use docker info to query the used storage driver. Here is overlay2.

[root@localhost ~]# docker info
Client:
 Debug Mode: false

Server:
 Containers: 4
  Running: 4
  Paused: 0
  Stopped: 0
 Images: 4
 Server Version: 19.03.8
 Storage Driver: overlay2

Next, let's see how overlay 2's file system is used in docker?

Overlay2

On Linux hosts, OverlayFS generally has two directories, but it will be displayed as one directory. The two directories are called layers, and the process of combining is called union mount. The lower directory is called lowerdir, and the upper directory is called upperdir. After the two directories are combined, the exposed view is called view. It sounds a little abstract. First, look at the overall structure:

As you can see, lowerdir actually corresponds to the image layer, while upperdir corresponds to the container. And merged corresponds to the content after the joint mount of the two. And we found that when the image layer and the container layer have the same file, the file of the container layer will prevail (the file of the top layer will prevail). Generally, overlay2 supports up to 128 lower layers.

Next, I will see the specific embodiment of the container layer and image. On my linux host, there are four containers running.

The general storage location of Docker is / var/lib/docker. First look at the structure:

[root@localhost docker]# ls -l /var/lib/docker
total 16
drwx------.  2 root root   24 Mar  4 03:39 builder
drwx--x--x.  4 root root   92 Mar  4 03:39 buildkit
drwx------.  7 root root 4096 Jun  1 10:36 containers
drwx------.  3 root root   22 Mar  4 03:39 image
drwxr-x---.  3 root root   19 Mar  4 03:39 network
drwx------. 69 root root 8192 Jun  1 15:01 overlay2
drwx------.  4 root root   32 Mar  4 03:39 plugins
drwx------.  2 root root    6 Jun  1 15:00 runtimes
drwx------.  2 root root    6 Mar  4 03:39 swarm
drwx------.  2 root root    6 Jun  1 15:01 tmp
drwx------.  2 root root    6 Mar  4 03:39 trust
drwx------.  3 root root   45 May 18 10:28 volumes

We need to focus on the folders container, image and overlay2.

  • Container: Needless to say, the running or created container will be in this directory.
  • Image: the corresponding record is the image.
  • Overlay 2: records the lowerrdir contained in each image

As mentioned before, there are many possible implementations of unionfs, such as overlay2,aufs,devicemapper, etc. Naturally, under the image folder, there will be multiple drive folders:

image/
└── overlay2
    ├── distribution
    ├── imagedb
    │   ├── content
    │   └── metadata
    ├── layerdb
    │   ├── mounts
    │   ├── sha256
    │   └── tmp
    └── repositories.json

Here, imagedb and layerdb are the places where metadata is stored. As we have learned before, the file system of a container is composed of image layer and container layer, and each image may be composed of multiple layers. This means that each layer may be referenced by multiple images. So how are they related? The answer is in the imagedb file.

Here I take mysql image as an example:

# View the image id of mysql
[root@localhost docker]# docker image ls
REPOSITORY          TAG                 IMAGE ID            CREATED             SIZE
ctg/mysql           5.7.29              84164b03fa2e        3 months ago        456MB

# Enter the imagedb/content/sha256 directory to find the corresponding image id
[root@localhost docker]# ls -l image/overlay2/imagedb/content/sha256/
...
-rw-------. 1 root root  6995 Apr 27 02:45 84164b03fa2ecb33e8b4c1f2636ec3286e90786819faa4d1c103ae147824196a

# Next, take a look at the contents recorded in it. Here you can take a useful part
cat  image/overlay2/imagedb/content/sha256/84164b03fa2ecb33e8b4c1f2636ec3286e90786819faa4d1c103ae147824196a
{
.........
  "os": "linux",
  "rootfs": {
    "type": "layers",
    "diff_ids": [
      "sha256:f2cb0ecef392f2a630fa1205b874ab2e2aedf96de04d0b8838e4e728e28142da",
      "sha256:a9f6b7c7101b86ffaa53dc29638e577dabf5b24150577a59199d8554d7ce2921",
      "sha256:0c615b40cc37ed667e9cbaf33b726fe986d23e5b2588b7acbd9288c92b8716b6",
      "sha256:ad160f341db9317284bba805a3fe9112d868b272041933552df5ea14647ec54a",
      "sha256:1ea6ef84dc3af6506c26753e9e2cf7c0d6c1c743102b85ebd3ee5e357d7e9bc4",
      "sha256:6fce4d95d4af3777f3e3452e5d17612b7396a36bf0cb588ba2ae1b71d139bab9",
      "sha256:6de3946ea0137e75dcc43a3a081d10dda2fec0d065627a03800a99e4abe2ede4",
      "sha256:a35a4bacba4d5402b85ee6e898b95cc71462bc071078941cbe8c77a6ce2fca62",
      "sha256:1ff9500bdff4455fa89a808685622b64790c321da101d27c17b710f7be2e0e7e",
      "sha256:1cf663d0cb7a52a3a33a7c84ff5290b80966921ee8d3cb11592da332b4a9e016",
      "sha256:bcb387cbc5bcbc8b5c33fbfadbce4287522719db43d3e3a286da74492b7d6eca"
    ]
  }
}

You can see that mysql image is composed of 11 layers, among which f2cb is the lowest layer and bcb3 is the highest layer.

Next, let's look at the content of layerdb:

[root@localhost docker]# ls -l  image/overlay2/layerdb/
total 8
drwxr-xr-x.  6 root root 4096 May 13 13:38 mounts
drwxr-xr-x. 39 root root 4096 Apr 27 02:51 sha256
drwxr-xr-x.  2 root root    6 Apr 27 02:51 tmp

# First, look at the contents of sha256 directory
[root@localhost docker]# ls -l  image/overlay2/layerdb/sha256/
total 0
....
drwx------. 2 root root 71 Apr 27 02:45 bbb9cccab59a16cb6da78f8879e9d07a19e3a8d49010ab9c98a2c348fa116c87
drwx------. 2 root root 71 Apr 27 02:45 f2cb0ecef392f2a630fa1205b874ab2e2aedf96de04d0b8838e4e728e28142da
....

It can be found that only the lowest layer id can be found here, because the association between layers is saved in the way of chainID. In short, the container id of one layer can be calculated by sha256 algorithm

For example, here, the lowest id is f2cb0eef392f2a630fa1205b874ab2e2aedf96de04d0b8838e4e728e28142da, and the upper id is a9f6b7c7101b86ffaa53dc29638e577bf5b24150577a59199d8554d7ce2921. Then, the calculation method of the next id under the sha256 directory is as follows:

[root@localhost docker]# echo -n "sha256:f2cb0ecef392f2a630fa1205b874ab2e2aedf96de04d0b8838e4e728e28142da sha256:a9f6b7c7101b86ffaa53dc29638e577dabf5b24150577a59199d8554d7ce2921" | sha256sum
bbb9cccab59a16cb6da78f8879e9d07a19e3a8d49010ab9c98a2c348fa116c87  -

Then we can find the content of bbb9.. in the sha256 directory.

OK, now we have associated the image with the layer, but as mentioned before, metadata is stored in the image directory. The real rootfs is actually in another place - / docker/overlay2.

# By querying the cache ID, we get the real rootfs layer
[root@localhost docker]# cat  image/overlay2/layerdb/sha256/f2cb0ecef392f2a630fa1205b874ab2e2aedf96de04d0b8838e4e728e28142da/cache-id
2996b24990e75cbd304093139e665a45d96df8d7e49334527827dcff820dbf16[

Enter / docker/overlay2 to view:

[root@localhost docker]# ls -l overlay2/
total 4
...
drwx------. 3 root root   47 Apr 27 02:45 2996b24990e75cbd304093139e665a45d96df8d7e49334527827dcff820dbf16
...
drwx------. 2 root root 4096 May 13 13:38 l

So the real rootfs layer is also found.

To rearrange, we first check that the image has all layer IDS according to the image id in mage/overlay2/imagedb/content/sha256 /, and then calculate it through sha256 according to the lowest layer ID and the upper layer ID, and then associate all layers by referring to the upper layer ID, and so on. Finally, through the cache ID of each layer, the metadata is mapped to the real rootfs layer data.

Finally, the composition of rootfs.

Each rootfs is composed of a Lower Dir and a Upper Dir. The mirror layer can only be read-only, while the container layer can read and write. And the image layer can have up to 128 layers.

In fact, there is another layer in the composition of rootfs. However, this layer will not be added during submission or compilation, so it is not included in rootfs, but it actually exists.

Before we looked at the ls -l /var/lib/docker/overlay2 / lower mirror layer, we could see several directories ending in - init, and the number is exactly equal to the number of containers. This layer is sandwiched between the image layer and the container layer. It is a separate layer generated by docker, which is specially used to store etc/hosts, / etc/resolv.conf Etc. The purpose of existence is that the user needs to configure some specific values, such as hostname, dns, etc. when the container is started, but these configurations are only valid for the current container, and there are other configurations naturally in other environments, so this layer is taken out separately. When submitting the image, this layer is ignored.

5, Comparison between container and virtual machine technology

The following picture is taken from the official docker. Based on what we learned above, we re analyze the difference between docker and traditional VM:

Mobility and performance:

  • Traditional VM: hardware virtualization technology based on Hypervisor is needed to simulate CPU, memory and other hardware. Then build a set of complete operating system on it, which will lose a lot of performance naturally. Migration naturally doesn't need to mention that the traditional ova export is a complete operating system.
  • Docker: docker changes the location of Hypervisor to its own dockekr engine. The running container is only a special process, and its natural performance will not be greatly lost. And can be applied and its required system files packaged into a mirror, no matter where the read can run normally, and compared to ova, the volume is also smaller. (kernel support required)

Generally speaking, KVM running CentOS needs to occupy 100-200m memory without optimization after startup. In addition, the user's call to the host needs to be intercepted and processed by virtualization software, which is also a layer of performance loss, especially for computing resources, network and disk I/O, etc.

Isolation:

  • Traditional VM: due to the virtualization of a complete operating system, so the isolation is very good. For example, Microsoft's Azure Platform is to virtualize a large number of Linux virtual machines on Windows Servers.

  • Docker: there are many differences in isolation, because the container itself is a process, and all processes need to share a system kernel.
    • This means that the Linux container running on Windows, or the container running a higher version kernel on a Linux host, cannot be implemented.
    • In the Linux kernel, there are many resources and objects that cannot be namespaced, such as time. For example, if you modify the time through the settimeofday(2) system call, the actual host will be modified.
    • Security issues, the fact that the host kernel is shared, the container exposes a larger * * * side.

Resource limitations:

Traditional VM: very easy to manage, control the use of resources, rely on the virtual operating system.
Docker: due to the limitation of resources in docker, Cgroup is implemented through Cgroup, which has many imperfections, such as
How to deal with / proc. After entering the container, execute the top command, and the information you see is the same as that of the host, not the data of the configured container. (can be corrected by lxcfs).
When running java programs, the memory set in the container is 4g, using the default jvm configuration. The memory read by the default jvm is the host (maybe larger than 4g), so the OOM situation will occur.

6, Problems solved

1. How is the container isolated?

When a new process is created, isolation is realized by Namespace technology, such as PID namespaces. Let the running container only see its contents.

For example, when the container is running, it will add PID, UTS, network, user, mount, IPC, cgroup and other Namespace by default

2. How do containers restrict resources?

With Linux Cgroup technology, you can set Limited CPU, Memory and other resources for each process, and then set the upper limit of process access resources.

3. Briefly describe the file system of docker?

Docker's file system is called rootfs, and its implementation idea comes from Linux unionFS. Mount different directories together to form a separate view. On this basis, docker introduces the concept of layer to solve the problem of reusability.

In the specific implementation, the storage of rootfs is divided into overlay 2, overlay, aufs, devicemapper, etc. according to the linux kernel and the version of docker itself. Rootfs (image) is actually the superposition of multiple layers. When there are the same files in multiple layers, the files in the upper layer will overwrite the files in the lower layer.

4. Container startup process?

  1. Specify Linux Namespace configuration
  2. Set the specified Cgroups parameter
  3. Switch the root of the process

5. The problem of running multiple applications in the container?

First, correct a concept. We all say that the container is a single process application. In fact, the single process here does not mean that only one process is allowed in the container, but that only one process is controllable. You can use ping, ssh and other processes in the container, but these processes are not controlled by docker.

The main process in the container, that is, the process with pid =1, is generally specified by ENTRYPOINT or CMD in the DockerFile. If there are multiple services (processes) in a container, the main process may run normally, but the sub process exits and hangs up. For docker For example, if you just control the main process, you can't deal with this unexpected situation. It will also appear that the container is running normally, but the service has been suspended. At this time, the orchestration system becomes very difficult. Moreover, multiple services are not easy to troubleshoot and manage.

So if you really want to run multiple services in a container, you will usually manage them with tools like systemd or supervisor, or through the -- init method. In fact, the essence of these methods is to let the processes of multiple services have the same parent process.

But considering the design of the container itself, it is hoped that the container and the service can be in the same life cycle. So it's a bit of a contrarian.

Control (recycling and Lifecycle Management)

Tags: Java Docker Linux Python network

Posted on Mon, 22 Jun 2020 00:38:31 -0400 by LearningKid