Thread creation process


User mode create thread

Both processes and threads are tasks in the kernel. Aren't they all the same? But the question is, if the two are exactly the same, why are the programs written in the first two sections so different? If not, how can we distinguish them in the kernel?

In fact, thread is not a mechanism completely implemented by the kernel. It is completed by the cooperation of kernel state and user state. pthread_create is not a system call, but a function of Glibc library, so we have to go to Glibc to find clues.

Sure enough, we're on NPTL / pthread_ This function is found in create. C. We should be familiar with the parameters here.

nt __pthread_create_2_1 (pthread_t *newthread, const pthread_attr_t *attr, void *(*start_routine) (void *), void *arg)
versioned_symbol (libpthread, __pthread_create_2_1, pthread_create, GLIBC_2_1);

Let's take a look at what this function does in turn.

The first thing to deal with is the property parameters of the thread. For example, we set the thread stack size when writing the program. If no thread attribute is passed in, the default value is taken.

const struct pthread_attr *iattr = (struct pthread_attr *) attr;
struct pthread_attr default_attr;
if (iattr == NULL)
  iattr = &default_attr;

Next, just like in the kernel, each process or thread has a task_struct structure. In the user state, there is also a structure for maintaining threads, which is the pthread structure.

 struct pthread *pd = NULL;

All calls involving functions should use the stack. Each thread also has its own stack. Then the next step is to create a thread stack.

 int err = ALLOCATE_STACK (iattr, &pd);

ALLOCATE_STACK is a macro. After we find its definition, we find that it is actually a function. It's just that this function is a little complicated, so I'll list the main codes here.

 # define ALLOCATE_STACK(attr, pd) allocate_stack (attr, pd, &stackaddr)
static int
allocate_stack (const struct pthread_attr *attr, struct pthread **pdp,
  struct pthread *pd;
  size_t size;
  size_t pagesize_m1 = __getpagesize () - 1;
  size = attr->stacksize;
  /* Allocate some anonymous memory.  If possible use the cache.  */
  size_t guardsize;
  void *mem;
  const int prot = (PROT_READ | PROT_WRITE
                   | ((GL(dl_stack_flags) & PF_X) ? PROT_EXEC : 0));
  /* Adjust the stack size for alignment.  */
  size &= ~__static_tls_align_m1;
  /* Make sure the size of the stack is enough for the guard and
  eventually the thread descriptor.  */
  guardsize = (attr->guardsize + pagesize_m1) & ~pagesize_m1;
  size += guardsize;
  pd = get_cached_stack (&size, &mem);
  if (pd == NULL)
    /* If a guard page is required, avoid committing memory by first
    allocate with PROT_NONE and then reserve with required permission
    excluding the guard page.  */
	mem = __mmap (NULL, size, (guardsize == 0) ? prot : PROT_NONE,
    /* Place the thread descriptor at the end of the stack.  */
    pd = (struct pthread *) ((char *) mem + size) - 1;
    pd = (struct pthread *) ((((uintptr_t) mem + size - __static_tls_size) & ~__static_tls_align_m1) - TLS_PRE_TCB_SIZE);
    /* Now mprotect the required region excluding the guard area. */
    char *guard = guard_position (mem, size, guardsize, pd, pagesize_m1);
    setup_stack_prot (mem, size, guard, guardsize, prot);
    pd->stackblock = mem;
    pd->stackblock_size = size;
    pd->guardsize = guardsize;
    pd->specific[0] = pd->specific_1stblock;
    /* And add to the list of stacks in use.  */
    stack_list_add (&pd->list, &stack_used);
  *pdp = pd;
  void *stacktop;
  /* The stack begins before the TCB and the static TLS block.  */
  stacktop = ((char *) (pd + 1) - __static_tls_size);
# elif TLS_DTV_AT_TP
  stacktop = (char *) (pd - 1);
# endif
  *stack = stacktop;

Let's take a look at allocate_stack mainly does the following things:

  • If you have set the stack size in the thread attribute, you need to take out the set value;

  • In order to prevent the access of the stack from crossing the boundary, there will be a space guardsize at the end of the stack. Once you access here, you will make an error;

  • In fact, the thread stack is created in the process heap. If a process keeps creating and deleting threads, we can't constantly apply for and clear the memory blocks used by the thread stack, so we need a cache. get_cached_stack is based on the calculated size to see if the conditions have been met in the existing cache;

  • If there is no in the cache, you need to call__ MMAP creates a new memory. We talked about the system call section. If you want to malloc a large memory in the heap, use__ mmap;

  • The thread stack also grows from top to bottom. Remember that each thread should have a pthread structure, which is also placed in the stack space. The position at the bottom of the stack is actually the highest address;

  • Calculate the location of guard memory and call setup_stack_prot sets this memory to be protected;

  • Next, start to fill in the member variables stackblock and stackblock in the pthread structure_ size,guardsize,specific. The specific here is used to store Thread Specific Data, that is, the global variable belonging to the thread;

  • Put the thread stack on the stack_ In the used linked list, there are actually two linked lists in the management thread stack, one is stack_used, that is, the stack is being used; The other is stack_cache, as mentioned above, once the thread ends, cache it first without releasing it. When other threads are created, it will be used by other threads.

The problem of user state stack has been solved. In fact, half of the things in user state have been basically solved.

Kernel creation task

Next, let's go to pthread_create look. In fact, with the user state stack, the next problem to be solved is where the user state program starts to run.

pd->start_routine = start_routine;
pd->arg = arg;
pd->schedpolicy = self->schedpolicy;
pd->schedparam = self->schedparam;
/* Pass the descriptor to the caller.  */
*newthread = (pthread_t) pd;
atomic_increment (&__nptl_nthreads);
retval = create_thread (pd, iattr, &stopped_start, STACK_VARIABLES_ARGS, &thread_ran);

  start_routine is the function we give to the thread, start_routine,start_ The parameter arg of routine and the scheduling policy should be assigned to pthread.

Next__ nptl_nthreads plus one indicates that there is an additional thread.

The real way to create a thread is to call create_thread function, which is defined as follows:

static int create_thread (struct pthread *pd, const struct pthread_attr *attr,
bool *stopped_start, STACK_VARIABLES_PARMS, bool *thread_ran)
  ARCH_CLONE (&start_thread, STACK_VARIABLES_ARGS, clone_flags, pd, &pd->tid, tp, &pd->tid);
  /* It's started now, so if we fail below, we'll have to cancel it
and let it clean itself up.  */
  *thread_ran = true;

  There's a long clone in here_ Flags, we haven't noticed these before, but in the next process, we should pay special attention to these flag bits.

Then ARCH_CLONE actually calls__ clone. See here, you should have the feeling that the system call is coming soon.

 # define ARCH_CLONE __clone
/* The userland implementation is:
   int clone (int (*fn)(void *arg), void *child_stack, int flags, void *arg),
   the kernel entry is:
   int clone (long flags, void *child_stack).
   The parameters are passed in register and on the stack from userland:
   rdi: fn
   rsi: child_stack
   rdx: flags
   rcx: arg
   r8d: TID field in parent
   r9d: thread pointer
%esp+8: TID field in child
   The kernel expects:
   rax: system call number
   rdi: flags
   rsi: child_stack
   rdx: TID field in parent
   r10: TID field in child
   r8:  thread pointer  */
ENTRY (__clone)
        movq    $-EINVAL,%rax
        /* Insert the argument onto the new stack.  */
        subq    $16,%rsi
        movq    %rcx,8(%rsi)
        /* Save the function pointer.  It will be popped off in the
           child in the ebx frobbing below.  */
        movq    %rdi,0(%rsi)
        /* Do the system call.  */
        movq    %rdx, %rdi
        movq    %r8, %rdx
        movq    %r9, %r8
        mov     8(%rsp), %R10_LP
        movl    $SYS_ify(clone),%eax
PSEUDO_END (__clone)

If you are not familiar with assembly, it doesn't matter. You can focus on the comments above.

We can see that we finally called syscall, which is almost the same as the other system calls we were familiar with in clone. However, there are a few differences.

If other system calls are called in the main thread of the process, the stack in the current user state points to the stack of the whole process, the pointer at the top of the stack also points to the stack of the process, and the instruction pointer also points to the code of the main thread of the process. At this moment, when calling clone, the user state stack, stack top pointer and instruction pointer, like other system calls, point to the main thread.

But for threads, these things change. Because we hope that when the clone system call succeeds, there will be tasks corresponding to this thread in the kernel_ Struct, when the system call returns to the user state, the stack of the user state should be the stack of the thread, the pointer at the top of the stack should point to the stack of the thread, and the instruction pointer should point to the function to be executed by the thread.

Therefore, we need to do all these by ourselves. Press the parameters and instruction positions of the function to be executed by the thread into the stack. When it returns from the kernel and pops out of the stack, we will start with this function and execute with these parameters.

Next, we're going to enter the kernel. The definition of clone system call in the kernel is as follows:

 SYSCALL_DEFINE5(clone, unsigned long, clone_flags, unsigned long, newsp,
		 int __user *, parent_tidptr,
		 int __user *, child_tidptr,
		 unsigned long, tls)
	return _do_fork(clone_flags, newsp, 0, parent_tidptr, child_tidptr, tls);

Seeing here, I found a familiar face_ do_fork, is it easier? We have followed its logic in the last section. Here we focus on several differences.

The first is the complex flag bit setting above. Let's see what is affected.

For copy_files, originally called dup_fd copy a file_ Struct, now because of clone_ The files flag bit changes to the original files_ The struct reference count is incremented by one.

static int copy_files(unsigned long clone_flags, struct task_struct *tsk)
	struct files_struct *oldf, *newf;
	oldf = current->files;
	if (clone_flags & CLONE_FILES) {
		goto out;
	newf = dup_fd(oldf, &error);
	tsk->files = newf;
	return error;

For copy_fs, originally called copy_fs_struct copy a fs_struct, now because of clone_ The FS identification bit changes to the original FS_ Add one to the number of users of struct.

For copy_sighand, originally to create a new sighand_struct, now because of clone_ The sighand flag bit becomes the original sighand_ The struct reference count is incremented by one.

 static int copy_sighand(unsigned long clone_flags, struct task_struct *tsk)
	struct sighand_struct *sig;
	if (clone_flags & CLONE_SIGHAND) {
		return 0;
	sig = kmem_cache_alloc(sighand_cachep, GFP_KERNEL);
	atomic_set(&sig->count, 1);
	memcpy(sig->action, current->sighand->action, sizeof(sig->action));
	return 0;

For copy_signal, originally to create a new signal_struct, now because of CLONE_THREAD returns directly.

 static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
	struct signal_struct *sig;
	if (clone_flags & CLONE_THREAD)
		return 0;
	sig = kmem_cache_zalloc(signal_cachep, GFP_KERNEL);
	tsk->signal = sig;

For copy_mm, originally called dup_mm copy one mm_struct, now because of clone_ The VM identification bit points directly to the original mm_struct

 static int copy_mm(unsigned long clone_flags, struct task_struct *tsk)
	struct mm_struct *mm, *oldmm;
	oldmm = current->mm;
	if (clone_flags & CLONE_VM) {
		mm = oldmm;
		goto good_mm;
	mm = dup_mm(tsk);
	tsk->mm = mm;
	tsk->active_mm = mm;
	return 0;

The second is the impact on kinship. After all, we need to identify whether multiple threads belong to one process.

p->pid = pid_nr(pid);
if (clone_flags & CLONE_THREAD) {
	p->exit_signal = -1;
	p->group_leader = current->group_leader;
	p->tgid = current->tgid;
} else {
	if (clone_flags & CLONE_PARENT)
		p->exit_signal = current->group_leader->exit_signal;
		p->exit_signal = (clone_flags & CSIGNAL);
	p->group_leader = p;
	p->tgid = p->pid;
	/* CLONE_PARENT re-uses the old parent */
if (clone_flags & (CLONE_PARENT|CLONE_THREAD)) {
	p->real_parent = current->real_parent;
	p->parent_exec_id = current->parent_exec_id;
} else {
	p->real_parent = current;
	p->parent_exec_id = current->self_exec_id;

As can be seen from the above code, clone is used_ After the thread marker bit, the genetic relationship has changed to a certain extent.

  • If it is a new process, the group of the process_ The leader is himself, and the tgid is its own pid, which completely replays the gongs and drums and opens another chapter. He is the head of the thread group. If it is a new thread, click group_ The leader is the group of the current process_ The leader and tgid are the tgid of the current process, that is, the pid of the current process. At this time, the original process is the boss.

  • If it is a new process, the real of the new process_ Parent is the current process, and there is another generation in the process tree; If it is a new thread, the real of the thread_ Parent is the real of the current process_ Parent, in fact, is of the same generation.

Third, for signal processing, how to ensure that the signal sent to the process can be processed by one thread, but the scope of influence should be the whole process. For example, if you kill a process, all threads will be killed. If a signal is sent to a pthread of a thread_ Kill, only the thread should receive it.

In copy_ In the main process of process, whether creating a process or thread, struct sigpending pending will be initialized, that is, each task_struct, there will be such a member variable. This is a list of signals. If this task_struct is a thread, and the signal in it is sent to this thread; If this task_struct is a process, and the signals in it are sent to the main thread.


  In addition, copy above_ When using signal, we can see that signal will be initialized during the process of creating the process_ Struct sigpending shared in struct_ pending. However, in the process of creating a thread, the signal is not connected_ Structs are shared. In other words, all threads in the whole process share a shared_pending, which is also a signal list, is sent to the whole process. It is the same for any thread.


  At this point, the clone has been called in the kernel. You need to return to the system call and return to the user state.

User mode execution thread

According to__ The first parameter of clone returns to the user state. Instead of directly running the specified function, it is a general start_thread, which is the unified entry of all threads in user state.

  static int __attribute__ ((noreturn)) start_thread (void *arg)
    struct pthread *pd = START_THREAD_SELF;
    /* Run the code the user provided.  */
    THREAD_SETMEM (pd, result, pd->start_routine (pd->arg));
    /* Call destructors for the thread_local TLS variables.  */
    /* Run the destructor for the thread-local data.  */
    __nptl_deallocate_tsd ();
    if (__glibc_unlikely (atomic_decrement_and_test (&__nptl_nthreads)))
        /* This was the last thread.  */
        exit (0);
    __free_tcb (pd);
    __exit_thread ();

  At start_ In the thread entry function, the function provided by the user is really called. After the user's function is executed, the thread related data will be released. For example, thread local data thread_local variables, and the number of threads is also reduced by one. If this is the last thread, exit the process directly. In addition__ free_tcb is used to release pthread.

__free_tcb (struct pthread *pd)
  __deallocate_stack (pd);
__deallocate_stack (struct pthread *pd)
  /* Remove the thread from the list of threads with user defined
     stacks.  */
  stack_list_del (&pd->list);
  /* Not much to do.  Just free the mmap()ed memory.  Note that we do
     not reset the 'used' flag in the 'tid' field.  This is done by
     the kernel.  If no thread has been created yet this field is
     still zero.  */
  if (__glibc_likely (! pd->user_stack))
    (void) queue_stack (pd);


__ free_tcb will call__ deallocate_stack to release the whole thread stack. This thread stack should be stacked from the list of currently used thread stacks_ Used and put it into the cached thread stack list stack_ In cache.

Well, the whole thread life cycle is over here.

Summary moment

The calling process of thread has been parsed. I drew a diagram to summarize it. This figure compares the difference between creating a process and creating a thread in user state and kernel state.

If the process is created, the system call to be called is fork and copy_ In the process function, the five structure files_struct,fs_struct,sighand_struct,signal_struct,mm_struct is copied again, and the parent process and child process use their own data structures. When creating a thread, the system call clone is called in copy_ In the process function, the five structures are only the reference count plus one, that is, the data structure of the thread shared process.

Do you know how to view the threads of a process and the usage of thread stack? Please find the relevant commands and API s and try them.

There is a directory of each process under the proc directory, and the directory of each thread under the process contains the directory of each thread under the process. Enter the thread directory to view the details of each thread, which is viewed through the proc file system on the command line. The pthread library should also provide an api to obtain the information of each thread. It has not been checked yet.


Tags: Linux thread

Posted on Thu, 23 Sep 2021 07:16:18 -0400 by josephman1988