Exploration of Golang Source Code (3) Realization Principle of GC

Golang has introduced tricolor GC since 1.5. After several improvements, the current 1.9 version of GC pause time can be very short. The reduction of ...

Golang has introduced tricolor GC since 1.5. After several improvements, the current 1.9 version of GC pause time can be very short.
The reduction of pause time means the shortening of "maximum response time", which makes go more suitable for writing network service programs.
This article will explain the implementation principle of tricolor GC in go by analyzing the source code of golang.

The Goang source code for this series of analysis is version 1.9.2 of Google's official implementation, which is not applicable to other versions and other implementations such as gccgo.
The operating environment is Ubuntu 16.04 LTS 64bit.
First I will explain the basic concepts, then I will explain the distributor, and then I will explain the implementation of the collector.

Basic concepts

Memory structure

go allocates a virtual memory address that is contiguous when the program starts. The structure is as follows:

This block of memory is divided into three regions, the size of which is 512M, 16G and 512G on X64. Their functions are as follows:

arena

Area region is what we usually call heap, where the memory allocated by go from heap is located.

bitmap

The bitmap area is used to indicate which addresses in the arena area hold the object and which addresses in the object contain pointers.
A byte(8 bit) in the bitmap region corresponds to four pointer-sized memory in the arena region, that is, 2 bit corresponds to one pointer-sized memory.
So the size of the bitmap region is 512GB / pointer size (8 byte) / 4 = 16GB.

A byte in the bitmap region corresponds to four pointer-sized memory structures in the arena region as follows.
Each pointer-sized memory has two bit s to indicate whether the scan should continue and whether the pointer should be included:

The corresponding relationship between byte and arena in bitmap starts at the end, that is, as memory allocation expands to both sides:

spans

The spans area is used to indicate which span a page in the arena area belongs to and what span is described below.
A pointer (8 byte) in the spans region corresponds to a page in the arena region (8 KB in go).
So the size of spans is 512GB / page size (8KB) * pointer size (8 byte) = 512MB.

A pointer in the spans area corresponds to a page in the arena area. Unlike bitmap, the corresponding relationship starts at the beginning.

When to assign objects from Heap

As mentioned in many articles and books on go, go automatically determines which objects should be placed on the stack and which objects should be placed on the heap.
Simply put, when the content of an object may be accessed after the function generating the object is finished, the object will be allocated to the heap.
The allocation of objects on the heap includes:

  • Pointer to return object
  • Pass an object's pointer to another function
  • Objects are used in closures and need to be modified
  • Using new

In C, it is very dangerous for functions to return pointers to objects on the stack, but it is safe in go because the objects are automatically allocated on the heap.
The process of go deciding whether to use heap allocation objects is also called escape analysis.

GC Bitmap

GC needs to know where to include pointers when tagging. For example, the bitmap area mentioned above covers the pointer information in arena area.
In addition, GC needs to know where pointers are included in stack space.
Because stack space does not belong to arena region, the pointer information of stack space will be in function information.
In addition, GC also needs to set the bitmap area according to the type of the object, and the source pointer information will be in the type information.

In summary, there are the following GC Bitmap in go:

  • bitmap region: Covers arena region, using 2 bit s to represent a pointer size of memory
  • Function information: Covers the stack space of a function, using 1 bit to represent a pointer-sized memory (located in stackmap.bytedata)
  • Type information: When an object is allocated, it is copied to the bitmap area, using 1 bit to represent a pointer-sized memory (located in _type.gcdata)

Span

Span is a block used to allocate objects. The following figure briefly illustrates the internal structure of Span:

Usually a span contains several elements of the same size, and an element will save an object unless:

  • Spans are used to save large objects, in which case span s have only one element
  • Spans are used to save minimal objects without pointers, in which case span s save multiple objects with one element

There is a free index tag in the span that should start searching for the address when assigning objects next time. After assignment, the free index will increase.
Elements before and after freeindex are all allocated. Elements after freeindex may or may not be allocated.

span may reclaim some elements after each GC. allocBits are used to mark which elements have been allocated and which elements have not been allocated.
Using freeindex + allocBits, the allocated elements can be skipped and the objects set in the unallocated elements.
But because it's slow to access allocBits every time, there's an integer allocCache in span that caches the bitmap at the beginning of freeindex, and the cached bit value is the opposite of the original value.

GcmarkBits are used to mark which objects survive in gc, and after each gc, gcmarkBits becomes allocBits.
It should be noted that the memory of the span structure itself is allocated from the system, and the spans and bitmap regions mentioned above are only an index.

Types of Span s

Spans can be divided into 67 types according to size, as follows:

// class bytes/obj bytes/span objects tail waste max waste // 1 8 8192 1024 0 87.50% // 2 16 8192 512 0 43.75% // 3 32 8192 256 0 46.88% // 4 48 8192 170 32 31.52% // 5 64 8192 128 0 23.44% // 6 80 8192 102 32 19.07% // 7 96 8192 85 32 15.95% // 8 112 8192 73 16 13.56% // 9 128 8192 64 0 11.72% // 10 144 8192 56 128 11.82% // 11 160 8192 51 32 9.73% // 12 176 8192 46 96 9.59% // 13 192 8192 42 128 9.25% // 14 208 8192 39 80 8.12% // 15 224 8192 36 128 8.15% // 16 240 8192 34 32 6.62% // 17 256 8192 32 0 5.86% // 18 288 8192 28 128 12.16% // 19 320 8192 25 192 11.80% // 20 352 8192 23 96 9.88% // 21 384 8192 21 128 9.51% // 22 416 8192 19 288 10.71% // 23 448 8192 18 128 8.37% // 24 480 8192 17 32 6.82% // 25 512 8192 16 0 6.05% // 26 576 8192 14 128 12.33% // 27 640 8192 12 512 15.48% // 28 704 8192 11 448 13.93% // 29 768 8192 10 512 13.94% // 30 896 8192 9 128 15.52% // 31 1024 8192 8 0 12.40% // 32 1152 8192 7 128 12.41% // 33 1280 8192 6 512 15.55% // 34 1408 16384 11 896 14.00% // 35 1536 8192 5 512 14.00% // 36 1792 16384 9 256 15.57% // 37 2048 8192 4 0 12.45% // 38 2304 16384 7 256 12.46% // 39 2688 8192 3 128 15.59% // 40 3072 24576 8 0 12.47% // 41 3200 16384 5 384 6.22% // 42 3456 24576 7 384 8.83% // 43 4096 8192 2 0 15.60% // 44 4864 24576 5 256 16.65% // 45 5376 16384 3 256 10.92% // 46 6144 24576 4 0 12.48% // 47 6528 32768 5 128 6.23% // 48 6784 40960 6 256 4.36% // 49 6912 49152 7 768 3.37% // 50 8192 8192 1 0 15.61% // 51 9472 57344 6 512 14.28% // 52 9728 49152 5 512 3.64% // 53 10240 40960 4 0 4.99% // 54 10880 32768 3 128 6.24% // 55 12288 24576 2 0 11.45% // 56 13568 40960 3 256 9.99% // 57 14336 57344 4 0 5.35% // 58 16384 16384 1 0 12.49% // 59 18432 73728 4 0 11.11% // 60 19072 57344 3 128 3.57% // 61 20480 40960 2 0 6.87% // 62 21760 65536 3 256 6.25% // 63 24576 24576 1 0 11.45% // 64 27264 81920 3 128 10.00% // 65 28672 57344 2 0 4.91% // 66 32768 32768 1 0 12.50%

Take a span of type (class) 1 as an example.
The size of the element in span is 8 byte. Span itself occupies one page, or 8K. It can save 1024 objects.

When allocating objects, the type of span is determined according to the size of the object.
For example, 16 byte objects will use span 2, 17 byte objects will use span 3, 32 byte objects will use span 3.
As you can see from this example, objects allocated 17 and 32 byte s all use span 3, which means that partially sized objects waste a certain amount of space in allocation.

One might notice that the largest span above has an element size of 32K, so where will objects allocated over 32K be allocated?
Objects over 32K are called "large objects", and when large objects are allocated, a special span is allocated directly from the heap.
The special span type (class) is 0 and contains only one large object. The size of the span is determined by the size of the object.

The special span plus 66 standard span forms 67 span types.

Location of Span

stay Previous article I mentioned that P is a virtual resource, only one thread can access the same P at the same time, so the data in P need not be locked.
In order to achieve better performance in allocating objects, there are caches of span (also called mcache) in each P. The structure of caches is as follows:

There are 67*2 = 134 spans in each P according to the type of span.

The difference between scan and noscan is that
If the object contains pointers, span of scan is used when assigning objects.
If the object does not contain pointers, the span of noscan is used when assigning the object.
The significance of dividing span into scan and noscan is that:
When GC scans objects, the space of noscan can can mark sub-objects without looking at the bitmap area, which can greatly improve the efficiency of tagging.

When allocating objects, appropriate span s will be obtained from the following locations for allocation:

  • First, it is obtained from P's cache (mcache). If there is a cached span and it is not full, then it is used. This step does not require a lock.
  • It is then retrieved from the global cache (mcentral) and set to P if it succeeds. This step requires a lock.
  • Finally, get it from mheap and set it to the global cache. This step requires a lock.

Caching span s in P is similar to the Allocation Context in CoreCLR.
Thread locks are not needed most of the time to allocate objects, which improves the performance of allocation.

Processing of Distribution Objects

Process of assigning objects

The new object function is called when go allocates objects from the heap. The flow of this function is roughly as follows:

First, it checks whether the GC is working. If the GC is working and the current G allocates a certain amount of memory, it needs to assist the GC to do some work.
This mechanism, called GC Assist, is used to prevent GC recycling from happening when memory allocation is too fast.

It then determines whether it is a small object or a large object, and if it is a large object, it directly calls largeAlloc to allocate from the heap.
If it is a small object, get the available span in three stages, and then allocate the object from the span:

  • First, get it from P's cache (mcache)
  • The list of span s available in the global cache is then retrieved from the mcentral
  • Finally, it is obtained from mheap, which also has a free list of span s. If both acquisitions fail, it is allocated from the arena region.

The detailed structure of these three stages is as follows:

Definition of data type

The data types involved in assigning objects include:

p As mentioned in the previous article, P is a virtual resource in the Consortium for running go code
m As mentioned in the previous article, M currently represents system threads
g As mentioned in the previous article, G is goroutine
mspan Blocks for allocating objects
mcentral Global mspan caches, 67*2 = 134
mheap For managing heap, there is only one global object

Source code analysis

go is called when allocating objects from the heap newobject Function, starting with this function:

// implementation of new builtin // compiler (both frontend and SSA backend) knows the signature // of this function func newobject(typ *_type) unsafe.Pointer { return mallocgc(typ.size, typ, true) }

newobject Called mallocgc Function:

// Allocate an object of size bytes. // Small objects are allocated from the per-P cache's free lists. // Large objects (> 32 kB) are allocated straight from the heap. func mallocgc(size uintptr, typ *_type, needzero bool) unsafe.Pointer { if gcphase == _GCmarktermination { throw("mallocgc called with gcphase == _GCmarktermination") } if size == 0 { return unsafe.Pointer(&zerobase) } if debug.sbrk != 0 { align := uintptr(16) if typ != nil { align = uintptr(typ.align) } return persistentalloc(size, align, &memstats.other_sys) } // Judging whether to assist GC work // GC Blacken Enabled opens during the GC markup phase // assistG is the G to charge for this allocation, or nil if // GC is not currently active. var assistG *g if gcBlackenEnabled != 0 { // Charge the current user G for this allocation. assistG = getg() if assistG.m.curg != nil { assistG = assistG.m.curg } // Charge the allocation against the G. We'll account // for internal fragmentation at the end of mallocgc. assistG.gcAssistBytes -= int64(size) // The size of the assignment will determine how much work the GC needs to assist in. // The specific algorithm will be explained below when the collector is explained. if assistG.gcAssistBytes < 0 { // This G is in debt. Assist the GC to correct // this before allocating. This must happen // before disabling preemption. gcAssistAlloc(assistG) } } // Increase the lock count of the M corresponding to the current G to prevent the G from being preempted // Set mp.mallocing to keep from being preempted by GC. mp := acquirem() if mp.mallocing != 0 { throw("malloc deadlock") } if mp.gsignal == getg() { throw("malloc during signal") } mp.mallocing = 1 shouldhelpgc := false dataSize := size // Get the local span cache (mcache) of P corresponding to M corresponding to the current G // Because M sets the mcache of P to M after it has P, the return here is getg().m.mcache. c := gomcache() var x unsafe.Pointer noscan := typ == nil || typ.kind&kindNoPointers != 0 // The current value of maxSmallSize for small objects is 32K if size <= maxSmallSize { // If the object does not contain pointers and the size of the object is less than 16 bytes, special processing can be done. // This is an optimization for very small objects, because the minimum element of span is only 8 byte. If the object is smaller, a lot of space will be wasted. // Very small objects can be integrated into elements of "class 2 noscan" (size 16 byte) if noscan && size < maxTinySize { // Tiny allocator. // // Tiny allocator combines several tiny allocation requests // into a single memory block. The resulting memory block // is freed when all subobjects are unreachable. The subobjects // must be noscan (don't have pointers), this ensures that // the amount of potentially wasted memory is bounded. // // Size of the memory block used for combining (maxTinySize) is tunable. // Current setting is 16 bytes, which relates to 2x worst case memory // wastage (when all but one subobjects are unreachable). // 8 bytes would result in no wastage at all, but provides less // opportunities for combining. // 32 bytes provides more opportunities for combining, // but can lead to 4x worst case wastage. // The best case winning is 8x regardless of block size. // // Objects obtained from tiny allocator must not be freed explicitly. // So when an object will be freed explicitly, we ensure that // its size >= maxTinySize. // // SetFinalizer has a special case for objects potentially coming // from tiny allocator, it such case it allows to set finalizers // for an inner byte of a memory block. // // The main targets of tiny allocator are small strings and // standalone escaping variables. On a json benchmark // the allocator reduces number of allocations by ~12% and // reduces heap size by ~20%. off := c.tinyoffset // Align tiny pointer for required (conservative) alignment. if size&7 == 0 { off = round(off, 8) } else if size&3 == 0 { off = round(off, 4) } else if size&1 == 0 { off = round(off, 2) } if off+size <= maxTinySize && c.tiny != 0 { // The object fits into existing tiny block. x = unsafe.Pointer(c.tiny + off) c.tinyoffset = off + size c.local_tinyallocs++ mp.mallocing = 0 releasem(mp) return x } // Allocate a new maxTinySize block. span := c.alloc[tinySpanClass] v := nextFreeFast(span) if v == 0 { v, _, shouldhelpgc = c.nextFree(tinySpanClass) } x = unsafe.Pointer(v) (*[2]uint64)(x)[0] = 0 (*[2]uint64)(x)[1] = 0 // See if we need to replace the existing tiny block with the new one // based on amount of remaining free space. if size < c.tinyoffset || c.tiny == 0 { c.tiny = uintptr(x) c.tinyoffset = size } size = maxTinySize } else { // Otherwise, they are allocated by ordinary small objects. // First, which span type should be used to get the size of the object var sizeclass uint8 if size <= smallSizeMax-8 { sizeclass = size_to_class8[(size+smallSizeDiv-1)/smallSizeDiv] } else { sizeclass = size_to_class128[(size-smallSizeMax+largeSizeDiv-1)/largeSizeDiv] } size = uintptr(class_to_size[sizeclass]) // Equivalent to sizeclass * 2 + (noscan?1:0) spc := makeSpanClass(sizeclass, noscan) span := c.alloc[spc] // Try to quickly allocate from this span v := nextFreeFast(span) if v == 0 { // Failure to allocate may need to be retrieved from mcentral or mheap // If a new span is obtained from mcentral or mheap, shouldhelpgc equals true // Should help GC equals true, the following judges whether to trigger GC v, span, shouldhelpgc = c.nextFree(spc) } x = unsafe.Pointer(v) if needzero && span.needzero != 0 { memclrNoHeapPointers(unsafe.Pointer(v), size) } } } else { // Large objects are allocated directly from mheap, where s is a special span whose class is 0 var s *mspan shouldhelpgc = true systemstack(func() { s = largeAlloc(size, needzero, noscan) }) s.freeindex = 1 s.allocCount = 1 x = unsafe.Pointer(s.base()) size = s.elemsize } // Set up the bitmap corresponding to arena, record which locations contain pointers, and GC scans all reachable objects using bitmap var scanSize uintptr if !noscan { // If allocating a defer+arg block, now that we've picked a malloc size // large enough to hold everything, cut the "asked for" size down to // just the defer header, so that the GC bitmap will record the arg block // as containing nothing at all (as if it were unused space at the end of // a malloc block caused by size rounding). // The defer arg areas are scanned as part of scanstack. if typ == deferType { dataSize = unsafe.Sizeof(_defer{}) } // This function is very long and interesting to see. // https://github.com/golang/go/blob/go1.9.2/src/runtime/mbitmap.go#L855 // Although the code is long, the settings are the same as the structure of the bitmap area mentioned above. // scan bit and pointer bit are set according to the type information. scan bit is set up to indicate that it should continue to scan, and pointer bit is set to indicate that the location is a pointer. // There are some points to be noted. // - If a type contains pointers only at the beginning, such as [ptr, ptr, large non-pointer data] // Then the scan bit of the latter part will be 0, which can greatly improve the efficiency of markup. // - The scan bit of the second slot has a special purpose. It is not used to mark whether to continue scanning, but to mark checkmark. // What is checkmark? // - Because the parallel GC of go is complex, in order to check whether the implementation is correct, go needs to have a mechanism to check whether all objects that should be tagged are tagged. // This mechanism is checkmark. When checkmark is opened, go stops the whole world at the end of the markup phase and executes the markup again. // The scan bit of the second slot above is used to mark whether the object is marked in the checkmark tag. // - Some people may find that the second slot requires an object to have at least two pointer sizes, so what about an object with only one pointer? // Objects with only one pointer can be divided into two cases // The object is the pointer, because the size is exactly one pointer, so you don't need to look at the bitmap area. The first slot is checkmark. // Objects are not pointers, because there is a tiny alloc mechanism. Objects that are not pointers and have only one pointer size are allocated in the span of two pointers. // You don't need to look at the bitmap area at this point, so as above, the first slot is checkmark. heapBitsSetType(uintptr(x), size, dataSize, typ) if dataSize > typ.size { // Array allocation. If there are any // pointers, GC has to scan to the last // element. if typ.ptrdata != 0 { scanSize = dataSize - typ.size + typ.ptrdata } } else { scanSize = typ.ptrdata } c.local_scan += scanSize } // Memory barrier, because store s of x86 and x64 are not out of order, so here's just a barrier for compilers. The assembly is ret. // Ensure that the stores above that initialize x to // type-safe memory and set the heap bits occur before // the caller can make x observable to the garbage // collector. Otherwise, on weakly ordered machines, // the garbage collector could follow a pointer to x, // but see uninitialized memory or stale heap bits. publicationBarrier() // If you are currently in GC, you need to immediately mark the allocated object as "black" to prevent it from being recycled. // Allocate black during GC. // All slots hold nil so no scanning is needed. // This may be racing with GC so do it atomically if there can be // a race marking the bit. if gcphase != _GCoff { gcmarknewobject(uintptr(x), size, scanSize) } // Processing of Race Detector (for thread conflict detection) if raceenabled { racemalloc(x, size) } // Processing of Memory Sanitizer (for detecting memory problems such as dangerous pointers) if msanenabled { msanmalloc(x, size) } // Re-allow the current G to be preempted mp.mallocing = 0 releasem(mp) // Debug Logs if debug.allocfreetrace != 0 { tracealloc(x, size, typ) } // Profiler record if rate := MemProfileRate; rate > 0 { if size < uintptr(rate) && int32(size) < c.next_sample { c.next_sample -= int32(size) } else { mp := acquirem() profilealloc(mp, x, size) releasem(mp) } } // gcAssistBytes subtracts "actual allocation size - required allocation size" and adjusts to the exact value if assistG != nil { // Account for internal fragmentation in the assist // debt now that we know it. assistG.gcAssistBytes -= int64(size - dataSize) } // If a new span has been obtained before, it is determined whether a background start GC is needed. // The GC Trigger here is explained in detail below. if shouldhelpgc { if t := (gcTrigger); t.test() { gcStart(gcBackgroundMode, t) } } return x }

Next, let's look at how to allocate objects from span. First, we call nextFreeFast Try to allocate quickly:

// nextFreeFast returns the next free object if one is quickly available. // Otherwise it returns 0. func nextFreeFast(s *mspan) gclinkptr { // Getting the first non-zero bit is the first bit, that is, which element is unassigned theBit := sys.Ctz64(s.allocCache) // Is there a free object in the allocCache? // Find unallocated elements if theBit < 64 { result := s.freeindex + uintptr(theBit) // Require that the index value be less than the number of elements if result < s.nelems { // Next free index freeidx := result + 1 // Special processing is required when divisible by 64 (refer to nextFree) if freeidx%64 == 0 && freeidx != s.nelems { return 0 } // Update freeindex and allocCache (both high-level 0, updated after exhaustion) s.allocCache >>= uint(theBit + 1) s.freeindex = freeidx // Returns the address of the element v := gclinkptr(result*s.elemsize + s.base()) // Add the assigned element count s.allocCount++ return v } } return 0 }

If the unassigned element cannot be quickly found after freeindex, it needs to be called nextFree Make more complex processing:

// nextFree returns the next free object from the cached span if one is available. // Otherwise it refills the cache with a span with an available object and // returns that object along with a flag indicating that this was a heavy // weight allocation. If it is a heavy weight allocation the caller must // determine whether a new GC cycle needs to be started or if the GC is active // whether this goroutine needs to assist the GC. func (c *mcache) nextFree(spc spanClass) (v gclinkptr, s *mspan, shouldhelpgc bool) { // Find the next freeindex and update allocCache s = c.alloc[spc] shouldhelpgc = false freeIndex := s.nextFreeIndex() // If all the elements in the span have been allocated, you need to get a new span if freeIndex == s.nelems { // The span is full. if uintptr(s.allocCount) != s.nelems { println("runtime: s.allocCount=", s.allocCount, "s.nelems=", s.nelems) throw("s.allocCount != s.nelems && freeIndex == s.nelems") } // Apply for a new span systemstack(func() { c.refill(spc) }) // Get the new span after the application and set the need to check whether GC is executed shouldhelpgc = true s = c.alloc[spc] freeIndex = s.nextFreeIndex() } if freeIndex >= s.nelems { throw("freeIndex is not valid") } // Returns the address of the element v = gclinkptr(freeIndex*s.elemsize + s.base()) // Add the assigned element count s.allocCount++ if uintptr(s.allocCount) > s.nelems { println("s.allocCount=", s.allocCount, "s.nelems=", s.nelems) throw("s.allocCount > s.nelems") } return }

If the span of the specified type in mcache is full, you need to call it refill Function to apply for a new span:

// Gets a span that has a free object in it and assigns it // to be the cached span for the given sizeclass. Returns this span. func (c *mcache) refill(spc spanClass) *mspan { _g_ := getg() // Prevent G from being preempted _g_.m.locks++ // Return the current cached span to the central lists. s := c.alloc[spc] // Ensure that all elements of the current span are allocated if uintptr(s.allocCount) != s.nelems { throw("refill of span with free space remaining") } // Set the incache attribute of span, unless it is an empty span used globally (that is, the default value of the span pointer in mcache) if s != &emptymspan { s.incache = false } // Apply to mcentral for a new span // Get a new cached span from the central lists. s = mheap_.central[spc].mcentral.cacheSpan() if s == nil { throw("out of memory") } if uintptr(s.allocCount) == s.nelems { throw("span has no free space") } // Set up a new span to mcache c.alloc[spc] = s // Allow G to be preempted _g_.m.locks-- return s }

Applying to mcentral for a new span will pass cacheSpan Function:
mcentral first attempts to reuse the original span from the internal list, and then applies to mheap if the reuse fails.

// Allocate a span to use in an MCache. func (c *mcentral) cacheSpan() *mspan { // Let current G assist part of sweep's work // Deduct credit for this span allocation and sweep if necessary. spanBytes := uintptr(class_to_allocnpages[c.spanclass.sizeclass()]) * _PageSize deductSweepCredit(spanBytes, 0) // Lock mcentral because there may be multiple M(P) accesses at the same time lock(&c.lock) traceDone := false if trace.enabled { traceGCSweepStart() } sg := mheap_.sweepgen retry: // There are two span s in mcentral // - nonempty means determining that the span has at least one unallocated element // - empty denotes uncertainty that the span has at least one unallocated element // Find nonempty's list first here // Swepgen adds 2 per GC // - sweepgen = global sweepgen, indicating that span has sweeped // - sweepgen = global sweepgen-1, indicating that span is sweep // - sweepgen = global sweepgen-2, which means span waits for sweep var s *mspan for s = c.nonempty.first; s != nil; s = s.next { // If span waits for sweep, try atomic modification of sweepgen to global sweepgen-1 if s.sweepgen == sg-2 && atomic.Cas(&s.sweepgen, sg-2, sg-1) { // If the modification is successful, move span to empty list, sweep it and jump to havespan c.nonempty.remove(s) c.empty.insertBack(s) unlock(&c.lock) s.sweep(true) goto havespan } // If this span is being sweep by other threads, skip if s.sweepgen == sg-1 { // the span is being swept by background sweeper, skip continue } // span has been sweep // Because span in the nonempty list determines that there is at least one unallocated element, it can be used directly here. // we have a nonempty span that does not require sweeping, allocate from it c.nonempty.remove(s) c.empty.insertBack(s) unlock(&c.lock) goto havespan } // Find empty's list for s = c.empty.first; s != nil; s = s.next { // If span waits for sweep, try atomic modification of sweepgen to global sweepgen-1 if s.sweepgen == sg-2 && atomic.Cas(&s.sweepgen, sg-2, sg-1) { // Put span at the end of empty list // we have an empty span that requires sweeping, // sweep it and see if we can free some space in it c.empty.remove(s) // swept spans are at the end of the list c.empty.insertBack(s) unlock(&c.lock) // Try sweep s.sweep(true) // Swep also needs to detect whether there are unassigned objects, and if so, it can be used freeIndex := s.nextFreeIndex() if freeIndex != s.nelems { s.freeindex = freeIndex goto havespan } lock(&c.lock) // the span is still empty after sweep // it is already in the empty list, so just retry goto retry } // If this span is being sweep by other threads, skip if s.sweepgen == sg-1 { // the span is being swept by background sweeper, skip continue } // Cannot find span with unassigned objects // already swept empty span, // all subsequent ones must also be either swept or in process of sweeping break } if trace.enabled { traceGCSweepDone() traceDone = true } unlock(&c.lock) // Spans with unassigned objects cannot be found and need to be allocated from mheap // Add to empty list after assignment is completed // Replenish central list if empty. s = c.grow() if s == nil { return nil } lock(&c.lock) c.empty.insertBack(s) unlock(&c.lock) // At this point s is a non-empty span, queued at the end of the empty list, // c is unlocked. havespan: if trace.enabled && !traceDone { traceGCSweepDone() } // Statistically count the number of unallocated elements in span and add them to mcentral.nmalloc // Statistics the total size of unallocated elements in span and add them to memstats.heap_live cap := int32((s.npages << _PageShift) / s.elemsize) n := cap - int32(s.allocCount) if n == 0 || s.freeindex == s.nelems || uintptr(s.allocCount) == s.nelems { throw("span has no free objects") } // Assume all objects from this span will be allocated in the // mcache. If it gets uncached, we'll adjust this. atomic.Xadd64(&c.nmalloc, int64(n)) usedBytes := uintptr(s.allocCount) * s.elemsize atomic.Xadd64(&memstats.heap_live, int64(spanBytes)-int64(usedBytes)) // Tracking processing if trace.enabled { // heap_live changed. traceHeapAlloc() } // If currently in GC, because heap_live has changed, adjust the value of G auxiliary markup work // For more details, see the following parsing of the revise function if gcBlackenEnabled != 0 { // heap_live changed. gcController.revise() } // Set the incache attribute of span to indicate that span is in mcache s.incache = true // Update allocCache according to freeindex freeByteBase := s.freeindex &^ (64 - 1) whichByte := freeByteBase / 8 // Init alloc bits cache. s.refillAllocCache(whichByte) // Adjust the allocCache so that s.freeindex corresponds to the low bit in // s.allocCache. s.allocCache >>= s.freeindex % 64 return s }

mcentral applies to mheap for a new span to use grow Function:

// grow allocates a new empty span from the heap and initializes it for c's size class. func (c *mcentral) grow() *mspan { // Calculate the size of the span to be applied (divided by 8K = how many pages) and how many elements can be saved according to the type of mcentral npages := uintptr(class_to_allocnpages[c.spanclass.sizeclass()]) size := uintptr(class_to_size[c.spanclass.sizeclass()]) n := (npages << _PageShift) / size // Apply to mheap for a new span in pages (8K) s := mheap_.alloc(npages, c.spanclass, false, true) if s == nil { return nil } p := s.base() s.limit = p + size*n // allocBits and gcmarkBits for allocating and initializing span s heapBitsForSpan(s.base()).initSpan(s) return s }

The function of mheap allocation span is alloc:

func (h *mheap) alloc(npage uintptr, spanclass spanClass, large bool, needzero bool) *mspan { // Calling the alloc_m function in the stack space of g0 // See the previous article for a description of systemstack // Don't do any operations that lock the heap on the G stack. // It might trigger stack growth, and the stack growth code needs // to be able to allocate heap. var s *mspan systemstack(func() { s = h.alloc_m(npage, spanclass, large) }) if s != nil { if needzero && s.needzero != 0 { memclrNoHeapPointers(unsafe.Pointer(s.base()), s.npages<<_PageShift) } s.needzero = 0 } return s }

alloc functions are called in the stack space of g0 alloc_m Function:

// Allocate a new span of npage pages from the heap for GC'd memory // and record its size class in the HeapMap and HeapMapCache. func (h *mheap) alloc_m(npage uintptr, spanclass spanClass, large bool) *mspan { _g_ := getg() if _g_ != _g_.m.g0 { throw("_mheap_alloc not on g0 stack") } // For mheap locks, the locks here are global locks lock(&h.lock) // To prevent heap from growing too fast, sweep and recycle n pages before allocating n pages // busy lists are first enumerated and then busyLarge lists are enumerated for sweep, referring specifically to reclaim and reclaimList functions. // To prevent excessive heap growth, before allocating n pages // we need to sweep and reclaim at least n pages. if h.sweepdone == 0 { // TODO(austin): This tends to sweep a large number of // spans in order to find a few completely free spans // (for example, in the garbage benchmark, this sweeps // ~30x the number of pages its trying to allocate). // If GC kept a bit for whether there were any marks // in a span, we could release these free spans // at the end of GC and eliminate this entirely. if trace.enabled { traceGCSweepStart() } h.reclaim(npage) if trace.enabled { traceGCSweepDone() } } // Add local statistics from mcache to the global // transfer stats from cache to global memstats.heap_scan += uint64(_g_.m.mcache.local_scan) _g_.m.mcache.local_scan = 0 memstats.tinyallocs += uint64(_g_.m.mcache.local_tinyallocs) _g_.m.mcache.local_tinyallocs = 0 // Call allocSpanLocked to allocate span, and the allocSpanLocked function requires that mheap is currently locked s := h.allocSpanLocked(npage, &memstats.heap_inuse) if s != nil { // Record span info, because gc needs to be // able to map interior pointer to containing span. // Setting span sweepgen = global sweepgen atomic.Store(&s.sweepgen, h.sweepgen) // Put it in the global span list, where the length of sweepSpans is 2 // sweepSpans[h.sweepgen/2%2] saves the list of span s currently in use // SwepSpans [1-h.sweepgen/2%2] saves the list of span s waiting for sweep // Because every time gcsweepgen adds 2, each time the two lists of gc are exchanged h.sweepSpans[h.sweepgen/2%2].push(s) // Add to swept in-use list. // Initialize span members s.state = _MSpanInUse s.allocCount = 0 s.spanclass = spanclass if sizeclass := spanclass.sizeclass(); sizeclass == 0 { s.elemsize = s.npages << _PageShift s.divShift = 0 s.divMul = 0 s.divShift2 = 0 s.baseMask = 0 } else { s.elemsize = uintptr(class_to_size[sizeclass]) m := &class_to_divmagic[sizeclass] s.divShift = m.shift s.divMul = m.mul s.divShift2 = m.shift2 s.baseMask = m.baseMask } // update stats, sweep lists h.pagesInUse += uint64(npage) // The above grow function passes in true, which means that calling large s here through grow equals true. // Add the allocated span to the busy list and place it in the busylarge list if the number of pages exceeds _MaxMHeapList(128 pages = 8K*128=1M) if large { memstats.heap_objects++ mheap_.largealloc += uint64(s.elemsize) mheap_.nlargealloc++ atomic.Xadd64(&memstats.heap_live, int64(npage<<_PageShift)) // Swept spans are at the end of lists. if s.npages < uintptr(len(h.busy)) { h.busy[s.npages].insertBack(s) } else { h.busylarge.insertBack(s) } } } // If currently in GC, because heap_live has changed, adjust the value of G auxiliary markup work // For more details, see the following parsing of the revise function // heap_scan and heap_live were updated. if gcBlackenEnabled != 0 { gcController.revise() } // Tracking processing if trace.enabled { traceHeapAlloc() } // h.spans is accessed concurrently without synchronization // from other threads. Hence, there must be a store/store // barrier here to ensure the writes to h.spans above happen // before the caller can publish a pointer p to an object // allocated from s. As soon as this happens, the garbage // collector running on another processor could read p and // look up s in h.spans. The unlock acts as the barrier to // order these writes. On the read side, the data dependency // between p and the index in h.spans orders the reads. unlock(&h.lock) return s }

Continue to view allocSpanLocked Function:

// Allocates a span of the given size. h must be locked. // The returned span has been removed from the // free list, but its state is still MSpanFree. func (h *mheap) allocSpanLocked(npage uintptr, stat *uint64) *mspan { var list *mSpanList var s *mspan // Trying to allocate free lists in mheap // Free span with pages less than _MaxMHeapList(128 pages = 1M) will be in the free list // Free span with pages greater than _MaxMHeapList will be in the freelarge list // Try in fixed-size lists up to max. for i := int(npage); i < len(h.free); i++ { list = &h.free[i] if !list.isEmpty() { s = list.first list.remove(s) goto HaveSpan } } // Find the freelarge list if the free list is not found // If you can't find it, apply for a new span in the arena area to add to freelarges, and then look for the list of freelarges // Best fit in list of large spans. s = h.allocLarge(npage) // allocLarge removed s from h.freelarge for us if s == nil { if !h.grow(npage) { return nil } s = h.allocLarge(npage) if s == nil { return nil } } HaveSpan: // Mark span in use. if s.state != _MSpanFree { throw("MHeap_AllocLocked - MSpan not free") } if s.npages < npage { throw("MHeap_AllocLocked - bad npages") } // If the span has pages that have been released (de-virtual memory versus physical memory), remind them that they will be used and update statistics if s.npreleased > 0 { sysUsed(unsafe.Pointer(s.base()), s.npages<<_PageShift) memstats.heap_released -= uint64(s.npreleased << _PageShift) s.npreleased = 0 } // If you get more span pages than you want // Divide the remaining pages into another span and place them in a free list if s.npages > npage { // Trim extra and put it back in the heap. t := (*mspan)(h.spanalloc.alloc()) t.init(s.base()+npage<<_PageShift, s.npages-npage) s.npages = npage p := (t.base() - h.arena_start) >> _PageShift if p > 0 { h.spans[p-1] = s } h.spans[p] = t h.spans[p+t.npages-1] = t t.needzero = s.needzero s.state = _MSpanManual // prevent coalescing with s t.state = _MSpanManual h.freeSpanLocked(t, false, false, s.unusedsince) s.state = _MSpanFree } s.unusedsince = 0 // Set the spans area, which addresses correspond to which mspan object p := (s.base() - h.arena_start) >> _PageShift for n := uintptr(0); n < npage; n++ { h.spans[p+n] = s } // Update statistical data *stat += uint64(npage << _PageShift) memstats.heap_idle -= uint64(npage << _PageShift) //println("spanalloc", hex(s.start<<_PageShift)) if s.inList() { throw("still in list") } return s }

Continue to view allocLarge Function:

// allocLarge allocates a span of at least npage pages from the treap of large spans. // Returns nil if no such span currently exists. func (h *mheap) allocLarge(npage uintptr) *mspan { // Search treap for smallest span with >= npage pages. return h.freelarge.remove(npage) }

The type of freelarge s is mTreap, which calls remove The function searches for at least one npage in the tree and returns the smallest span in the tree:

// remove searches for, finds, removes from the treap, and returns the smallest // span that can hold npages. If no span has at least npages return nil. // This is slightly more complicated than a simple binary tree search // since if an exact match is not found the next larger node is // returned. // If the last node inspected > npagesKey not holding // a left node (a smaller npages) is the "best fit" node. func (root *mTreap) remove(npages uintptr) *mspan { t := root.treap for t != nil { if t.spanKey == nil { throw("treap node with nil spanKey found") } if t.npagesKey < npages { t = t.right } else if t.left != nil && t.left.npagesKey >= npages { t = t.left } else { result := t.spanKey root.removeNode(t) return result } } return nil }

The function of applying for a new span to arena region is mheap class grow Function:

// Try to add at least npage pages of memory to the heap, // returning whether it worked. // // h must be locked. func (h *mheap) grow(npage uintptr) bool { // Ask for a big chunk, to reduce the number of mappings // the operating system needs to track; also amortizes // the overhead of an operating system mapping. // Allocate a multiple of 64kB. npage = round(npage, (64<<10)/_PageSize) ask := npage << _PageShift if ask < _HeapAllocChunk { ask = _HeapAllocChunk } // Call the mheap.sysAlloc function to apply v := h.sysAlloc(ask) if v == nil { if ask > npage<<_PageShift { ask = npage << _PageShift v = h.sysAlloc(ask) } if v == nil { print("runtime: out of memory: cannot allocate ", ask, "-byte block (", memstats.heap_sys, " in use)\n") return false } } // Create a new span and add it to the free list // Create a fake "in use" span and free it, so that the // right coalescing happens. s := (*mspan)(h.spanalloc.alloc()) s.init(uintptr(v), ask>>_PageShift) p := (s.base() - h.arena_start) >> _PageShift for i := p; i < p+s.npages; i++ { h.spans[i] = s } atomic.Store(&s.sweepgen, h.sweepgen) s.state = _MSpanInUse h.pagesInUse += uint64(s.npages) h.freeSpanLocked(s, false, true, 0) return true }

Continue to view mheap's sysAlloc Function:

// sysAlloc allocates the next n bytes from the heap arena. The // returned pointer is always _PageSize aligned and between // h.arena_start and h.arena_end. sysAlloc returns nil on failure. // There is no corresponding free function. func (h *mheap) sysAlloc(n uintptr) unsafe.Pointer { // strandLimit is the maximum number of bytes to strand from // the current arena block. If we would need to strand more // than this, we fall back to sysAlloc'ing just enough for // this allocation. const strandLimit = 16 << 20 // If the arena region is currently insufficient, call sysReserve to reserve more space, and then update arena_end // sysReserve calls mmap function on linux // mmap(v, n, _PROT_NONE, _MAP_ANON|_MAP_PRIVATE, -1, 0) if n > h.arena_end-h.arena_alloc { // If we haven't grown the arena to _MaxMem yet, try // to reserve some more address space. p_size := round(n+_PageSize, 256<<20) new_end := h.arena_end + p_size // Careful: can overflow if h.arena_end <= new_end && new_end-h.arena_start-1 <= _MaxMem { // TODO: It would be bad if part of the arena // is reserved and part is not. var reserved bool p := uintptr(sysReserve(unsafe.Pointer(h.arena_end), p_size, &reserved)) if p == 0 { // TODO: Try smaller reservation // growths in case we're in a crowded // 32-bit address space. goto reservationFailed } // p can be just about anywhere in the address // space, including before arena_end. if p == h.arena_end { // The new block is contiguous with // the current block. Extend the // current arena block. h.arena_end = new_end h.arena_reserved = reserved } else if h.arena_start <= p && p+p_size-h.arena_start-1 <= _MaxMem && h.arena_end-h.arena_alloc < strandLimit { // We were able to reserve more memory // within the arena space, but it's // not contiguous with our previous // reservation. It could be before or // after our current arena_used. // // Keep everything page-aligned. // Our pages are bigger than hardware pages. h.arena_end = p + p_size p = round(p, _PageSize) h.arena_alloc = p h.arena_reserved = reserved } else { // We got a mapping, but either // // 1) It's not in the arena, so we // can't use it. (This should never // happen on 32-bit.) // // 2) We would need to discard too // much of our current arena block to // use it. // // We haven't added this allocation to // the stats, so subtract it from a // fake stat (but avoid underflow). // // We'll fall back to a small sysAlloc. stat := uint64(p_size) sysFree(unsafe.Pointer(p), p_size, &stat) } } } // When the reserved space is enough, only arena_alloc needs to be added. if n <= h.arena_end-h.arena_alloc { // Keep taking from our reservation. p := h.arena_alloc sysMap(unsafe.Pointer(p), n, h.arena_reserved, &memstats.heap_sys) h.arena_alloc += n if h.arena_alloc > h.arena_used { h.setArenaUsed(h.arena_alloc, true) } if p&(_PageSize-1) != 0 { throw("misrounded allocation in MHeap_SysAlloc") } return unsafe.Pointer(p) } // Processing after Reserved Space Failure reservationFailed: // If using 64-bit, our reservation is all we have. if sys.PtrSize != 4 { return nil } // On 32-bit, once the reservation is gone we can // try to get memory at a location chosen by the OS. p_size := round(n, _PageSize) + _PageSize p := uintptr(sysAlloc(p_size, &memstats.heap_sys)) if p == 0 { return nil } if p < h.arena_start || p+p_size-h.arena_start > _MaxMem { // This shouldn't be possible because _MaxMem is the // whole address space on 32-bit. top := uint64(h.arena_start) + _MaxMem print("runtime: memory allocated by OS (", hex(p), ") not in usable range [", hex(h.arena_start), ",", hex(top), ")\n") sysFree(unsafe.Pointer(p), p_size, &memstats.heap_sys) return nil } p += -p & (_PageSize - 1) if p+n > h.arena_used { h.setArenaUsed(p+n, true) } if p&(_PageSize-1) != 0 { throw("misrounded allocation in MHeap_SysAlloc") } return unsafe.Pointer(p) }

This is the complete process of allocating objects. Next, we analyze GC tags and the processing of recycling objects.

Processing of Recycled Objects

Process of Recycling Objects

The GC of GO is parallel GC, that is, most of the processing of GC runs at the same time as the ordinary go code, which makes the GC process of GO more complex.
First of all, GC has four stages, which are:

  • Sweep Termination: Uncleaned span s can only be cleaned up by the last GC cycle before a new GC cycle can be started.
  • Mark: Scan all root objects, and all objects that root objects can reach, and mark that they are not recycled
  • Mark Termination: Complete markup and re-scan part of the root object (STW required)
  • Sweep: Sweep span by marker result

The following figure is a relatively complete GC process, and classifies the four stages by color:

There are two kinds of background tasks (G) in the GC process. One is the background task for marking and the other is the background task for cleaning.
Markup background tasks will be started when needed. The number of background tasks that can work at the same time is about 25% of the number of P tasks, which is the basis for using 25% of the cpu on GC as described by go.
A background task for cleaning starts when the program starts and wakes up when it enters the cleaning stage.

At present, the whole GC process will undergo two STW(Stop The World), the first is the beginning of the Mark phase, and the second is the Mark Termination phase.
The first STW prepares the scanning of the root object, starts Write Barrier and assists GC.
The second STW scans part of the root object, disables Write Barrier and mutator assist.
It should be noted that not all scanning of root objects requires STW, such as scanning objects on the stack only needs to stop the G that owns the stack.
Starting from go 1.9, the implementation of write barrier uses Hybrid Write Barrier, which greatly reduces the time of the second STW.

Triggering Conditions of GC

GC will be triggered after satisfying certain conditions. There are several triggering conditions as follows:

  • GC Trigger Always: Forced trigger GC
  • gcTriggerHeap: The current allocated memory triggers the GC when it reaches a certain value
  • gcTriggerTime: triggers GC when GC has not been executed for a certain period of time
  • gcTriggerCycle: Requires a new round of GC to be started and skips when started. runtime.GC() that manually triggers GC will use this condition

The judgment of trigger condition is based on gctrigger test Function.
Among them, GC Trigger Heap and GC Trigger Time are triggered naturally. The judgment code of GC Trigger Heap is as follows:

return memstats.heap_live >= memstats.gc_trigger

The increase in heap_live can be seen in the code analysis of the allocator above. When the value reaches gc_trigger, the GC will be triggered. How does the gc_trigger decide?
The calculation of gc_trigger is in gcSetTriggerRatio In the function, the formula is:

trigger = uint64(float64(memstats.heap_marked) * (1 + triggerRatio))

The current marker survival multiplied by a 1 + coefficient triggerRatio is the allocation required for the next GC departure.
The triggerRatio is adjusted after each GC, and the function for calculating triggerRatio is encCycle The formula is:

const triggerGain = 0.5 // Target Heap growth rate, default 1.0 goalGrowthRatio := float64(gcpercent) / 100 // Real Heap growth rate, equal to total size/survival size-1 actualGrowthRatio := float64(memstats.heap_live)/float64(memstats.heap_marked) - 1 // Use time of GC markup phase (because endCycle is invoked in Mark Termination phase) assistDuration := nanotime() - c.markStartTime // CPU occupancy in GC markup phase, target value is 0.25 utilization := gcGoalUtilization if assistDuration > 0 { // AssistanTime is the total time used by G-assisted GC tagging objects // (nanosecnds spent in mutator assists during this cycle) // Additional CPU occupancy = total time of auxiliary GC markup objects / (number of GC markup time * P) utilization += float64(c.assistTime) / float64(assistDuration*int64(gomaxprocs)) } // Offset value of trigger coefficient = target growth rate - original trigger coefficient - CPU occupancy / target CPU occupancy * (actual growth rate - original trigger coefficient) // Analysis of parameters: // The bigger the real growth rate is, the smaller the offset value of trigger coefficient is. When the trigger coefficient is less than 0, the next trigger GC will be earlier. // The bigger the CPU occupancy, the smaller the offset value of trigger coefficient. The next trigger GC will be earlier when the CPU occupancy is less than 0. // The bigger the original trigger coefficient is, the smaller the offset value of the trigger coefficient is. When the trigger coefficient is less than 0, the next trigger GC will be earlier. triggerError := goalGrowthRatio - memstats.triggerRatio - utilization/gcGoalUtilization*(actualGrowthRatio-memstats.triggerRatio) // Adjust the trigger coefficient according to the offset value, only half of the offset value is adjusted at a time (progressive adjustment) triggerRatio := memstats.triggerRatio + triggerGain*triggerError

The "target Heap growth rate" in the formula can be adjusted by setting the environment variable GOGC. The default value is 100. Increasing its value can reduce the trigger of GC.
Setting "GOGC=off" can completely turn off GC.

The judgment code of gcTriggerTime is as follows:

lastgc := int64(atomic.Load64(&memstats.last_gc_nanotime)) return lastgc != 0 && t.now-lastgc > forcegcperiod

Forcegcperiod is defined as a two-minute forcegcperiod, in which the trigger is mandatory if the GC has not been executed in two minutes.

Definition of tricolor (black, grey, white)

The best article I've ever read explaining the concept of "tricolor" in tricolor GC is This article Yes, I strongly recommend that you read the explanation in this article first.
The concept of "tricolor" can be simply understood as:

  • Black: The object has been marked in this GC, and the child objects contained in this object have also been marked.
  • Gray: The object is marked in this GC, but the child objects contained in this object are not marked.
  • White: Objects are not marked in this GC

In go, objects do not have color attributes. Tricolors are just descriptions of their states.
The corresponding bit of the white object in the gcmarkBits of its span is 0.
The gray object has a bit of 1 in the gcmarkBits of its span, and the object is in the tag queue.
The black object has a corresponding bit of 1 in the gcmarkBits of its span, and the object has been removed from the tag queue and processed.
When gc is complete, gcmarkBits will move to allocBits and redistribute a bitmap with all zeros so that the black object becomes white.

Write Barrier

Because go supports parallel GC, the scan of GC and go code can run at the same time. The problem is that the go code may change the dependency tree of the object in the process of GC scan.
For example, when we start scanning, we find that the root objects A and B have pointers of C. GC scans A first, then B gives pointers of C to A, GC scans B again, then C will not be scanned.
To avoid this problem, go will enable Write Barrier in the GC markup phase.

When Write Barrier is enabled, when B hands the pointer of C to A, GC will assume that the pointer of C is alive in this round of scanning.
Even if A may lose C later, C will be recycled in the next round.
Write barriers are only enabled for pointers, and are only enabled during GC's markup phase. Normally, values are written directly to the target address.

go has been enabled since 1.9 Hybrid Write Barrier The pseudocode is as follows:

writePointer(slot, ptr): shade(*slot) if any stack is grey: shade(ptr) *slot = ptr

The hybrid write barrier marks both the "original pointer" and the "new pointer" for the pointer to write to the target.

The reason for marking the original pointer is that other running threads may copy the value of the pointer to local variables on registers or stacks at the same time.
Because local variables that copy pointers to registers or stacks do not pass through the write barrier, it is possible to cause pointers not to be marked. Consider the following scenario:

[go] b = obj [go] oldx = nil [gc] scan oldx... [go] oldx = b.x // Copy b.x to local variables without crossing the write barrier [go] b.x = ptr // Write barriers should mark the original value of b.x [gc] scan b... //If the write barrier does not mark the original value, oldx will not be scanned.

The reason for marking the new pointer is that other running threads may shift the pointer's position. Imagine the following situation:

[go] a = ptr [go] b = obj [gc] scan b... [go] b.x = a // Write barriers should mark new values of b.x [go] a = nil [gc] scan a... //If the write barrier does not mark the new value, the ptr will not be scanned.

Hybrid write barriers can reduce STW time in Mark Termination by eliminating the need for GC to re-scan the stacks of each G after parallel markup.
In addition to the write barrier, all newly allocated objects will immediately turn black in the GC process, as can be seen in the mallocgc function above.

Assisted GC(mutator assist)

In order to prevent heap from growing too fast, if the simultaneous running G allocates memory during the execution of GC, the G will be required to assist the GC to do some work.
G, which runs simultaneously in the process of GC, is called "mutator". The "mutator assist" mechanism is the mechanism by which G assists GC to do part of its work.

There are two types of assistant GC work: Mark and Sweep.
The mallocgc function above can be viewed by the trigger of auxiliary tags. When triggered, G will help scan the "workload" objects. The formula for calculating the workload is as follows:

debtBytes * assistWorkPerByte

This means that the size of the allocation is multiplied by the coefficient assistWorkPerByte. The calculation of assistWorkPerByte is in the function. revise In Chinese, the formula is:

// Number of objects waiting to be scanned = Number of objects not scanned - Number of objects scanned scanWorkExpected := int64(memstats.heap_scan) - c.scanWork if scanWorkExpected < 1000 { scanWorkExpected = 1000 } // Heap size of distance triggering GC = Heap size of expected triggering GC - current Heap size // Note that the calculation of next_gc is different from that of gc_trigger. Next_gc equals heap_marked* (1 + gcpercent / 100) heapDistance := int64(memstats.next_gc) - int64(atomic.Load64(&memstats.heap_live)) if heapDistance <= 0 { heapDistance = 1 } // The number of objects to be scanned per 1 byte allocation = the number of objects to be scanned / the distance to trigger the GC Heap size c.assistWorkPerByte = float64(scanWorkExpected) / float64(heapDistance) c.assistBytesPerWork = float64(heapDistance) / float64(scanWorkExpected)

Unlike auxiliary markers, they are checked only when new span s are applied for, whereas auxiliary markers are checked every time objects are allocated.
The trigger for assistant sweeping can be seen in the cacheSpan function above. When triggered, G will help to reclaim the object of the "workload" page. The formula for calculating the workload is as follows:

spanBytes * sweepPagesPerByte // Not exactly the same. Look specifically at the deductSweepCredit function

It means that the size of the allocation is multiplied by the coefficient sweepPages PerByte. The sweepPages PerByte is calculated in a function. gcSetTriggerRatio In Chinese, the formula is:

// Current Heap size heapLiveBasis := atomic.Load64(&memstats.heap_live) // Heap size of distance trigger GC = Heap size of next trigger GC - current Heap size heapDistance := int64(trigger) - int64(heapLiveBasis) heapDistance -= 1024 * 1024 if heapDistance < _PageSize { heapDistance = _PageSize } // Number of pages cleaned pagesSwept := atomic.Load64(&mheap_.pagesSwept) // Unswept pages = pages in use - pages swept sweepDistancePages := int64(mheap_.pagesInUse) - int64(pagesSwept) if sweepDistancePages <= 0 { mheap_.sweepPagesPerByte = 0 } else { // The number of pages that need to be assisted in cleaning per allocation of 1 byte = the number of pages that have not been cleaned/the size of Heap that triggers GC mheap_.sweepPagesPerByte = float64(sweepDistancePages) / float64(heapDistance) }

Root object

In the labeling phase of GC, the first thing to be labeled is the "root object", and all objects reachable from the root object are considered to be alive.
The root object contains global variables, variables on the stack of each G, etc. The GC scans the root object first and then all the objects reachable by the root object.
Scanning the root object involves a series of tasks defined in[ https://github.com/golang/go/blob/go1.9.2/src/runtime/mgcmark.go#L54 Function:

  • Fixed Roots: Special Scanning Work
    • Fixed Root Finalizers: Scanning Destructor Queue
    • Fixed Root FreeGStacks: Release the Stack of G that has been aborted
  • Flush Cache Roots: Release all span s in mcache, requiring STW
  • Data Roots: Scanning Readable and Writable Global Variables
  • BSS Roots: Scanning Read-Only Global Variables
  • Span Roots: Scan for special objects in each span (destructor list)
  • Stack Roots: Scanning the stacks of each G

Mark will do "Fixed Roots", "Data Roots", "BSS Roots", "Span Roots", "Stack Roots".
Mark Termination does "Fixed Roots" and "Flush Cache Roots".

Tagged queue

In the markup phase of GC, a "markup queue" is used to determine that all objects reachable from the root object are marked. The "gray" object mentioned above is the object in the markup queue.
For example, if there are currently three root objects [A, B, C], they will be placed in the tag queue when scanning the root object:

work queue: [A, B, C]

The background tag task extracts A from the tag queue, and if A refers to D, puts D into the tag queue:

work queue: [B, C, D]

The background tag task takes B out of the tag queue, and if B also refers to D, it will skip because the corresponding bit of D in gcmarkBits is already 1:

work queue: [C, D]

If the go code running in parallel assigns an object E, the object E will be marked immediately, but will not enter the tag queue (because it is determined that E does not refer to other objects).
The go code that runs in parallel sets object F to the members of object E. Writing barriers mark object F and then add object F to the running queue:

work queue: [C, D, F]

The background tag task takes C out of the tag queue and does not need to process if C does not refer to other objects:

work queue: [D, F]

The background tag task extracts D from the tag queue, and if D refers to X, puts X into the tag queue:

work queue: [F, X]

The background tag task takes F from the tag queue and does not need to be processed if F does not refer to other objects.
The background tag task takes X out of the tag queue and does not need to be processed if X does not refer to other objects.
Finally, the tag queue is empty, the tag is completed and the surviving objects are [A, B, C, D, E, F, X].

The actual situation is slightly more complicated than the one described above.
The tag queue is divided into global tag queue and local tag queue for each P, which is similar to the running queue in the coprocess.
And after the tag queue is empty, you also need to stop the whole world and prohibit writing barriers, and then check again whether it is empty.

Source code analysis

go triggers gc from gcStart Function begins:

// gcStart transitions the GC from _GCoff to _GCmark (if // !mode.stwMark) or _GCmarktermination (if mode.stwMark) by // performing sweep termination and GC initialization. // // This may return without performing this transition in some cases, // such as when called on a system stack or with locks held. func gcStart(mode gcMode, trigger gcTrigger) { // Judging whether the current G is preemptive and not triggering GC when it is not preemptive // Since this is called from malloc and malloc is called in // the guts of a number of libraries that might be holding // locks, don't attempt to start GC in non-preemptible or // potentially unstable situations. mp := acquirem() if gp := getg(); gp == mp.g0 || mp.locks > 1 || mp.preemptoff != "" { releasem(mp) return } releasem(mp) mp = nil // Parallel Cleaning of Spans Not Cleaned by the Last GC Round // Pick up the remaining unswept/not being swept spans concurrently // // This shouldn't happen if we're being invoked in background // mode since proportional sweep should have just finished // sweeping everything, but rounding errors, etc, may leave a // few spans unswept. In forced mode, this is necessary since // GC can be forced at any point in the sweeping cycle. // // We check the transition condition continuously here in case // this G gets delayed in to the next GC cycle. for trigger.test() && gosweepone() != ^uintptr(0) { sweep.nbgsweep++ } // Lock, and then re-check whether the gcTrigger condition is valid. If not, no GC is triggered. // Perform GC initialization and the sweep termination // transition. semacquire(&work.startSema) // Re-check transition condition under transition lock. if !trigger.test() { semrelease(&work.startSema) return } // Record whether the trigger is mandatory. gcTriggerCycle is used by runtime.GC // For stats, check if this GC was forced by the user. work.userForced = trigger.kind == gcTriggerAlways || trigger.kind == gcTriggerCycle // Determine whether a parameter forbidding parallel GC is specified // In gcstoptheworld debug mode, upgrade the mode accordingly. // We do this after re-checking the transition condition so // that multiple goroutines that detect the heap trigger don't // start multiple STW GCs. if mode == gcBackgroundMode { if debug.gcstoptheworld == 1 { mode = gcForceMode } else if debug.gcstoptheworld == 2 { mode = gcForceBlockMode } } // Ok, we're doing it! Stop everybody else semacquire(&worldsema) // Tracking processing if trace.enabled { traceGCStart() } // Start Background Scan Task (G) if mode == gcBackgroundMode { gcBgMarkStartWorkers() } // Reset tag-related status gcResetMarkState() // reset parameters work.stwprocs, work.maxprocs = gcprocs(), gomaxprocs work.heap0 = atomic.Load64(&memstats.heap_live) work.pauseNS = 0 work.mode = mode // Record start time now := nanotime() work.tSweepTerm = now work.pauseStart = now // Stop all running G and prohibit them from running systemstack(stopTheWorldWithSema) // !!!!!!!!!!!!!!!! // The world has stopped (STW)... // !!!!!!!!!!!!!!!! // Clean up the span that was not cleaned by the last GC to ensure that the last GC is completed // Finish sweep before we start concurrent scan. systemstack(func() { finishsweep_m() }) // Clean sched.sudogcache and sched.deferpool // clearpools before we start the GC. If we wait they memory will not be // reclaimed until the next GC cycle. clearpools() // Increase GC count work.cycles++ // Judging Parallel GC Mode if mode == gcBackgroundMode { // Do as much work concurrently as possible // Marking a new round of GC has begun gcController.startCycle() work.heapGoal = memstats.next_gc // Set the GC state in the global variable to _GCmark // Then enable the write barrier // Enter concurrent mark phase and enable // write barriers. // // Because the world is stopped, all Ps will // observe that write barriers are enabled by // the time we start the world and begin // scanning. // // Write barriers must be enabled before assists are // enabled because they must be enabled before // any non-leaf heap objects are marked. Since // allocations are blocked until assists can // happen, we want enable assists as early as // possible. setGCPhase(_GCmark) // Counting of Reset Background Markup Tasks gcBgMarkPrepare() // Must happen before assist enable. // Calculate the number of tasks to scan the root object gcMarkRootPrepare() // Mark all tiny alloc objects waiting for merge // Mark all active tinyalloc blocks. Since we're // allocating from these, they need to be black like // other allocations. The alternative is to blacken // the tiny block on every allocation from it, which // would slow down the tiny allocator. gcMarkTinyAllocs() // Enabling Auxiliary GC // At this point all Ps have enabled the write // barrier, thus maintaining the no white to // black invariant. Enable mutator assists to // put back-pressure on fast allocating // mutators. atomic.Store(&gcBlackenEnabled, 1) // Record the start time of the tag // Assists and workers can start the moment we start // the world. gcController.markStartTime = now // Restart the world // The background markup task created earlier will start working, and after all the background markup tasks have completed their work, they will enter the completion markup stage. // Concurrent mark. systemstack(startTheWorldWithSema) // !!!!!!!!!!!!!!! // The world has been restarted. // !!!!!!!!!!!!!!! // How long was the record stopped, and when the markup phase started now = nanotime() work.pauseNS += now - work.pauseStart work.tMark = now } else { // Not Parallel GC Mode // Record the time at which the markup phase begins t := nanotime() work.tMark, work.tMarkTerm = t, t work.heapGoal = work.heap0 // Skip the markup phase and execute the completion markup phase // All markup work will be performed in a state where the world has stopped // (The markup phase sets work.markrootDone=true, if skipped, its value is false, and the completion of the markup phase performs all the work.) // Completing the markup phase will restart the world // Perform mark termination. This will restart the world. gcMarkTermination(memstats.triggerRatio) } semrelease(&work.startSema) }

Next, the functions called by gcStart are analyzed one by one. It is suggested that they should be understood in accordance with the graph in the "Process of Recycling Objects" above.

function gcBgMarkStartWorkers To start background markup tasks, start one for each P:

// gcBgMarkStartWorkers prepares background mark worker goroutines. // These goroutines will not run until the mark phase, but they must // be started while the work is not stopped and from a regular G // stack. The caller must hold worldsema. func gcBgMarkStartWorkers() { // Background marking is performed by per-P G's. Ensure that // each P has a background GC G. for _, p := range &allp { if p == nil || p.status == _Pdead { break } // If started, do not start again if p.gcBgMarkWorker == 0 { go gcBgMarkWorker(p) // Wait for the task notification semaphore bgMarkReady to continue after startup notetsleepg(&work.bgMarkReady, -1) noteclear(&work.bgMarkReady) } } }

Here, although a background markup task is started for each P, only 25% of it can work at the same time. This logic is invoked when Co-Procedure M acquires G. findRunnableGCWorker China:

// findRunnableGCWorker returns the background mark worker for _p_ if it // should be run. This must only be called when gcBlackenEnabled != 0. func (c *gcControllerState) findRunnableGCWorker(_p_ *p) *g { if gcBlackenEnabled == 0 { throw("gcControllerState.findRunnable: blackening not enabled") } if _p_.gcBgMarkWorker == 0 { // The mark worker associated with this P is blocked // performing a mark transition. We can't run it // because it may be on some other run or wait queue. return nil } if !gcMarkWorkAvailable(_p_) { // No work to be done right now. This can happen at // the end of the mark phase when there are still // assists tapering off. Don't bother running a worker // now because it'll just return immediately. return nil } // The corresponding value of atomic reduction is returned true if the reduction is greater than or equal to zero, otherwise false is returned. decIfPositive := func(ptr *int64) bool { if *ptr > 0 { if atomic.Xaddint64(ptr, -1) >= 0 { return true } // We lost a race atomic.Xaddint64(ptr, +1) } return false } // Reduce dedicated Mark Workers Needed, and the pattern for background markup tasks when successful is Dedicated // Dedicated Mark Workers Needed is 25% of the current number of P to remove decimal points // See startCycle function for more details if decIfPositive(&c.dedicatedMarkWorkersNeeded) { // This P is now dedicated to marking until the end of // the concurrent mark phase. _p_.gcMarkWorkerMode = gcMarkWorkerDedicatedMode } else { // Reduce fractional Mark Workers Needed. The pattern for successful background markup tasks is Fractional // The above calculation shows that fractional Mark Workers Needed is 1 if the decimal point is followed by a numerical value (not divisible), otherwise it is 0. // See startCycle function for more details // For example, four P will perform tasks in one Dedicated mode, and five P will perform tasks in one Dedicated mode and one Fractional mode. if !decIfPositive(&c.fractionalMarkWorkersNeeded) { // No more workers are need right now. return nil } // Judge whether the cpu occupancy rate exceeds the budgeted value according to the execution time of tasks in Dedicated mode, and do not start when the cpu occupancy rate exceeds the budgeted value // This P has picked the token for the fractional worker. // Is the GC currently under or at the utilization goal? // If so, do more work. // // We used to check whether doing one time slice of work // would remain under the utilization goal, but that has the // effect of delaying work until the mutator has run for // enough time slices to pay for the work. During those time // slices, write barriers are enabled, so the mutator is running slower. // Now instead we do the work whenever we're under or at the // utilization work and pay for it by letting the mutator run later. // This doesn't change the overall utilization averages, but it // front loads the GC work so that the GC finishes earlier and // write barriers can be turned off sooner, effectively giving // the mutator a faster machine. // // The old, slower behavior can be restored by setting // gcForcePreemptNS = forcePreemptNS. const gcForcePreemptNS = 0 // TODO(austin): We could fast path this and basically // eliminate contention on c.fractionalMarkWorkersNeeded by // precomputing the minimum time at which it's worth // next scheduling the fractional worker. Then Ps // don't have to fight in the window where we've // passed that deadline and no one has started the // worker yet. // // TODO(austin): Shorter preemption interval for mark // worker to improve fairness and give this // finer-grained control over schedule? now := nanotime() - gcController.markStartTime then := now + gcForcePreemptNS timeUsed := c.fractionalMarkTime + gcForcePreemptNS if then > 0 && float64(timeUsed)/float64(then) > c.fractionalUtilizationGoal { // Nope, we'd overshoot the utilization goal atomic.Xaddint64(&c.fractionalMarkWorkersNeeded, +1) return nil } _p_.gcMarkWorkerMode = gcMarkWorkerFractionalMode } // Arrange background markup task execution // Run the background mark worker gp := _p_.gcBgMarkWorker.ptr() casgstatus(gp, _Gwaiting, _Grunnable) if trace.enabled { traceGoUnpark(gp, 0) } return gp }

gcResetMarkState Functions reset tag-related states:

// gcResetMarkState resets global state prior to marking (concurrent // or STW) and resets the stack scan state of all Gs. // // This is safe to do without the world stopped because any Gs created // during or after this will start out in the reset state. func gcResetMarkState() { // This may be called during a concurrent phase, so make sure // allgs doesn't change. lock(&allglock) for _, gp := range allgs { gp.gcscandone = false // set to true in gcphasework gp.gcscanvalid = false // stack has not been scanned gp.gcAssistBytes = 0 } unlock(&allglock) work.bytesMarked = 0 work.initialHeapLive = atomic.Load64(&memstats.heap_live) work.markrootDone = false }

stopTheWorldWithSema Function stops the whole world. This function must run in g0:

// stopTheWorldWithSema is the core implementation of stopTheWorld. // The caller is responsible for acquiring worldsema and disabling // preemption first and then should stopTheWorldWithSema on the system // stack: // // semacquire(&worldsema, 0) // m.preemptoff = "reason" // systemstack(stopTheWorldWithSema) // // When finished, the caller must either call startTheWorld or undo // these three operations separately: // // m.preemptoff = "" // systemstack(startTheWorldWithSema) // semrelease(&worldsema) // // It is allowed to acquire worldsema once and then execute multiple // startTheWorldWithSema/stopTheWorldWithSema pairs. // Other P's are able to execute between successive calls to // startTheWorldWithSema and stopTheWorldWithSema. // Holding worldsema causes any other goroutines invoking // stopTheWorld to block. func stopTheWorldWithSema() { _g_ := getg() // If we hold a lock, then we won't be able to stop another M // that is blocked trying to acquire the lock. if _g_.m.locks > 0 { throw("stopTheWorld: holding locks") } lock(&sched.lock) // Number of P to stop sched.stopwait = gomaxprocs // Set the gc wait tag, which will enter the wait when you see it during scheduling atomic.Store(&sched.gcwaiting, 1) // Preempt all running G preemptall() // Stop the current P // stop current P _g_.m.p.ptr().status = _Pgcstop // Pgcstop is only diagnostic. // Reduce the number of P that needs to stop (current P counts one) sched.stopwait-- // Preempt all P in Psyscall state to prevent them from re-participating in scheduling // try to retake all P's in Psyscall status for i := 0; i < int(gomaxprocs); i++ { p := allp[i] s := p.status if s == _Psyscall && atomic.Cas(&p.status, s, _Pgcstop) { if trace.enabled { traceGoSysBlock(p) traceProcStop(p) } p.syscalltick++ sched.stopwait-- } } // Prevent all idle P from re-participating in scheduling // stop idle P's for { p := pidleget() if p == nil { break } p.status = _Pgcstop sched.stopwait-- } wait := sched.stopwait > 0 unlock(&sched.lock) // If P still needs to stop, wait for them to stop // wait for remaining P's to stop voluntarily if wait { for { // Loop Wait + Preempt All Running G // wait for 100us, then try to re-preempt in case of any races if notetsleep(&sched.stopnote, 100*1000) { noteclear(&sched.stopnote) break } preemptall() } } // Logical Correctness Check // sanity checks bad := "" if sched.stopwait != 0 { bad = "stopTheWorld: not stopped (stopwait != 0)" } else { for i := 0; i < int(gomaxprocs); i++ { p := allp[i] if p.status != _Pgcstop { bad = "stopTheWorld: not stopped (status != _Pgcstop)" } } } if atomic.Load(&freezing) != 0 { // Some other thread is panicking. This can cause the // sanity checks above to fail if the panic happens in // the signal handler on a stopped thread. Either way, // we should halt this thread. lock(&deadlock) lock(&deadlock) } if bad != "" { throw(bad) } // At this point, all running G becomes to be run, and all P cannot be acquired by M. // That is to say, all go codes (except the current ones) will stop running and new go codes will not be able to run. }

finishsweep_m The function sweeps the span that was not swept by the previous GC to ensure that the last GC is complete:

// finishsweep_m ensures that all spans are swept. // // The world must be stopped. This ensures there are no sweeps in // progress. // //go:nowritebarrier func finishsweep_m() { // Swepone takes out a span without sweep and executes sweep // Details will be analyzed in the following sweep phase // Sweeping must be complete before marking commences, so // sweep any unswept spans. If this is a concurrent GC, there // shouldn't be any spans left to sweep, so this should finish // instantly. If GC was forced before the concurrent sweep // finished, there may be spans to sweep. for sweepone() != ^uintptr(0) { sweep.npausesweep++ } // After all span s are sweep, a new markbit era is launched // This function is the key to the distribution and reuse of gcmarkBits and allocBits in span. The process is as follows. // - span assignment gcmarkBits and allocBits // - span completes sweep // - The original allocBits are no longer used // - gcmarkBits to allocBits // - Allocate new gcmarkBits // - Opening a New markbit Era // - span completes sweep, ibid. // - Opening a New markbit Era // - Bitmaps that existed two years ago will no longer be used and can be reused nextMarkBitArenaEpoch() }

clearpools Functions clean sched.sudogcache and sched.deferpool so that their memory can be reclaimed:

func clearpools() { // clear sync.Pools if poolcleanup != nil { poolcleanup() } // Clear central sudog cache. // Leave per-P caches alone, they have strictly bounded size. // Disconnect cached list before dropping it on the floor, // so that a dangling ref to one entry does not pin all of them. lock(&sched.sudoglock) var sg, sgnext *sudog for sg = sched.sudogcache; sg != nil; sg = sgnext { sgnext = sg.next sg.next = nil } sched.sudogcache = nil unlock(&sched.sudoglock) // Clear central defer pools. // Leave per-P pools alone, they have strictly bounded size. lock(&sched.deferlock) for i := range sched.deferpool { // disconnect cached list before dropping it on the floor, // so that a dangling ref to one entry does not pin all of them. var d, dlink *_defer for d = sched.deferpool[i]; d != nil; d = dlink { dlink = d.link d.link = nil } sched.deferpool[i] = nil } unlock(&sched.deferlock) }

startCycle The labeling started a new round of GC:

// startCycle resets the GC controller's state and computes estimates // for a new GC cycle. The caller must hold worldsema. func (c *gcControllerState) startCycle() { c.scanWork = 0 c.bgScanCredit = 0 c.assistTime = 0 c.dedicatedMarkTime = 0 c.fractionalMarkTime = 0 c.idleMarkTime = 0 // Camouflage the value of heap_marked if the value of gc_trigger is small, to prevent subsequent misadjustment of triggerRatio // If this is the first GC cycle or we're operating on a very // small heap, fake heap_marked so it looks like gc_trigger is // the appropriate growth from heap_marked, even though the // real heap_marked may not have a meaningful value (on the // first cycle) or may be much smaller (resulting in a large // error response). if memstats.gc_trigger <= heapminimum { memstats.heap_marked = uint64(float64(memstats.gc_trigger) / (1 + memstats.triggerRatio)) } // Recalculate next_gc. Note that the calculation of next_gc is different from that of gc_trigger. // Re-compute the heap goal for this cycle in case something // changed. This is the same calculation we use elsewhere. memstats.next_gc = memstats.heap_marked + memstats.heap_marked*uint64(gcpercent)/100 if gcpercent < 0 { memstats.next_gc = ^uint64(0) } // Ensure that there is at least 1MB between next_gc and heap_live // Ensure that the heap goal is at least a little larger than // the current live heap size. This may not be the case if GC // start is delayed or if the allocation that pushed heap_live // over gc_trigger is large or if the trigger is really close to // GOGC. Assist is proportional to this distance, so enforce a // minimum distance, even if it means going over the GOGC goal // by a tiny bit. if memstats.next_gc < memstats.heap_live+1024*1024 { memstats.next_gc = memstats.heap_live + 1024*1024 } // Calculate the number of background markup tasks that can be performed simultaneously // Designed Mark Workers Needed equals 25% of the number of P to remove decimal points // If divisible, fractional Mark Workers Needed equals zero or 1 // Total Utilization Goal is the target value of P occupied by GC (for example, when P is five, the target is 1.25 P) // Fractional Utilization Goal is the target value of P for Fractiona-mode tasks (for example, when P is five, the target is 0.25 P). // Compute the total mark utilization goal and divide it among // dedicated and fractional workers. totalUtilizationGoal := float64(gomaxprocs) * gcGoalUtilization c.dedicatedMarkWorkersNeeded = int64(totalUtilizationGoal) c.fractionalUtilizationGoal = totalUtilizationGoal - float64(c.dedicatedMarkWorkersNeeded) if c.fractionalUtilizationGoal > 0 { c.fractionalMarkWorkersNeeded = 1 } else { c.fractionalMarkWorkersNeeded = 0 } // Time Statistics for Auxiliary GC in Reset P // Clear per-P state for _, p := range &allp { if p == nil { break } p.gcAssistTime = 0 } // Calculating parameters of auxiliary GC // Refer to the above analysis of the formula for calculating assistWorkPerByte // Compute initial values for controls that are updated // throughout the cycle. c.revise() if debug.gcpacertrace > 0 { print("pacer: assist ratio=", c.assistWorkPerByte, " (scan ", memstats.heap_scan>>20, " MB in ", work.initialHeapLive>>20, "->", memstats.next_gc>>20, " MB)", " workers=", c.dedicatedMarkWorkersNeeded, "+", c.fractionalMarkWorkersNeeded, "\n") } }

setGCPhase The function modifies the global variables that represent the current GC phase and whether the write barrier is opened or not:

//go:nosplit func setGCPhase(x uint32) { atomic.Store(&gcphase, x) writeBarrier.needed = gcphase == _GCmark || gcphase == _GCmarktermination writeBarrier.enabled = writeBarrier.needed || writeBarrier.cgo }

gcBgMarkPrepare The function resets the count of background markup tasks:

// gcBgMarkPrepare sets up state for background marking. // Mutator assists must not yet be enabled. func gcBgMarkPrepare() { // Background marking will stop when the work queues are empty // and there are no more workers (note that, since this is // concurrent, this may be a transient state, but mark // termination will clean it up). Between background workers // and assists, we don't really know how many workers there // will be, so we pretend to have an arbitrarily large number // of workers, almost all of which are "waiting". While a // worker is working it decrements nwait. If nproc == nwait, // there are no workers. work.nproc = ^uint32(0) work.nwait = ^uint32(0) }

gcMarkRootPrepare Functional accounting calculates the number of tasks that scan the root object:

// gcMarkRootPrepare queues root scanning jobs (stacks, globals, and // some miscellany) and initializes scanning-related state. // // The caller must have call gcCopySpans(). // // The world must be stopped. // //go:nowritebarrier func gcMarkRootPrepare() { // The task of releasing all span s in mcache is performed only in the mark termination phase if gcphase == _GCmarktermination { work.nFlushCacheRoots = int(gomaxprocs) } else { work.nFlushCacheRoots = 0 } // The function for calculating the number of block s, rootBlockBytes, is 256KB // Compute how many data and BSS root blocks there are. nBlocks := func(bytes uintptr) int { return int((bytes + rootBlockBytes - 1) / rootBlockBytes) } work.nDataRoots = 0 work.nBSSRoots = 0 // data and bss are scanned only once per round of GC // Parallel GC scans in the background markup task and does not scan in the markup termination phase. // Non-parallel GC scans during mark termination // Only scan globals once per cycle; preferably concurrently. if !work.markrootDone { // Calculating the number of tasks for scanning readable and writable global variables for _, datap := range activeModules() { nDataRoots := nBlocks(datap.edata - datap.data) if nDataRoots > work.nDataRoots { work.nDataRoots = nDataRoots } } // Calculate the number of tasks that scan read-only global variables for _, datap := range activeModules() { nBSSRoots := nBlocks(datap.ebss - datap.bss) if nBSSRoots > work.nBSSRoots { work.nBSSRoots = nBSSRoots } } } // The finalizer in span and the stacks of each G are scanned only once per round of GC // Ditto if !work.markrootDone { // Calculate the number of finalizer tasks in the scan span // On the first markroot, we need to scan span roots. // In concurrent GC, this happens during concurrent // mark and we depend on addfinalizer to ensure the // above invariants for objects that get finalizers // after concurrent mark. In STW GC, this will happen // during mark termination. // // We're only interested in scanning the in-use spans, // which will all be swept at this point. More spans // may be added to this list during concurrent GC, but // we only care about spans that were allocated before // this mark phase. work.nSpanRoots = mheap_.sweepSpans[mheap_.sweepgen/2%2].numBlocks() // Calculate the number of tasks that scan the stacks of each G // On the first markroot, we need to scan all Gs. Gs // may be created after this point, but it's okay that // we ignore them because they begin life without any // roots, so there's nothing to scan, and any roots // they create during the concurrent phase will be // scanned during mark termination. During mark // termination, allglen isn't changing, so we'll scan // all Gs. work.nStackRoots = int(atomic.Loaduintptr(&allglen)) } else { // We've already scanned span roots and kept the scan // up-to-date during concurrent mark. work.nSpanRoots = 0 // The hybrid barrier ensures that stacks can't // contain pointers to unmarked objects, so on the // second markroot, there's no need to scan stacks. work.nStackRoots = 0 if debug.gcrescanstacks > 0 { // Scan stacks anyway for debugging. work.nStackRoots = int(atomic.Loaduintptr(&allglen)) } } // Calculate the total number of tasks // Background markup tasks increment the markrootNext atomically to decide which task to do // This numerical approach to locking free queues is clever, even though google engineers don't feel good about it (see the analysis of the markroot function below). work.markrootNext = 0 work.markrootJobs = uint32(fixedRootCount + work.nFlushCacheRoots + work.nDataRoots + work.nBSSRoots + work.nSpanRoots + work.nStackRoots) }

gcMarkTinyAllocs The function marks all tiny alloc objects waiting for merge:

// gcMarkTinyAllocs greys all active tiny alloc blocks. // // The world must be stopped. func gcMarkTinyAllocs() { for _, p := range &allp { if p == nil || p.status == _Pdead { break } c := p.mcache if c == nil || c.tiny == 0 { continue } // Mark tiny in mcache in each P // In the mallocgc function above, you can see that tiny is the current object waiting for merge _, hbits, span, objIndex := heapBitsForObject(c.tiny, 0, 0) gcw := &p.gcw // Mark an object alive and add it to the tag queue (the object greys) greyobject(c.tiny, 0, 0, hbits, span, gcw, objIndex) // The gcBlackenPromptly variable indicates whether the local queue is currently disabled, and if it is disabled, flush the tag task to the global queue. if gcBlackenPromptly { gcw.dispose() } } }

startTheWorldWithSema Functions will restart the world:

func startTheWorldWithSema() { _g_ := getg() // Prohibit G from being preempted _g_.m.locks++ // disable preemption because it can be holding p in a local var // Determine the received network events (fd readable, writable, or error) and add the corresponding G to the queue to be run gp := netpoll(false) // non-blocking injectglist(gp) // Determine whether to start gc helper add := needaddgcproc() lock(&sched.lock) // Adjust the number of P if you want to change gomaxprocs // procresize returns the linked list of P with runnable tasks procs := gomaxprocs if newprocs != 0 { procs = newprocs newprocs = 0 } p1 := procresize(procs) // Cancel GC Waiting Mark sched.gcwaiting = 0 // If sysmon is waiting, wake it up if sched.sysmonwait != 0 { sched.sysmonwait = 0 notewakeup(&sched.sysmonnote) } unlock(&sched.lock) // Wake up P with a runnable task for p1 != nil { p := p1 p1 = p1.link.ptr() if p.m != 0 { mp := p.m.ptr() p.m = 0 if mp.nextp != 0 { throw("startTheWorld: inconsistent mp->nextp") } mp.nextp.set(p) notewakeup(&mp.park) } else { // Start M to run P. Do not start another M below. newm(nil, p) add = false } } // If there is an idle P and no spinning M, wake up or create an M. // Wakeup an additional proc in case we have excessive runnable goroutines // in local queues or in the global queue. If we don't, the proc will park itself. // If we have lots of excessive work, resetspinning will unpark additional procs as necessary. if atomic.Load(&sched.npidle) != 0 && atomic.Load(&sched.nmspinning) == 0 { wakep() } // Start gc helper if add { // If GC could have used another helper proc, start one now, // in the hope that it will be available next time. // It would have been even better to start it before the collection, // but doing so requires allocating memory, so it's tricky to // coordinate. This lazy approach works out in practice: // we don't mind if the first couple gc rounds don't have quite // the maximum number of procs. newm(mhelpgc, nil) } // Allow G to be preempted _g_.m.locks-- // If the current G request is preempted, try again if _g_.m.locks == 0 && _g_.preempt { // restore the preemption request in case we've cleared it in newstack _g_.stackguard0 = stackPreempt } }

After restarting the world, each M will start scheduling again. The findRunnable GCWorker function mentioned above will be preferred to find tasks. After that, about 25% of the P-running background markup tasks will be performed.
The function of the background markup task is gcBgMarkWorker:

func gcBgMarkWorker(_p_ *p) { gp := getg() // Constructs for retrieving P after dormancy type parkInfo struct { m muintptr // Release this m on park. attach puintptr // If non-nil, attach to this p on park. } // We pass park to a gopark unlock function, so it can't be on // the stack (see gopark). Prevent deadlock from recursively // starting GC by disabling preemption. gp.m.preemptoff = "GC worker init" park := new(parkInfo) gp.m.preemptoff = "" // Set up the current M and prohibit preemption park.m.set(acquirem()) // Set the current P (P to be associated with) park.attach.set(_p_) // Notify gcBgMarkStartWorkers that they can continue processing // Inform gcBgMarkStartWorkers that this worker is ready. // After this point, the background mark worker is scheduled // cooperatively by gcController.findRunnable. Hence, it must // never be preempted, as this would put it into _Grunnable // and put it on a run queue. Instead, when the preempt flag // is set, this puts itself into _Gwaiting to be woken up by // gcController.findRunnable at the appropriate time. notewakeup(&work.bgMarkReady) for { // Let the current G go to sleep // Go to sleep until woken by gcController.findRunnable. // We can't releasem yet since even the call to gopark // may be preempted. gopark(func(g *g, parkp unsafe.Pointer) bool { park := (*parkInfo)(parkp) // Re-allow preemption // The worker G is no longer running, so it's // now safe to allow preemption. releasem(park.m.ptr()) // Setting the associated P // Set the current G to the gcBgMarkWorker member of P, and the next find Runnable GCWorker will use it // Do not sleep when settings fail // If the worker isn't attached to its P, // attach now. During initialization and after // a phase change, the worker may have been // running on a different P. As soon as we // attach, the owner P may schedule the // worker, so this must be done after the G is // stopped. if park.attach != 0 { p := park.attach.ptr() park.attach.set(nil) // cas the worker because we may be // racing with a new worker starting // on this P. if !p.gcBgMarkWorker.cas(0, guintptr(unsafe.Pointer(g))) { // The P got a new worker. // Exit this worker. return false } } return true }, unsafe.Pointer(park), "GC worker (idle)", traceEvGoBlock, 0) // Check if the gcBgMarkWorker of P is consistent with the current G and terminate the current task when inconsistent // Loop until the P dies and disassociates this // worker (the P may later be reused, in which case // it will get a new worker) or we failed to associate. if _p_.gcBgMarkWorker.ptr() != gp { break } // Prohibit G from being preempted // Disable preemption so we can use the gcw. If the // scheduler wants to preempt us, we'll stop draining, // dispose the gcw, and then preempt. park.m.set(acquirem()) if gcBlackenEnabled == 0 { throw("gcBgMarkWorker: blackening not enabled") } // Record start time startTime := nanotime() decnwait := atomic.Xadd(&work.nwait, -1) if decnwait == work.nproc { println("runtime: work.nwait=", decnwait, "work.nproc=", work.nproc) throw("work.nwait was > work.nproc") } // Switch to g0 systemstack(func() { // Set the state of G to wait so that its stack can be scanned (two background tag tasks can scan each other's stack) // Mark our goroutine preemptible so its stack // can be scanned. This lets two mark workers // scan each other (otherwise, they would // deadlock). We must not modify anything on // the G stack. However, stack shrinking is // disabled for mark workers, so it is safe to // read from the G stack. casgstatus(gp, _Grunning, _Gwaiting) // Judging the Patterns of Background Markup Tasks switch _p_.gcMarkWorkerMode { default: throw("gcBgMarkWorker: unexpected gcMarkWorkerMode") case gcMarkWorkerDedicatedMode: // In this mode, P should concentrate on markup execution // Execute the tag until it is preempted, and you need to calculate the amount of scans in the background to reduce the number of auxiliary GC and wake-up waiting G gcDrain(&_p_.gcw, gcDrainUntilPreempt|gcDrainFlushBgCredit) // When preempted, kick all G s in the local run queue into the global run queue if gp.preempt { // We were preempted. This is // a useful signal to kick // everything out of the run // queue so it can run // somewhere else. lock(&sched.lock) for { gp, _ := runqget(_p_) if gp == nil { break } globrunqput(gp) } unlock(&sched.lock) } // Continue to execute the markup until there are no more tasks, and you need to calculate the amount of scans in the background to reduce the auxiliary GC and wake-up waiting G // Go back to draining, this time // without preemption. gcDrain(&_p_.gcw, gcDrainNoBlock|gcDrainFlushBgCredit) case gcMarkWorkerFractionalMode: // In this mode, P should execute markup appropriately // Execute the tag until it is preempted, and you need to calculate the amount of scans in the background to reduce the number of auxiliary GC and wake-up waiting G gcDrain(&_p_.gcw, gcDrainUntilPreempt|gcDrainFlushBgCredit) case gcMarkWorkerIdleMode: // In this mode, P executes markup only when it is free. // Execute markup until it is preempted or reaches a certain amount, and you need to calculate the amount of backend scanning to reduce auxiliary GC and G in wake-up waiting gcDrain(&_p_.gcw, gcDrainIdle|gcDrainUntilPreempt|gcDrainFlushBgCredit) } // Restore the state of G to run casgstatus(gp, _Gwaiting, _Grunning) }) // If the local tag queue is marked forbidden, flush goes to the global tag queue // If we are nearing the end of mark, dispose // of the cache promptly. We must do this // before signaling that we're no longer // working so that other workers can't observe // no workers and no work while we have this // cached, and before we compute done. if gcBlackenPromptly { _p_.gcw.dispose() } // Cumulative time // Account for time. duration := nanotime() - startTime switch _p_.gcMarkWorkerMode { case gcMarkWorkerDedicatedMode: atomic.Xaddint64(&gcController.dedicatedMarkTime, duration) atomic.Xaddint64(&gcController.dedicatedMarkWorkersNeeded, 1) case gcMarkWorkerFractionalMode: atomic.Xaddint64(&gcController.fractionalMarkTime, duration) atomic.Xaddint64(&gcController.fractionalMarkWorkersNeeded, 1) case gcMarkWorkerIdleMode: atomic.Xaddint64(&gcController.idleMarkTime, duration) } // Was this the last worker and did we run out // of work? incnwait := atomic.Xadd(&work.nwait, +1) if incnwait > work.nproc { println("runtime: p.gcMarkWorkerMode=", _p_.gcMarkWorkerMode, "work.nwait=", incnwait, "work.nproc=", work.nproc) throw("work.nwait > work.nproc") } // Determine whether all background markup tasks are completed and there are no more tasks // If this worker reached a background mark completion // point, signal the main GC goroutine. if incnwait == work.nproc && !gcMarkWorkAvailable(nil) { // Cancellation of association with P // Make this G preemptible and disassociate it // as the worker for this P so // findRunnableGCWorker doesn't try to // schedule it. _p_.gcBgMarkWorker.set(nil) // Allow G to be preempted releasem(park.m.ptr()) // Ready to enter the completion marking phase gcMarkDone() // Re-associate P before hibernation // Because the above is allowed to be preempted, it may become other P's by the time it arrives here. // If the relatedness of P fails, the task ends // Disable preemption and prepare to reattach // to the P. // // We may be running on a different P at this // point, so we can't reattach until this G is // parked. park.m.set(acquirem()) park.attach.set(_p_) } } }

gcDrain Functions are used to execute tags:

// gcDrain scans roots and objects in work buffers, blackening grey // objects until all roots and work buffers have been drained. // // If flags&gcDrainUntilPreempt != 0, gcDrain returns when g.preempt // is set. This implies gcDrainNoBlock. // // If flags&gcDrainIdle != 0, gcDrain returns when there is other work // to do. This implies gcDrainNoBlock. // // If flags&gcDrainNoBlock != 0, gcDrain returns as soon as it is // unable to get more work. Otherwise, it will block until all // blocking calls are blocked in gcDrain. // // If flags&gcDrainFlushBgCredit != 0, gcDrain flushes scan work // credit to gcController.bgScanCredit every gcCreditSlack units of // scan work. // //go:nowritebarrier func gcDrain(gcw *gcWork, flags gcDrainFlags) { if !writeBarrier.needed { throw("gcDrain phase incorrect") } gp := getg().m.curg // Do you want to return when you see the preemption sign preemptible := flags&gcDrainUntilPreempt != 0 // Whether to wait for a task when there is no task blocking := flags&(gcDrainUntilPreempt|gcDrainIdle|gcDrainNoBlock) == 0 // Whether to calculate the amount of backstage scanning to reduce auxiliary GC and G in wake-up waiting flushBgCredit := flags&gcDrainFlushBgCredit != 0 // Do you only perform a certain amount of work? idle := flags&gcDrainIdle != 0 // Record the initial number of scans initScanWork := gcw.scanWork // Scanning idleCheckThreshold(100000) objects checks for returns // idleCheck is the scan work at which to perform the next // idle check with the scheduler. idleCheck := initScanWork + idleCheckThreshold // If the root object is not scanned, the root object is scanned first // Drain root marking jobs. if work.markrootNext < work.markrootJobs { // If preemptible is marked, loops until preempted for !(preemptible && gp.preempt) { // Extract a value from the root object scan queue (atomic increment) job := atomic.Xadd(&work.markrootNext, +1) - 1 if job >= work.markrootJobs { break } // Perform root object scanning markroot(gcw, job) // If it is idle mode and has other work, it returns if idle && pollWork() { goto done } } } // The root object is already in the tag queue, consuming the tag queue // If preemptible is marked, loops until preempted // Drain heap marking jobs. for !(preemptible && gp.preempt) { // If the global tag queue is empty, divide part of the local tag queue work over // (If wbuf2 is not empty, move wbuf2 over, otherwise move half of wbuf1 over) // Try to keep work available on the global queue. We used to // check if there were waiting workers, but it's better to // just keep work available than to make workers wait. In the // worst case, we'll do O(log(_WorkbufSize)) unnecessary // balances. if work.full == 0 { gcw.balance() } // Get objects from the local tag queue, but not from the global tag queue var b uintptr if blocking { // Congestion acquisition b = gcw.get() } else { // Non-blocking acquisition b = gcw.tryGetFast() if b == 0 { b = gcw.tryGet() } } // Objects cannot be retrieved, the tag queue is empty, jumping out of the loop if b == 0 { // work barrier reached or tryGet failed. break } // Scan the acquired object scanobject(b, gcw) // If a certain number of objects have been scanned (the value of gcCreditSlack is 2000) // Flush background scan work credit to the global // account if we've accumulated enough locally so // mutator assists can draw on it. if gcw.scanWork >= gcCreditSlack { // Add the number of scanned objects to the global atomic.Xaddint64(&gcController.scanWork, gcw.scanWork) // Reduce the workload of auxiliary GC and G in wake-up waiting if flushBgCredit { gcFlushBgCredit(gcw.scanWork - initScanWork) initScanWork = 0 } idleCheck -= gcw.scanWork gcw.scanWork = 0 // If idle mode is in place and the amount of scans checked is reached, check if there are other tasks (G), and if so, jump out of the loop. if idle && idleCheck <= 0 { idleCheck += idleCheckThreshold if pollWork() { break } } } } // In blocking mode, write barriers are not allowed after this // point because we must preserve the condition that the work // buffers are empty. done: // Add the number of scanned objects to the global // Flush remaining scan work credit. if gcw.scanWork > 0 { atomic.Xaddint64(&gcController.scanWork, gcw.scanWork) // Reduce the workload of auxiliary GC and G in wake-up waiting if flushBgCredit { gcFlushBgCredit(gcw.scanWork - initScanWork) } gcw.scanWork = 0 } }

markroot Function to perform root object scanning:

// markroot scans the i'th root. // // Preemption must be disabled (because this uses a gcWork). // // nowritebarrier is only advisory here. // //go:nowritebarrier func markroot(gcw *gcWork, i uint32) { // Determine which tasks the extracted values correspond to // (google engineers find this ridiculous) // TODO(austin): This is a bit ridiculous. Compute and store // the bases in gcMarkRootPrepare instead of the counts. baseFlushCache := uint32(fixedRootCount) baseData := baseFlushCache + uint32(work.nFlushCacheRoots) baseBSS := baseData + uint32(work.nDataRoots) baseSpans := baseBSS + uint32(work.nBSSRoots) baseStacks := baseSpans + uint32(work.nSpanRoots) end := baseStacks + uint32(work.nStackRoots) // Note: if you add a case here, please also update heapdump.go:dumproots. switch { // Release all span s in mcache, requiring STW case baseFlushCache <= i && i < baseData: flushmcache(int(i - baseFlushCache)) // Scanning Readable and Writable Global Variables // Here, only the block corresponding to i is scanned, and bitmap data containing where the pointer is passed in at the time of scanning. case baseData <= i && i < baseBSS: for _, datap := range activeModules() { markrootBlock(datap.data, datap.edata-datap.data, datap.gcdatamask.bytedata, gcw, int(i-baseData)) } // Scanning read-only global variables // Here, only the block corresponding to i is scanned, and bitmap data containing where the pointer is passed in at the time of scanning. case baseBSS <= i && i < baseSpans: for _, datap := range activeModules() { markrootBlock(datap.bss, datap.ebss-datap.bss, datap.gcbssmask.bytedata, gcw, int(i-baseBSS)) } // Scanning destructor queue case i == fixedRootFinalizers: // Only do this once per GC cycle since we don't call // queuefinalizer during marking. if work.markrootDone { break } for fb := allfin; fb != nil; fb = fb.alllink { cnt := uintptr(atomic.Load(&fb.cnt)) scanblock(uintptr(unsafe.Pointer(&fb.fin[0])), cnt*unsafe.Sizeof(fb.fin[0]), &finptrmask[0], gcw) } // Release the stack of aborted G case i == fixedRootFreeGStacks: // Only do this once per GC cycle; preferably // concurrently. if !work.markrootDone { // Switch to the system stack so we can call // stackfree. systemstack(markrootFreeGStacks) } // Scanning for special objects in various span s (destructor list) case baseSpans <= i && i < baseStacks: // mark MSpan.specials markrootSpans(gcw, int(i-baseSpans)) // Scanning stacks for each G default: // Get the G that needs to be scanned // the rest is scanning goroutine stacks var gp *g if baseStacks <= i && i < end { gp = allgs[i-baseStacks] } else { throw("markroot: bad index") } // Record the waiting time for the start // remember when we've first observed the G blocked // needed only to output in traceback status := readgstatus(gp) // We are not in a scan state if (status == _Gwaiting || status == _Gsyscall) && gp.waitsince == 0 { gp.waitsince = work.tstart } // Switch to g0 (which may sweep to your own stack) // scang must be done on the system stack in case // we're trying to scan our own stack. systemstack(func() { // Determine whether the scanned stack is its own // If this is a self-scan, put the user G in // _Gwaiting to prevent self-deadlock. It may // already be in _Gwaiting if this is a mark // worker or we're in mark termination. userG := getg().m.curg selfScan := gp == userG && readgstatus(userG) == _Grunning // If you are scanning your own stack, switch status to wait to prevent deadlock if selfScan { casgstatus(userG, _Grunning, _Gwaiting) userG.waitreason = "garbage collection scan" } // Scanning G stack // TODO: scang blocks until gp's stack has // been scanned, which may take a while for // running goroutines. Consider doing this in // two phases where the first is non-blocking: // we scan the stacks we can and ask running // goroutines to scan themselves; and the // second blocks. scang(gp, gcw) // If you are scanning your own stack, switch back to running if selfScan { casgstatus(userG, _Gwaiting, _Grunning) } }) } }

scang The function scans G's stack:

// scang blocks until gp's stack has been scanned. // It might be scanned by scang or it might be scanned by the goroutine itself. // Either way, the stack scan has completed when scang returns. func scang(gp *g, gcw *gcWork) { // Invariant; we (the caller, markroot for a specific goroutine) own gp.gcscandone. // Nothing is racing with us now, but gcscandone might be set to true left over // from an earlier round of stack scanning (we scan twice per GC). // We use gcscandone to record whether the scan has been done during this round. // Marker Scanning Not Completed gp.gcscandone = false // See http://golang.org/cl/21503 for justification of the yield delay. const yieldDelay = 10 * 1000 var nextYield int64 // Cycle until the scan is complete // Endeavor to get gcscandone set to true, // either by doing the stack scan ourselves or by coercing gp to scan itself. // gp.gcscandone can transition from false to true when we're not looking // (if we asked for preemption), so any time we lock the status using // castogscanstatus we have to double-check that the scan is still not done. loop: for i := 0; !gp.gcscandone; i++ { // Judging the current state of G switch s := readgstatus(gp); s { default: dumpgstatus(gp) throw("stopg: invalid status") // G has been aborted and does not need to be scanned case _Gdead: // No stack. gp.gcscandone = true break loop // G's stack is expanding, and the next round of retries case _Gcopystack: // Stack being switched. Go around again. // G is not in operation. First, it needs to be prevented from running. case _Grunnable, _Gsyscall, _Gwaiting: // Claim goroutine by setting scan bit. // Racing with execution or readying of gp. // The scan bit keeps them from running // the goroutine until we're done. if castogscanstatus(gp, s, s|_Gscan) { // Scanning the stack of an atom when its switching state is successful if !gp.gcscandone { scanstack(gp, gcw) gp.gcscandone = true } // Restore the state of G and jump out of the loop restartg(gp) break loop } // G is scanning itself, waiting for the scan to be finished. case _Gscanwaiting: // newstack is doing a scan for us right now. Wait. // G is running case _Grunning: // Goroutine running. Try to preempt execution so it can scan itself. // The preemption handler (in newstack) does the actual scan. // If a preemption request already exists, it will be handled for us if the preemption succeeds. // Optimization: if there is already a pending preemption request // (from the previous loop iteration), don't bother with the atomics. if gp.preemptscan && gp.preempt && gp.stackguard0 == stackPreempt { break } // Grab G. When it succeeds, G scans itself. // Ask for preemption and self scan. if castogscanstatus(gp, _Grunning, _Gscanrunning) { if !gp.gcscandone { gp.preemptscan = true gp.preempt = true gp.stackguard0 = stackPreempt } casfrom_Gscanstatus(gp, _Gscanrunning, _Grunning) } } // The first dormancy is 10 milliseconds and the second dormancy is 5 milliseconds. if i == 0 { nextYield = nanotime() + yieldDelay } if nanotime() < nextYield { procyield(10) } else { osyield() nextYield = nanotime() + yieldDelay/2 } } // Scanning Completed, Cancel Preemptive Scanning Request gp.preemptscan = false // cancel scan request if no longer needed }

After setting up preemptscan, when preemption G succeeds, scanstack is called to scan its own stack, specific code Ad locum.
The function used to scan the stack is scanstack:

// scanstack scans gp's stack, greying all pointers found on the stack. // // scanstack is marked go:systemstack because it must not be preempted // while using a workbuf. // //go:nowritebarrier //go:systemstack func scanstack(gp *g, gcw *gcWork) { if gp.gcscanvalid { return } if readgstatus(gp)&_Gscan == 0 { print("runtime:scanstack: gp=", gp, ", goid=", gp.goid, ", gp->atomicstatus=", hex(readgstatus(gp)), "\n") throw("scanstack - bad status") } switch readgstatus(gp) &^ _Gscan { default: print("runtime: gp=", gp, ", goid=", gp.goid, ", gp->atomicstatus=", readgstatus(gp), "\n") throw("mark - bad status") case _Gdead: return case _Grunning: print("runtime: gp=", gp, ", goid=", gp.goid, ", gp->atomicstatus=", readgstatus(gp), "\n") throw("scanstack: goroutine not stopped") case _Grunnable, _Gsyscall, _Gwaiting: // ok } if gp == getg() { throw("can't scan our own stack") } mp := gp.m if mp != nil && mp.helpgc != 0 { throw("can't scan gchelper stack") } // Shrink the stack if not much of it is being used. During // concurrent GC, we can do this during concurrent mark. if !work.markrootDone { shrinkstack(gp) } // Scan the stack. var cache pcvalueCache scanframe := func(frame *stkframe, unused unsafe.Pointer) bool { // scanframeworker retrieves function information based on code address (pc) // Then find stackmap.bytedata in the function information, which saves where the pointer is on the function stack. // Call scanblock again to scan the stack space of the function, and the parameters of the function will also be scanned in this way. scanframeworker(frame, &cache, gcw) return true } // Enumerate all calling frames and call the scanframe function separately gentraceback(^uintptr(0), ^uintptr(0), 0, gp, 0, nil, 0x7fffffff, scanframe, nil, 0) // Enumerate all defer call frames and call the scanframe function separately tracebackdefers(gp, scanframe, nil) gp.gcscanvalid = true }

scanblock Function is a general scanning function, which is used to scan global variables and stack space. Unlike scanobject, bitmap needs to be passed in manually:

// scanblock scans b as scanobject would, but using an explicit // pointer bitmap instead of the heap bitmap. // // This is used to scan non-heap roots, so it does not update // gcw.bytesMarked or gcw.scanWork. // //go:nowritebarrier func scanblock(b0, n0 uintptr, ptrmask *uint8, gcw *gcWork) { // Use local copies of original parameters, so that a stack trace // due to one of the throws below shows the original block // base and extent. b := b0 n := n0 arena_start := mheap_.arena_start arena_used := mheap_.arena_used // Enumeration of scanned addresses for i := uintptr(0); i < n; { // Find the corresponding byte in bitmap // Find bits for the next word. bits := uint32(*addb(ptrmask, i/(sys.PtrSize*8))) if bits == 0 { i += sys.PtrSize * 8 continue } // Enumeration of byte for j := 0; j < 8 && i < n; j++ { // If the address contains a pointer if bits&1 != 0 { // The object marked at that address survives and is added to the tag queue (the object becomes grey) // Same work as in scanobject; see comments there. obj := *(*uintptr)(unsafe.Pointer(b + i)) if obj != 0 && arena_start <= obj && obj < arena_used { // Find span s and bitmap s for this object if obj, hbits, span, objIndex := heapBitsForObject(obj, b, i); obj != 0 { // Mark an object alive and add it to the tag queue (the object greys) greyobject(obj, b, i, hbits, span, gcw, objIndex) } } } // Processing the next bit of the next pointer bits >>= 1 i += sys.PtrSize } } }

greyobject Used to mark the survival of an object and add it to the tag queue (the object greys):

// obj is the start of an object with mark mbits. // If it isn't already marked, mark it and enqueue into gcw. // base and off are for debugging only and could be removed. //go:nowritebarrierrec func greyobject(obj, base, off uintptr, hbits heapBits, span *mspan, gcw *gcWork, objIndex uintptr) { // obj should be start of allocation, and so must be at least pointer-aligned. if obj&(sys.PtrSize-1) != 0 { throw("greyobject: obj not pointer-aligned") } mbits := span.markBitsForIndex(objIndex) if useCheckmark { // checkmark is a mechanism for checking whether all reachable objects are correctly marked, using only debugging if !mbits.isMarked() { printlock() print("runtime:greyobject: checkmarks finds unexpected unmarked object obj=", hex(obj), "\n") print("runtime: found obj at *(", hex(base), "+", hex(off), ")\n") // Dump the source (base) object gcDumpObject("base", base, off) // Dump the object gcDumpObject("obj", obj, ^uintptr(0)) getg().m.traceback = 2 throw("checkmark found unmarked object") } if hbits.isCheckmarked(span.elemsize) { return } hbits.setCheckmarked(span.elemsize) if !hbits.isCheckmarked(span.elemsize) { throw("setCheckmarked and isCheckmarked disagree") } } else { if debug.gccheckmark > 0 && span.isFree(objIndex) { print("runtime: marking free object ", hex(obj), " found at *(", hex(base), "+", hex(off), ")\n") gcDumpObject("base", base, off) gcDumpObject("obj", obj, ^uintptr(0)) getg().m.traceback = 2 throw("marking free object") } // If the bit corresponding to gcmarkBits in the span where the object is located has been set to 1, processing can be skipped // If marked we have nothing to do. if mbits.isMarked() { return } // Set the bit corresponding to gcmarkBits in the span where the object is located to 1 // mbits.setMarked() // Avoid extra call overhead with manual inlining. atomic.Or8(mbits.bytep, mbits.mask) // If it is determined that the object does not contain pointers (the type of span in which it is located is noscan), it is not necessary to put the object in the tag queue. // If this is a noscan object, fast-track it to black // instead of greying it. if span.spanclass.noscan() { gcw.bytesMarked += uint64(span.elemsize) return } } // Put objects in tag queues // Place in the local tag queue first, transfer part of the local tag queue work to the global tag queue when it fails, and then put in the local tag queue. // Queue the obj for scanning. The PREFETCH(obj) logic has been removed but // seems like a nice optimization that can be added back in. // There needs to be time between the PREFETCH and the use. // Previously we put the obj in an 8 element buffer that is drained at a rate // to give the PREFETCH time to do its work. // Use of PREFETCHNTA might be more appropriate than PREFETCH if !gcw.putFast(obj) { gcw.put(obj) } }

After the gcDrain function scans the root object, it begins to consume the tag queue and call the object taken from the tag queue. scanobject Function:

// scanobject scans the object starting at b, adding pointers to gcw. // b must point to the beginning of a heap object or an oblet. // scanobject consults the GC bitmap for the pointer mask and the // spans for the size of the object. // //go:nowritebarrier func scanobject(b uintptr, gcw *gcWork) { // Note that arena_used may change concurrently during // scanobject and hence scanobject may encounter a pointer to // a newly allocated heap object that is *not* in // [start,used). It will not mark this object; however, we // know that it was just installed by a mutator, which means // that mutator will execute a write barrier and take care of // marking it. This is even more pronounced on relaxed memory // architectures since we access arena_used without barriers // or synchronization, but the same logic applies. arena_start := mheap_.arena_start arena_used := mheap_.arena_used // Find the bits for b and the size of the object at b. // // b is either the beginning of an object, in which case this // is the size of the object to scan, or it points to an // oblet, in which case we compute the size to scan below. // Get the bitmap corresponding to the object hbits := heapBitsForAddr(b) // Get the span where the object is located s := spanOfUnchecked(b) // Get the size of the object n := s.elemsize if n == 0 { throw("scanobject n == 0") } // When the object size is too large (maxObletBytes is 128KB), it needs to be segmented and scanned. // Scan 128KB at most at a time if n > maxObletBytes { // Large object. Break into oblets for better // parallelism and lower latency. if b == s.base() { // It's possible this is a noscan object (not // from greyobject, but from other code // paths), in which case we must *not* enqueue // oblets since their bitmaps will be // uninitialized. if s.spanclass.noscan() { // Bypass the whole scan. gcw.bytesMarked += uint64(n) return } // Enqueue the other oblets to scan later. // Some oblets may be in b's scalar tail, but // these will be marked as "no more pointers", // so we'll drop out immediately when we go to // scan those. for oblet := b + maxObletBytes; oblet < s.base()+s.elemsize; oblet += maxObletBytes { if !gcw.putFast(oblet) { gcw.put(oblet) } } } // Compute the size of the oblet. Since this object // must be a large object, s.base() is the beginning // of the object. n = s.base() + s.elemsize - b if n > maxObletBytes { n = maxObletBytes } } // Pointer in Scanning Object var i uintptr for i = 0; i < n; i += sys.PtrSize { // Get the corresponding bit // Find bits for this word. if i != 0 { // Avoid needless hbits.next() on last iteration. hbits = hbits.next() } // Load bits once. See CL 22712 and issue 16973 for discussion. bits := hbits.bits() // Check scan bit to determine whether to continue scanning. Note that the second scan bit is checkmark. // During checkmarking, 1-word objects store the checkmark // in the type bit for the one word. The only one-word objects // are pointers, or else they'd be merged with other non-pointer // data into larger allocations. if i != 1*sys.PtrSize && bits&bitScan == 0 { break // no more pointers in this object } // Check pointer bit, not pointer, then continue if bits&bitPointer == 0 { continue // not a pointer } // Remove the value of the pointer // Work here is duplicated in scanblock and above. // If you make changes here, make changes there too. obj := *(*uintptr)(unsafe.Pointer(b + i)) // If the pointer is in the arena area, the greyobject tag object is called and placed in the tag queue // At this point we have extracted the next potential pointer. // Check if it points into heap and not back at the current object. if obj != 0 && arena_start <= obj && obj < arena_used && obj-b >= n { // Mark the object. if obj, hbits, span, objIndex := heapBitsForObject(obj, b, i); obj != 0 { greyobject(obj, b, i, hbits, span, gcw, objIndex) } } } // Statistical Scanned Size and Number of Objects gcw.bytesMarked += uint64(n) gcw.scanWork += int64(i) }

When all background markup tasks are consumed, the markup queue is executed gcMarkDone Functions are ready to enter the mark termination phase:
In parallel GC, gcMarkDone is executed twice, forbidding the local tag queue for the first time, then restarting the background tag task, and entering the mark termination stage for the second time.

// gcMarkDone transitions the GC from mark 1 to mark 2 and from mark 2 // to mark termination. // // This should be called when all mark work has been drained. In mark // 1, this includes all root marking jobs, global work buffers, and // active work buffers in assists and background workers; however, // work may still be cached in per-P work buffers. In mark 2, per-P // caches are disabled. // // The calling context must be preemptible. // // Note that it is explicitly okay to have write barriers in this // function because completion of concurrent mark is best-effort // anyway. Any work created by write barriers here will be cleaned up // by mark termination. func gcMarkDone() { top: semacquire(&work.markDoneSema) // Re-check transition condition under transition lock. if !(gcphase == _GCmark && work.nwait == work.nproc && !gcMarkWorkAvailable(nil)) { semrelease(&work.markDoneSema) return } // Temporarily prohibit starting new background markup tasks // Disallow starting new workers so that any remaining workers // in the current mark phase will drain out. // // TODO(austin): Should dedicated workers keep an eye on this // and exit gcDrain promptly? atomic.Xaddint64(&gcController.dedicatedMarkWorkersNeeded, -0xffffffff) atomic.Xaddint64(&gcController.fractionalMarkWorkersNeeded, -0xffffffff) // Determine whether the local tag queue is disabled if !gcBlackenPromptly { // Whether the local tag queue is not disabled, disable and restart the background tag task // Transition from mark 1 to mark 2. // // The global work list is empty, but there can still be work // sitting in the per-P work caches. // Flush and disable work caches. // Disable local tag queues // Disallow caching workbufs and indicate that we're in mark 2. gcBlackenPromptly = true // Prevent completion of mark 2 until we've flushed // cached workbufs. atomic.Xadd(&work.nwait, -1) // GC is set up for mark 2. Let Gs blocked on the // transition lock go while we flush caches. semrelease(&work.markDoneSema) // Push all objects in the local tag queue to the global tag queue systemstack(func() { // Flush all currently cached workbufs and // ensure all Ps see gcBlackenPromptly. This // also blocks until any remaining mark 1 // workers have exited their loop so we can // start new mark 2 workers. forEachP(func(_p_ *p) { _p_.gcw.dispose() }) }) // Except for misuse // Check that roots are marked. We should be able to // do this before the forEachP, but based on issue // #16083 there may be a (harmless) race where we can // enter mark 2 while some workers are still scanning // stacks. The forEachP ensures these scans are done. // // TODO(austin): Figure out the race and fix this // properly. gcMarkRootCheck() // Allow new background markup tasks to be started // Now we can start up mark 2 workers. atomic.Xaddint64(&gcController.dedicatedMarkWorkersNeeded, 0xffffffff) atomic.Xaddint64(&gcController.fractionalMarkWorkersNeeded, 0xffffffff) // If you're sure you don't have more tasks, you can jump directly to the top of the function. // That's the second call. incnwait := atomic.Xadd(&work.nwait, +1) if incnwait == work.nproc && !gcMarkWorkAvailable(nil) { // This loop will make progress because // gcBlackenPromptly is now true, so it won't // take this same "if" branch. goto top } } else { // Record the start time of the completion markup phase and the start time of the STW // Transition to mark termination. now := nanotime() work.tMarkTerm = now work.pauseStart = now // Prohibit G from being preempted getg().m.preemptoff = "gcing" // Stop all running G and prohibit them from running systemstack(stopTheWorldWithSema) // !!!!!!!!!!!!!!!! // The world has stopped (STW)... // !!!!!!!!!!!!!!!! // The gcphase is _GCmark, it will transition to _GCmarktermination // below. The important thing is that the wb remains active until // all marking is complete. This includes writes made by the GC. // Scanning of the root object by tags has been completed, affecting the processing in gcMarkRootPrepare // Record that one root marking pass has completed. work.markrootDone = true // Disallow the operation of auxiliary GC and background markup tasks // Disable assists and background workers. We must do // this before waking blocked assists. atomic.Store(&gcBlackenEnabled, 0) // Wake up all G's that are dormant due to assistant GC // Wake all blocked assists. These will run when we // start the world again. gcWakeAllAssists() // Likewise, release the transition lock. Blocked // workers and assists will run when we start the // world again. semrelease(&work.markDoneSema) // Calculate the heap size needed to trigger the next gc // endCycle depends on all gcWork cache stats being // flushed. This is ensured by mark 2. nextTriggerRatio := gcController.endCycle() // Entering the completion markup phase will restart the world // Perform mark termination. This will restart the world. gcMarkTermination(nextTriggerRatio) } }

gcMarkTermination Functions enter the completion tag stage:

func gcMarkTermination(nextTriggerRatio float64) { // World is stopped. // Start marktermination which includes enabling the write barrier. // Disallow the operation of auxiliary GC and background markup tasks atomic.Store(&gcBlackenEnabled, 0) // Re-allow local tag queues (next GC use) gcBlackenPromptly = false // Set the current GC stage to the completion tag stage and enable write barriers setGCPhase(_GCmarktermination) // Record start time work.heap1 = memstats.heap_live startTime := nanotime() // Prohibit G from being preempted mp := acquirem() mp.preemptoff = "gcing" _g_ := getg() _g_.m.traceback = 2 // Set the state of G to wait so that its stack can be scanned gp := _g_.m.curg casgstatus(gp, _Grunning, _Gwaiting) gp.waitreason = "garbage collection" // Switch to g0 // Run gc on the g0 stack. We do this so that the g stack // we're currently running on will no longer change. Cuts // the root set down a bit (g0 stacks are not scanned, and // we don't need to scan gc's internal state). We also // need to switch to g0 so we can shrink the stack. systemstack(func() { // Start the tag in STW gcMark(startTime) // It must be returned immediately, because the stack of external G may be moved and the external variables cannot be accessed after that. // Must return immediately. // The outer function's stack may have moved // during gcMark (it shrinks stacks, including the // outer function's stack), so we must not refer // to any of its variables. Return back to the // non-system stack to pick up the new addresses // before continuing. }) // Switch back to g0 systemstack(func() { work.heap2 = work.bytesMarked // If checkmark is enabled, check to see if all reachable objects are marked if debug.gccheckmark > 0 { // Run a full stop-the-world mark using checkmark bits, // to check that we didn't forget to mark anything during // the concurrent mark process. gcResetMarkState() initCheckmarks() gcMark(startTime) clearCheckmarks() } // Set the current GC phase to close and disable the write barrier // marking is complete so we can turn the write barrier off setGCPhase(_GCoff) // Wake up the background cleaning task, which will start running after STW gcSweep(work.mode) // Except for misuse if debug.gctrace > 1 { startTime = nanotime() // The g stacks have been scanned so // they have gcscanvalid==true and gcworkdone==true. // Reset these so that all stacks will be rescanned. gcResetMarkState() finishsweep_m() // Still in STW but gcphase is _GCoff, reset to _GCmarktermination // At this point all objects will be found during the gcMark which // does a complete STW mark and object scan. setGCPhase(_GCmarktermination) gcMark(startTime) setGCPhase(_GCoff) // marking is done, turn off wb. gcSweep(work.mode) } }) // Set the state of G to running _g_.m.traceback = 0 casgstatus(gp, _Gwaiting, _Grunning) // Tracking processing if trace.enabled { traceGCDone() } // all done mp.preemptoff = "" if gcphase != _GCoff { throw("gc done but gcphase != _GCoff") } // Update the heap size required to trigger the next GC (gc_trigger) // Update GC trigger and pacing for the next cycle. gcSetTriggerRatio(nextTriggerRatio) // Update Time Record // Update timing memstats now := nanotime() sec, nsec, _ := time_now() unixNow := sec*1e9 + int64(nsec) work.pauseNS += now - work.pauseStart work.tEnd = now atomic.Store64(&memstats.last_gc_unix, uint64(unixNow)) // must be Unix time to make sense to user atomic.Store64(&memstats.last_gc_nanotime, uint64(now)) // monotonic time for us memstats.pause_ns[memstats.numgc%uint32(len(memstats.pause_ns))] = uint64(work.pauseNS) memstats.pause_end[memstats.numgc%uint32(len(memstats.pause_end))] = uint64(unixNow) memstats.pause_total_ns += uint64(work.pauseNS) // Update cpu records used // Update work.totaltime. sweepTermCpu := int64(work.stwprocs) * (work.tMark - work.tSweepTerm) // We report idle marking time below, but omit it from the // overall utilization here since it's "free". markCpu := gcController.assistTime + gcController.dedicatedMarkTime + gcController.fractionalMarkTime markTermCpu := int64(work.stwprocs) * (work.tEnd - work.tMarkTerm) cycleCpu := sweepTermCpu + markCpu + markTermCpu work.totaltime += cycleCpu // Compute overall GC CPU utilization. totalCpu := sched.totaltime + (now-sched.procresizetime)*int64(gomaxprocs) memstats.gc_cpu_fraction = float64(work.totaltime) / float64(totalCpu) // Reset Cleaning State // Reset sweep state. sweep.nbgsweep = 0 sweep.npausesweep = 0 // Statistical Forced Start GC Number if work.userForced { memstats.numforcedgc++ } // Count the number of times GC is executed and wake up the G waiting to be cleaned // Bump GC cycle count and wake goroutines waiting on sweep. lock(&work.sweepWaiters.lock) memstats.numgc++ injectglist(work.sweepWaiters.head.ptr()) work.sweepWaiters.head = 0 unlock(&work.sweepWaiters.lock) // Performance statistics // Finish the current heap profiling cycle and start a new // heap profiling cycle. We do this before starting the world // so events don't leak into the wrong cycle. mProf_NextCycle() // Restart the world systemstack(startTheWorldWithSema) // !!!!!!!!!!!!!!! // The world has been restarted. // !!!!!!!!!!!!!!! // Performance statistics // Flush the heap profile so we can start a new cycle next GC. // This is relatively expensive, so we don't do it with the // world stopped. mProf_Flush() // Buffers used by mobile tag queues to free lists so that they can be recycled // Prepare workbufs for freeing by the sweeper. We do this // asynchronously because it can take non-trivial time. prepareFreeWorkbufs() // Release unused stacks // Free stack spans. This must be done between GC cycles. systemstack(freeStackSpans) // Except for misuse // Print gctrace before dropping worldsema. As soon as we drop // worldsema another cycle could start and smash the stats // we're trying to print. if debug.gctrace > 0 { util := int(memstats.gc_cpu_fraction * 100) var sbuf [24]byte printlock() print("gc ", memstats.numgc, " @", string(itoaDiv(sbuf[:], uint64(work.tSweepTerm-runtimeInitTime)/1e6, 3)), "s ", util, "%: ") prev := work.tSweepTerm for i, ns := range []int64 { if i != 0 { print("+") } print(string(fmtNSAsMS(sbuf[:], uint64(ns-prev)))) prev = ns } print(" ms clock, ") for i, ns := range []int64 { if i == 2 || i == 3 { // Separate mark time components with /. print("/") } else if i != 0 { print("+") } print(string(fmtNSAsMS(sbuf[:], uint64(ns)))) } print(" ms cpu, ", work.heap0>>20, "->", work.heap1>>20, "->", work.heap2>>20, " MB, ", work.heapGoal>>20, " MB goal, ", work.maxprocs, " P") if work.userForced { print(" (forced)") } print("\n") printunlock() } semrelease(&worldsema) // Careful: another GC cycle may start now. // Re-allow the current G to be preempted releasem(mp) mp = nil // If it's a parallel GC, let the current M continue to run (returning to gcBgMarkWorker and then hibernating) // If it's not parallel GC, let the current M start scheduling // now that gc is done, kick off finalizer thread if needed if !concurrentSweep { // give the queued finalizers, if any, a chance to run Gosched() } }

gcSweep Functions wake up background cleaning tasks:
Background sweep tasks are called at program startup gcenable Start in the function.

func gcSweep(mode gcMode) { if gcphase != _GCoff { throw("gcSweep being done but phase is not GCoff") } // Add sweepgen so that the two queue roles in sweepSpans will be exchanged and all spans will become "to be cleaned" span lock(&mheap_.lock) mheap_.sweepgen += 2 mheap_.sweepdone = 0 if mheap_.sweepSpans[mheap_.sweepgen/2%2].index != 0 { // We should have drained this list during the last // sweep phase. We certainly need to start this phase // with an empty swept list. throw("non-empty swept list") } mheap_.pagesSwept = 0 unlock(&mheap_.lock) // If the non-parallel GC does all the work here (in STW) if !_ConcurrentSweep || mode == gcForceBlockMode { // Special case synchronous sweep. // Record that no proportional sweeping has to happen. lock(&mheap_.lock) mheap_.sweepPagesPerByte = 0 unlock(&mheap_.lock) // Sweep all spans eagerly. for sweepone() != ^uintptr(0) { sweep.npausesweep++ } // Free workbufs eagerly. prepareFreeWorkbufs() for freeSomeWbufs(false) { } // All "free" events for this mark/sweep cycle have // now happened, so we can make this profile cycle // available immediately. mProf_NextCycle() mProf_Flush() return } // Wake up the Backstage Cleaning Task // Background sweep. lock(&sweep.lock) if sweep.parked { sweep.parked = false ready(sweep.g, 0, true) } unlock(&sweep.lock) }

The function of the background cleaning task is bgsweep:

func bgsweep(c chan int) { sweep.g = getg() // Wake-up call lock(&sweep.lock) sweep.parked = true c <- 1 goparkunlock(&sweep.lock, "GC sweep wait", traceEvGoBlock, 1) // Circular cleaning for { // Clean up a span and then go into scheduling (doing only a small amount of work at a time) for gosweepone() != ^uintptr(0) { sweep.nbgsweep++ Gosched() } // Release some unused tag queue buffer to heap for freeSomeWbufs(true) { Gosched() } // Continue the cycle if the cleaning is incomplete lock(&sweep.lock) if !gosweepdone() { // This can happen if a GC runs between // gosweepone returning ^0 above // and the lock being acquired. unlock(&sweep.lock) continue } // Otherwise, the background cleaning task will go dormant and the current M will continue to schedule. sweep.parked = true goparkunlock(&sweep.lock, "GC sweep wait", traceEvGoBlock, 1) } }

gosweepone The function takes out a single span from sweepSpans to sweep:

//go:nowritebarrier func gosweepone() uintptr { var ret uintptr // Switch to g0 systemstack(func() { ret = sweepone() }) return ret }

sweepone The functions are as follows:

// sweeps one span // returns number of pages returned to heap, or ^uintptr(0) if there is nothing to sweep //go:nowritebarrier func sweepone() uintptr { _g_ := getg() sweepRatio := mheap_.sweepPagesPerByte // For debugging // Prohibit G from being preempted // increment locks to ensure that the goroutine is not preempted // in the middle of sweep thus leaving the span in an inconsistent state for next GC _g_.m.locks++ // Check whether the cleaning has been completed if atomic.Load(&mheap_.sweepdone) != 0 { _g_.m.locks-- return ^uintptr(0) } // Update the number of tasks that perform sweep at the same time atomic.Xadd(&mheap_.sweepers, +1) npages := ^uintptr(0) sg := mheap_.sweepgen for { // Take a span from sweepSpans s := mheap_.sweepSpans[1-sg/2%2].pop() // Jump out of the cycle when all the cleaning is done if s == nil { atomic.Store(&mheap_.sweepdone, 1) break } // Other M's have skipped while cleaning the span. if s.state != mSpanInUse { // This can happen if direct sweeping already // swept this span, but in that case the sweep // generation should always be up-to-date. if s.sweepgen != sg { print("runtime: bad span s.state=", s.state, " s.sweepgen=", s.sweepgen, " sweepgen=", sg, "\n") throw("non in-use span in unswept list") } continue } // The atom adds sweepgen to span, and the failure indicates that other M has begun to sweep the span, skipping it. if s.sweepgen != sg-2 || !atomic.Cas(&s.sweepgen, sg-2, sg-1) { continue } // Clean the span and jump out of the loop npages = s.npages if !s.sweep(false) { // Span is still in-use, so this returned no // pages to the heap and the span needs to // move to the swept in-use list. npages = 0 } break } // Update the number of tasks that perform sweep at the same time // Decrement the number of active sweepers and if this is the // last one print trace information. if atomic.Xadd(&mheap_.sweepers, -1) == 0 && atomic.Load(&mheap_.sweepdone) != 0 { if debug.gcpacertrace > 0 { print("pacer: sweep done at heap size ", memstats.heap_live>>20, "MB; allocated ", (memstats.heap_live-mheap_.sweepHeapLiveBasis)>>20, "MB during sweep; swept ", mheap_.pagesSwept, " pages at ", sweepRatio, " pages/byte\n") } } // Allow G to be preempted _g_.m.locks-- // Return the number of pages cleaned return npages }

span sweep The function is used to sweep a single span:

// Sweep frees or collects finalizers for blocks not marked in the mark phase. // It clears the mark bits in preparation for the next GC round. // Returns true if the span was returned to heap. // If preserve=true, don't return it to heap nor relink in MCentral lists; // caller takes care of it. //TODO go:nowritebarrier func (s *mspan) sweep(preserve bool) bool { // It's critical that we enter this function with preemption disabled, // GC must not start while we are in the middle of this function. _g_ := getg() if _g_.m.locks == 0 && _g_.m.mallocing == 0 && _g_ != _g_.m.g0 { throw("MSpan_Sweep: m is not locked") } sweepgen := mheap_.sweepgen if s.state != mSpanInUse || s.sweepgen != sweepgen-1 { print("MSpan_Sweep: state=", s.state, " sweepgen=", s.sweepgen, " mheap.sweepgen=", sweepgen, "\n") throw("MSpan_Sweep: bad span state") } if trace.enabled { traceGCSweepSpan(s.npages * _PageSize) } // Statistics of cleaned pages atomic.Xadd64(&mheap_.pagesSwept, int64(s.npages)) spc := s.spanclass size := s.elemsize res := false c := _g_.m.mcache freeToHeap := false // The allocBits indicate which unmarked objects don't need to be // processed since they were free at the end of the last GC cycle // and were not allocated since then. // If the allocBits index is >= s.freeindex and the bit // is not marked then the object remains unallocated // since the last GC. // This situation is analogous to being on a freelist. // Judging the destructor in special s, if the corresponding object is no longer alive, mark the object alive to prevent recovery, and then move the destructor to the running queue // Unlink & free special records for any objects we're about to free. // Two complications here: // 1. An object can have both finalizer and profile special records. // In such case we need to queue finalizer for execution, // mark the object as live and preserve the profile special. // 2. A tiny object can have several finalizers setup for different offsets. // If such object is not marked, we need to queue all finalizers at once. // Both 1 and 2 are possible at the same time. specialp := &s.specials special := *specialp for special != nil { // A finalizer can be set for an inner byte of an object, find object beginning. objIndex := uintptr(special.offset) / size p := s.base() + objIndex*size mbits := s.markBitsForIndex(objIndex) if !mbits.isMarked() { // This object is not marked and has at least one special record. // Pass 1: see if it has at least one finalizer. hasFin := false endOffset := p - s.base() + size for tmp := special; tmp != nil && uintptr(tmp.offset) < endOffset; tmp = tmp.next { if tmp.kind == _KindSpecialFinalizer { // Stop freeing of object if it has a finalizer. mbits.setMarkedNonAtomic() hasFin = true break } } // Pass 2: queue all finalizers _or_ handle profile record. for special != nil && uintptr(special.offset) < endOffset { // Find the exact byte for which the special was setup // (as opposed to object beginning). p := s.base() + uintptr(special.offset) if special.kind == _KindSpecialFinalizer || !hasFin { // Splice out special record. y := special special = special.next *specialp = special freespecial(y, unsafe.Pointer(p), size) } else { // This is profile record, but the object has finalizers (so kept alive). // Keep special record. specialp = &special.next special = *specialp } } } else { // object is still live: keep special record specialp = &special.next special = *specialp } } // Except for misuse if debug.allocfreetrace != 0 || raceenabled || msanenabled { // Find all newly freed objects. This doesn't have to // efficient; allocfreetrace has massive overhead. mbits := s.markBitsForBase() abits := s.allocBitsForIndex(0) for i := uintptr(0); i < s.nelems; i++ { if !mbits.isMarked() && (abits.index < s.freeindex || abits.isMarked()) { x := s.base() + i*s.elemsize if debug.allocfreetrace != 0 { tracefree(unsafe.Pointer(x), size) } if raceenabled { racefree(unsafe.Pointer(x), size) } if msanenabled { msanfree(unsafe.Pointer(x), size) } } mbits.advance() abits.advance() } } // Calculate the number of objects released // Count the number of free objects in this span. nalloc := uint16(s.countAlloc()) if spc.sizeclass() == 0 && nalloc == 0 { // If the span type is 0 (large object) and the object in it is no longer alive, it is released to heap. s.needzero = 1 freeToHeap = true } nfreed := s.allocCount - nalloc if nalloc > s.allocCount { print("runtime: nelems=", s.nelems, " nalloc=", nalloc, " previous allocCount=", s.allocCount, " nfreed=", nfreed, "\n") throw("sweep increased allocation count") } // Setting up a new allocCount s.allocCount = nalloc // Judging whether span has no unassigned objects wasempty := s.nextFreeIndex() == s.nelems // Reset the free index and start searching from 0 for the next assignment s.freeindex = 0 // reset allocation index to start of span. if trace.enabled { getg().m.p.ptr().traceReclaimed += uintptr(nfreed) * s.elemsize } // gcmarkBits to New allocBits // Then redistribute a gcmarkBits with all zeros // Next time you assign objects, you can know which elements are unallocated based on allocBits // gcmarkBits becomes the allocBits. // get a fresh cleared gcmarkBits in preparation for next GC s.allocBits = s.gcmarkBits s.gcmarkBits = newMarkBits(s.nelems) // allocCache to Update freeindex Start // Initialize alloc bits cache. s.refillAllocCache(0) // If there are no surviving objects in span, update sweepgen to the latest // Now add span to mcentral or mheap // We need to set s.sweepgen = h.sweepgen only when all blocks are swept, // because of the potential for a concurrent free/SetFinalizer. // But we need to set it before we make the span available for allocation // (return it to heap or mcentral), because allocation code assumes that a // span is already swept if available for allocation. if freeToHeap || nfreed == 0 { // The span must be in our exclusive ownership until we update sweepgen, // check for potential races. if s.state != mSpanInUse || s.sweepgen != sweepgen-1 { print("MSpan_Sweep: state=", s.state, " sweepgen=", s.sweepgen, " mheap.sweepgen=", sweepgen, "\n") throw("MSpan_Sweep: bad span state after sweep") } // Serialization point. // At this point the mark bits are cleared and allocation ready // to go so release the span. atomic.Store(&s.sweepgen, sweepgen) } if nfreed > 0 && spc.sizeclass() != 0 { // Add span to mcentral, res equals success c.local_nsmallfree[spc.sizeclass()] += uintptr(nfreed) res = mheap_.central[spc].mcentral.freeSpan(s, preserve, wasempty) // freeSpan will update sweepgen // MCentral_FreeSpan updates sweepgen } else if freeToHeap { // Release span to mheap // Free large span to heap // NOTE(rsc,dvyukov): The original implementation of efence // in CL 22060046 used SysFree instead of SysFault, so that // the operating system would eventually give the memory // back to us again, so that an efence program could run // longer without running out of memory. Unfortunately, // calling SysFree here without any kind of adjustment of the // heap data structures means that when the memory does // come back to us, we have the wrong metadata for it, either in // the MSpan structures or in the garbage collection bitmap. // Using SysFault here means that the program will run out of // memory fairly quickly in efence mode, but at least it won't // have mysterious crashes due to confused memory reuse. // It should be possible to switch back to SysFree if we also // implement and then call some kind of MHeap_DeleteSpan. if debug.efence > 0 { s.limit = 0 // prevent mlookup from finding this span sysFault(unsafe.Pointer(s.base()), size) } else { mheap_.freeSpan(s, 1) } c.local_nlargefree++ c.local_largefree += size res = true } // If span is not added to mcentral or released to mheap, then span is still in use if !res { // Add span s that are still in use to sweepSpans'Cleaned Queue // The span has been swept and is still in-use, so put // it on the swept in-use list. mheap_.sweepSpans[sweepgen/2%2].push(s) } return res }

From bgsweep and the previous allocator, we can see that the scanning phase is lazy.
In fact, there may be a situation where the previous phase of the scan has not been completed and a new round of GC needs to be started.
So before each round of GC starts, the previous round of GC scanning (Sweep Termination phase) needs to be completed.

The whole process of GC is analyzed, and barrier function is added at last. writebarrierptr Implementation:

// NOTE: Really dst *unsafe.Pointer, src unsafe.Pointer, // but if we do that, Go inserts a write barrier on *dst = src. //go:nosplit func writebarrierptr(dst *uintptr, src uintptr) { if writeBarrier.cgo { cgoCheckWriteBarrier(dst, src) } if !writeBarrier.needed { *dst = src return } if src != 0 && src < minPhysPageSize { systemstack(func() { print("runtime: writebarrierptr *", dst, " = ", hex(src), "\n") throw("bad pointer in write barrier") }) } // Marker pointer writebarrierptr_prewrite1(dst, src) // Set pointer to target *dst = src }

writebarrierptr_prewrite1 The functions are as follows:

// writebarrierptr_prewrite1 invokes a write barrier for *dst = src // prior to the write happening. // // Write barrier calls must not happen during critical GC and scheduler // related operations. In particular there are times when the GC assumes // that the world is stopped but scheduler related code is still being // executed, dealing with syscalls, dealing with putting gs on runnable // queues and so forth. This code cannot execute write barriers because // the GC might drop them on the floor. Stopping the world involves removing // the p associated with an m. We use the fact that m.p == nil to indicate // that we are in one these critical section and throw if the write is of // a pointer to a heap object. //go:nosplit func writebarrierptr_prewrite1(dst *uintptr, src uintptr) { mp := acquirem() if mp.inwb || mp.dying > 0 { releasem(mp) return } systemstack(func() { if mp.p == 0 && memstats.enablegc && !mp.inwb && inheap(src) { throw("writebarrierptr_prewrite1 called with mp.p == nil") } mp.inwb = true gcmarkwb_m(dst, src) }) mp.inwb = false releasem(mp) }

gcmarkwb_m The functions are as follows:

func gcmarkwb_m(slot *uintptr, ptr uintptr) { if writeBarrier.needed { // Note: This turns bad pointer writes into bad // pointer reads, which could be confusing. We avoid // reading from obviously bad pointers, which should // take care of the vast majority of these. We could // patch this up in the signal handler, or use XCHG to // combine the read and the write. Checking inheap is // insufficient since we need to track changes to // roots outside the heap. // // Note: profbuf.go omits a barrier during signal handler // profile logging; that's safe only because this deletion barrier exists. // If we remove the deletion barrier, we'll have to work out // a new way to handle the profile logging. if slot1 := uintptr(unsafe.Pointer(slot)); slot1 >= minPhysPageSize { if optr := *slot; optr != 0 { // Mark the old pointer shade(optr) } } // TODO: Make this conditional on the caller's stack color. if ptr != 0 && inheap(ptr) { // Mark the new pointer shade(ptr) } } }

shade The functions are as follows:

// Shade the object if it isn't already. // The object is not nil and known to be in the heap. // Preemption must be disabled. //go:nowritebarrier func shade(b uintptr) { if obj, hbits, span, objIndex := heapBitsForObject(b, 0, 0); obj != 0 { gcw := &getg().m.p.ptr().gcw // Mark an object alive and add it to the tag queue (the object greys) greyobject(obj, 0, 0, hbits, span, gcw, objIndex) // If the local tag queue is marked forbidden, flush goes to the global tag queue if gcphase == _GCmarktermination || gcBlackenPromptly { // Ps aren't allowed to cache work during mark // termination. gcw.dispose() } } }
Reference link

https://github.com/golang/go
https://making.pusher.com/golangs-real-time-gc-in-theory-and-practice
https://github.com/golang/proposal/blob/master/design/17503-eliminate-rescan.md
https://golang.org/s/go15gcpacing
https://golang.org/ref/mem
https://talks.golang.org/2015/go-gc.pdf
https://docs.google.com/document/d/1ETuA2IOmnaQ4j81AtTGT40Y4_Jr6_IDASEKg0t0dBR8/edit#heading=h.x4kziklnb8fr
https://go-review.googlesource.com/c/go/+/21503
http://www.cnblogs.com/diegodu/p/5803202.html
http://legendtkl.com/2017/04/28/golang-gc
https://lengzzz.com/note/gc-in-golang

GC comparison between Golang's GC and Core CLR

Because I've analyzed Core CLR's GC before (see) This article and This article Here I can simply compare the GC implementations of CoreCLR and GO:

  • CoreCLR objects have type information, GO objects do not, but record which places contain pointers through the bitmap area.
  • CoreCLR allocates objects much faster, and GO allocates objects that need to find span and write to bitmap regions
  • CoreCLR collectors do a lot more work than GO collectors do.
    • CoreCLR objects of different sizes are placed in a segment and can only be scanned linearly.
    • CoreCLR determines that object references need to access type information, while go only needs to access bitmap
    • CoreCLR sweeps one by one to mark as free objects, while go only needs to switch allocBits
  • CoreCLR pauses longer than GO
    • Although CoreCLR supports parallel GC, without GO thoroughly, GO does not need to stop scanning the root object completely.
  • CoreCLR supports generational GC
    • Although CoreCLR is less efficient than GO in Full GC, CoreCLR can only scan objects of generation 0 and 1 most of the time.
    • Because of the support for generational GC, CoreCLR usually spends less CPU time on GC than GO.

CoreCLR allocators and collectors are usually more efficient than GO, which means that CoreCLR has higher throughput.
But the maximum pause time of CoreCLR is not as short as that of GO, because the whole design of GO GC is to reduce the pause time.

Distributed computing and horizontal scaling are becoming more and more popular.
Rather than pursuing single-machine throughput, pursuing low latency and then allowing distributed solutions to throughput problems is undoubtedly a more sensible choice.
GO's design goal makes it more suitable for writing network service programs than other languages.

17 May 2019, 15:40 | Views: 2274

Add new comment

For adding a comment, please log in
or create account

0 comments