深入理解GO語言:GC原理及源碼分析

Go 中的runtime 類似 Java的虛擬機,它負責管理包括內存分配、垃圾回收、棧處理、goroutine、channel、切片(slice)、map 和反射(reflection)等。Go 的可執行文件都比相對應的源代碼文件要大很多,這是因爲 Go 的 runtime 嵌入到了每一個可執行文件當中。

常見的幾種gc算法:

引用計數:對每個對象維護一個引用計數,當引用該對象的對象被銷燬時,引用計數減1,當引用計數器爲0是回收該對象。

優點:對象可以很快的被回收,不會出現內存耗盡或達到某個閥值時纔回收。

缺點:不能很好的處理循環引用,而且實時維護引用計數,有也一定的代價。

代表語言:Python、PHP、Swift

標記-清除:從根變量開始遍歷所有引用的對象,引用的對象標記爲"被引用",沒有被標記的進行回收。

優點:解決了引用計數的缺點。

缺點:需要STW,即要暫時停掉程序運行。

代表語言:Golang(其採用三色標記法)

分代收集:按照對象生命週期長短劃分不同的代空間,生命週期長的放入老年代,而短的放入新生代,不同代有不能的回收算法和回收頻率。

優點:回收性能好

缺點:算法複雜

代表語言: JAVA

每種算法都不是完美的,都是折中的產物。

Gc流程圖:

Stack scan:收集根對象(全局變量,和G stack),開啓寫屏障。全局變量、開啓寫屏障需要STW,G stack只需要停止該G就好,時間比較少。

 Mark: 掃描所有根對象, 和根對象可以到達的所有對象, 標記它們不被回收

Mark Termination: 完成標記工作, 重新掃描部分根對象(要求STW)

Sweep: 按標記結果清掃span

從上圖中我們可以看到整個GC流程會進行兩次STW(Stop The World), 第一次是Mark階段的開始, 第二次是Mark Termination階段.
第一次STW會準備根對象的掃描, 啓動寫屏障(Write Barrier)和輔助GC(mutator assist).
第二次STW會重新掃描部分根對象, 禁用寫屏障(Write Barrier)和輔助GC(mutator assist).
需要注意的是, 不是所有根對象的掃描都需要STW, 例如掃描棧上的對象只需要停止擁有該棧的G.

三色標記

有黑、灰、白三個集合,每種顏色的含義:

白色:對象未被標記,gcmarkBits對應的位爲0

灰色:對象已被標記,但這個對象包含的子對象未標記,gcmarkBits對應的位爲1

黑色:對象已被標記,且這個對象包含的子對象也已標記,gcmarkBits對應的位爲1

灰色和黑色的gcmarkBits都是1,如何區分二者呢?

標記任務有標記隊列,在標記隊列中的是灰色,不在標記隊裏中的是黑色。標記過程見下圖:

 

上圖中根對象A是棧上分配的對象,H是堆中分配的全局變量,根對象A、H內部有分別引用了其他對象,而其他對象內部可能還引用額其他對象,各個對象見的關係如上圖所示。

  1. 初始狀態下所有對象都是白色的。
  2. 接着開始掃描根對象,A、H是根對象所以被掃描到,A,H變爲灰色對象。
  3. 接下來就開始掃描灰色對象,通過A到達B,B被標註灰色,A掃描結束後被標註黑色。同理J,K都被標註灰色,H被標註黑色。
  4. 繼續掃描灰色對象,通過B到達C,C 被標註灰色,B被標註黑色,因爲J,K沒有引用對象,J,K標註黑色結束
  5. 最終,黑色的對象會被保留下來,白色對象D,E,F會被回收掉。

屏障

                        

上圖,假如B對象變黑後,又給B指向對象G,因爲這個時候G對象已經掃描過了,所以G 對象還是白色,會被誤回收。怎麼解決這個問題呢?

最簡單的方法就是STW(stop the world)。也就是說,停止所有的協程。這個方法比較暴力會引起程序的卡頓,並不友好。讓GC回收器,滿足下面兩種情況之一時,可保對象不丟失. 所以引出強-弱三色不變式:

強三色不變式:黑色不能引用白色對象。

弱三色不變式:被黑色引用的白色對象都處於灰色保護。

如何實現這個兩個公式呢?這就是屏障機制。

GO1.5 採用了插入屏障、刪除屏障。到了GO1.8採用混合屏障。黑色對象的內存槽有兩種位置, . 棧空間的特點是容量小,但是要求相應速度快,因爲函數調用彈出頻繁使用, 所以“插入屏障”機制,在棧空間的對象操作中不使用. 而僅僅使用在堆空間對象的操作中。

插入屏障:插入屏障只對堆上的內存分配起作用,棧空間先掃描一遍然後啓動STW後再重新掃描一遍掃描後停止STW。如果在對象在插入平展期間分配內存會自動設置成灰色,不用再重新掃描。

刪除屏障:刪除屏障適用於棧和堆,在刪除屏障機制下刪除一個節點該節點會被置成灰色,後續會繼續掃描該灰色對象的子對象。該方法就是精準度不夠高

混合屏障:

插入寫屏障和刪除寫屏障的短板:

插入寫屏障:結束時需要STW來重新掃描棧,標記棧上引用的白色對象的存活;

刪除寫屏障:回收精度低,GC開始時STW掃描堆棧來記錄初始快照,這個過程會保護開始時刻的所有存活對象。

混合寫屏障規則

具體操作:

1、GC開始將棧上的對象全部掃描並標記爲黑色(之後不再進行第二次重複掃描,無需STW),

2、GC期間,任何在棧上創建的新對象,均爲黑色。

3、被刪除的對象標記爲灰色。

4、被添加的對象標記爲灰色。

滿足: 變形的弱三色不變式.

僞代碼如下:

添加下游對象(當前下游對象slot, 新下游對象ptr) {
      //1 
        標記灰色(當前下游對象slot)    //只要當前下游對象被移走,就標記灰色
      
      //2 
      標記灰色(新下游對象ptr)
          
      //3
      當前下游對象slot = 新下游對象ptr
}

上面說到整個GC有兩次STW,採用混合屏障後可以大幅壓縮第二次STW的時間。

Gc pacer

觸發gc的時機:

閾值gcTriggerHeap:默認內存擴大一倍,啓動gc

定期gcTriggerTime:默認2min觸發一次gc,src/runtime/proc.go:forcegcperiod

手動gcTriggerCycle:runtime.gc()

當然了閥值是根據使用內存的增加動態變化的。假如前一次GC之後內存使用Hm(n-1)爲1GB,默認GCGO=100,那麼下一次會在接近Hg(2GB)的位置發起新一輪的GC。如下圖:

Ht的時候開始GC,Ha的時候結束GC,Ha非常接近Hg。

(1)如何保證在Ht開始gc時所有的span都被清掃完?

除了有一個後臺清掃協程外,用戶的分配內存時也需要輔助清掃來保證在開啓下一輪的gc時span都被清掃完畢。假設有k page的span需要sweep,那麼距離下一次gc還有Ht-Hm(n-1)的內存可供分配,那麼平均每申請1byte內存需要清掃k/ Ht-Hm(n-1) page 的span。(k值會根據sweep進度更改)

輔助清掃申請新span時纔會檢查,,輔助清掃的觸發可以看cacheSpan函數, 觸發時G會幫助回收"工作量"頁的對象, 工作量的計算公式是:

spanBytes * sweepPagesPerByte

意思是分配的大小乘以係數sweepPagesPerByte, sweepPagesPerByte的計算在函數gcSetTriggerRatio中, 公式是:

// 當前的Heap大小
heapLiveBasis := atomic.Load64(&memstats.heap_live)
// 距離觸發GC的Heap大小 = 下次觸發GC的Heap大小 - 當前的Heap大小
heapDistance := int64(trigger) - int64(heapLiveBasis)
heapDistance -= 1024 * 1024
if heapDistance < _PageSize {
	heapDistance = _PageSize
}
// 已清掃的頁數
pagesSwept := atomic.Load64(&mheap_.pagesSwept)
// 未清掃的頁數 = 使用中的頁數 - 已清掃的頁數
sweepDistancePages := int64(mheap_.pagesInUse) - int64(pagesSwept)
if sweepDistancePages <= 0 {
	mheap_.sweepPagesPerByte = 0
} else {
	// 每分配1 byte(的span)需要輔助清掃的頁數 = 未清掃的頁數 / 距離觸發GC的Heap大小
	mheap_.sweepPagesPerByte = float64(sweepDistancePages) / float64(heapDistance)
}

 

(2)如何保證在Ha時gc都被mark完?

Gc在Ht開始,在到達Hg時儘量標記完所有的對象,除了後臺的標記協程外還需要在分配內存是進行輔助mark。從Ht到Hg的內存可以分配,這個時候還有scanWorkExpected的對象需要scan,那麼平均分配1byte內存需要輔助mark量:scanWorkExpected/(Hg-Ht) 個對象,scanWorkExpected會根據mark進度更改。

輔助標記的觸發可以查看上面的mallocgc函數, 觸發時G會幫助掃描"工作量"個對象, 工作量的計算公式是:

debtBytes * assistWorkPerByte

意思是分配的大小乘以係數assistWorkPerByte, assistWorkPerByte的計算在函數revise中, 公式是:

// 等待掃描的對象數量 = 未掃描的對象數量 - 已掃描的對象數量
scanWorkExpected := int64(memstats.heap_scan) - c.scanWork
if scanWorkExpected < 1000 {
	scanWorkExpected = 1000
}
// 距離觸發GC的Heap大小 = 期待觸發GC的Heap大小 - 當前的Heap大小
// 注意next_gc的計算跟gc_trigger不一樣, next_gc等於heap_marked * (1 + gcpercent / 100)
heapDistance := int64(memstats.next_gc) - int64(atomic.Load64(&memstats.heap_live))
if heapDistance <= 0 {
	heapDistance = 1
}
// 每分配1 byte需要輔助掃描的對象數量 = 等待掃描的對象數量 / 距離觸發GC的Heap大小
c.assistWorkPerByte = float64(scanWorkExpected) / float64(heapDistance)
c.assistBytesPerWork = float64(heapDistance) / float64(scanWorkExpected)

 根對象

在GC的標記階段首先需要標記的就是"根對象", 從根對象開始可到達的所有對象都會被認爲是存活的.
根對象包含了全局變量, 各個G的棧上的變量等, GC會先掃描根對象然後再掃描根對象可到達的所有對象.

Fixed Roots: 特殊的掃描工作 :

fixedRootFinalizers: 掃描析構器隊列

fixedRootFreeGStacks: 釋放已中止的G的棧

Flush Cache Roots: 釋放mcache中的所有span, 要求STW

Data Roots: 掃描可讀寫的全局變量

BSS Roots: 掃描只讀的全局變量

Span Roots: 掃描各個span中特殊對象(析構器列表)

Stack Roots: 掃描各個G的棧

標記階段(Mark)會做其中的"Fixed Roots", "Data Roots", "BSS Roots", "Span Roots", "Stack Roots".
完成標記階段(Mark Termination)會做其中的"Fixed Roots", "Flush Cache Roots".

對象掃描

當拿到一個對象的p時如何找到該對象的span和heapbit。以下分析是基於go1.10

我們在內存分配部分介紹過2 bit表示一個字,一個字節就可以表示4個字。2bit中一個表示是否被scan另一個表示該對象內是否有指針類型,根據地址p可以根據固定偏移計算出該p對應的hbit:

func heapBitsForAddr(addr uintptr) heapBits {
    // 2 bits per work, 4 pairs per byte, and a mask is hard coded.
    off := (addr - mheap_.arena_start) / sys.PtrSize
    return heapBits{(*uint8)(unsafe.Pointer(mheap_.bitmap - off/4 - 1)), uint32(off & 3)}
}

查找p對應的span更簡單了,我們前面介紹過spans區域中就是記錄每個page對應的span結構,所以根據p對page取餘計算出是第幾個page就可以找到對應的span指針了

mheap_.spans[(p-mheap_.arena_start)>>_PageShift]

以下分析是基於go1.11及之後

Go1.11及以後的版本改用了稀疏索引的方式來管理整體的內存. 可以超過 512G 內存, 也可以允許內存空間擴展時不連續.在全局的 mheap struct 中有個 arenas 二階數組, 在 linux amd64 上,一階只有一個 slot, 二階有 4M 個 slot, 每個 slot 指向一個 heapArena 結構, 每個 heapArena 結構可以管理 64M 內存, 所以在新的版本中, go 可以管理 4M*64M=256TB 內存, 即目前 64 位機器中 48bit 的尋址總線全部 256TB 內存。可以通過指針加上一定得偏移量, 就知道屬於哪個 heap arean 64M 塊. 再通過對 64M 求餘, 結合 spans 數組, 即可知道屬於哪個 mspan 了,結合 heapArean 的 bitmap 和每 8 個字節在 heapArean 中的偏移, 就可知道對象每 8 個字節是指針還是普通數據。

源碼分析

源碼分析引自:https://www.cnblogs.com/zkweb/p/7880099.html 講的很詳細:

 go觸發gc會從gcStart函數開始:

// gcStart transitions the GC from _GCoff to _GCmark (if
// !mode.stwMark) or _GCmarktermination (if mode.stwMark) by
// performing sweep termination and GC initialization.
//
// This may return without performing this transition in some cases,
// such as when called on a system stack or with locks held.
func gcStart(mode gcMode, trigger gcTrigger) {
	// 判斷當前G是否可搶佔, 不可搶佔時不觸發GC
	// Since this is called from malloc and malloc is called in
	// the guts of a number of libraries that might be holding
	// locks, don't attempt to start GC in non-preemptible or
	// potentially unstable situations.
	mp := acquirem()
	if gp := getg(); gp == mp.g0 || mp.locks > 1 || mp.preemptoff != "" {
		releasem(mp)
		return
	}
	releasem(mp)
	mp = nil

	// 並行清掃上一輪GC未清掃的span
	// Pick up the remaining unswept/not being swept spans concurrently
	//
	// This shouldn't happen if we're being invoked in background
	// mode since proportional sweep should have just finished
	// sweeping everything, but rounding errors, etc, may leave a
	// few spans unswept. In forced mode, this is necessary since
	// GC can be forced at any point in the sweeping cycle.
	//
	// We check the transition condition continuously here in case
	// this G gets delayed in to the next GC cycle.
	for trigger.test() && gosweepone() != ^uintptr(0) {
		sweep.nbgsweep++
	}

	// 上鎖, 然後重新檢查gcTrigger的條件是否成立, 不成立時不觸發GC
	// Perform GC initialization and the sweep termination
	// transition.
	semacquire(&work.startSema)
	// Re-check transition condition under transition lock.
	if !trigger.test() {
		semrelease(&work.startSema)
		return
	}

	// 記錄是否強制觸發, gcTriggerCycle是runtime.GC用的
	// For stats, check if this GC was forced by the user.
	work.userForced = trigger.kind == gcTriggerAlways || trigger.kind == gcTriggerCycle

	// 判斷是否指定了禁止並行GC的參數
	// In gcstoptheworld debug mode, upgrade the mode accordingly.
	// We do this after re-checking the transition condition so
	// that multiple goroutines that detect the heap trigger don't
	// start multiple STW GCs.
	if mode == gcBackgroundMode {
		if debug.gcstoptheworld == 1 {
			mode = gcForceMode
		} else if debug.gcstoptheworld == 2 {
			mode = gcForceBlockMode
		}
	}

	// Ok, we're doing it!  Stop everybody else
	semacquire(&worldsema)

	// 跟蹤處理
	if trace.enabled {
		traceGCStart()
	}

	// 啓動後臺掃描任務(G)
	if mode == gcBackgroundMode {
		gcBgMarkStartWorkers()
	}

	// 重置標記相關的狀態
	gcResetMarkState()

	// 重置參數
	work.stwprocs, work.maxprocs = gcprocs(), gomaxprocs
	work.heap0 = atomic.Load64(&memstats.heap_live)
	work.pauseNS = 0
	work.mode = mode

	// 記錄開始時間
	now := nanotime()
	work.tSweepTerm = now
	work.pauseStart = now
	
	// 停止所有運行中的G, 並禁止它們運行
	systemstack(stopTheWorldWithSema)
	
	// !!!!!!!!!!!!!!!!
	// 世界已停止(STW)...
	// !!!!!!!!!!!!!!!!
	
	// 清掃上一輪GC未清掃的span, 確保上一輪GC已完成
	// Finish sweep before we start concurrent scan.
	systemstack(func() {
		finishsweep_m()
	})
	// 清掃sched.sudogcache和sched.deferpool
	// clearpools before we start the GC. If we wait they memory will not be
	// reclaimed until the next GC cycle.
	clearpools()

	// 增加GC計數
	work.cycles++
	
	// 判斷是否並行GC模式
	if mode == gcBackgroundMode { // Do as much work concurrently as possible
		// 標記新一輪GC已開始
		gcController.startCycle()
		work.heapGoal = memstats.next_gc

		// 設置全局變量中的GC狀態爲_GCmark
		// 然後啓用寫屏障
		// Enter concurrent mark phase and enable
		// write barriers.
		//
		// Because the world is stopped, all Ps will
		// observe that write barriers are enabled by
		// the time we start the world and begin
		// scanning.
		//
		// Write barriers must be enabled before assists are
		// enabled because they must be enabled before
		// any non-leaf heap objects are marked. Since
		// allocations are blocked until assists can
		// happen, we want enable assists as early as
		// possible.
		setGCPhase(_GCmark)

		// 重置後臺標記任務的計數
		gcBgMarkPrepare() // Must happen before assist enable.

		// 計算掃描根對象的任務數量
		gcMarkRootPrepare()

		// 標記所有tiny alloc等待合併的對象
		// Mark all active tinyalloc blocks. Since we're
		// allocating from these, they need to be black like
		// other allocations. The alternative is to blacken
		// the tiny block on every allocation from it, which
		// would slow down the tiny allocator.
		gcMarkTinyAllocs()

		// 啓用輔助GC
		// At this point all Ps have enabled the write
		// barrier, thus maintaining the no white to
		// black invariant. Enable mutator assists to
		// put back-pressure on fast allocating
		// mutators.
		atomic.Store(&gcBlackenEnabled, 1)

		// 記錄標記開始的時間
		// Assists and workers can start the moment we start
		// the world.
		gcController.markStartTime = now

		// 重新啓動世界
		// 前面創建的後臺標記任務會開始工作, 所有後臺標記任務都完成工作後, 進入完成標記階段
		// Concurrent mark.
		systemstack(startTheWorldWithSema)
		
		// !!!!!!!!!!!!!!!
		// 世界已重新啓動...
		// !!!!!!!!!!!!!!!
		
		// 記錄停止了多久, 和標記階段開始的時間
		now = nanotime()
		work.pauseNS += now - work.pauseStart
		work.tMark = now
	} else {
		// 不是並行GC模式
		// 記錄完成標記階段開始的時間
		t := nanotime()
		work.tMark, work.tMarkTerm = t, t
		work.heapGoal = work.heap0

		// 跳過標記階段, 執行完成標記階段
		// 所有標記工作都會在世界已停止的狀態執行
		// (標記階段會設置work.markrootDone=true, 如果跳過則它的值是false, 完成標記階段會執行所有工作)
		// 完成標記階段會重新啓動世界
		// Perform mark termination. This will restart the world.
		gcMarkTermination(memstats.triggerRatio)
	}

	semrelease(&work.startSema)
}

接下來一個個分析gcStart調用的函數, 建議配合上面的"回收對象的流程"中的圖理解.

函數gcBgMarkStartWorkers用於啓動後臺標記任務, 先分別對每個P啓動一個:

// gcBgMarkStartWorkers prepares background mark worker goroutines.
// These goroutines will not run until the mark phase, but they must
// be started while the work is not stopped and from a regular G
// stack. The caller must hold worldsema.
func gcBgMarkStartWorkers() {
	// Background marking is performed by per-P G's. Ensure that
	// each P has a background GC G.
	for _, p := range &allp {
		if p == nil || p.status == _Pdead {
			break
		}
		// 如果已啓動則不重複啓動
		if p.gcBgMarkWorker == 0 {
			go gcBgMarkWorker(p)
			// 啓動後等待該任務通知信號量bgMarkReady再繼續
			notetsleepg(&work.bgMarkReady, -1)
			noteclear(&work.bgMarkReady)
		}
	}
}

這裏雖然爲每個P啓動了一個後臺標記任務, 但是可以同時工作的只有25%, 這個邏輯在協程M獲取G時調用的findRunnableGCWorker中:

// findRunnableGCWorker returns the background mark worker for _p_ if it
// should be run. This must only be called when gcBlackenEnabled != 0.
func (c *gcControllerState) findRunnableGCWorker(_p_ *p) *g {
	if gcBlackenEnabled == 0 {
		throw("gcControllerState.findRunnable: blackening not enabled")
	}
	if _p_.gcBgMarkWorker == 0 {
		// The mark worker associated with this P is blocked
		// performing a mark transition. We can't run it
		// because it may be on some other run or wait queue.
		return nil
	}

	if !gcMarkWorkAvailable(_p_) {
		// No work to be done right now. This can happen at
		// the end of the mark phase when there are still
		// assists tapering off. Don't bother running a worker
		// now because it'll just return immediately.
		return nil
	}

	// 原子減少對應的值, 如果減少後大於等於0則返回true, 否則返回false
	decIfPositive := func(ptr *int64) bool {
		if *ptr > 0 {
			if atomic.Xaddint64(ptr, -1) >= 0 {
				return true
			}
			// We lost a race
			atomic.Xaddint64(ptr, +1)
		}
		return false
	}

	// 減少dedicatedMarkWorkersNeeded, 成功時後臺標記任務的模式是Dedicated
	// dedicatedMarkWorkersNeeded是當前P的數量的25%去除小數點
	// 詳見startCycle函數
	if decIfPositive(&c.dedicatedMarkWorkersNeeded) {
		// This P is now dedicated to marking until the end of
		// the concurrent mark phase.
		_p_.gcMarkWorkerMode = gcMarkWorkerDedicatedMode
	} else {
		// 減少fractionalMarkWorkersNeeded, 成功是後臺標記任務的模式是Fractional
		// 上面的計算如果小數點後有數值(不能夠整除)則fractionalMarkWorkersNeeded爲1, 否則爲0
		// 詳見startCycle函數
		// 舉例來說, 4個P時會執行1個Dedicated模式的任務, 5個P時會執行1個Dedicated模式和1個Fractional模式的任務
		if !decIfPositive(&c.fractionalMarkWorkersNeeded) {
			// No more workers are need right now.
			return nil
		}

		// 按Dedicated模式的任務的執行時間判斷cpu佔用率是否超過預算值, 超過時不啓動
		// This P has picked the token for the fractional worker.
		// Is the GC currently under or at the utilization goal?
		// If so, do more work.
		//
		// We used to check whether doing one time slice of work
		// would remain under the utilization goal, but that has the
		// effect of delaying work until the mutator has run for
		// enough time slices to pay for the work. During those time
		// slices, write barriers are enabled, so the mutator is running slower.
		// Now instead we do the work whenever we're under or at the
		// utilization work and pay for it by letting the mutator run later.
		// This doesn't change the overall utilization averages, but it
		// front loads the GC work so that the GC finishes earlier and
		// write barriers can be turned off sooner, effectively giving
		// the mutator a faster machine.
		//
		// The old, slower behavior can be restored by setting
		//	gcForcePreemptNS = forcePreemptNS.
		const gcForcePreemptNS = 0

		// TODO(austin): We could fast path this and basically
		// eliminate contention on c.fractionalMarkWorkersNeeded by
		// precomputing the minimum time at which it's worth
		// next scheduling the fractional worker. Then Ps
		// don't have to fight in the window where we've
		// passed that deadline and no one has started the
		// worker yet.
		//
		// TODO(austin): Shorter preemption interval for mark
		// worker to improve fairness and give this
		// finer-grained control over schedule?
		now := nanotime() - gcController.markStartTime
		then := now + gcForcePreemptNS
		timeUsed := c.fractionalMarkTime + gcForcePreemptNS
		if then > 0 && float64(timeUsed)/float64(then) > c.fractionalUtilizationGoal {
			// Nope, we'd overshoot the utilization goal
			atomic.Xaddint64(&c.fractionalMarkWorkersNeeded, +1)
			return nil
		}
		_p_.gcMarkWorkerMode = gcMarkWorkerFractionalMode
	}

	// 安排後臺標記任務執行
	// Run the background mark worker
	gp := _p_.gcBgMarkWorker.ptr()
	casgstatus(gp, _Gwaiting, _Grunnable)
	if trace.enabled {
		traceGoUnpark(gp, 0)
	}
	return gp
}

gcResetMarkState函數會重置標記相關的狀態:

// gcResetMarkState resets global state prior to marking (concurrent
// or STW) and resets the stack scan state of all Gs.
//
// This is safe to do without the world stopped because any Gs created
// during or after this will start out in the reset state.
func gcResetMarkState() {
	// This may be called during a concurrent phase, so make sure
	// allgs doesn't change.
	lock(&allglock)
	for _, gp := range allgs {
		gp.gcscandone = false  // set to true in gcphasework
		gp.gcscanvalid = false // stack has not been scanned
		gp.gcAssistBytes = 0
	}
	unlock(&allglock)

	work.bytesMarked = 0
	work.initialHeapLive = atomic.Load64(&memstats.heap_live)
	work.markrootDone = false
}

stopTheWorldWithSema函數會停止整個世界, 這個函數必須在g0中運行:

// stopTheWorldWithSema is the core implementation of stopTheWorld.
// The caller is responsible for acquiring worldsema and disabling
// preemption first and then should stopTheWorldWithSema on the system
// stack:
//
//	semacquire(&worldsema, 0)
//	m.preemptoff = "reason"
//	systemstack(stopTheWorldWithSema)
//
// When finished, the caller must either call startTheWorld or undo
// these three operations separately:
//
//	m.preemptoff = ""
//	systemstack(startTheWorldWithSema)
//	semrelease(&worldsema)
//
// It is allowed to acquire worldsema once and then execute multiple
// startTheWorldWithSema/stopTheWorldWithSema pairs.
// Other P's are able to execute between successive calls to
// startTheWorldWithSema and stopTheWorldWithSema.
// Holding worldsema causes any other goroutines invoking
// stopTheWorld to block.
func stopTheWorldWithSema() {
	_g_ := getg()

	// If we hold a lock, then we won't be able to stop another M
	// that is blocked trying to acquire the lock.
	if _g_.m.locks > 0 {
		throw("stopTheWorld: holding locks")
	}

	lock(&sched.lock)
	
	// 需要停止的P數量
	sched.stopwait = gomaxprocs
	
	// 設置gc等待標記, 調度時看見此標記會進入等待
	atomic.Store(&sched.gcwaiting, 1)
	
	// 搶佔所有運行中的G
	preemptall()
	
	// 停止當前的P
	// stop current P
	_g_.m.p.ptr().status = _Pgcstop // Pgcstop is only diagnostic.
	
	// 減少需要停止的P數量(當前的P算一個)
	sched.stopwait--
	
	// 搶佔所有在Psyscall狀態的P, 防止它們重新參與調度
	// try to retake all P's in Psyscall status
	for i := 0; i < int(gomaxprocs); i++ {
		p := allp[i]
		s := p.status
		if s == _Psyscall && atomic.Cas(&p.status, s, _Pgcstop) {
			if trace.enabled {
				traceGoSysBlock(p)
				traceProcStop(p)
			}
			p.syscalltick++
			sched.stopwait--
		}
	}
	
	// 防止所有空閒的P重新參與調度
	// stop idle P's
	for {
		p := pidleget()
		if p == nil {
			break
		}
		p.status = _Pgcstop
		sched.stopwait--
	}
	wait := sched.stopwait > 0
	unlock(&sched.lock)

	// 如果仍有需要停止的P, 則等待它們停止
	// wait for remaining P's to stop voluntarily
	if wait {
		for {
			// 循環等待 + 搶佔所有運行中的G
			// wait for 100us, then try to re-preempt in case of any races
			if notetsleep(&sched.stopnote, 100*1000) {
				noteclear(&sched.stopnote)
				break
			}
			preemptall()
		}
	}

	// 邏輯正確性檢查
	// sanity checks
	bad := ""
	if sched.stopwait != 0 {
		bad = "stopTheWorld: not stopped (stopwait != 0)"
	} else {
		for i := 0; i < int(gomaxprocs); i++ {
			p := allp[i]
			if p.status != _Pgcstop {
				bad = "stopTheWorld: not stopped (status != _Pgcstop)"
			}
		}
	}
	if atomic.Load(&freezing) != 0 {
		// Some other thread is panicking. This can cause the
		// sanity checks above to fail if the panic happens in
		// the signal handler on a stopped thread. Either way,
		// we should halt this thread.
		lock(&deadlock)
		lock(&deadlock)
	}
	if bad != "" {
		throw(bad)
	}
	
	// 到這裏所有運行中的G都會變爲待運行, 並且所有的P都不能被M獲取
	// 也就是說所有的go代碼(除了當前的)都會停止運行, 並且不能運行新的go代碼
}

finishsweep_m函數會清掃上一輪GC未清掃的span, 確保上一輪GC已完成:

// finishsweep_m ensures that all spans are swept.
//
// The world must be stopped. This ensures there are no sweeps in
// progress.
//
//go:nowritebarrier
func finishsweep_m() {
	// sweepone會取出一個未sweep的span然後執行sweep
	// 詳細將在下面sweep階段時分析
	// Sweeping must be complete before marking commences, so
	// sweep any unswept spans. If this is a concurrent GC, there
	// shouldn't be any spans left to sweep, so this should finish
	// instantly. If GC was forced before the concurrent sweep
	// finished, there may be spans to sweep.
	for sweepone() != ^uintptr(0) {
		sweep.npausesweep++
	}

	// 所有span都sweep完成後, 啓動一個新的markbit時代
	// 這個函數是實現span的gcmarkBits和allocBits的分配和複用的關鍵, 流程如下
	// - span分配gcmarkBits和allocBits
	// - span完成sweep
	//   - 原allocBits不再被使用
	//   - gcmarkBits變爲allocBits
	//   - 分配新的gcmarkBits
	// - 開啓新的markbit時代
	// - span完成sweep, 同上
	// - 開啓新的markbit時代
	//   - 2個時代之前的bitmap將不再被使用, 可以複用這些bitmap
	nextMarkBitArenaEpoch()
}

clearpools函數會清理sched.sudogcache和sched.deferpool, 讓它們的內存可以被回收:

func clearpools() {
	// clear sync.Pools
	if poolcleanup != nil {
		poolcleanup()
	}

	// Clear central sudog cache.
	// Leave per-P caches alone, they have strictly bounded size.
	// Disconnect cached list before dropping it on the floor,
	// so that a dangling ref to one entry does not pin all of them.
	lock(&sched.sudoglock)
	var sg, sgnext *sudog
	for sg = sched.sudogcache; sg != nil; sg = sgnext {
		sgnext = sg.next
		sg.next = nil
	}
	sched.sudogcache = nil
	unlock(&sched.sudoglock)

	// Clear central defer pools.
	// Leave per-P pools alone, they have strictly bounded size.
	lock(&sched.deferlock)
	for i := range sched.deferpool {
		// disconnect cached list before dropping it on the floor,
		// so that a dangling ref to one entry does not pin all of them.
		var d, dlink *_defer
		for d = sched.deferpool[i]; d != nil; d = dlink {
			dlink = d.link
			d.link = nil
		}
		sched.deferpool[i] = nil
	}
	unlock(&sched.deferlock)
}

startCycle標記開始了新一輪的GC:

// startCycle resets the GC controller's state and computes estimates
// for a new GC cycle. The caller must hold worldsema.
func (c *gcControllerState) startCycle() {
	c.scanWork = 0
	c.bgScanCredit = 0
	c.assistTime = 0
	c.dedicatedMarkTime = 0
	c.fractionalMarkTime = 0
	c.idleMarkTime = 0

	// 僞裝heap_marked的值如果gc_trigger的值很小, 防止後面對triggerRatio做出錯誤的調整
	// If this is the first GC cycle or we're operating on a very
	// small heap, fake heap_marked so it looks like gc_trigger is
	// the appropriate growth from heap_marked, even though the
	// real heap_marked may not have a meaningful value (on the
	// first cycle) or may be much smaller (resulting in a large
	// error response).
	if memstats.gc_trigger <= heapminimum {
		memstats.heap_marked = uint64(float64(memstats.gc_trigger) / (1 + memstats.triggerRatio))
	}

	// 重新計算next_gc, 注意next_gc的計算跟gc_trigger不一樣
	// Re-compute the heap goal for this cycle in case something
	// changed. This is the same calculation we use elsewhere.
	memstats.next_gc = memstats.heap_marked + memstats.heap_marked*uint64(gcpercent)/100
	if gcpercent < 0 {
		memstats.next_gc = ^uint64(0)
	}

	// 確保next_gc和heap_live之間最少有1MB
	// Ensure that the heap goal is at least a little larger than
	// the current live heap size. This may not be the case if GC
	// start is delayed or if the allocation that pushed heap_live
	// over gc_trigger is large or if the trigger is really close to
	// GOGC. Assist is proportional to this distance, so enforce a
	// minimum distance, even if it means going over the GOGC goal
	// by a tiny bit.
	if memstats.next_gc < memstats.heap_live+1024*1024 {
		memstats.next_gc = memstats.heap_live + 1024*1024
	}

	// 計算可以同時執行的後臺標記任務的數量
	// dedicatedMarkWorkersNeeded等於P的數量的25%去除小數點
	// 如果可以整除則fractionalMarkWorkersNeeded等於0否則等於1
	// totalUtilizationGoal是GC所佔的P的目標值(例如P一共有5個時目標是1.25個P)
	// fractionalUtilizationGoal是Fractiona模式的任務所佔的P的目標值(例如P一共有5個時目標是0.25個P)
	// Compute the total mark utilization goal and divide it among
	// dedicated and fractional workers.
	totalUtilizationGoal := float64(gomaxprocs) * gcGoalUtilization
	c.dedicatedMarkWorkersNeeded = int64(totalUtilizationGoal)
	c.fractionalUtilizationGoal = totalUtilizationGoal - float64(c.dedicatedMarkWorkersNeeded)
	if c.fractionalUtilizationGoal > 0 {
		c.fractionalMarkWorkersNeeded = 1
	} else {
		c.fractionalMarkWorkersNeeded = 0
	}

	// 重置P中的輔助GC所用的時間統計
	// Clear per-P state
	for _, p := range &allp {
		if p == nil {
			break
		}
		p.gcAssistTime = 0
	}

	// 計算輔助GC的參數
	// 參考上面對計算assistWorkPerByte的公式的分析
	// Compute initial values for controls that are updated
	// throughout the cycle.
	c.revise()

	if debug.gcpacertrace > 0 {
		print("pacer: assist ratio=", c.assistWorkPerByte,
			" (scan ", memstats.heap_scan>>20, " MB in ",
			work.initialHeapLive>>20, "->",
			memstats.next_gc>>20, " MB)",
			" workers=", c.dedicatedMarkWorkersNeeded,
			"+", c.fractionalMarkWorkersNeeded, "\n")
	}
}

setGCPhase函數會修改表示當前GC階段的全局變量和是否開啓寫屏障的全局變量:

//go:nosplit
func setGCPhase(x uint32) {
	atomic.Store(&gcphase, x)
	writeBarrier.needed = gcphase == _GCmark || gcphase == _GCmarktermination
	writeBarrier.enabled = writeBarrier.needed || writeBarrier.cgo
}

gcBgMarkPrepare函數會重置後臺標記任務的計數:

// gcBgMarkPrepare sets up state for background marking.
// Mutator assists must not yet be enabled.
func gcBgMarkPrepare() {
	// Background marking will stop when the work queues are empty
	// and there are no more workers (note that, since this is
	// concurrent, this may be a transient state, but mark
	// termination will clean it up). Between background workers
	// and assists, we don't really know how many workers there
	// will be, so we pretend to have an arbitrarily large number
	// of workers, almost all of which are "waiting". While a
	// worker is working it decrements nwait. If nproc == nwait,
	// there are no workers.
	work.nproc = ^uint32(0)
	work.nwait = ^uint32(0)
}

gcMarkRootPrepare函數會計算掃描根對象的任務數量:

// gcMarkRootPrepare queues root scanning jobs (stacks, globals, and
// some miscellany) and initializes scanning-related state.
//
// The caller must have call gcCopySpans().
//
// The world must be stopped.
//
//go:nowritebarrier
func gcMarkRootPrepare() {
	// 釋放mcache中的所有span的任務, 只在完成標記階段(mark termination)中執行
	if gcphase == _GCmarktermination {
		work.nFlushCacheRoots = int(gomaxprocs)
	} else {
		work.nFlushCacheRoots = 0
	}

	// 計算block數量的函數, rootBlockBytes是256KB
	// Compute how many data and BSS root blocks there are.
	nBlocks := func(bytes uintptr) int {
		return int((bytes + rootBlockBytes - 1) / rootBlockBytes)
	}

	work.nDataRoots = 0
	work.nBSSRoots = 0

	// data和bss每一輪GC只掃描一次
	// 並行GC中會在後臺標記任務中掃描, 完成標記階段(mark termination)中不掃描
	// 非並行GC會在完成標記階段(mark termination)中掃描
	// Only scan globals once per cycle; preferably concurrently.
	if !work.markrootDone {
		// 計算掃描可讀寫的全局變量的任務數量
		for _, datap := range activeModules() {
			nDataRoots := nBlocks(datap.edata - datap.data)
			if nDataRoots > work.nDataRoots {
				work.nDataRoots = nDataRoots
			}
		}

		// 計算掃描只讀的全局變量的任務數量
		for _, datap := range activeModules() {
			nBSSRoots := nBlocks(datap.ebss - datap.bss)
			if nBSSRoots > work.nBSSRoots {
				work.nBSSRoots = nBSSRoots
			}
		}
	}

	// span中的finalizer和各個G的棧每一輪GC只掃描一次
	// 同上
	if !work.markrootDone {
		// 計算掃描span中的finalizer的任務數量
		// On the first markroot, we need to scan span roots.
		// In concurrent GC, this happens during concurrent
		// mark and we depend on addfinalizer to ensure the
		// above invariants for objects that get finalizers
		// after concurrent mark. In STW GC, this will happen
		// during mark termination.
		//
		// We're only interested in scanning the in-use spans,
		// which will all be swept at this point. More spans
		// may be added to this list during concurrent GC, but
		// we only care about spans that were allocated before
		// this mark phase.
		work.nSpanRoots = mheap_.sweepSpans[mheap_.sweepgen/2%2].numBlocks()

		// 計算掃描各個G的棧的任務數量
		// On the first markroot, we need to scan all Gs. Gs
		// may be created after this point, but it's okay that
		// we ignore them because they begin life without any
		// roots, so there's nothing to scan, and any roots
		// they create during the concurrent phase will be
		// scanned during mark termination. During mark
		// termination, allglen isn't changing, so we'll scan
		// all Gs.
		work.nStackRoots = int(atomic.Loaduintptr(&allglen))
	} else {
		// We've already scanned span roots and kept the scan
		// up-to-date during concurrent mark.
		work.nSpanRoots = 0

		// The hybrid barrier ensures that stacks can't
		// contain pointers to unmarked objects, so on the
		// second markroot, there's no need to scan stacks.
		work.nStackRoots = 0

		if debug.gcrescanstacks > 0 {
			// Scan stacks anyway for debugging.
			work.nStackRoots = int(atomic.Loaduintptr(&allglen))
		}
	}

	// 計算總任務數量
	// 後臺標記任務會對markrootNext進行原子遞增, 來決定做哪個任務
	// 這種用數值來實現鎖自由隊列的辦法挺聰明的, 儘管google工程師覺得不好(看後面markroot函數的分析)
	work.markrootNext = 0
	work.markrootJobs = uint32(fixedRootCount + work.nFlushCacheRoots + work.nDataRoots + work.nBSSRoots + work.nSpanRoots + work.nStackRoots)
}

gcMarkTinyAllocs函數會標記所有tiny alloc等待合併的對象:

// gcMarkTinyAllocs greys all active tiny alloc blocks.
//
// The world must be stopped.
func gcMarkTinyAllocs() {
	for _, p := range &allp {
		if p == nil || p.status == _Pdead {
			break
		}
		c := p.mcache
		if c == nil || c.tiny == 0 {
			continue
		}
		// 標記各個P中的mcache中的tiny
		// 在上面的mallocgc函數中可以看到tiny是當前等待合併的對象
		_, hbits, span, objIndex := heapBitsForObject(c.tiny, 0, 0)
		gcw := &p.gcw
		// 標記一個對象存活, 並把它加到標記隊列(該對象變爲灰色)
		greyobject(c.tiny, 0, 0, hbits, span, gcw, objIndex)
		// gcBlackenPromptly變量表示當前是否禁止本地隊列, 如果已禁止則把標記任務flush到全局隊列
		if gcBlackenPromptly {
			gcw.dispose()
		}
	}
}

startTheWorldWithSema函數會重新啓動世界:

func startTheWorldWithSema() {
	_g_ := getg()
	
	// 禁止G被搶佔
	_g_.m.locks++        // disable preemption because it can be holding p in a local var
	
	// 判斷收到的網絡事件(fd可讀可寫或錯誤)並添加對應的G到待運行隊列
	gp := netpoll(false) // non-blocking
	injectglist(gp)
	
	// 判斷是否要啓動gc helper
	add := needaddgcproc()
	lock(&sched.lock)
	
	// 如果要求改變gomaxprocs則調整P的數量
	// procresize會返回有可運行任務的P的鏈表
	procs := gomaxprocs
	if newprocs != 0 {
		procs = newprocs
		newprocs = 0
	}
	p1 := procresize(procs)
	
	// 取消GC等待標記
	sched.gcwaiting = 0
	
	// 如果sysmon在等待則喚醒它
	if sched.sysmonwait != 0 {
		sched.sysmonwait = 0
		notewakeup(&sched.sysmonnote)
	}
	unlock(&sched.lock)
	
	// 喚醒有可運行任務的P
	for p1 != nil {
		p := p1
		p1 = p1.link.ptr()
		if p.m != 0 {
			mp := p.m.ptr()
			p.m = 0
			if mp.nextp != 0 {
				throw("startTheWorld: inconsistent mp->nextp")
			}
			mp.nextp.set(p)
			notewakeup(&mp.park)
		} else {
			// Start M to run P.  Do not start another M below.
			newm(nil, p)
			add = false
		}
	}
	
	// 如果有空閒的P,並且沒有自旋中的M則喚醒或者創建一個M
	// Wakeup an additional proc in case we have excessive runnable goroutines
	// in local queues or in the global queue. If we don't, the proc will park itself.
	// If we have lots of excessive work, resetspinning will unpark additional procs as necessary.
	if atomic.Load(&sched.npidle) != 0 && atomic.Load(&sched.nmspinning) == 0 {
		wakep()
	}
	
	// 啓動gc helper
	if add {
		// If GC could have used another helper proc, start one now,
		// in the hope that it will be available next time.
		// It would have been even better to start it before the collection,
		// but doing so requires allocating memory, so it's tricky to
		// coordinate. This lazy approach works out in practice:
		// we don't mind if the first couple gc rounds don't have quite
		// the maximum number of procs.
		newm(mhelpgc, nil)
	}
	
	// 允許G被搶佔
	_g_.m.locks--
	
	// 如果當前G要求被搶佔則重新嘗試
	if _g_.m.locks == 0 && _g_.preempt { // restore the preemption request in case we've cleared it in newstack
		_g_.stackguard0 = stackPreempt
	}
}

重啓世界後各個M會重新開始調度, 調度時會優先使用上面提到的findRunnableGCWorker函數查找任務, 之後就有大約25%的P運行後臺標記任務.
後臺標記任務的函數是gcBgMarkWorker:

func gcBgMarkWorker(_p_ *p) {
	gp := getg()
	
	// 用於休眠後重新獲取P的構造體
	type parkInfo struct {
		m      muintptr // Release this m on park.
		attach puintptr // If non-nil, attach to this p on park.
	}
	// We pass park to a gopark unlock function, so it can't be on
	// the stack (see gopark). Prevent deadlock from recursively
	// starting GC by disabling preemption.
	gp.m.preemptoff = "GC worker init"
	park := new(parkInfo)
	gp.m.preemptoff = ""
	
	// 設置當前的M並禁止搶佔
	park.m.set(acquirem())
	// 設置當前的P(需要關聯到的P)
	park.attach.set(_p_)
	
	// 通知gcBgMarkStartWorkers可以繼續處理
	// Inform gcBgMarkStartWorkers that this worker is ready.
	// After this point, the background mark worker is scheduled
	// cooperatively by gcController.findRunnable. Hence, it must
	// never be preempted, as this would put it into _Grunnable
	// and put it on a run queue. Instead, when the preempt flag
	// is set, this puts itself into _Gwaiting to be woken up by
	// gcController.findRunnable at the appropriate time.
	notewakeup(&work.bgMarkReady)
	
	for {
		// 讓當前G進入休眠
		// Go to sleep until woken by gcController.findRunnable.
		// We can't releasem yet since even the call to gopark
		// may be preempted.
		gopark(func(g *g, parkp unsafe.Pointer) bool {
			park := (*parkInfo)(parkp)
			
			// 重新允許搶佔
			// The worker G is no longer running, so it's
			// now safe to allow preemption.
			releasem(park.m.ptr())
			
			// 設置關聯的P
			// 把當前的G設到P的gcBgMarkWorker成員, 下次findRunnableGCWorker會使用
			// 設置失敗時不休眠
			// If the worker isn't attached to its P,
			// attach now. During initialization and after
			// a phase change, the worker may have been
			// running on a different P. As soon as we
			// attach, the owner P may schedule the
			// worker, so this must be done after the G is
			// stopped.
			if park.attach != 0 {
				p := park.attach.ptr()
				park.attach.set(nil)
				// cas the worker because we may be
				// racing with a new worker starting
				// on this P.
				if !p.gcBgMarkWorker.cas(0, guintptr(unsafe.Pointer(g))) {
					// The P got a new worker.
					// Exit this worker.
					return false
				}
			}
			return true
		}, unsafe.Pointer(park), "GC worker (idle)", traceEvGoBlock, 0)
		
		// 檢查P的gcBgMarkWorker是否和當前的G一致, 不一致時結束當前的任務
		// Loop until the P dies and disassociates this
		// worker (the P may later be reused, in which case
		// it will get a new worker) or we failed to associate.
		if _p_.gcBgMarkWorker.ptr() != gp {
			break
		}
		
		// 禁止G被搶佔
		// Disable preemption so we can use the gcw. If the
		// scheduler wants to preempt us, we'll stop draining,
		// dispose the gcw, and then preempt.
		park.m.set(acquirem())
		
		if gcBlackenEnabled == 0 {
			throw("gcBgMarkWorker: blackening not enabled")
		}
		
		// 記錄開始時間
		startTime := nanotime()
		
		decnwait := atomic.Xadd(&work.nwait, -1)
		if decnwait == work.nproc {
			println("runtime: work.nwait=", decnwait, "work.nproc=", work.nproc)
			throw("work.nwait was > work.nproc")
		}
		
		// 切換到g0運行
		systemstack(func() {
			// 設置G的狀態爲等待中這樣它的棧可以被掃描(兩個後臺標記任務可以互相掃描對方的棧)
			// Mark our goroutine preemptible so its stack
			// can be scanned. This lets two mark workers
			// scan each other (otherwise, they would
			// deadlock). We must not modify anything on
			// the G stack. However, stack shrinking is
			// disabled for mark workers, so it is safe to
			// read from the G stack.
			casgstatus(gp, _Grunning, _Gwaiting)
			
			// 判斷後臺標記任務的模式
			switch _p_.gcMarkWorkerMode {
			default:
				throw("gcBgMarkWorker: unexpected gcMarkWorkerMode")
			case gcMarkWorkerDedicatedMode:
				// 這個模式下P應該專心執行標記
				// 執行標記, 直到被搶佔, 並且需要計算後臺的掃描量來減少輔助GC和喚醒等待中的G
				gcDrain(&_p_.gcw, gcDrainUntilPreempt|gcDrainFlushBgCredit)
				// 被搶佔時把本地運行隊列中的所有G都踢到全局運行隊列
				if gp.preempt {
					// We were preempted. This is
					// a useful signal to kick
					// everything out of the run
					// queue so it can run
					// somewhere else.
					lock(&sched.lock)
					for {
						gp, _ := runqget(_p_)
						if gp == nil {
							break
						}
						globrunqput(gp)
					}
					unlock(&sched.lock)
				}
				// 繼續執行標記, 直到無更多任務, 並且需要計算後臺的掃描量來減少輔助GC和喚醒等待中的G
				// Go back to draining, this time
				// without preemption.
				gcDrain(&_p_.gcw, gcDrainNoBlock|gcDrainFlushBgCredit)
			case gcMarkWorkerFractionalMode:
				// 這個模式下P應該適當執行標記
				// 執行標記, 直到被搶佔, 並且需要計算後臺的掃描量來減少輔助GC和喚醒等待中的G
				gcDrain(&_p_.gcw, gcDrainUntilPreempt|gcDrainFlushBgCredit)
			case gcMarkWorkerIdleMode:
				// 這個模式下P只在空閒時執行標記
				// 執行標記, 直到被搶佔或者達到一定的量, 並且需要計算後臺的掃描量來減少輔助GC和喚醒等待中的G
				gcDrain(&_p_.gcw, gcDrainIdle|gcDrainUntilPreempt|gcDrainFlushBgCredit)
			}
			
			// 恢復G的狀態到運行中
			casgstatus(gp, _Gwaiting, _Grunning)
		})
		
		// 如果標記了禁止本地標記隊列則flush到全局標記隊列
		// If we are nearing the end of mark, dispose
		// of the cache promptly. We must do this
		// before signaling that we're no longer
		// working so that other workers can't observe
		// no workers and no work while we have this
		// cached, and before we compute done.
		if gcBlackenPromptly {
			_p_.gcw.dispose()
		}
		
		// 累加所用時間
		// Account for time.
		duration := nanotime() - startTime
		switch _p_.gcMarkWorkerMode {
		case gcMarkWorkerDedicatedMode:
			atomic.Xaddint64(&gcController.dedicatedMarkTime, duration)
			atomic.Xaddint64(&gcController.dedicatedMarkWorkersNeeded, 1)
		case gcMarkWorkerFractionalMode:
			atomic.Xaddint64(&gcController.fractionalMarkTime, duration)
			atomic.Xaddint64(&gcController.fractionalMarkWorkersNeeded, 1)
		case gcMarkWorkerIdleMode:
			atomic.Xaddint64(&gcController.idleMarkTime, duration)
		}
		
		// Was this the last worker and did we run out
		// of work?
		incnwait := atomic.Xadd(&work.nwait, +1)
		if incnwait > work.nproc {
			println("runtime: p.gcMarkWorkerMode=", _p_.gcMarkWorkerMode,
				"work.nwait=", incnwait, "work.nproc=", work.nproc)
			throw("work.nwait > work.nproc")
		}
		
		// 判斷是否所有後臺標記任務都完成, 並且沒有更多的任務
		// If this worker reached a background mark completion
		// point, signal the main GC goroutine.
		if incnwait == work.nproc && !gcMarkWorkAvailable(nil) {
			// 取消和P的關聯
			// Make this G preemptible and disassociate it
			// as the worker for this P so
			// findRunnableGCWorker doesn't try to
			// schedule it.
			_p_.gcBgMarkWorker.set(nil)
			
			// 允許G被搶佔
			releasem(park.m.ptr())
			
			// 準備進入完成標記階段
			gcMarkDone()
			
			// 休眠之前會重新關聯P
			// 因爲上面允許被搶佔, 到這裏的時候可能就會變成其他P
			// 如果重新關聯P失敗則這個任務會結束
			// Disable preemption and prepare to reattach
			// to the P.
			//
			// We may be running on a different P at this
			// point, so we can't reattach until this G is
			// parked.
			park.m.set(acquirem())
			park.attach.set(_p_)
		}
	}
}

gcDrain函數用於執行標記:

// gcDrain scans roots and objects in work buffers, blackening grey
// objects until all roots and work buffers have been drained.
//
// If flags&gcDrainUntilPreempt != 0, gcDrain returns when g.preempt
// is set. This implies gcDrainNoBlock.
//
// If flags&gcDrainIdle != 0, gcDrain returns when there is other work
// to do. This implies gcDrainNoBlock.
//
// If flags&gcDrainNoBlock != 0, gcDrain returns as soon as it is
// unable to get more work. Otherwise, it will block until all
// blocking calls are blocked in gcDrain.
//
// If flags&gcDrainFlushBgCredit != 0, gcDrain flushes scan work
// credit to gcController.bgScanCredit every gcCreditSlack units of
// scan work.
//
//go:nowritebarrier
func gcDrain(gcw *gcWork, flags gcDrainFlags) {
	if !writeBarrier.needed {
		throw("gcDrain phase incorrect")
	}
	
	gp := getg().m.curg
	
	// 看到搶佔標誌時是否要返回
	preemptible := flags&gcDrainUntilPreempt != 0
	
	// 沒有任務時是否要等待任務
	blocking := flags&(gcDrainUntilPreempt|gcDrainIdle|gcDrainNoBlock) == 0
	
	// 是否計算後臺的掃描量來減少輔助GC和喚醒等待中的G
	flushBgCredit := flags&gcDrainFlushBgCredit != 0
	
	// 是否只執行一定量的工作
	idle := flags&gcDrainIdle != 0
	
	// 記錄初始的已掃描數量
	initScanWork := gcw.scanWork
	
	// 掃描idleCheckThreshold(100000)個對象以後檢查是否要返回
	// idleCheck is the scan work at which to perform the next
	// idle check with the scheduler.
	idleCheck := initScanWork + idleCheckThreshold
	
	// 如果根對象未掃描完, 則先掃描根對象
	// Drain root marking jobs.
	if work.markrootNext < work.markrootJobs {
		// 如果標記了preemptible, 循環直到被搶佔
		for !(preemptible && gp.preempt) {
			// 從根對象掃描隊列取出一個值(原子遞增)
			job := atomic.Xadd(&work.markrootNext, +1) - 1
			if job >= work.markrootJobs {
				break
			}
			// 執行根對象掃描工作
			markroot(gcw, job)
			// 如果是idle模式並且有其他工作, 則返回
			if idle && pollWork() {
				goto done
			}
		}
	}
	
	// 根對象已經在標記隊列中, 消費標記隊列
	// 如果標記了preemptible, 循環直到被搶佔
	// Drain heap marking jobs.
	for !(preemptible && gp.preempt) {
		// 如果全局標記隊列爲空, 把本地標記隊列的一部分工作分過去
		// (如果wbuf2不爲空則移動wbuf2過去, 否則移動wbuf1的一半過去)
		// Try to keep work available on the global queue. We used to
		// check if there were waiting workers, but it's better to
		// just keep work available than to make workers wait. In the
		// worst case, we'll do O(log(_WorkbufSize)) unnecessary
		// balances.
		if work.full == 0 {
			gcw.balance()
		}
		
		// 從本地標記隊列中獲取對象, 獲取不到則從全局標記隊列獲取
		var b uintptr
		if blocking {
			// 阻塞獲取
			b = gcw.get()
		} else {
			// 非阻塞獲取
			b = gcw.tryGetFast()
			if b == 0 {
				b = gcw.tryGet()
			}
		}
		
		// 獲取不到對象, 標記隊列已爲空, 跳出循環
		if b == 0 {
			// work barrier reached or tryGet failed.
			break
		}
		
		// 掃描獲取到的對象
		scanobject(b, gcw)
		
		// 如果已經掃描了一定數量的對象(gcCreditSlack的值是2000)
		// Flush background scan work credit to the global
		// account if we've accumulated enough locally so
		// mutator assists can draw on it.
		if gcw.scanWork >= gcCreditSlack {
			// 把掃描的對象數量添加到全局
			atomic.Xaddint64(&gcController.scanWork, gcw.scanWork)
			// 減少輔助GC的工作量和喚醒等待中的G
			if flushBgCredit {
				gcFlushBgCredit(gcw.scanWork - initScanWork)
				initScanWork = 0
			}
			idleCheck -= gcw.scanWork
			gcw.scanWork = 0
			
			// 如果是idle模式且達到了檢查的掃描量, 則檢查是否有其他任務(G), 如果有則跳出循環
			if idle && idleCheck <= 0 {
				idleCheck += idleCheckThreshold
				if pollWork() {
					break
				}
			}
		}
	}
	
	// In blocking mode, write barriers are not allowed after this
	// point because we must preserve the condition that the work
	// buffers are empty.
	
done:
	// 把掃描的對象數量添加到全局
	// Flush remaining scan work credit.
	if gcw.scanWork > 0 {
		atomic.Xaddint64(&gcController.scanWork, gcw.scanWork)
		// 減少輔助GC的工作量和喚醒等待中的G
		if flushBgCredit {
			gcFlushBgCredit(gcw.scanWork - initScanWork)
		}
		gcw.scanWork = 0
	}
}

markroot函數用於執行根對象掃描工作:

// markroot scans the i'th root.
//
// Preemption must be disabled (because this uses a gcWork).
//
// nowritebarrier is only advisory here.
//
//go:nowritebarrier
func markroot(gcw *gcWork, i uint32) {
	// 判斷取出的數值對應哪種任務
	// (google的工程師覺得這種辦法可笑)
	// TODO(austin): This is a bit ridiculous. Compute and store
	// the bases in gcMarkRootPrepare instead of the counts.
	baseFlushCache := uint32(fixedRootCount)
	baseData := baseFlushCache + uint32(work.nFlushCacheRoots)
	baseBSS := baseData + uint32(work.nDataRoots)
	baseSpans := baseBSS + uint32(work.nBSSRoots)
	baseStacks := baseSpans + uint32(work.nSpanRoots)
	end := baseStacks + uint32(work.nStackRoots)

	// Note: if you add a case here, please also update heapdump.go:dumproots.
	switch {
	// 釋放mcache中的所有span, 要求STW
	case baseFlushCache <= i && i < baseData:
		flushmcache(int(i - baseFlushCache))

	// 掃描可讀寫的全局變量
	// 這裏只會掃描i對應的block, 掃描時傳入包含哪裏有指針的bitmap數據
	case baseData <= i && i < baseBSS:
		for _, datap := range activeModules() {
			markrootBlock(datap.data, datap.edata-datap.data, datap.gcdatamask.bytedata, gcw, int(i-baseData))
		}

	// 掃描只讀的全局變量
	// 這裏只會掃描i對應的block, 掃描時傳入包含哪裏有指針的bitmap數據
	case baseBSS <= i && i < baseSpans:
		for _, datap := range activeModules() {
			markrootBlock(datap.bss, datap.ebss-datap.bss, datap.gcbssmask.bytedata, gcw, int(i-baseBSS))
		}

	// 掃描析構器隊列
	case i == fixedRootFinalizers:
		// Only do this once per GC cycle since we don't call
		// queuefinalizer during marking.
		if work.markrootDone {
			break
		}
		for fb := allfin; fb != nil; fb = fb.alllink {
			cnt := uintptr(atomic.Load(&fb.cnt))
			scanblock(uintptr(unsafe.Pointer(&fb.fin[0])), cnt*unsafe.Sizeof(fb.fin[0]), &finptrmask[0], gcw)
		}

	// 釋放已中止的G的棧
	case i == fixedRootFreeGStacks:
		// Only do this once per GC cycle; preferably
		// concurrently.
		if !work.markrootDone {
			// Switch to the system stack so we can call
			// stackfree.
			systemstack(markrootFreeGStacks)
		}

	// 掃描各個span中特殊對象(析構器列表)
	case baseSpans <= i && i < baseStacks:
		// mark MSpan.specials
		markrootSpans(gcw, int(i-baseSpans))

	// 掃描各個G的棧
	default:
		// 獲取需要掃描的G
		// the rest is scanning goroutine stacks
		var gp *g
		if baseStacks <= i && i < end {
			gp = allgs[i-baseStacks]
		} else {
			throw("markroot: bad index")
		}

		// 記錄等待開始的時間
		// remember when we've first observed the G blocked
		// needed only to output in traceback
		status := readgstatus(gp) // We are not in a scan state
		if (status == _Gwaiting || status == _Gsyscall) && gp.waitsince == 0 {
			gp.waitsince = work.tstart
		}

		// 切換到g0運行(有可能會掃到自己的棧)
		// scang must be done on the system stack in case
		// we're trying to scan our own stack.
		systemstack(func() {
			// 判斷掃描的棧是否自己的
			// If this is a self-scan, put the user G in
			// _Gwaiting to prevent self-deadlock. It may
			// already be in _Gwaiting if this is a mark
			// worker or we're in mark termination.
			userG := getg().m.curg
			selfScan := gp == userG && readgstatus(userG) == _Grunning
			
			// 如果正在掃描自己的棧則切換狀態到等待中防止死鎖
			if selfScan {
				casgstatus(userG, _Grunning, _Gwaiting)
				userG.waitreason = "garbage collection scan"
			}
			
			// 掃描G的棧
			// TODO: scang blocks until gp's stack has
			// been scanned, which may take a while for
			// running goroutines. Consider doing this in
			// two phases where the first is non-blocking:
			// we scan the stacks we can and ask running
			// goroutines to scan themselves; and the
			// second blocks.
			scang(gp, gcw)
			
			// 如果正在掃描自己的棧則把狀態切換回運行中
			if selfScan {
				casgstatus(userG, _Gwaiting, _Grunning)
			}
		})
	}
}

scang函數負責掃描G的棧:

// scang blocks until gp's stack has been scanned.
// It might be scanned by scang or it might be scanned by the goroutine itself.
// Either way, the stack scan has completed when scang returns.
func scang(gp *g, gcw *gcWork) {
	// Invariant; we (the caller, markroot for a specific goroutine) own gp.gcscandone.
	// Nothing is racing with us now, but gcscandone might be set to true left over
	// from an earlier round of stack scanning (we scan twice per GC).
	// We use gcscandone to record whether the scan has been done during this round.

	// 標記掃描未完成
	gp.gcscandone = false

	// See http://golang.org/cl/21503 for justification of the yield delay.
	const yieldDelay = 10 * 1000
	var nextYield int64

	// 循環直到掃描完成
	// Endeavor to get gcscandone set to true,
	// either by doing the stack scan ourselves or by coercing gp to scan itself.
	// gp.gcscandone can transition from false to true when we're not looking
	// (if we asked for preemption), so any time we lock the status using
	// castogscanstatus we have to double-check that the scan is still not done.
loop:
	for i := 0; !gp.gcscandone; i++ {
		// 判斷G的當前狀態
		switch s := readgstatus(gp); s {
		default:
			dumpgstatus(gp)
			throw("stopg: invalid status")

		// G已中止, 不需要掃描它
		case _Gdead:
			// No stack.
			gp.gcscandone = true
			break loop

		// G的棧正在擴展, 下一輪重試
		case _Gcopystack:
		// Stack being switched. Go around again.

		// G不是運行中, 首先需要防止它運行
		case _Grunnable, _Gsyscall, _Gwaiting:
			// Claim goroutine by setting scan bit.
			// Racing with execution or readying of gp.
			// The scan bit keeps them from running
			// the goroutine until we're done.
			if castogscanstatus(gp, s, s|_Gscan) {
				// 原子切換狀態成功時掃描它的棧
				if !gp.gcscandone {
					scanstack(gp, gcw)
					gp.gcscandone = true
				}
				// 恢復G的狀態, 並跳出循環
				restartg(gp)
				break loop
			}

		// G正在掃描它自己, 等待掃描完畢
		case _Gscanwaiting:
		// newstack is doing a scan for us right now. Wait.

		// G正在運行
		case _Grunning:
			// Goroutine running. Try to preempt execution so it can scan itself.
			// The preemption handler (in newstack) does the actual scan.

			// 如果已經有搶佔請求, 則搶佔成功時會幫我們處理
			// Optimization: if there is already a pending preemption request
			// (from the previous loop iteration), don't bother with the atomics.
			if gp.preemptscan && gp.preempt && gp.stackguard0 == stackPreempt {
				break
			}

			// 搶佔G, 搶佔成功時G會掃描它自己
			// Ask for preemption and self scan.
			if castogscanstatus(gp, _Grunning, _Gscanrunning) {
				if !gp.gcscandone {
					gp.preemptscan = true
					gp.preempt = true
					gp.stackguard0 = stackPreempt
				}
				casfrom_Gscanstatus(gp, _Gscanrunning, _Grunning)
			}
		}

		// 第一輪休眠10毫秒, 第二輪休眠5毫秒
		if i == 0 {
			nextYield = nanotime() + yieldDelay
		}
		if nanotime() < nextYield {
			procyield(10)
		} else {
			osyield()
			nextYield = nanotime() + yieldDelay/2
		}
	}

	// 掃描完成, 取消搶佔掃描的請求
	gp.preemptscan = false // cancel scan request if no longer needed
}

設置preemptscan後, 在搶佔G成功時會調用scanstack掃描它自己的棧, 具體代碼在這裏.
掃描棧用的函數是scanstack:

// scanstack scans gp's stack, greying all pointers found on the stack.
//
// scanstack is marked go:systemstack because it must not be preempted
// while using a workbuf.
//
//go:nowritebarrier
//go:systemstack
func scanstack(gp *g, gcw *gcWork) {
	if gp.gcscanvalid {
		return
	}

	if readgstatus(gp)&_Gscan == 0 {
		print("runtime:scanstack: gp=", gp, ", goid=", gp.goid, ", gp->atomicstatus=", hex(readgstatus(gp)), "\n")
		throw("scanstack - bad status")
	}

	switch readgstatus(gp) &^ _Gscan {
	default:
		print("runtime: gp=", gp, ", goid=", gp.goid, ", gp->atomicstatus=", readgstatus(gp), "\n")
		throw("mark - bad status")
	case _Gdead:
		return
	case _Grunning:
		print("runtime: gp=", gp, ", goid=", gp.goid, ", gp->atomicstatus=", readgstatus(gp), "\n")
		throw("scanstack: goroutine not stopped")
	case _Grunnable, _Gsyscall, _Gwaiting:
		// ok
	}

	if gp == getg() {
		throw("can't scan our own stack")
	}
	mp := gp.m
	if mp != nil && mp.helpgc != 0 {
		throw("can't scan gchelper stack")
	}

	// Shrink the stack if not much of it is being used. During
	// concurrent GC, we can do this during concurrent mark.
	if !work.markrootDone {
		shrinkstack(gp)
	}

	// Scan the stack.
	var cache pcvalueCache
	scanframe := func(frame *stkframe, unused unsafe.Pointer) bool {
		// scanframeworker會根據代碼地址(pc)獲取函數信息
		// 然後找到函數信息中的stackmap.bytedata, 它保存了函數的棧上哪些地方有指針
		// 再調用scanblock來掃描函數的棧空間, 同時函數的參數也會這樣掃描
		scanframeworker(frame, &cache, gcw)
		return true
	}
	// 枚舉所有調用幀, 分別調用scanframe函數
	gentraceback(^uintptr(0), ^uintptr(0), 0, gp, 0, nil, 0x7fffffff, scanframe, nil, 0)
	// 枚舉所有defer的調用幀, 分別調用scanframe函數
	tracebackdefers(gp, scanframe, nil)
	gp.gcscanvalid = true
}

scanblock函數是一個通用的掃描函數, 掃描全局變量和棧空間都會用它, 和scanobject不同的是bitmap需要手動傳入:

// scanblock scans b as scanobject would, but using an explicit
// pointer bitmap instead of the heap bitmap.
//
// This is used to scan non-heap roots, so it does not update
// gcw.bytesMarked or gcw.scanWork.
//
//go:nowritebarrier
func scanblock(b0, n0 uintptr, ptrmask *uint8, gcw *gcWork) {
	// Use local copies of original parameters, so that a stack trace
	// due to one of the throws below shows the original block
	// base and extent.
	b := b0
	n := n0

	arena_start := mheap_.arena_start
	arena_used := mheap_.arena_used

	// 枚舉掃描的地址
	for i := uintptr(0); i < n; {
		// 找到bitmap中對應的byte
		// Find bits for the next word.
		bits := uint32(*addb(ptrmask, i/(sys.PtrSize*8)))
		if bits == 0 {
			i += sys.PtrSize * 8
			continue
		}
		// 枚舉byte
		for j := 0; j < 8 && i < n; j++ {
			// 如果該地址包含指針
			if bits&1 != 0 {
				// 標記在該地址的對象存活, 並把它加到標記隊列(該對象變爲灰色)
				// Same work as in scanobject; see comments there.
				obj := *(*uintptr)(unsafe.Pointer(b + i))
				if obj != 0 && arena_start <= obj && obj < arena_used {
					// 找到該對象對應的span和bitmap
					if obj, hbits, span, objIndex := heapBitsForObject(obj, b, i); obj != 0 {
						// 標記一個對象存活, 並把它加到標記隊列(該對象變爲灰色)
						greyobject(obj, b, i, hbits, span, gcw, objIndex)
					}
				}
			}
			// 處理下一個指針下一個bit
			bits >>= 1
			i += sys.PtrSize
		}
	}
}

greyobject用於標記一個對象存活, 並把它加到標記隊列(該對象變爲灰色):

// obj is the start of an object with mark mbits.
// If it isn't already marked, mark it and enqueue into gcw.
// base and off are for debugging only and could be removed.
//go:nowritebarrierrec
func greyobject(obj, base, off uintptr, hbits heapBits, span *mspan, gcw *gcWork, objIndex uintptr) {
	// obj should be start of allocation, and so must be at least pointer-aligned.
	if obj&(sys.PtrSize-1) != 0 {
		throw("greyobject: obj not pointer-aligned")
	}
	mbits := span.markBitsForIndex(objIndex)

	if useCheckmark {
		// checkmark是用於檢查是否所有可到達的對象都被正確標記的機制, 僅除錯使用
		if !mbits.isMarked() {
			printlock()
			print("runtime:greyobject: checkmarks finds unexpected unmarked object obj=", hex(obj), "\n")
			print("runtime: found obj at *(", hex(base), "+", hex(off), ")\n")

			// Dump the source (base) object
			gcDumpObject("base", base, off)

			// Dump the object
			gcDumpObject("obj", obj, ^uintptr(0))

			getg().m.traceback = 2
			throw("checkmark found unmarked object")
		}
		if hbits.isCheckmarked(span.elemsize) {
			return
		}
		hbits.setCheckmarked(span.elemsize)
		if !hbits.isCheckmarked(span.elemsize) {
			throw("setCheckmarked and isCheckmarked disagree")
		}
	} else {
		if debug.gccheckmark > 0 && span.isFree(objIndex) {
			print("runtime: marking free object ", hex(obj), " found at *(", hex(base), "+", hex(off), ")\n")
			gcDumpObject("base", base, off)
			gcDumpObject("obj", obj, ^uintptr(0))
			getg().m.traceback = 2
			throw("marking free object")
		}

		// 如果對象所在的span中的gcmarkBits對應的bit已經設置爲1則可以跳過處理
		// If marked we have nothing to do.
		if mbits.isMarked() {
			return
		}
		
		// 設置對象所在的span中的gcmarkBits對應的bit爲1
		// mbits.setMarked() // Avoid extra call overhead with manual inlining.
		atomic.Or8(mbits.bytep, mbits.mask)
		
		// 如果確定對象不包含指針(所在span的類型是noscan), 則不需要把對象放入標記隊列
		// If this is a noscan object, fast-track it to black
		// instead of greying it.
		if span.spanclass.noscan() {
			gcw.bytesMarked += uint64(span.elemsize)
			return
		}
	}

	// 把對象放入標記隊列
	// 先放入本地標記隊列, 失敗時把本地標記隊列中的部分工作轉移到全局標記隊列, 再放入本地標記隊列
	// Queue the obj for scanning. The PREFETCH(obj) logic has been removed but
	// seems like a nice optimization that can be added back in.
	// There needs to be time between the PREFETCH and the use.
	// Previously we put the obj in an 8 element buffer that is drained at a rate
	// to give the PREFETCH time to do its work.
	// Use of PREFETCHNTA might be more appropriate than PREFETCH
	if !gcw.putFast(obj) {
		gcw.put(obj)
	}
}

gcDrain函數掃描完根對象, 就會開始消費標記隊列, 對從標記隊列中取出的對象調用scanobject函數:

// scanobject scans the object starting at b, adding pointers to gcw.
// b must point to the beginning of a heap object or an oblet.
// scanobject consults the GC bitmap for the pointer mask and the
// spans for the size of the object.
//
//go:nowritebarrier
func scanobject(b uintptr, gcw *gcWork) {
	// Note that arena_used may change concurrently during
	// scanobject and hence scanobject may encounter a pointer to
	// a newly allocated heap object that is *not* in
	// [start,used). It will not mark this object; however, we
	// know that it was just installed by a mutator, which means
	// that mutator will execute a write barrier and take care of
	// marking it. This is even more pronounced on relaxed memory
	// architectures since we access arena_used without barriers
	// or synchronization, but the same logic applies.
	arena_start := mheap_.arena_start
	arena_used := mheap_.arena_used

	// Find the bits for b and the size of the object at b.
	//
	// b is either the beginning of an object, in which case this
	// is the size of the object to scan, or it points to an
	// oblet, in which case we compute the size to scan below.
	// 獲取對象對應的bitmap
	hbits := heapBitsForAddr(b)
	
	// 獲取對象所在的span
	s := spanOfUnchecked(b)
	
	// 獲取對象的大小
	n := s.elemsize
	if n == 0 {
		throw("scanobject n == 0")
	}

	// 對象大小過大時(maxObletBytes是128KB)需要分割掃描
	// 每次最多隻掃描128KB
	if n > maxObletBytes {
		// Large object. Break into oblets for better
		// parallelism and lower latency.
		if b == s.base() {
			// It's possible this is a noscan object (not
			// from greyobject, but from other code
			// paths), in which case we must *not* enqueue
			// oblets since their bitmaps will be
			// uninitialized.
			if s.spanclass.noscan() {
				// Bypass the whole scan.
				gcw.bytesMarked += uint64(n)
				return
			}

			// Enqueue the other oblets to scan later.
			// Some oblets may be in b's scalar tail, but
			// these will be marked as "no more pointers",
			// so we'll drop out immediately when we go to
			// scan those.
			for oblet := b + maxObletBytes; oblet < s.base()+s.elemsize; oblet += maxObletBytes {
				if !gcw.putFast(oblet) {
					gcw.put(oblet)
				}
			}
		}

		// Compute the size of the oblet. Since this object
		// must be a large object, s.base() is the beginning
		// of the object.
		n = s.base() + s.elemsize - b
		if n > maxObletBytes {
			n = maxObletBytes
		}
	}

	// 掃描對象中的指針
	var i uintptr
	for i = 0; i < n; i += sys.PtrSize {
		// 獲取對應的bit
		// Find bits for this word.
		if i != 0 {
			// Avoid needless hbits.next() on last iteration.
			hbits = hbits.next()
		}
		// Load bits once. See CL 22712 and issue 16973 for discussion.
		bits := hbits.bits()
		
		// 檢查scan bit判斷是否繼續掃描, 注意第二個scan bit是checkmark
		// During checkmarking, 1-word objects store the checkmark
		// in the type bit for the one word. The only one-word objects
		// are pointers, or else they'd be merged with other non-pointer
		// data into larger allocations.
		if i != 1*sys.PtrSize && bits&bitScan == 0 {
			break // no more pointers in this object
		}
		
		// 檢查pointer bit, 不是指針則繼續
		if bits&bitPointer == 0 {
			continue // not a pointer
		}

		// 取出指針的值
		// Work here is duplicated in scanblock and above.
		// If you make changes here, make changes there too.
		obj := *(*uintptr)(unsafe.Pointer(b + i))

		// 如果指針在arena區域中, 則調用greyobject標記對象並把對象放到標記隊列中
		// At this point we have extracted the next potential pointer.
		// Check if it points into heap and not back at the current object.
		if obj != 0 && arena_start <= obj && obj < arena_used && obj-b >= n {
			// Mark the object.
			if obj, hbits, span, objIndex := heapBitsForObject(obj, b, i); obj != 0 {
				greyobject(obj, b, i, hbits, span, gcw, objIndex)
			}
		}
	}
	
	// 統計掃描過的大小和對象數量
	gcw.bytesMarked += uint64(n)
	gcw.scanWork += int64(i)
}

在所有後臺標記任務都把標記隊列消費完畢時, 會執行gcMarkDone函數準備進入完成標記階段(mark termination):
在並行GC中gcMarkDone會被執行兩次, 第一次會禁止本地標記隊列然後重新開始後臺標記任務, 第二次會進入完成標記階段(mark termination)。

// gcMarkDone transitions the GC from mark 1 to mark 2 and from mark 2
// to mark termination.
//
// This should be called when all mark work has been drained. In mark
// 1, this includes all root marking jobs, global work buffers, and
// active work buffers in assists and background workers; however,
// work may still be cached in per-P work buffers. In mark 2, per-P
// caches are disabled.
//
// The calling context must be preemptible.
//
// Note that it is explicitly okay to have write barriers in this
// function because completion of concurrent mark is best-effort
// anyway. Any work created by write barriers here will be cleaned up
// by mark termination.
func gcMarkDone() {
top:
	semacquire(&work.markDoneSema)

	// Re-check transition condition under transition lock.
	if !(gcphase == _GCmark && work.nwait == work.nproc && !gcMarkWorkAvailable(nil)) {
		semrelease(&work.markDoneSema)
		return
	}

	// 暫時禁止啓動新的後臺標記任務
	// Disallow starting new workers so that any remaining workers
	// in the current mark phase will drain out.
	//
	// TODO(austin): Should dedicated workers keep an eye on this
	// and exit gcDrain promptly?
	atomic.Xaddint64(&gcController.dedicatedMarkWorkersNeeded, -0xffffffff)
	atomic.Xaddint64(&gcController.fractionalMarkWorkersNeeded, -0xffffffff)

	// 判斷本地標記隊列是否已禁用
	if !gcBlackenPromptly {
		// 本地標記隊列是否未禁用, 禁用然後重新開始後臺標記任務
		// Transition from mark 1 to mark 2.
		//
		// The global work list is empty, but there can still be work
		// sitting in the per-P work caches.
		// Flush and disable work caches.

		// 禁用本地標記隊列
		// Disallow caching workbufs and indicate that we're in mark 2.
		gcBlackenPromptly = true

		// Prevent completion of mark 2 until we've flushed
		// cached workbufs.
		atomic.Xadd(&work.nwait, -1)

		// GC is set up for mark 2. Let Gs blocked on the
		// transition lock go while we flush caches.
		semrelease(&work.markDoneSema)

		// 把所有本地標記隊列中的對象都推到全局標記隊列
		systemstack(func() {
			// Flush all currently cached workbufs and
			// ensure all Ps see gcBlackenPromptly. This
			// also blocks until any remaining mark 1
			// workers have exited their loop so we can
			// start new mark 2 workers.
			forEachP(func(_p_ *p) {
				_p_.gcw.dispose()
			})
		})

		// 除錯用
		// Check that roots are marked. We should be able to
		// do this before the forEachP, but based on issue
		// #16083 there may be a (harmless) race where we can
		// enter mark 2 while some workers are still scanning
		// stacks. The forEachP ensures these scans are done.
		//
		// TODO(austin): Figure out the race and fix this
		// properly.
		gcMarkRootCheck()

		// 允許啓動新的後臺標記任務
		// Now we can start up mark 2 workers.
		atomic.Xaddint64(&gcController.dedicatedMarkWorkersNeeded, 0xffffffff)
		atomic.Xaddint64(&gcController.fractionalMarkWorkersNeeded, 0xffffffff)

		// 如果確定沒有更多的任務則可以直接跳到函數頂部
		// 這樣就當作是第二次調用了
		incnwait := atomic.Xadd(&work.nwait, +1)
		if incnwait == work.nproc && !gcMarkWorkAvailable(nil) {
			// This loop will make progress because
			// gcBlackenPromptly is now true, so it won't
			// take this same "if" branch.
			goto top
		}
	} else {
		// 記錄完成標記階段開始的時間和STW開始的時間
		// Transition to mark termination.
		now := nanotime()
		work.tMarkTerm = now
		work.pauseStart = now
		
		// 禁止G被搶佔
		getg().m.preemptoff = "gcing"
		
		// 停止所有運行中的G, 並禁止它們運行
		systemstack(stopTheWorldWithSema)
		
		// !!!!!!!!!!!!!!!!
		// 世界已停止(STW)...
		// !!!!!!!!!!!!!!!!
		
		// The gcphase is _GCmark, it will transition to _GCmarktermination
		// below. The important thing is that the wb remains active until
		// all marking is complete. This includes writes made by the GC.
		
		// 標記對根對象的掃描已完成, 會影響gcMarkRootPrepare中的處理
		// Record that one root marking pass has completed.
		work.markrootDone = true
		
		// 禁止輔助GC和後臺標記任務的運行
		// Disable assists and background workers. We must do
		// this before waking blocked assists.
		atomic.Store(&gcBlackenEnabled, 0)
		
		// 喚醒所有因爲輔助GC而休眠的G
		// Wake all blocked assists. These will run when we
		// start the world again.
		gcWakeAllAssists()
		
		// Likewise, release the transition lock. Blocked
		// workers and assists will run when we start the
		// world again.
		semrelease(&work.markDoneSema)
		
		// 計算下一次觸發gc需要的heap大小
		// endCycle depends on all gcWork cache stats being
		// flushed. This is ensured by mark 2.
		nextTriggerRatio := gcController.endCycle()
		
		// 進入完成標記階段, 會重新啓動世界
		// Perform mark termination. This will restart the world.
		gcMarkTermination(nextTriggerRatio)
	}
}

gcMarkTermination函數會進入完成標記階段:

func gcMarkTermination(nextTriggerRatio float64) {
	// World is stopped.
	// Start marktermination which includes enabling the write barrier.
	// 禁止輔助GC和後臺標記任務的運行
	atomic.Store(&gcBlackenEnabled, 0)
	
	// 重新允許本地標記隊列(下次GC使用)
	gcBlackenPromptly = false
	
	// 設置當前GC階段到完成標記階段, 並啓用寫屏障
	setGCPhase(_GCmarktermination)

	// 記錄開始時間
	work.heap1 = memstats.heap_live
	startTime := nanotime()

	// 禁止G被搶佔
	mp := acquirem()
	mp.preemptoff = "gcing"
	_g_ := getg()
	_g_.m.traceback = 2
	
	// 設置G的狀態爲等待中這樣它的棧可以被掃描
	gp := _g_.m.curg
	casgstatus(gp, _Grunning, _Gwaiting)
	gp.waitreason = "garbage collection"

	// 切換到g0運行
	// Run gc on the g0 stack. We do this so that the g stack
	// we're currently running on will no longer change. Cuts
	// the root set down a bit (g0 stacks are not scanned, and
	// we don't need to scan gc's internal state).  We also
	// need to switch to g0 so we can shrink the stack.
	systemstack(func() {
		// 開始STW中的標記
		gcMark(startTime)
		
		// 必須立刻返回, 因爲外面的G的棧有可能被移動, 不能在這之後訪問外面的變量
		// Must return immediately.
		// The outer function's stack may have moved
		// during gcMark (it shrinks stacks, including the
		// outer function's stack), so we must not refer
		// to any of its variables. Return back to the
		// non-system stack to pick up the new addresses
		// before continuing.
	})

	// 重新切換到g0運行
	systemstack(func() {
		work.heap2 = work.bytesMarked
		
		// 如果啓用了checkmark則執行檢查, 檢查是否所有可到達的對象都有標記
		if debug.gccheckmark > 0 {
			// Run a full stop-the-world mark using checkmark bits,
			// to check that we didn't forget to mark anything during
			// the concurrent mark process.
			gcResetMarkState()
			initCheckmarks()
			gcMark(startTime)
			clearCheckmarks()
		}

		// 設置當前GC階段到關閉, 並禁用寫屏障
		// marking is complete so we can turn the write barrier off
		setGCPhase(_GCoff)
		
		// 喚醒後臺清掃任務, 將在STW結束後開始運行
		gcSweep(work.mode)

		// 除錯用
		if debug.gctrace > 1 {
			startTime = nanotime()
			// The g stacks have been scanned so
			// they have gcscanvalid==true and gcworkdone==true.
			// Reset these so that all stacks will be rescanned.
			gcResetMarkState()
			finishsweep_m()

			// Still in STW but gcphase is _GCoff, reset to _GCmarktermination
			// At this point all objects will be found during the gcMark which
			// does a complete STW mark and object scan.
			setGCPhase(_GCmarktermination)
			gcMark(startTime)
			setGCPhase(_GCoff) // marking is done, turn off wb.
			gcSweep(work.mode)
		}
	})

	// 設置G的狀態爲運行中
	_g_.m.traceback = 0
	casgstatus(gp, _Gwaiting, _Grunning)

	// 跟蹤處理
	if trace.enabled {
		traceGCDone()
	}

	// all done
	mp.preemptoff = ""

	if gcphase != _GCoff {
		throw("gc done but gcphase != _GCoff")
	}

	// 更新下一次觸發gc需要的heap大小(gc_trigger)
	// Update GC trigger and pacing for the next cycle.
	gcSetTriggerRatio(nextTriggerRatio)

	// 更新用時記錄
	// Update timing memstats
	now := nanotime()
	sec, nsec, _ := time_now()
	unixNow := sec*1e9 + int64(nsec)
	work.pauseNS += now - work.pauseStart
	work.tEnd = now
	atomic.Store64(&memstats.last_gc_unix, uint64(unixNow)) // must be Unix time to make sense to user
	atomic.Store64(&memstats.last_gc_nanotime, uint64(now)) // monotonic time for us
	memstats.pause_ns[memstats.numgc%uint32(len(memstats.pause_ns))] = uint64(work.pauseNS)
	memstats.pause_end[memstats.numgc%uint32(len(memstats.pause_end))] = uint64(unixNow)
	memstats.pause_total_ns += uint64(work.pauseNS)

	// 更新所用cpu記錄
	// Update work.totaltime.
	sweepTermCpu := int64(work.stwprocs) * (work.tMark - work.tSweepTerm)
	// We report idle marking time below, but omit it from the
	// overall utilization here since it's "free".
	markCpu := gcController.assistTime + gcController.dedicatedMarkTime + gcController.fractionalMarkTime
	markTermCpu := int64(work.stwprocs) * (work.tEnd - work.tMarkTerm)
	cycleCpu := sweepTermCpu + markCpu + markTermCpu
	work.totaltime += cycleCpu

	// Compute overall GC CPU utilization.
	totalCpu := sched.totaltime + (now-sched.procresizetime)*int64(gomaxprocs)
	memstats.gc_cpu_fraction = float64(work.totaltime) / float64(totalCpu)

	// 重置清掃狀態
	// Reset sweep state.
	sweep.nbgsweep = 0
	sweep.npausesweep = 0

	// 統計強制開始GC的次數
	if work.userForced {
		memstats.numforcedgc++
	}

	// 統計執行GC的次數然後喚醒等待清掃的G
	// Bump GC cycle count and wake goroutines waiting on sweep.
	lock(&work.sweepWaiters.lock)
	memstats.numgc++
	injectglist(work.sweepWaiters.head.ptr())
	work.sweepWaiters.head = 0
	unlock(&work.sweepWaiters.lock)

	// 性能統計用
	// Finish the current heap profiling cycle and start a new
	// heap profiling cycle. We do this before starting the world
	// so events don't leak into the wrong cycle.
	mProf_NextCycle()

	// 重新啓動世界
	systemstack(startTheWorldWithSema)

	// !!!!!!!!!!!!!!!
	// 世界已重新啓動...
	// !!!!!!!!!!!!!!!

	// 性能統計用
	// Flush the heap profile so we can start a new cycle next GC.
	// This is relatively expensive, so we don't do it with the
	// world stopped.
	mProf_Flush()

	// 移動標記隊列使用的緩衝區到自由列表, 使得它們可以被回收
	// Prepare workbufs for freeing by the sweeper. We do this
	// asynchronously because it can take non-trivial time.
	prepareFreeWorkbufs()

	// 釋放未使用的棧
	// Free stack spans. This must be done between GC cycles.
	systemstack(freeStackSpans)

	// 除錯用
	// Print gctrace before dropping worldsema. As soon as we drop
	// worldsema another cycle could start and smash the stats
	// we're trying to print.
	if debug.gctrace > 0 {
		util := int(memstats.gc_cpu_fraction * 100)

		var sbuf [24]byte
		printlock()
		print("gc ", memstats.numgc,
			" @", string(itoaDiv(sbuf[:], uint64(work.tSweepTerm-runtimeInitTime)/1e6, 3)), "s ",
			util, "%: ")
		prev := work.tSweepTerm
		for i, ns := range []int64{work.tMark, work.tMarkTerm, work.tEnd} {
			if i != 0 {
				print("+")
			}
			print(string(fmtNSAsMS(sbuf[:], uint64(ns-prev))))
			prev = ns
		}
		print(" ms clock, ")
		for i, ns := range []int64{sweepTermCpu, gcController.assistTime, gcController.dedicatedMarkTime + gcController.fractionalMarkTime, gcController.idleMarkTime, markTermCpu} {
			if i == 2 || i == 3 {
				// Separate mark time components with /.
				print("/")
			} else if i != 0 {
				print("+")
			}
			print(string(fmtNSAsMS(sbuf[:], uint64(ns))))
		}
		print(" ms cpu, ",
			work.heap0>>20, "->", work.heap1>>20, "->", work.heap2>>20, " MB, ",
			work.heapGoal>>20, " MB goal, ",
			work.maxprocs, " P")
		if work.userForced {
			print(" (forced)")
		}
		print("\n")
		printunlock()
	}

	semrelease(&worldsema)
	// Careful: another GC cycle may start now.

	// 重新允許當前的G被搶佔
	releasem(mp)
	mp = nil

	// 如果是並行GC, 讓當前M繼續運行(會回到gcBgMarkWorker然後休眠)
	// 如果不是並行GC, 則讓當前M開始調度
	// now that gc is done, kick off finalizer thread if needed
	if !concurrentSweep {
		// give the queued finalizers, if any, a chance to run
		Gosched()
	}
}

gcSweep函數會喚醒後臺清掃任務:

後臺清掃任務會在程序啓動時調用的gcenable函數中啓動.

func gcSweep(mode gcMode) {
	if gcphase != _GCoff {
		throw("gcSweep being done but phase is not GCoff")
	}

	// 增加sweepgen, 這樣sweepSpans中兩個隊列角色會交換, 所有span都會變爲"待清掃"的span
	lock(&mheap_.lock)
	mheap_.sweepgen += 2
	mheap_.sweepdone = 0
	if mheap_.sweepSpans[mheap_.sweepgen/2%2].index != 0 {
		// We should have drained this list during the last
		// sweep phase. We certainly need to start this phase
		// with an empty swept list.
		throw("non-empty swept list")
	}
	mheap_.pagesSwept = 0
	unlock(&mheap_.lock)

	// 如果非並行GC則在這裏完成所有工作(STW中)
	if !_ConcurrentSweep || mode == gcForceBlockMode {
		// Special case synchronous sweep.
		// Record that no proportional sweeping has to happen.
		lock(&mheap_.lock)
		mheap_.sweepPagesPerByte = 0
		unlock(&mheap_.lock)
		// Sweep all spans eagerly.
		for sweepone() != ^uintptr(0) {
			sweep.npausesweep++
		}
		// Free workbufs eagerly.
		prepareFreeWorkbufs()
		for freeSomeWbufs(false) {
		}
		// All "free" events for this mark/sweep cycle have
		// now happened, so we can make this profile cycle
		// available immediately.
		mProf_NextCycle()
		mProf_Flush()
		return
	}

	// 喚醒後臺清掃任務
	// Background sweep.
	lock(&sweep.lock)
	if sweep.parked {
		sweep.parked = false
		ready(sweep.g, 0, true)
	}
	unlock(&sweep.lock)
}

後臺清掃任務的函數是bgsweep:

func bgsweep(c chan int) {
	sweep.g = getg()

	// 等待喚醒
	lock(&sweep.lock)
	sweep.parked = true
	c <- 1
	goparkunlock(&sweep.lock, "GC sweep wait", traceEvGoBlock, 1)

	// 循環清掃
	for {
		// 清掃一個span, 然後進入調度(一次只做少量工作)
		for gosweepone() != ^uintptr(0) {
			sweep.nbgsweep++
			Gosched()
		}
		// 釋放一些未使用的標記隊列緩衝區到heap
		for freeSomeWbufs(true) {
			Gosched()
		}
		// 如果清掃未完成則繼續循環
		lock(&sweep.lock)
		if !gosweepdone() {
			// This can happen if a GC runs between
			// gosweepone returning ^0 above
			// and the lock being acquired.
			unlock(&sweep.lock)
			continue
		}
		// 否則讓後臺清掃任務進入休眠, 當前M繼續調度
		sweep.parked = true
		goparkunlock(&sweep.lock, "GC sweep wait", traceEvGoBlock, 1)
	}
}

gosweepone函數會從sweepSpans中取出單個span清掃:

//go:nowritebarrier
func gosweepone() uintptr {
	var ret uintptr
	// 切換到g0運行
	systemstack(func() {
		ret = sweepone()
	})
	return ret
}

sweepone函數如下:

// sweeps one span
// returns number of pages returned to heap, or ^uintptr(0) if there is nothing to sweep
//go:nowritebarrier
func sweepone() uintptr {
	_g_ := getg()
	sweepRatio := mheap_.sweepPagesPerByte // For debugging

	// 禁止G被搶佔
	// increment locks to ensure that the goroutine is not preempted
	// in the middle of sweep thus leaving the span in an inconsistent state for next GC
	_g_.m.locks++
	
	// 檢查是否已完成清掃
	if atomic.Load(&mheap_.sweepdone) != 0 {
		_g_.m.locks--
		return ^uintptr(0)
	}
	
	// 更新同時執行sweep的任務數量
	atomic.Xadd(&mheap_.sweepers, +1)

	npages := ^uintptr(0)
	sg := mheap_.sweepgen
	for {
		// 從sweepSpans中取出一個span
		s := mheap_.sweepSpans[1-sg/2%2].pop()
		// 全部清掃完畢時跳出循環
		if s == nil {
			atomic.Store(&mheap_.sweepdone, 1)
			break
		}
		// 其他M已經在清掃這個span時跳過
		if s.state != mSpanInUse {
			// This can happen if direct sweeping already
			// swept this span, but in that case the sweep
			// generation should always be up-to-date.
			if s.sweepgen != sg {
				print("runtime: bad span s.state=", s.state, " s.sweepgen=", s.sweepgen, " sweepgen=", sg, "\n")
				throw("non in-use span in unswept list")
			}
			continue
		}
		// 原子增加span的sweepgen, 失敗表示其他M已經開始清掃這個span, 跳過
		if s.sweepgen != sg-2 || !atomic.Cas(&s.sweepgen, sg-2, sg-1) {
			continue
		}
		// 清掃這個span, 然後跳出循環
		npages = s.npages
		if !s.sweep(false) {
			// Span is still in-use, so this returned no
			// pages to the heap and the span needs to
			// move to the swept in-use list.
			npages = 0
		}
		break
	}

	// 更新同時執行sweep的任務數量
	// Decrement the number of active sweepers and if this is the
	// last one print trace information.
	if atomic.Xadd(&mheap_.sweepers, -1) == 0 && atomic.Load(&mheap_.sweepdone) != 0 {
		if debug.gcpacertrace > 0 {
			print("pacer: sweep done at heap size ", memstats.heap_live>>20, "MB; allocated ", (memstats.heap_live-mheap_.sweepHeapLiveBasis)>>20, "MB during sweep; swept ", mheap_.pagesSwept, " pages at ", sweepRatio, " pages/byte\n")
		}
	}
	// 允許G被搶佔
	_g_.m.locks--
	// 返回清掃的頁數
	return npages
}

span的sweep函數用於清掃單個span:

// Sweep frees or collects finalizers for blocks not marked in the mark phase.
// It clears the mark bits in preparation for the next GC round.
// Returns true if the span was returned to heap.
// If preserve=true, don't return it to heap nor relink in MCentral lists;
// caller takes care of it.
//TODO go:nowritebarrier
func (s *mspan) sweep(preserve bool) bool {
	// It's critical that we enter this function with preemption disabled,
	// GC must not start while we are in the middle of this function.
	_g_ := getg()
	if _g_.m.locks == 0 && _g_.m.mallocing == 0 && _g_ != _g_.m.g0 {
		throw("MSpan_Sweep: m is not locked")
	}
	sweepgen := mheap_.sweepgen
	if s.state != mSpanInUse || s.sweepgen != sweepgen-1 {
		print("MSpan_Sweep: state=", s.state, " sweepgen=", s.sweepgen, " mheap.sweepgen=", sweepgen, "\n")
		throw("MSpan_Sweep: bad span state")
	}

	if trace.enabled {
		traceGCSweepSpan(s.npages * _PageSize)
	}

	// 統計已清理的頁數
	atomic.Xadd64(&mheap_.pagesSwept, int64(s.npages))

	spc := s.spanclass
	size := s.elemsize
	res := false

	c := _g_.m.mcache
	freeToHeap := false

	// The allocBits indicate which unmarked objects don't need to be
	// processed since they were free at the end of the last GC cycle
	// and were not allocated since then.
	// If the allocBits index is >= s.freeindex and the bit
	// is not marked then the object remains unallocated
	// since the last GC.
	// This situation is analogous to being on a freelist.

	// 判斷在special中的析構器, 如果對應的對象已經不再存活則標記對象存活防止回收, 然後把析構器移到運行隊列
	// Unlink & free special records for any objects we're about to free.
	// Two complications here:
	// 1. An object can have both finalizer and profile special records.
	//    In such case we need to queue finalizer for execution,
	//    mark the object as live and preserve the profile special.
	// 2. A tiny object can have several finalizers setup for different offsets.
	//    If such object is not marked, we need to queue all finalizers at once.
	// Both 1 and 2 are possible at the same time.
	specialp := &s.specials
	special := *specialp
	for special != nil {
		// A finalizer can be set for an inner byte of an object, find object beginning.
		objIndex := uintptr(special.offset) / size
		p := s.base() + objIndex*size
		mbits := s.markBitsForIndex(objIndex)
		if !mbits.isMarked() {
			// This object is not marked and has at least one special record.
			// Pass 1: see if it has at least one finalizer.
			hasFin := false
			endOffset := p - s.base() + size
			for tmp := special; tmp != nil && uintptr(tmp.offset) < endOffset; tmp = tmp.next {
				if tmp.kind == _KindSpecialFinalizer {
					// Stop freeing of object if it has a finalizer.
					mbits.setMarkedNonAtomic()
					hasFin = true
					break
				}
			}
			// Pass 2: queue all finalizers _or_ handle profile record.
			for special != nil && uintptr(special.offset) < endOffset {
				// Find the exact byte for which the special was setup
				// (as opposed to object beginning).
				p := s.base() + uintptr(special.offset)
				if special.kind == _KindSpecialFinalizer || !hasFin {
					// Splice out special record.
					y := special
					special = special.next
					*specialp = special
					freespecial(y, unsafe.Pointer(p), size)
				} else {
					// This is profile record, but the object has finalizers (so kept alive).
					// Keep special record.
					specialp = &special.next
					special = *specialp
				}
			}
		} else {
			// object is still live: keep special record
			specialp = &special.next
			special = *specialp
		}
	}

	// 除錯用
	if debug.allocfreetrace != 0 || raceenabled || msanenabled {
		// Find all newly freed objects. This doesn't have to
		// efficient; allocfreetrace has massive overhead.
		mbits := s.markBitsForBase()
		abits := s.allocBitsForIndex(0)
		for i := uintptr(0); i < s.nelems; i++ {
			if !mbits.isMarked() && (abits.index < s.freeindex || abits.isMarked()) {
				x := s.base() + i*s.elemsize
				if debug.allocfreetrace != 0 {
					tracefree(unsafe.Pointer(x), size)
				}
				if raceenabled {
					racefree(unsafe.Pointer(x), size)
				}
				if msanenabled {
					msanfree(unsafe.Pointer(x), size)
				}
			}
			mbits.advance()
			abits.advance()
		}
	}

	// 計算釋放的對象數量
	// Count the number of free objects in this span.
	nalloc := uint16(s.countAlloc())
	if spc.sizeclass() == 0 && nalloc == 0 {
		// 如果span的類型是0(大對象)並且其中的對象已經不存活則釋放到heap
		s.needzero = 1
		freeToHeap = true
	}
	nfreed := s.allocCount - nalloc
	if nalloc > s.allocCount {
		print("runtime: nelems=", s.nelems, " nalloc=", nalloc, " previous allocCount=", s.allocCount, " nfreed=", nfreed, "\n")
		throw("sweep increased allocation count")
	}

	// 設置新的allocCount
	s.allocCount = nalloc

	// 判斷span是否無未分配的對象
	wasempty := s.nextFreeIndex() == s.nelems

	// 重置freeindex, 下次分配從0開始搜索
	s.freeindex = 0 // reset allocation index to start of span.
	if trace.enabled {
		getg().m.p.ptr().traceReclaimed += uintptr(nfreed) * s.elemsize
	}

	// gcmarkBits變爲新的allocBits
	// 然後重新分配一塊全部爲0的gcmarkBits
	// 下次分配對象時可以根據allocBits得知哪些元素是未分配的
	// gcmarkBits becomes the allocBits.
	// get a fresh cleared gcmarkBits in preparation for next GC
	s.allocBits = s.gcmarkBits
	s.gcmarkBits = newMarkBits(s.nelems)

	// 更新freeindex開始的allocCache
	// Initialize alloc bits cache.
	s.refillAllocCache(0)

	// 如果span中已經無存活的對象則更新sweepgen到最新
	// 下面會把span加到mcentral或者mheap
	// We need to set s.sweepgen = h.sweepgen only when all blocks are swept,
	// because of the potential for a concurrent free/SetFinalizer.
	// But we need to set it before we make the span available for allocation
	// (return it to heap or mcentral), because allocation code assumes that a
	// span is already swept if available for allocation.
	if freeToHeap || nfreed == 0 {
		// The span must be in our exclusive ownership until we update sweepgen,
		// check for potential races.
		if s.state != mSpanInUse || s.sweepgen != sweepgen-1 {
			print("MSpan_Sweep: state=", s.state, " sweepgen=", s.sweepgen, " mheap.sweepgen=", sweepgen, "\n")
			throw("MSpan_Sweep: bad span state after sweep")
		}
		// Serialization point.
		// At this point the mark bits are cleared and allocation ready
		// to go so release the span.
		atomic.Store(&s.sweepgen, sweepgen)
	}

	if nfreed > 0 && spc.sizeclass() != 0 {
		// 把span加到mcentral, res等於是否添加成功
		c.local_nsmallfree[spc.sizeclass()] += uintptr(nfreed)
		res = mheap_.central[spc].mcentral.freeSpan(s, preserve, wasempty)
		// freeSpan會更新sweepgen
		// MCentral_FreeSpan updates sweepgen
	} else if freeToHeap {
		// 把span釋放到mheap
		// Free large span to heap

		// NOTE(rsc,dvyukov): The original implementation of efence
		// in CL 22060046 used SysFree instead of SysFault, so that
		// the operating system would eventually give the memory
		// back to us again, so that an efence program could run
		// longer without running out of memory. Unfortunately,
		// calling SysFree here without any kind of adjustment of the
		// heap data structures means that when the memory does
		// come back to us, we have the wrong metadata for it, either in
		// the MSpan structures or in the garbage collection bitmap.
		// Using SysFault here means that the program will run out of
		// memory fairly quickly in efence mode, but at least it won't
		// have mysterious crashes due to confused memory reuse.
		// It should be possible to switch back to SysFree if we also
		// implement and then call some kind of MHeap_DeleteSpan.
		if debug.efence > 0 {
			s.limit = 0 // prevent mlookup from finding this span
			sysFault(unsafe.Pointer(s.base()), size)
		} else {
			mheap_.freeSpan(s, 1)
		}
		c.local_nlargefree++
		c.local_largefree += size
		res = true
	}
	
	// 如果span未加到mcentral或者未釋放到mheap, 則表示span仍在使用
	if !res {
		// 把仍在使用的span加到sweepSpans的"已清掃"隊列中
		// The span has been swept and is still in-use, so put
		// it on the swept in-use list.
		mheap_.sweepSpans[sweepgen/2%2].push(s)
	}
	return res
}

從bgsweep和前面的分配器可以看出掃描階段的工作是十分懶惰(lazy)的,
實際可能會出現前一階段的掃描還未完成, 就需要開始新一輪的GC的情況,
所以每一輪GC開始之前都需要完成前一輪GC的掃描工作(Sweep Termination階段).

GC的整個流程都分析完畢了, 最後貼上寫屏障函數writebarrierptr的實現:

// NOTE: Really dst *unsafe.Pointer, src unsafe.Pointer,
// but if we do that, Go inserts a write barrier on *dst = src.
//go:nosplit
func writebarrierptr(dst *uintptr, src uintptr) {
	if writeBarrier.cgo {
		cgoCheckWriteBarrier(dst, src)
	}
	if !writeBarrier.needed {
		*dst = src
		return
	}
	if src != 0 && src < minPhysPageSize {
		systemstack(func() {
			print("runtime: writebarrierptr *", dst, " = ", hex(src), "\n")
			throw("bad pointer in write barrier")
		})
	}
	// 標記指針
	writebarrierptr_prewrite1(dst, src)
	// 設置指針到目標
	*dst = src
}

writebarrierptr_prewrite1函數如下:

// writebarrierptr_prewrite1 invokes a write barrier for *dst = src
// prior to the write happening.
//
// Write barrier calls must not happen during critical GC and scheduler
// related operations. In particular there are times when the GC assumes
// that the world is stopped but scheduler related code is still being
// executed, dealing with syscalls, dealing with putting gs on runnable
// queues and so forth. This code cannot execute write barriers because
// the GC might drop them on the floor. Stopping the world involves removing
// the p associated with an m. We use the fact that m.p == nil to indicate
// that we are in one these critical section and throw if the write is of
// a pointer to a heap object.
//go:nosplit
func writebarrierptr_prewrite1(dst *uintptr, src uintptr) {
	mp := acquirem()
	if mp.inwb || mp.dying > 0 {
		releasem(mp)
		return
	}
	systemstack(func() {
		if mp.p == 0 && memstats.enablegc && !mp.inwb && inheap(src) {
			throw("writebarrierptr_prewrite1 called with mp.p == nil")
		}
		mp.inwb = true
		gcmarkwb_m(dst, src)
	})
	mp.inwb = false
	releasem(mp)
}

gcmarkwb_m函數如下:

func gcmarkwb_m(slot *uintptr, ptr uintptr) {
	if writeBarrier.needed {
		// Note: This turns bad pointer writes into bad
		// pointer reads, which could be confusing. We avoid
		// reading from obviously bad pointers, which should
		// take care of the vast majority of these. We could
		// patch this up in the signal handler, or use XCHG to
		// combine the read and the write. Checking inheap is
		// insufficient since we need to track changes to
		// roots outside the heap.
		//
		// Note: profbuf.go omits a barrier during signal handler
		// profile logging; that's safe only because this deletion barrier exists.
		// If we remove the deletion barrier, we'll have to work out
		// a new way to handle the profile logging.
		if slot1 := uintptr(unsafe.Pointer(slot)); slot1 >= minPhysPageSize {
			if optr := *slot; optr != 0 {
				// 標記舊指針
				shade(optr)
			}
		}
		// TODO: Make this conditional on the caller's stack color.
		if ptr != 0 && inheap(ptr) {
			// 標記新指針
			shade(ptr)
		}
	}
}

shade函數如下:

// Shade the object if it isn't already.
// The object is not nil and known to be in the heap.
// Preemption must be disabled.
//go:nowritebarrier
func shade(b uintptr) {
	if obj, hbits, span, objIndex := heapBitsForObject(b, 0, 0); obj != 0 {
		gcw := &getg().m.p.ptr().gcw
		// 標記一個對象存活, 並把它加到標記隊列(該對象變爲灰色)
		greyobject(obj, 0, 0, hbits, span, gcw, objIndex)
		// 如果標記了禁止本地標記隊列則flush到全局標記隊列
		if gcphase == _GCmarktermination || gcBlackenPromptly {
			// Ps aren't allowed to cache work during mark
			// termination.
			gcw.dispose()
		}
	}
}

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章