Linux操作系統學習筆記（七）任務調度

一. 前言

在前文中，我們分析了內核中進程和線程的統一結構體task_struct，並分析進程、線程的創建和派生的過程。在本文中，我們會對任務間調度進行詳細剖析，瞭解其原理和整個執行過程。由此，進程、線程部分的大體框架就算是介紹完了。本節主要分爲三個部分：Linux內核中常見的調度策略，調度的基本結構體以及調度發生的整個流程。下面將詳細展開說明。

二. 調度策略

Linux的調度策略主要分爲實時任務和普通任務。實時任務需求儘快返回結果，而普通任務則沒有較高的要求。在前文中我們提到了task_struct中調度策略相應的變量爲policy，調度優先級有prio, static_prio, normal_prio, rt_priority幾個。優先級其實就是一個數值，對於實時進程，優先級的範圍是 0～99；對於普通進程，優先級的範圍是 100～139。數值越小，優先級越高。

2.1 實時調度策略

實時調度策略主要包括以下幾種

SCHED_FIFO：先來先出型策略，顧名思義相同優先級的情況下先到先得
SCHED_RR：輪詢策略，注重公平性，相同優先級的任務會使用相同的時間片輪流執行
SCHED_DEADLINE：根據任務結束時間來進行調度，即將結束的擁有較高的優先級

2.2 普通調度策略

普通調度策略主要包括以下幾種

SCHED_NORMAL：普通任務
SCHED_BATCH：後臺任務，優先級較低
SCHED_IDLE：空閒時間纔會跑的任務
CFS：完全公平調度策略，較爲特殊的一種策略。CFS 會爲每一個任務安排一個虛擬運行時間 vruntime。如果一個任務在運行，隨着一個個 CPU時鐘tick 的到來，任務的 vruntime 將不斷增大，而沒有得到執行的任務的 vruntime 不變。由此，當調度的時候，vruntime較小的就擁有較高的優先級。 vruntime的實際計算方式和權重相關，由此保證了優先級高的按比例擁有更多的執行時間，從而達到完全公平。

三. 調度相關的架構體

首先，我們需要一個結構體去執行調度策略，即sched_class。該類有幾種實現

stop_sched_class 優先級最高的任務會使用這種策略，會中斷所有其他線程，且不會被其他任務打斷；
dl_sched_class 就對應上面的 deadline 調度策略；
rt_sched_class 就對應 RR 算法或者 FIFO 算法的調度策略，具體調度策略由進程的 task_struct->policy 指定；
fair_sched_class 就是普通進程的調度策略；
idle_sched_class 就是空閒進程的調度策略。

其次，我們需要一個調度結構體來集合調度信息，用於調度，即sched_entity，主要有

struct sched_entity se：普通任務調度實體
struct sched_rt_entity rt：實時調度實體
struct sched_dl_entity dl：DEADLINE調度實體

普通任務調度實體源碼如下，這裏麪包含了 vruntime 和權重 load_weight，以及對於運行時間的統計。

struct sched_entity {
	/* For load-balancing: */
	struct load_weight		load;
	unsigned long			runnable_weight;
	struct rb_node			run_node;
	struct list_head		group_node;
	unsigned int			on_rq;
	u64				exec_start;
	u64				sum_exec_runtime;
	u64				vruntime;
	u64				prev_sum_exec_runtime;
	u64				nr_migrations;
	struct sched_statistics		statistics;
#ifdef CONFIG_FAIR_GROUP_SCHED
	int				depth;
	struct sched_entity		*parent;
	/* rq on which this entity is (to be) queued: */
	struct cfs_rq			*cfs_rq;
	/* rq "owned" by this entity/group: */
	struct cfs_rq			*my_q;
#endif
#ifdef CONFIG_SMP
	/*
	 * Per entity load average tracking.
	 *
	 * Put into separate cache line so it does not
	 * collide with read-mostly values above.
	 */
	struct sched_avg		avg;
#endif
};

在調度時，多個任務調度實體會首先區分是實時任務還是普通任務，然後通過以時間爲順序的紅黑樹結構組合起來，vruntime 最小的在樹的左側，vruntime最多的在樹的右側。以CFS策略爲例，則會選擇紅黑樹最左邊的葉子節點作爲下一個將獲得 CPU 的任務。而這顆紅黑樹，我們稱之爲運行時隊列（run queue），即struct rq。

/*
 * This is the main, per-CPU runqueue data structure.
 *
 * Locking rule: those places that want to lock multiple runqueues
 * (such as the load balancing or the thread migration code), lock
 * acquire operations must be ordered by ascending &runqueue.
 */
struct rq {
	/* runqueue lock: */
	raw_spinlock_t		lock;
	/*
	 * nr_running and cpu_load should be in the same cacheline because
	 * remote CPUs use both these fields when doing load calculation.
	 */
	unsigned int		nr_running;
......
	#define CPU_LOAD_IDX_MAX 5
	unsigned long		cpu_load[CPU_LOAD_IDX_MAX];
......
	/* capture load from *all* tasks on this CPU: */
	struct load_weight	load;
	unsigned long		nr_load_updates;
	u64			nr_switches;
	struct cfs_rq		cfs;
	struct rt_rq		rt;
	struct dl_rq		dl;
......
	/*
	 * This is part of a global counter where only the total sum
	 * over all CPUs matters. A task can increase this counter on
	 * one CPU and if it got migrated afterwards it may decrease
	 * it on another CPU. Always updated under the runqueue lock:
	 */
	unsigned long		nr_uninterruptible;
	struct task_struct	*curr;
	struct task_struct	*idle;
	struct task_struct	*stop;
	unsigned long		next_balance;
	struct mm_struct	*prev_mm;
	unsigned int		clock_update_flags;
	u64			clock;
	/* Ensure that all clocks are in the same cache line */
	u64			clock_task ____cacheline_aligned;
	u64			clock_pelt;
	unsigned long		lost_idle_time;
	atomic_t		nr_iowait;
......
	/* calc_load related fields */
	unsigned long		calc_load_update;
	long			calc_load_active;
......
};

其中包含結構體cfs_rq，其定義如下，主要是CFS調度相關的結構體，主要有權值相關變量、vruntime相關變量以及紅黑樹指針，其中結構體rb_root_cached即爲紅黑樹的節點

/* CFS-related fields in a runqueue */
struct cfs_rq {
	struct load_weight	load;
	unsigned long		runnable_weight;
	unsigned int		nr_running;
	unsigned int		h_nr_running;
	u64			exec_clock;
	u64			min_vruntime;
#ifndef CONFIG_64BIT
	u64			min_vruntime_copy;
#endif
	struct rb_root_cached	tasks_timeline;
	/*
	 * 'curr' points to currently running entity on this cfs_rq.
	 * It is set to NULL otherwise (i.e when none are currently running).
	 */
	struct sched_entity	*curr;
	struct sched_entity	*next;
	struct sched_entity	*last;
	struct sched_entity	*skip;
......
};

對結構體dl_rq有類似的定義，運行隊列由紅黑樹結構體構成，並按照deadline策略進行管理

/* Deadline class' related fields in a runqueue */
struct dl_rq {
	/* runqueue is an rbtree, ordered by deadline */
	struct rb_root_cached	root;
	unsigned long		dl_nr_running;
#ifdef CONFIG_SMP
	/*
	 * Deadline values of the currently executing and the
	 * earliest ready task on this rq. Caching these facilitates
	 * the decision whether or not a ready but not running task
	 * should migrate somewhere else.
	 */
	struct {
		u64		curr;
		u64		next;
	} earliest_dl;
	unsigned long		dl_nr_migratory;
	int			overloaded;
	/*
	 * Tasks on this rq that can be pushed away. They are kept in
	 * an rb-tree, ordered by tasks' deadlines, with caching
	 * of the leftmost (earliest deadline) element.
	 */
	struct rb_root_cached	pushable_dl_tasks_root;
#else
	struct dl_bw		dl_bw;
#endif
	/*
	 * "Active utilization" for this runqueue: increased when a
	 * task wakes up (becomes TASK_RUNNING) and decreased when a
	 * task blocks
	 */
	u64			running_bw;
	/*
	 * Utilization of the tasks "assigned" to this runqueue (including
	 * the tasks that are in runqueue and the tasks that executed on this
	 * CPU and blocked). Increased when a task moves to this runqueue, and
	 * decreased when the task moves away (migrates, changes scheduling
	 * policy, or terminates).
	 * This is needed to compute the "inactive utilization" for the
	 * runqueue (inactive utilization = this_bw - running_bw).
	 */
	u64			this_bw;
	u64			extra_bw;
	/*
	 * Inverse of the fraction of CPU utilization that can be reclaimed
	 * by the GRUB algorithm.
	 */
	u64			bw_ratio;
};

對於實施隊列相應的rt_rq則有所不同，並沒有用紅黑樹實現。

/* Real-Time classes' related field in a runqueue: */
struct rt_rq {
	struct rt_prio_array	active;
	unsigned int		rt_nr_running;
	unsigned int		rr_nr_running;
#if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED
	struct {
		int		curr; /* highest queued rt task prio */
#ifdef CONFIG_SMP
		int		next; /* next highest */
#endif
	} highest_prio;
#endif
#ifdef CONFIG_SMP
	unsigned long		rt_nr_migratory;
	unsigned long		rt_nr_total;
	int			overloaded;
	struct plist_head	pushable_tasks;
#endif /* CONFIG_SMP */
	int			rt_queued;
	int			rt_throttled;
	u64			rt_time;
	u64			rt_runtime;
	/* Nests inside the rq lock: */
	raw_spinlock_t		rt_runtime_lock;
#ifdef CONFIG_RT_GROUP_SCHED
	unsigned long		rt_nr_boosted;
	struct rq		*rq;
	struct task_group	*tg;
#endif
};

下面再看看調度類sched_class，該類以函數指針的形式定義了諸多隊列操作，如

enqueue_task 向就緒隊列中添加一個任務，當某個任務進入可運行狀態時，調用這個函數；
dequeue_task 將一個任務從就緒隊列中刪除；
yield_task將主動放棄CPU；
yield_to_task主動放棄CPU並執行指定的task_struct；
check_preempt_curr檢查當前任務是否可被強佔；
pick_next_task 選擇接下來要運行的任務；
put_prev_task 用另一個進程代替當前運行的任務；
set_curr_task 用於修改調度策略；
task_tick 每次週期性時鐘到的時候，這個函數被調用，可能觸發調度。
task_dead:進程結束時調用
switched_from、switched_to:進程改變調度器時使用
prio_changed:改變進程優先級

struct sched_class {
	const struct sched_class *next;
	void (*enqueue_task) (struct rq *rq, struct task_struct *p, int flags);
	void (*dequeue_task) (struct rq *rq, struct task_struct *p, int flags);
	void (*yield_task)   (struct rq *rq);
	bool (*yield_to_task)(struct rq *rq, struct task_struct *p, bool preempt);
	void (*check_preempt_curr)(struct rq *rq, struct task_struct *p, int flags);
	/*
	 * It is the responsibility of the pick_next_task() method that will
	 * return the next task to call put_prev_task() on the @prev task or
	 * something equivalent.
	 *
	 * May return RETRY_TASK when it finds a higher prio class has runnable
	 * tasks.
	 */
	struct task_struct * (*pick_next_task)(struct rq *rq,
					       struct task_struct *prev,
					       struct rq_flags *rf);
	void (*put_prev_task)(struct rq *rq, struct task_struct *p);
......
	void (*set_curr_task)(struct rq *rq);
	void (*task_tick)(struct rq *rq, struct task_struct *p, int queued);
	void (*task_fork)(struct task_struct *p);
	void (*task_dead)(struct task_struct *p);
	/*
	 * The switched_from() call is allowed to drop rq->lock, therefore we
	 * cannot assume the switched_from/switched_to pair is serliazed by
	 * rq->lock. They are however serialized by p->pi_lock.
	 */
	void (*switched_from)(struct rq *this_rq, struct task_struct *task);
	void (*switched_to)  (struct rq *this_rq, struct task_struct *task);
	void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
			      int oldprio);
	unsigned int (*get_rr_interval)(struct rq *rq,
					struct task_struct *task);
	void (*update_curr)(struct rq *rq);
#define TASK_SET_GROUP		0
#define TASK_MOVE_GROUP		1
......
};

調度類分爲下面幾種：

extern const struct sched_class stop_sched_class;
extern const struct sched_class dl_sched_class;
extern const struct sched_class rt_sched_class;
extern const struct sched_class fair_sched_class;
extern const struct sched_class idle_sched_class;

隊列操作中函數指針指向不同策略隊列的實際執行函數函數，在linux/kernel/sched/目錄下，fair.c、idle.c、rt.c等文件對不同類型的策略實現了不同的函數，如fair.c中定義了

/*
 * All the scheduling class methods:
 */
const struct sched_class fair_sched_class = {
	.next			    = &idle_sched_class,
	.enqueue_task		= enqueue_task_fair,
	.dequeue_task		= dequeue_task_fair,
	.yield_task		    = yield_task_fair,
	.yield_to_task		= yield_to_task_fair,
	.check_preempt_curr	 = check_preempt_wakeup,
	.pick_next_task		 = pick_next_task_fair,
	.put_prev_task		 = put_prev_task_fair,
......
	.set_curr_task       = set_curr_task_fair,
	.task_tick		    = task_tick_fair,
	.task_fork		    = task_fork_fair,
	.prio_changed		= prio_changed_fair,
	.switched_from		= switched_from_fair,
	.switched_to		= switched_to_fair,
	.get_rr_interval	= get_rr_interval_fair,
	.update_curr		= update_curr_fair,
......
};

以選擇下一個任務爲例，CFS對應的是pick_next_task_fair，而rt_rq對應的則是pick_next_task_rt，等等。

由此，我們來總結一下：

每個CPU都有一個struct rq結構體，裏面會有着cfs_rq, rt_rq等一系列隊列
每個隊列由一個紅黑樹組織，紅黑樹裏每一個節點爲一個任務實體sched_entity
每一個任務實體sched_entity對應於一個任務task_struct
在task_struct中對應的sched_class會根據不同策略申明不同的對應處理函數，處理實際的調度工作

四. 調度流程

有了上述的基本策略和基本調度結構體，我們可以形成大致的骨架，下面就是需要核心的調度流程將其拼湊成一個整體，實現調度系統。調度分爲兩種，主動調度和搶佔式調度。

主動調度即任務執行一定時間以後主動讓出CPU，通過調度策略選擇合適的下一個任務執行。
搶佔式調度即任務執行中收到了其他任務的中斷，由此停止執行並切換至下一個任務。

4.1 主動調度

說到調用，逃不過核心函數schedule()。其中sched_submit_work()函數完成當前任務的收尾工作，以避免出現如死鎖或者IO中斷等情況。之後首先禁止搶佔式調度的發生，然後調用__schedule()函數完成調度，之後重新打開搶佔式調度，如果需要重新調度則會一直重複該過程，否則結束函數。

asmlinkage __visible void __sched schedule(void)
{
	struct task_struct *tsk = current;
	sched_submit_work(tsk);
	do {
		preempt_disable();
		__schedule(false);
		sched_preempt_enable_no_resched();
	} while (need_resched());
}
EXPORT_SYMBOL(schedule);

而__schedule()函數則是實際的核心調度函數，該函數主要操作包括選取下一進程和進行上下文切換，而上下文切換又包括用戶態空間切換和內核態的切換。具體的解釋可以參照英文源碼註釋以及中文對各個步驟的註釋。

/*
 * __schedule() is the main scheduler function.
 * The main means of driving the scheduler and thus entering this function are:
 *   1. Explicit blocking: mutex, semaphore, waitqueue, etc.
 *   2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return
 *      paths. For example, see arch/x86/entry_64.S.
 *      To drive preemption between tasks, the scheduler sets the flag in timer
 *      interrupt handler scheduler_tick().
 *   3. Wakeups don't really cause entry into schedule(). They add a
 *      task to the run-queue and that's it.
 *      Now, if the new task added to the run-queue preempts the current
 *      task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets
 *      called on the nearest possible occasion:
 *       - If the kernel is preemptible (CONFIG_PREEMPT=y):
 *         - in syscall or exception context, at the next outmost
 *           preempt_enable(). (this might be as soon as the wake_up()'s
 *           spin_unlock()!)
 *         - in IRQ context, return from interrupt-handler to
 *           preemptible context
 *       - If the kernel is not preemptible (CONFIG_PREEMPT is not set)
 *         then at the next:
 *          - cond_resched() call
 *          - explicit schedule() call
 *          - return from syscall or exception to user-space
 *          - return from interrupt-handler to user-space
 * WARNING: must be called with preemption disabled!
 */
static void __sched notrace __schedule(bool preempt)
{
	struct task_struct *prev, *next;
	unsigned long *switch_count;
	struct rq_flags rf;
	struct rq *rq;
	int cpu;
    
    //從當前的CPU中取出任務隊列rq，prev賦值爲當前任務
	cpu = smp_processor_id();
	rq = cpu_rq(cpu);
	prev = rq->curr;
    
    //檢測當前任務是否可以調度
	schedule_debug(prev);
	if (sched_feat(HRTICK))
		hrtick_clear(rq);
    
    //禁止中斷，RCU搶佔關閉，隊列加鎖，SMP加鎖
	local_irq_disable();
	rcu_note_context_switch(preempt);
	/*
	 * Make sure that signal_pending_state()->signal_pending() below
	 * can't be reordered with __set_current_state(TASK_INTERRUPTIBLE)
	 * done by the caller to avoid the race with signal_wake_up().
	 *
	 * The membarrier system call requires a full memory barrier
	 * after coming from user-space, before storing to rq->curr.
	 */
	rq_lock(rq, &rf);
	smp_mb__after_spinlock();
    
	/* Promote REQ to ACT */
	rq->clock_update_flags <<= 1;
	update_rq_clock(rq);
	switch_count = &prev->nivcsw;
    
	if (!preempt && prev->state) {
        //不可中斷的任務則繼續執行
		if (signal_pending_state(prev->state, prev)) {
			prev->state = TASK_RUNNING;
		} else {
            //當前任務從隊列rq中出隊，on_rq設置爲0，如果存在I/O未完成則延時完成
			deactivate_task(rq, prev, DEQUEUE_SLEEP | DEQUEUE_NOCLOCK);
			prev->on_rq = 0;
			if (prev->in_iowait) {
				atomic_inc(&rq->nr_iowait);
				delayacct_blkio_start();
			}
			/* 喚醒睡眠進程
			 * If a worker went to sleep, notify and ask workqueue
			 * whether it wants to wake up a task to maintain
			 * concurrency.
			 */
			if (prev->flags & PF_WQ_WORKER) {
				struct task_struct *to_wakeup;
				to_wakeup = wq_worker_sleeping(prev);
				if (to_wakeup)
					try_to_wake_up_local(to_wakeup, &rf);
			}
		}
		switch_count = &prev->nvcsw;
	}
    
    // 調用pick_next_task獲取下一個任務，賦值給next
	next = pick_next_task(rq, prev, &rf);
	clear_tsk_need_resched(prev);
	clear_preempt_need_resched();
    
    // 如果產生了任務切換，則需要切換上下文
	if (likely(prev != next)) {
		rq->nr_switches++;
		rq->curr = next;
		/*
		 * The membarrier system call requires each architecture
		 * to have a full memory barrier after updating
		 * rq->curr, before returning to user-space.
		 *
		 * Here are the schemes providing that barrier on the
		 * various architectures:
		 * - mm ? switch_mm() : mmdrop() for x86, s390, sparc, PowerPC.
		 *   switch_mm() rely on membarrier_arch_switch_mm() on PowerPC.
		 * - finish_lock_switch() for weakly-ordered
		 *   architectures where spin_unlock is a full barrier,
		 * - switch_to() for arm64 (weakly-ordered, spin_unlock
		 *   is a RELEASE barrier),
		 */
		++*switch_count;
		trace_sched_switch(preempt, prev, next);
		/* Also unlocks the rq: */
		rq = context_switch(rq, prev, next, &rf);
	} else {
        // 清除標記位，重開中斷
		rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
		rq_unlock_irq(rq, &rf);
	}
    //隊列自平衡：紅黑樹平衡操作
	balance_callback(rq);
}

其中核心函數是獲取下一個任務的pick_next_task()以及上下文切換的context_switch()，下面詳細展開剖析。首先看看pick_next_task()，該函數會根據調度策略分類，調用該類對應的調度函數選擇下一個任務實體。根據前文分析我們知道，最終是在不同的紅黑樹上選擇最左節點作爲下一個任務實體並返回。

/*
 * Pick up the highest-prio task:
 */
static inline struct task_struct *
pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
{
	const struct sched_class *class;
	struct task_struct *p;
	/* 這裏做了一個優化：如果是普通調度策略則直接調用fair_sched_class中的pick_next_task
	 * Optimization: we know that if all tasks are in the fair class we can
	 * call that function directly, but only if the @prev task wasn't of a
	 * higher scheduling class, because otherwise those loose the
	 * opportunity to pull in more work from other CPUs.
	 */
	if (likely((prev->sched_class == &idle_sched_class ||
		    prev->sched_class == &fair_sched_class) &&
		   rq->nr_running == rq->cfs.h_nr_running)) {
		p = fair_sched_class.pick_next_task(rq, prev, rf);
		if (unlikely(p == RETRY_TASK))
			goto again;
		/* Assumes fair_sched_class->next == idle_sched_class */
		if (unlikely(!p))
			p = idle_sched_class.pick_next_task(rq, prev, rf);
		return p;
	}
again:
    //依次調用類中的選擇函數，如果正確選擇到下一個任務則返回
	for_each_class(class) {
		p = class->pick_next_task(rq, prev, rf);
		if (p) {
			if (unlikely(p == RETRY_TASK))
				goto again;
			return p;
		}
	}
	/* The idle class should always have a runnable task: */
	BUG();
}

下面來看看上下文切換。上下文切換主要幹兩件事情，一是切換任務空間，也即虛擬內存；二是切換寄存器和 CPU 上下文。關於任務空間的切換放在內存部分的文章中詳細介紹，這裏先按下不表，通過任務空間切換實際完成了用戶態的上下文切換工作。下面我們重點看一下內核態切換，即寄存器和CPU上下文的切換。

/*
 * context_switch - switch to the new MM and the new thread's register state.
 */
static __always_inline struct rq *
context_switch(struct rq *rq, struct task_struct *prev,
	       struct task_struct *next, struct rq_flags *rf)
{
	struct mm_struct *mm, *oldmm;
	prepare_task_switch(rq, prev, next);
	mm = next->mm;
	oldmm = prev->active_mm;
	/*
	 * For paravirt, this is coupled with an exit in switch_to to
	 * combine the page table reload and the switch backend into
	 * one hypercall.
	 */
	arch_start_context_switch(prev);
	/*
	 * If mm is non-NULL, we pass through switch_mm(). If mm is
	 * NULL, we will pass through mmdrop() in finish_task_switch().
	 * Both of these contain the full memory barrier required by
	 * membarrier after storing to rq->curr, before returning to
	 * user-space.
	 */
	if (!mm) {
		next->active_mm = oldmm;
		mmgrab(oldmm);
		enter_lazy_tlb(oldmm, next);
	} else
		switch_mm_irqs_off(oldmm, mm, next);
	if (!prev->mm) {
		prev->active_mm = NULL;
		rq->prev_mm = oldmm;
	}
	rq->clock_update_flags &= ~(RQCF_ACT_SKIP|RQCF_REQ_SKIP);
	prepare_lock_switch(rq, next, rf);
	/* Here we just switch the register state and the stack. */
	switch_to(prev, next, prev);
    //barrier 語句是一個編譯器指令，用於保證 switch_to 和 finish_task_switch 的執行順序不會因爲編譯階段優化而改變
	barrier();
	return finish_task_switch(prev);
}

switch_to()就是寄存器和棧的切換，它調用到了 __switch_to_asm。這是一段彙編代碼，主要用於棧的切換，其中32位使用esp作爲棧頂指針，64位使用rsp，其他部分代碼一致。通過該段彙編代碼我們完成了棧頂指針的切換，並調用__switch_to完成最終TSS的切換。注意switch_to中其實是有三個變量，分別是prev, next, last，而實際在使用時，我們會對last也賦值爲prev。這裏的設計意圖需要結合一個例子來說明。假設有ABC三個任務，從A調度到B，B到C，最後C回到A，我們假設僅保存prev和next，則流程如下

A保存內核棧和寄存器，切換至B，此時prev = A, next = B，該狀態會保存在棧裏，等下次調用A的時候再恢復。然後調用B的finish_task_switch()繼續執行下去，返回B的隊列rq，
B保存內核棧和寄存器，切換至C
C保存內核棧和寄存器，切換至A。A從barrier()開始運行，而A從步驟1中保存的prev = A, next = B則完美的避開了C，丟失了C的信息。因此last指針的重要性就出現了。在執行完__switch_to_asm後，A的內核棧和寄存器重新覆蓋了prev和next，但是我們通過返回值提供了C的內存地址，保存在last中，在finish_task_switch中完成清理工作。

[外鏈圖片轉存失敗,源站可能有防盜鏈機制,建議將圖片保存下來直接上傳(img-ugzO6AxV-1592018842062)(C:\Users\Raytine\AppData\Roaming\Typora\typora-user-images\image-20200530133049327.png)]

#define switch_to(prev, next, last)			      \
do {									       \
	prepare_switch_to(next);					\
									           \
	((last) = __switch_to_asm((prev), (next)));	  \
} while (0)

/*
 * %eax: prev task
 * %edx: next task
 */
ENTRY(__switch_to_asm)
......
  /* switch stack */
  movl  %esp, TASK_threadsp(%eax)
  movl  TASK_threadsp(%edx), %esp
......
  jmp  __switch_to
END(__switch_to_asm)

最終調用__switch_to()函數。該函數中涉及到一個結構體TSS(Task State Segment)，該結構體存放了所有的寄存器。另外還有一個特殊的寄存器TR（Task Register）會指向TSS，我們通過更改TR的值，會觸發硬件保存CPU所有寄存器在當前TSS，並從新的TSS讀取寄存器的值加載入CPU，從而完成一次硬中斷帶來的上下文切換工作。系統初始化的時候，會調用 cpu_init()給每一個 CPU 關聯一個 TSS，然後將 TR 指向這個 TSS，然後在操作系統的運行過程中，TR 就不切換了，永遠指向這個 TSS。當修改TR的值得時候，則爲任務調度。

/*
 *	switch_to(x,y) should switch tasks from x to y.
 *
 * We fsave/fwait so that an exception goes off at the right time
 * (as a call from the fsave or fwait in effect) rather than to
 * the wrong process. Lazy FP saving no longer makes any sense
 * with modern CPU's, and this simplifies a lot of things (SMP
 * and UP become the same).
 *
 * NOTE! We used to use the x86 hardware context switching. The
 * reason for not using it any more becomes apparent when you
 * try to recover gracefully from saved state that is no longer
 * valid (stale segment register values in particular). With the
 * hardware task-switch, there is no way to fix up bad state in
 * a reasonable manner.
 *
 * The fact that Intel documents the hardware task-switching to
 * be slow is a fairly red herring - this code is not noticeably
 * faster. However, there _is_ some room for improvement here,
 * so the performance issues may eventually be a valid point.
 * More important, however, is the fact that this allows us much
 * more flexibility.
 *
 * The return value (in %ax) will be the "prev" task after
 * the task-switch, and shows up in ret_from_fork in entry.S,
 * for example.
 */
__visible __notrace_funcgraph struct task_struct *
__switch_to(struct task_struct *prev_p, struct task_struct *next_p)
{
	struct thread_struct *prev = &prev_p->thread,
			     *next = &next_p->thread;
	struct fpu *prev_fpu = &prev->fpu;
	struct fpu *next_fpu = &next->fpu;
	int cpu = smp_processor_id();
	/* never put a printk in __switch_to... printk() calls wake_up*() indirectly */
	switch_fpu_prepare(prev_fpu, cpu);
	/*
	 * Save away %gs. No need to save %fs, as it was saved on the
	 * stack on entry.  No need to save %es and %ds, as those are
	 * always kernel segments while inside the kernel.  Doing this
	 * before setting the new TLS descriptors avoids the situation
	 * where we temporarily have non-reloadable segments in %fs
	 * and %gs.  This could be an issue if the NMI handler ever
	 * used %fs or %gs (it does not today), or if the kernel is
	 * running inside of a hypervisor layer.
	 */
	lazy_save_gs(prev->gs);
	/*
	 * Load the per-thread Thread-Local Storage descriptor.
	 */
	load_TLS(next, cpu);
	/*
	 * Restore IOPL if needed.  In normal use, the flags restore
	 * in the switch assembly will handle this.  But if the kernel
	 * is running virtualized at a non-zero CPL, the popf will
	 * not restore flags, so it must be done in a separate step.
	 */
	if (get_kernel_rpl() && unlikely(prev->iopl != next->iopl))
		set_iopl_mask(next->iopl);
	switch_to_extra(prev_p, next_p);
	/*
	 * Leave lazy mode, flushing any hypercalls made here.
	 * This must be done before restoring TLS segments so
	 * the GDT and LDT are properly updated, and must be
	 * done before fpu__restore(), so the TS bit is up
	 * to date.
	 */
	arch_end_context_switch(next_p);
	/*
	 * Reload esp0 and cpu_current_top_of_stack.  This changes
	 * current_thread_info().  Refresh the SYSENTER configuration in
	 * case prev or next is vm86.
	 */
	update_task_stack(next_p);
	refresh_sysenter_cs(next);
	this_cpu_write(cpu_current_top_of_stack,
		       (unsigned long)task_stack_page(next_p) +
		       THREAD_SIZE);
	/*
	 * Restore %gs if needed (which is common)
	 */
	if (prev->gs | next->gs)
		lazy_load_gs(next->gs);
	switch_fpu_finish(next_fpu, cpu);
	this_cpu_write(current_task, next_p);
	/* Load the Intel cache allocation PQR MSR. */
	resctrl_sched_in();
	return prev_p;
}

在完成了switch_to()的內核態切換後，還有一個重要的函數finish_task_switch()負責善後清理工作。在前面介紹switch_to三個參數的時候我們已經說明了使用last的重要性。而這裏爲何讓prev和last均賦值爲prev，是因爲prev在後面沒有需要用到，所以節省了一個指針空間來存儲last。

/**
 * finish_task_switch - clean up after a task-switch
 * @prev: the thread we just switched away from.
 *
 * finish_task_switch must be called after the context switch, paired
 * with a prepare_task_switch call before the context switch.
 * finish_task_switch will reconcile locking set up by prepare_task_switch,
 * and do any other architecture-specific cleanup actions.
 *
 * Note that we may have delayed dropping an mm in context_switch(). If
 * so, we finish that here outside of the runqueue lock. (Doing it
 * with the lock held can cause deadlocks; see schedule() for
 * details.)
 *
 * The context switch have flipped the stack from under us and restored the
 * local variables which were saved when this task called schedule() in the
 * past. prev == current is still correct but we need to recalculate this_rq
 * because prev may have moved to another CPU.
 */
static struct rq *finish_task_switch(struct task_struct *prev)
	__releases(rq->lock)
{
	struct rq *rq = this_rq();
	struct mm_struct *mm = rq->prev_mm;
	long prev_state;
	/*
	 * The previous task will have left us with a preempt_count of 2
	 * because it left us after:
	 *
	 *	schedule()
	 *	  preempt_disable();			// 1
	 *	  __schedule()
	 *	    raw_spin_lock_irq(&rq->lock)	// 2
	 *
	 * Also, see FORK_PREEMPT_COUNT.
	 */
	if (WARN_ONCE(preempt_count() != 2*PREEMPT_DISABLE_OFFSET,
		      "corrupted preempt_count: %s/%d/0x%x\n",
		      current->comm, current->pid, preempt_count()))
		preempt_count_set(FORK_PREEMPT_COUNT);
	rq->prev_mm = NULL;
	/*
	 * A task struct has one reference for the use as "current".
	 * If a task dies, then it sets TASK_DEAD in tsk->state and calls
	 * schedule one last time. The schedule call will never return, and
	 * the scheduled task must drop that reference.
	 *
	 * We must observe prev->state before clearing prev->on_cpu (in
	 * finish_task), otherwise a concurrent wakeup can get prev
	 * running on another CPU and we could rave with its RUNNING -> DEAD
	 * transition, resulting in a double drop.
	 */
	prev_state = prev->state;
	vtime_task_switch(prev);
	perf_event_task_sched_in(prev, current);
	finish_task(prev);
	finish_lock_switch(rq);
	finish_arch_post_lock_switch();
	kcov_finish_switch(current);
	fire_sched_in_preempt_notifiers(current);
	/*
	 * When switching through a kernel thread, the loop in
	 * membarrier_{private,global}_expedited() may have observed that
	 * kernel thread and not issued an IPI. It is therefore possible to
	 * schedule between user->kernel->user threads without passing though
	 * switch_mm(). Membarrier requires a barrier after storing to
	 * rq->curr, before returning to userspace, so provide them here:
	 *
	 * - a full memory barrier for {PRIVATE,GLOBAL}_EXPEDITED, implicitly
	 *   provided by mmdrop(),
	 * - a sync_core for SYNC_CORE.
	 */
	if (mm) {
		membarrier_mm_sync_core_before_usermode(mm);
		mmdrop(mm);
	}
	if (unlikely(prev_state == TASK_DEAD)) {
		if (prev->sched_class->task_dead)
			prev->sched_class->task_dead(prev);
		/*
		 * Remove function-return probe instances associated with this
		 * task and put them back on the free list.
		 */
		kprobe_flush_task(prev);
		/* Task is done with its stack. */
		put_task_stack(prev);
		put_task_struct(prev);
	}
	tick_nohz_task_switch();
	return rq;
}

至此，我們完成了內核態的切換工作，也完成了整個主動調度的過程。

4.2 搶佔式調度

搶佔式調度通常發生在兩種情況下。一種是某任務執行時間過長，另一種是當某任務被喚醒的時候。首先看看任務執行時間過長的情況。

4.2.1 任務運行時間檢測

該情況需要衡量一個任務的執行時間長短，執行時間過長則發起搶佔。在計算機裏面有一個時鐘，會過一段時間觸發一次時鐘中斷，通知操作系統時間又過去一個時鐘週期，通過這種方式可以查看是否是需要搶佔的時間點。

時鐘中斷處理函數會調用scheduler_tick()。該函數首先取出當前CPU，並由此獲取對應的運行隊列rq和當前任務curr。接着調用該任務的調度類sched_class對應的task_tick()函數進行時間事件處理。

/*
 * This function gets called by the timer code, with HZ frequency.
 * We call it with interrupts disabled.
 */
void scheduler_tick(void)
{
	int cpu = smp_processor_id();
	struct rq *rq = cpu_rq(cpu);
	struct task_struct *curr = rq->curr;
	struct rq_flags rf;
	sched_clock_tick();
	rq_lock(rq, &rf);
	update_rq_clock(rq);
	curr->sched_class->task_tick(rq, curr, 0);
	cpu_load_update_active(rq);
	calc_global_load_tick(rq);
	psi_task_tick(rq);
	rq_unlock(rq, &rf);
	perf_event_task_tick();
......
}

以普通任務隊列爲例，對應的調度類爲fair_sched_class，對應的時鐘處理函數爲task_tick_fair()，該函數會獲取當前的調度實體和運行隊列，並調用entity_tick()函數更新時間。

/*
 * scheduler tick hitting a task of our scheduling class.
 * NOTE: This function can be called remotely by the tick offload that
 * goes along full dynticks. Therefore no local assumption can be made
 * and everything must be accessed through the @rq and @curr passed in
 * parameters.
 */
static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
{
	struct cfs_rq *cfs_rq;
	struct sched_entity *se = &curr->se;
	for_each_sched_entity(se) {
		cfs_rq = cfs_rq_of(se);
		entity_tick(cfs_rq, se, queued);
	}
	if (static_branch_unlikely(&sched_numa_balancing))
		task_tick_numa(rq, curr);
	update_misfit_status(curr, rq);
	update_overutilized_status(task_rq(curr));
}

在entity_tick()中，首先會調用update_curr()更新當前任務的vruntime，然後調用check_preempt_tick()檢測現在是否可以發起搶佔。

static void
entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
{
	/*
	 * Update run-time statistics of the 'current'.
	 */
	update_curr(cfs_rq);
	/*
	 * Ensure that runnable average is periodically updated.
	 */
	update_load_avg(cfs_rq, curr, UPDATE_TG);
	update_cfs_group(curr);
......
	if (cfs_rq->nr_running > 1)
		check_preempt_tick(cfs_rq, curr);
}

check_preempt_tick() 先是調用 sched_slice() 函數計算出一個調度週期中該任務運行的實際時間 ideal_runtime。sum_exec_runtime 指任務總共執行的實際時間，prev_sum_exec_runtime 指上次該進程被調度時已經佔用的實際時間，所以 sum_exec_runtime - prev_sum_exec_runtime 就是這次調度佔用實際時間。如果這個時間大於 ideal_runtime，則應該被搶佔了。除了這個條件之外，還會通過 __pick_first_entity 取出紅黑樹中最小的進程。如果當前進程的 vruntime 大於紅黑樹中最小的進程的 vruntime，且差值大於 ideal_runtime，也應該被搶佔了。

/*
 * Preempt the current task with a newly woken task if needed:
 */
static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
	unsigned long ideal_runtime, delta_exec;
	struct sched_entity *se;
	s64 delta;
	ideal_runtime = sched_slice(cfs_rq, curr);
	delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
	if (delta_exec > ideal_runtime) {
		resched_curr(rq_of(cfs_rq));
		/*
		 * The current task ran long enough, ensure it doesn't get
		 * re-elected due to buddy favours.
		 */
		clear_buddies(cfs_rq, curr);
		return;
	}
	/*
	 * Ensure that a task that missed wakeup preemption by a
	 * narrow margin doesn't have to wait for a full slice.
	 * This also mitigates buddy induced latencies under load.
	 */
	if (delta_exec < sysctl_sched_min_granularity)
		return;
	se = __pick_first_entity(cfs_rq);
	delta = curr->vruntime - se->vruntime;
	if (delta < 0)
		return;
	if (delta > ideal_runtime)
		resched_curr(rq_of(cfs_rq));
}

如果確認需要被搶佔，則會調用resched_curr()函數，該函數會調用set_tsk_need_resched()標記該任務爲_TIF_NEED_RESCHED，即該任務應該被搶佔。

/*
 * resched_curr - mark rq's current task 'to be rescheduled now'.
 *
 * On UP this means the setting of the need_resched flag, on SMP it
 * might also involve a cross-CPU call to trigger the scheduler on
 * the target CPU.
 */
void resched_curr(struct rq *rq)
{
	struct task_struct *curr = rq->curr;
	int cpu;
.......
	cpu = cpu_of(rq);
	if (cpu == smp_processor_id()) {
		set_tsk_need_resched(curr);
		set_preempt_need_resched();
		return;
	}
	if (set_nr_and_not_polling(curr))
		smp_send_reschedule(cpu);
	else
		trace_sched_wake_idle_without_ipi(cpu);
}

4.2.2 任務喚醒情況

某些任務會因爲中斷而喚醒，如當 I/O 到來的時候，I/O進程往往會被喚醒。在這種時候，如果被喚醒的任務優先級高於 CPU 上的當前任務，就會觸發搶佔。try_to_wake_up() 調用 ttwu_queue() 將這個喚醒的任務添加到隊列當中。ttwu_queue() 再調用 ttwu_do_activate() 激活這個任務。ttwu_do_activate() 調用 ttwu_do_wakeup()。這裏面調用了 check_preempt_curr() 檢查是否應該發生搶佔。如果應該發生搶佔，也不是直接踢走當前進程，而是將當前進程標記爲應該被搶佔。

static void ttwu_do_wakeup(struct rq *rq, struct task_struct *p, int wake_flags,
         struct rq_flags *rf)
{
  check_preempt_curr(rq, p, wake_flags);
  p->state = TASK_RUNNING;
  trace_sched_wakeup(p);

4.2.3 搶佔的發生

由前面的分析，我們知道了不論是是當前任務執行時間過長還是新任務喚醒，我們均會對現在的任務標記位_TIF_NEED_RESCUED，下面分析實際搶佔的發生。真正的搶佔還需要一個特定的時機讓正在運行中的進程有機會調用一下 __schedule()函數，發起真正的調度。

實際上會調用__schedule()函數共有以下幾個時機

從系統調用返回用戶態：以64位爲例，系統調用的鏈路爲do_syscall_64->syscall_return_slowpath->prepare_exit_to_usermode->exit_to_usermode_loop。在exit_to_usermode_loop中，會檢測是否爲_TIF_NEED_RESCHED，如果是則調用__schedule()

static void exit_to_usermode_loop(struct pt_regs *regs, u32 cached_flags)
{
    while (true) {
        /* We have work to do. */
        local_irq_enable();

        if (cached_flags & _TIF_NEED_RESCHED)
          schedule();
......
  }

內核態啓動：內核態的執行中，被搶佔的時機一般發生在 preempt_enable() 中。在內核態的執行中，有的操作是不能被中斷的，所以在進行這些操作之前，總是先調用 preempt_disable() 關閉搶佔，當再次打開的時候，就是一次內核態代碼被搶佔的機會。preempt_enable() 會調用 preempt_count_dec_and_test()，判斷 preempt_count 和 TIF_NEED_RESCHED 是否可以被搶佔。如果可以，就調用 preempt_schedule->preempt_schedule_common->__schedule 進行調度。

#define preempt_enable() \
do { \
  if (unlikely(preempt_count_dec_and_test())) \
    __preempt_schedule(); \
} while (0)

#define preempt_count_dec_and_test() \
  ({ preempt_count_sub(1); should_resched(0); })

static __always_inline bool should_resched(int preempt_offset)
{
  return unlikely(preempt_count() == preempt_offset &&
      tif_need_resched());
}

#define tif_need_resched() test_thread_flag(TIF_NEED_RESCHED)

static void __sched notrace preempt_schedule_common(void)
{
  do {
......
    __schedule(true);
......
  } while (need_resched())

從中斷返回內核態/用戶態：中斷處理調用的是 do_IRQ 函數，中斷完畢後分爲兩種情況，一個是返回用戶態，一個是返回內核態。
- 返回用戶態會調用 prepare_exit_to_usermode()，最終調用 exit_to_usermode_loop()
- 返回內核態會調用preempt_schedule_irq()，最終調用__schedule()

common_interrupt:
        ASM_CLAC
        addq    $-0x80, (%rsp) 
        interrupt do_IRQ
ret_from_intr:
        popq    %rsp
        testb   $3, CS(%rsp)
        jz      retint_kernel
/* Interrupt came from user space */
GLOBAL(retint_user)
        mov     %rsp,%rdi
        call    prepare_exit_to_usermode
        TRACE_IRQS_IRETQ
        SWAPGS
        jmp     restore_regs_and_iret
/* Returning to kernel space */
retint_kernel:
#ifdef CONFIG_PREEMPT
        bt      $9, EFLAGS(%rsp)  
        jnc     1f
0:      cmpl    $0, PER_CPU_VAR(__preempt_count)
        jnz     1f
        call    preempt_schedule_irq
        jmp     0b

asmlinkage __visible void __sched preempt_schedule_irq(void)
{
......
  do {
    preempt_disable();
    local_irq_enable();
    __schedule(true);
    local_irq_disable();
    sched_preempt_enable_no_resched();
  } while (need_resched());
......
}

五. 總結

本文分析了任務調度的策略、結構體以及整個調度流程，其中關於內存上下文切換的部分尚未詳細敘述，留待內存部分展開剖析。

源碼資料

[1] 調度相關結構體及函數實現

[2] schedule核心函數

參考資料

[1] wiki

[2] elixir.bootlin.com/linux

[3] woboq

[4] Linux-insides

[5] 深入理解Linux內核

[6] Linux內核設計的藝術

[7] 極客時間趣談Linux操作系統

玩轉Linux：常用命令實例指南

Linux操作系統學習筆記（七）任務調度

一. 前言

二. 調度策略

2.1 實時調度策略

2.2 普通調度策略

三. 調度相關的架構體

四. 調度流程

4.1 主動調度

4.2 搶佔式調度

4.2.1 任務運行時間檢測

4.2.2 任務喚醒情況

4.2.3 搶佔的發生

五. 總結

源碼資料

參考資料

關於遊戲付費的一點想法

我通過CKA和CKS啦！

Linux操作系統學習筆記（七）任務調度

Linux操作系統學習筆記（六）進程、線程的創建和派生

leetcode解題思路分析（二十八）200—206題

leetcode解題思路分析（二十七）193 - 199題

Linux操作系統學習筆記（五）進程的核心——task_truct

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結