Linux內核分析（八）Linux中的進程調度與進程切換

本文將包括以下內容：

1. Linux中進程調度的時機

2. Linux的進程調度函數schedule()處理過程分析

3. 進程上下文切換過程分析

一、Linux中進程調度的時機

進程調度函數schedule在Linux的源代碼文件中有非常多的地方會調用，包括各種設備驅動程序（網絡設備，文件系統，聲卡等等）中，用cscope可以找到500+處調用。而我們今天將只關注內核部分，也就是kernel目錄下的代碼中調用schedule的地方。一共找到53處，如下面兩截圖所示：

至於在這些地方進行進程調度的原因，我用了一個取巧的辦法就是去查看schedule函數的註釋，發現註釋寫的還真是非常詳細，對理解進程調度非常有幫助。

* __schedule() is the main scheduler function.

* __schedule()函數是主要的進程調度函數

* The main means of driving the scheduler and thus entering this function are:

* 主要的意思是進程調度的驅動器，所以，在下面幾種情況下會調用該函數

* 1. Explicit blocking: mutex, semaphore, waitqueue, etc.

* 1. 顯式的阻塞，如被同步鎖，信號量，等待隊列等所阻塞的時候

* 2. TIF_NEED_RESCHED flag is checked on interrupt and userspace return

* paths. For example, see arch/x86/entry_64.S.

* To drive preemption between tasks, the scheduler sets the flag in timer

* interrupt handler scheduler_tick().

* 2. TIF_NEED_RESCHED標記被中斷處理程序和用戶態返回處理的過程中被設置

* 爲了在進程之間實現搶佔優先調度，調度器在定時器中斷處理函數scheduler_tick()函數中設置該標誌

* 3. Wakeups don't really cause entry into schedule(). They add a

* task to the run-queue and that's it.

* 3. 喚醒一個進程的時候並不實際調用schedule()函數，而只是在運行隊列中添加一條任務。

* Now, if the new task added to the run-queue preempts the current

* task, then the wakeup sets TIF_NEED_RESCHED and schedule() gets

* called on the nearest possible occasion:

* 現在，如果新添加到運行隊列中的任務要搶佔當前的任務，喚醒函數會設置TIF_NEED_RESCHED標誌，所以，調度器會在下一次被調用時運行這個進程。調度時機包括：

* - If the kernel is preemptible (CONFIG_PREEMPT=y):

* - 內核被配置成搶佔式的

* - in syscall or exception context, at the next outmost

* preempt_enable(). (this might be as soon as the wake_up()'s

* spin_unlock()!)

* -

* - in IRQ context, return from interrupt-handler to

* preemptible context

* - If the kernel is not preemptible (CONFIG_PREEMPT is not set)

* then at the next:

* - 如果內核沒有被配置成可搶佔式的，則在下列情況下也會執行進程調度

* - cond_resched() call // cond_resched()被調用

* - explicit schedule() call // schedule函數被顯式調用

* - return from syscall or exception to user-space // 從系統調用或異常處理中返回用戶態

* - return from interrupt-handler to user-space // 從終端處理程序中返回用戶態

二、進程調度函數schedule()處理過程分析

schedule()函數的實現在core.c文件中，如下：

asmlinkage __visible void __sched schedule(void)

{

struct task_struct *tsk = current;

sched_submit_work(tsk); // 提交IO請求用於防止死鎖

__schedule(); // 主要的調度處理

}

__schedule()函數的實現和解釋如下，關鍵處理的註釋做了加粗並標註成了藍色：

static void __sched __schedule(void)

{

struct task_struct *prev, *next;

unsigned long *switch_count;

struct rq *rq;

int cpu;

need_resched:

preempt_disable();

cpu = smp_processor_id();

rq = cpu_rq(cpu); // 獲取當前正在CPU上運行的進程信息

rcu_note_context_switch(cpu);

prev = rq->curr; // 將當前的進程保存爲新的prev進程

schedule_debug(prev); // 調試進程調度函數的額外信息

if (sched_feat(HRTICK))

hrtick_clear(rq);

smp_mb__before_spinlock(); // 一些精細的特殊處理，防止死鎖的

raw_spin_lock_irq(&rq->lock); // 在要操作的rq結構上加鎖

switch_count = &prev->nivcsw;

if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {

if (unlikely(signal_pending_state(prev->state, prev))) {

prev->state = TASK_RUNNING;

} else {

deactivate_task(rq, prev, DEQUEUE_SLEEP);

prev->on_rq = 0; // 將當前進程掛起

if (prev->flags & PF_WQ_WORKER) {

struct task_struct *to_wakeup;

to_wakeup = wq_worker_sleeping(prev, cpu);

if (to_wakeup)

try_to_wake_up_local(to_wakeup);

}

switch_count = &prev->nvcsw;

}

if (task_on_rq_queued(prev) || rq->skip_clock_update < 0)

update_rq_clock(rq);

next = pick_next_task(rq, prev); // 調用具體的調度算法，從進程隊列中取出下一個要運行的進程

clear_tsk_need_resched(prev);

clear_preempt_need_resched(); // 清除一些調度標誌

rq->skip_clock_update = 0;

if (likely(prev != next)) {

rq->nr_switches++;

rq->curr = next;

++*switch_count;

context_switch(rq, prev, next); /* 執行進程切換 */

cpu = smp_processor_id();

rq = cpu_rq(cpu); /*重新獲得當前正在運行的進程信息，因爲我們已經切換到新進程上了*/

} else

raw_spin_unlock_irq(&rq->lock);

post_schedule(rq);

sched_preempt_enable_no_resched();

if (need_resched())

goto need_resched;

}

context_switch函數的實現如下，爲了能更清楚的看到整體的結構，刪掉了一些大段的註釋，並對關鍵步驟做了加粗標註：

static inline void context_switch(struct rq *rq, struct task_struct *prev, struct task_struct *next)

{

struct mm_struct *mm, *oldmm;

prepare_task_switch(rq, prev, next);

mm = next->mm;

oldmm = prev->active_mm;

arch_start_context_switch(prev);

if (!mm) {

next->active_mm = oldmm;

atomic_inc(&oldmm->mm_count);

enter_lazy_tlb(oldmm, next);

} else

switch_mm(oldmm, mm, next);

if (!prev->mm) {

prev->active_mm = NULL;

rq->prev_mm = oldmm;

}

spin_release(&rq->lock.dep_map, 1, _THIS_IP_);

context_tracking_task_switch(prev, next);

switch_to(prev, next, prev); // 具體處理過程見第三部分

barrier();

finish_task_switch(this_rq(), prev);

}

跟蹤schedule函數執行過程的方法也非常簡單，因爲我們有那麼多地方都會調用schedule函數，所以用之前的方法啓動內核之後，只需要在函數schedule處設置一個斷點，內核就會在下一次調用schedule函數的時候停在斷點的位置：

三、上下文切換宏switch_to解析

上面進程切換的最關鍵部分swtich_to實現了不同進程的CPU寄存器內容的切換，是硬件相關的，我們找到32位X86平臺的實現代碼來分析。（整潔期間，刪掉了源文件中的大段註釋）

#define switch_to(prev, next, last) \

do { \

unsigned long ebx, ecx, edx, esi, edi; \

asm volatile("pushfl\n\t" /* save flags */ \ // 保存狀態寄存器

"pushl %%ebp\n\t" /* save EBP */ \ // 保存棧底指針EBP到棧上

"movl %%esp,%[prev_sp]\n\t" /* save ESP */ \ // 把ESP保存到進程結構的sp字段中

"movl %[next_sp],%%esp\n\t" /* restore ESP */ \ // 將要調入的進程的ESP值設置給ESP寄存器

"movl $1f,%[prev_ip]\n\t" /* save EIP */ \ // 將標號1的代碼地址保存到換出的進程結構的IP字段

"pushl %[next_ip]\n\t" /* restore EIP */ \ // 將要調入的進程曾經保存的IP值設置給EIP

__switch_canary \ // 64位X86上有些額外的事情做，32位X86該宏是空

"jmp __switch_to\n" /* regparm call */ \ // 跳到__switch_to函數，將正式跳入新進程去執行

"1:\t" \ // 這是某進程被換入時將開始執行的地方

"popl %%ebp\n\t" /* restore EBP */ \ // 恢復EBP

"popfl\n" /* restore flags */ \ // 恢復狀態寄存器，隨後CPU將繼續執行上次調用

// schedule函數的下面的代碼，也就是上次被掛起的進程繼續執行

/* output parameters */ \

: [prev_sp] "=m" (prev->thread.sp), \

[prev_ip] "=m" (prev->thread.ip), \

"=a" (last), \

/* clobbered output registers: */ \

"=b" (ebx), "=c" (ecx), "=d" (edx), \

"=S" (esi), "=D" (edi) \

__switch_canary_oparam \ // , [stack_canary] "=m" (stack_canary.canary)

/* input parameters: */ \

: [next_sp] "m" (next->thread.sp), \

[next_ip] "m" (next->thread.ip),

/* regparm parameters for __switch_to(): */ \

[prev] "a" (prev), \

[next] "d" (next) \

__switch_canary_iparam \ //, [task_canary] "i" (offsetof(struct task_struct, stack_canary))

: /* reloaded segment registers */ \

"memory"); \ // 上面都是嵌入式彙編用到的變量

} while (0)

Linux內核分析（八）Linux中的進程調度與進程切換

測試人員都是畫畫大神，讓我看看誰還不會用代碼圖？

Object.values()對象遍歷

Linux系統調用 - write

Linux系統調用 - open

Linux系統調用 - read

Linux系統調用列表（CentOS 7 64bits）

在Android真機上使用gdb單步調試

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結