Linux內核中的上下文切換

在調度器選擇新的可用的進程之後，不是馬上可以執行運行這個進程，而是必須處理與多任務相關的一些環節，所有這些環節就組成了上下文切換。

在調度函數schedule中，有這樣的一個片段，這個函數主要用於在就緒隊列上尋找下一個可以運行的進程。

asmlinkage void __sched schedule(void)
{
        struct rq *rq;//這表示就緒隊列。
......
	if (likely(prev != next)) {
		rq->nr_switches++;
		rq->curr = next;
		++*switch_count;

		context_switch(rq, prev, next); /* unlocks the rq */
	} else
		spin_unlock_irq(&rq->lock);
......
}

上面的prev表示上一個執行的進程，next爲下一個要被調度的進程。如果這兩個進程不相等，就會執行context_switch，這個函數用於上下文的切換。這就是說如果剛剛結束的進程，馬上又被調度，就不需要太多的切換上下文信息了。

在這篇博文之前，先大致的瞭解一下上下文切換有哪些事情需要完成，我也是根據自己在讀源碼時的理解，歡迎大家拍磚。

內存管理上下文。
頁表切換，這就是重新裝載全局頁表，用於給進程安裝一個新的虛擬地址空間。
由於進程的棧都在內核態，所以切換內核態堆棧上下文數據。
硬件上下文，主要部分就是進程和CPU的任務狀態寄存器，就是TSS中的字段。在這裏CPU爲了減輕很多切換的工作，很多地方都是如果有必要，就切換，就是所謂的惰性原則。

下面一些簡單的討論，不追蹤到彙編層面以及編譯時各個寄存器的狀態。

一，代碼

static inline void
context_switch(struct rq *rq, struct task_struct *prev,
	       struct task_struct *next)
{
	struct mm_struct *mm, *oldmm;

	prepare_task_switch(rq, prev, next);
	mm = next->mm;
	oldmm = prev->active_mm;
	/*
	 * For paravirt, this is coupled with an exit in switch_to to
	 * combine the page table reload and the switch backend into
	 * one hypercall.
	 */
	arch_enter_lazy_cpu_mode();

	if (unlikely(!mm)) {
		next->active_mm = oldmm;
		atomic_inc(&oldmm->mm_count);
		enter_lazy_tlb(oldmm, next);
	} else
		switch_mm(oldmm, mm, next);

	if (unlikely(!prev->mm)) {
		prev->active_mm = NULL;
		rq->prev_mm = oldmm;
	}
	/*
	 * Since the runqueue lock will be released by the next
	 * task (which is an invalid locking op but in the case
	 * of the scheduler it's an obvious special-case), so we
	 * do an early lockdep release here:
	 */
#ifndef __ARCH_WANT_UNLOCKED_CTXSW
	spin_release(&rq->lock.dep_map, 1, _THIS_IP_);
#endif

	/* Here we just switch the register state and the stack. */
	switch_to(prev, next, prev);

	barrier();
	/*
	 * this_rq must be evaluated again because prev may have moved
	 * CPUs since it called schedule(), thus the 'rq' on its stack
	 * frame will be invalid.
	 */
	finish_task_switch(this_rq(), prev);
}

上面的代碼主要的工作可以分成兩個部分：切換內存管理上下文和切換處理器內容和內存棧。下面主要討論這兩個部分

二，切換內存管理上下文

這個工作是和處理器相關的，主要完成的事情包括加載頁表、刷出地址轉換後備緩衝器，向內存管理單元提供新的信息。

	if (unlikely(!mm)) {
		next->active_mm = oldmm;
		atomic_inc(&oldmm->mm_count);
		enter_lazy_tlb(oldmm, next);
	} else
		switch_mm(oldmm, mm, next);

在內核線程的結構中,task_struct的mm域是空的，它沒有自已的內存上下文信息，表示其不對用戶空間進行訪問。這樣在內核線程被調度的時候，它就不去修改有些內存上下文數據，因爲即將運行的進程不會用這樣數據，那麼如果在這之後的下一個進程，就是上一個進程的時候，數據還是全部有效的。對於內核線程來說，並將其active_mm指向當前進程的active_mm，並在其地址空間上運行。

enter_lazy_tbl函數在底層處理時會要求其體系結構不需要切換虛擬地址空間的用戶空間部分，這時就執行懶惰TLB處理，執行如下：

static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
{
#ifdef CONFIG_SMP
	unsigned cpu = smp_processor_id();
	if (per_cpu(cpu_tlbstate, cpu).state == TLBSTATE_OK)
		per_cpu(cpu_tlbstate, cpu).state = TLBSTATE_LAZY;
#endif
}

看一下這個函數空間是做什麼的，這樣就先要了解cpu_tblstate變量，這是一個每CPU變量，先分析一下這個結構：

cpu_tlbstate按照“每CPU”定義：

DEFINE_PER_CPU(struct tlb_state, cpu_tlbstate) ____cacheline_aligned = { &init_mm, 0, };
#define DEFINE_PER_CPU(type, name) \
    __attribute__((__section__(".data.percpu"))) __typeof__(type) per_cpu__##name

這裏就是定義了struct tlb_state per_cpu_cpu_tlbstate變量，然後將在編譯的時候放入.data.percpu段中。

下面再看一下per_cpu宏定義：

#define per_cpu(var, cpu) (*({				\
	extern int simple_indentifier_##var(void);	\
	RELOC_HIDE(&per_cpu__##var, __per_cpu_offset[cpu]); }))
#define RELOC_HIDE(ptr, off)					\
  ({ unsigned long __ptr;					\
     __ptr = (unsigned long) (ptr);				\
    (typeof(ptr)) (__ptr + (off)); })

__per_cpu_offset取一個偏移量，這樣就會根據cpu的編號【smp_processor_id()】找到其相關的變量，就是一個struct tlb_state實例，其結構如下：

struct tlb_state
{
	struct mm_struct *active_mm;
	int state;
	char __cacheline_padding[L1_CACHE_BYTES-8];
};

然後將其的state改爲TLBSTATE_LAZY。

我們知道如果內核線程運行時，沒有自己的用戶地址空間，是在某一進程的地址空間上隨機運行，就是借用這個地址，那運行完之後，就要歸還借用的狀態。

	if (unlikely(!prev->mm)) {
		prev->active_mm = NULL;
		rq->prev_mm = oldmm;
	}

如果mm域爲空的話，就執行switch_mm函數，這個函數會深入到具體的CPU內部，但主要工作是：

設置CPU狀態。
加載頁表，包括全局頁表和局部頁表。
更新TLB數據。

static inline void switch_mm(struct mm_struct *prev,
			     struct mm_struct *next,
			     struct task_struct *tsk)
{
	int cpu = smp_processor_id();

	if (likely(prev != next)) {
		/* stop flush ipis for the previous mm */
		cpu_clear(cpu, prev->cpu_vm_mask);
#ifdef CONFIG_SMP
		per_cpu(cpu_tlbstate, cpu).state = TLBSTATE_OK;
		per_cpu(cpu_tlbstate, cpu).active_mm = next;
#endif
		cpu_set(cpu, next->cpu_vm_mask);

		/* Re-load page tables */
		load_cr3(next->pgd);加載頁表

		/*
		 * load the LDT, if the LDT is different:
		 */
		if (unlikely(prev->context.ldt != next->context.ldt))
			load_LDT_nolock(&next->context);
	}
#ifdef CONFIG_SMP
	else {
		per_cpu(cpu_tlbstate, cpu).state = TLBSTATE_OK;
		BUG_ON(per_cpu(cpu_tlbstate, cpu).active_mm != next);

		if (!cpu_test_and_set(cpu, next->cpu_vm_mask)) {
			/* We were in lazy tlb mode and leave_mm disabled 
			 * tlb flush IPI delivery. We must reload %cr3.
			 */
			load_cr3(next->pgd);
			load_LDT_nolock(&next->conte-xt);
		}
	}
#endif
}

這一部分的內容比較簡單，其實就只需要知道這一部分要完成什麼樣的工作，因爲具體的完成是底層相關的。主要完成的就是頁表和TLB刷出，當然還有一些細節之處。

三，切換處理器內容

這部分主要的代碼如下：

	/* Here we just switch the register state and the stack. */
	switch_to(prev, next, prev);

	barrier();
	/*
	 * this_rq must be evaluated again because prev may have moved
	 * CPUs since it called schedule(), thus the 'rq' on its stack
	 * frame will be invalid.
	 */
	finish_task_switch(this_rq(), prev);
}

先理解一下這個代碼的結構，這裏注意switch_to在調用之後，就完成了進程切換，因爲在調用完這個函數之後，寄存器和棧的環境就會發生變化，也就是說位於switch_to之後的代碼只有在當前進程下一次被選擇運行時纔會被執行。

barrier()是一個編譯器指令，它其實提供一個內存使用的屏障。是一個原語操作。

#define barrier() __asm__ __volatile__("": : :"memory")

這條語句是告知編譯器，保存在CPU寄存器中、在barrier()執行之前有效的所有內存地址，在barrier()執行之後都將失效。就是在執行之後，該進程不會有任何依賴於之前的內存進行讀寫。這條語句在這裏也確保switch_to和finish_task_switch語句的執行順序不會被優化，因爲編譯器知道這裏不能被順序優化。

這裏看到switch_to有三個形式參數，但只傳遞了兩個參數值，讓我們看個究竟，它的定義是一個宏：

#define switch_to(prev,next,last) do {					\
	unsigned long esi,edi;						\
	asm volatile("pushfl\n\t"		/* Save flags */	\
		     "pushl %%ebp\n\t"					\
		     "movl %%esp,%0\n\t"	/* save ESP */		\
		     "movl %5,%%esp\n\t"	/* restore ESP */	\
		     "movl $1f,%1\n\t"		/* save EIP */		\
		     "pushl %6\n\t"		/* restore EIP */	\
		     "jmp __switch_to\n"				\
		     "1:\t"						\
		     "popl %%ebp\n\t"					\
		     "popfl"						\
		     :"=m" (prev->thread.esp),"=m" (prev->thread.eip),	\
		      "=a" (last),"=S" (esi),"=D" (edi)			\
		     :"m" (next->thread.esp),"m" (next->thread.eip),	\
		      "2" (prev), "d" (next));				\
} while (0)

在分析這段代碼之前，先看一些task_struct中thread的結構：

struct task_struct{
......
      struct thread_struct thread;
......
};
struct thread_struct {
/* cached TLS descriptors. */
	struct desc_struct tls_array[GDT_ENTRY_TLS_ENTRIES];
	unsigned long	esp0;
	unsigned long	sysenter_cs;
	unsigned long	eip;
	unsigned long	esp;
	unsigned long	fs;
	unsigned long	gs;
/* Hardware debugging registers */
	unsigned long	debugreg[8];  /* %%db0-7 debug registers */
/* fault info */
	unsigned long	cr2, trap_no, error_code;
/* floating point info */
	union i387_union	i387;
/* virtual 86 mode info */
	struct vm86_struct __user * vm86_info;
	unsigned long		screen_bitmap;
	unsigned long		v86flags, v86mask, saved_esp0;
	unsigned int		saved_fs, saved_gs;
/* IO permissions */
	unsigned long	*io_bitmap_ptr;
 	unsigned long	iopl;
/* max allowed port in the bitmap, in bytes: */
	unsigned long	io_bitmap_max;
};

在thread_struct結構中，我們下面關心的就是EIP：指令指針寄存器，在分段機制啓動後，它保存指令執行的偏移地址。ESP：堆棧棧頂的指針，就是內核態的棧頂指針。

先將標誌寄存器和ebp寄存器的內容存入內核棧棧頂，用push指令完成。
將esp寄存器存入prev->thread.esp中，保存之前運行進程的內核堆棧棧頂的指針。
把next->thread.esp的信息裝入內核的棧頂，將即將調度的進程的內核堆棧頂的指針恢復，就是存入ESP寄存器。這會改變內存對棧的尋址，所以這裏也就完成了prev到next的切換。
把上一個進程的thread.eip數據存儲爲指令：popl %%ebp;popfl的執行地址。就是存入了上面的標記爲1的地址。這是爲什麼呢？因爲當這裏執行完jmp __switch_to語句後，控制流就會回到標記爲1處。
將next進程的next->thread.eip壓入內核態棧頂。
調用__switch_to。

__switch_to函數是從彙編調用過來的，函數調用fastcall類型傳遞參數，表示從eax和edx獲取參數。

struct task_struct fastcall * __switch_to(struct task_struct *prev_p, struct task_struct *next_p)
{
	struct thread_struct *prev = &prev_p->thread,
				 *next = &next_p->thread;
	int cpu = smp_processor_id();
	struct tss_struct *tss = &per_cpu(init_tss, cpu);

	/* never put a printk in __switch_to... printk() calls wake_up*() indirectly */

	__unlazy_fpu(prev_p);


	/* we're going to use this soon, after a few expensive things */
	if (next_p->fpu_counter > 5)
		prefetch(&next->i387.fxsave);

	/*
	 * Reload esp0.
	 */
	load_esp0(tss, next);

	/*
	 * Save away %gs. No need to save %fs, as it was saved on the
	 * stack on entry.  No need to save %es and %ds, as those are
	 * always kernel segments while inside the kernel.  Doing this
	 * before setting the new TLS descriptors avoids the situation
	 * where we temporarily have non-reloadable segments in %fs
	 * and %gs.  This could be an issue if the NMI handler ever
	 * used %fs or %gs (it does not today), or if the kernel is
	 * running inside of a hypervisor layer.
	 */
	savesegment(gs, prev->gs);

	/*
	 * Load the per-thread Thread-Local Storage descriptor.
	 */
	load_TLS(next, cpu);

	/*
	 * Restore IOPL if needed.  In normal use, the flags restore
	 * in the switch assembly will handle this.  But if the kernel
	 * is running virtualized at a non-zero CPL, the popf will
	 * not restore flags, so it must be done in a separate step.
	 */
	if (get_kernel_rpl() && unlikely(prev->iopl != next->iopl))
		set_iopl_mask(next->iopl);

	/*
	 * Now maybe handle debug registers and/or IO bitmaps
	 */
	if (unlikely(task_thread_info(prev_p)->flags & _TIF_WORK_CTXSW_PREV ||
		     task_thread_info(next_p)->flags & _TIF_WORK_CTXSW_NEXT))
		__switch_to_xtra(prev_p, next_p, tss);

	/*
	 * Leave lazy mode, flushing any hypercalls made here.
	 * This must be done before restoring TLS segments so
	 * the GDT and LDT are properly updated, and must be
	 * done before math_state_restore, so the TS bit is up
	 * to date.
	 */
	arch_leave_lazy_cpu_mode();

	/* If the task has used fpu the last 5 timeslices, just do a full
	 * restore of the math state immediately to avoid the trap; the
	 * chances of needing FPU soon are obviously high now
	 */
	if (next_p->fpu_counter > 5)
		math_state_restore();

	/*
	 * Restore %gs if needed (which is common)
	 */
	if (prev->gs | next->gs)
		loadsegment(gs, next->gs);

	x86_write_percpu(current_task, next_p);

	return prev_p;
}

這個函數完成從prev_p到next_p的切換；

調用__unlazy_fpu函數。有選擇性的保存FPU（浮點運算單元）和MMS等相關的寄存器。知道這裏大致完成了什麼，就好了。

#define __unlazy_fpu( tsk ) do {				\
	if (task_thread_info(tsk)->status & TS_USEDFPU) {	\
		__save_init_fpu(tsk);				\
		stts();						\
	} else							\
		tsk->fpu_counter = 0;				\
} while (0)
static inline void __save_init_fpu( struct task_struct *tsk )
{
	/* Use more nops than strictly needed in case the compiler
	   varies code */
	alternative_input(
		"fnsave %[fx] ;fwait;" GENERIC_NOP8 GENERIC_NOP4,
		"fxsave %[fx]\n"
		"bt $7,%[fsw] ; jnc 1f ; fnclex\n1:",
		X86_FEATURE_FXSR,
		[fx] "m" (tsk->thread.i387.fxsave),
		[fsw] "m" (tsk->thread.i387.fxsave.swd) : "memory");
	/* AMD K7/K8 CPUs don't save/restore FDP/FIP/FOP unless an exception
	   is pending.  Clear the x87 state here by setting it to fixed
   	   values. safe_address is a random variable that should be in L1 */
	alternative_input(
		GENERIC_NOP8 GENERIC_NOP2,
		"emms\n\t"	  	/* clear stack tags */
		"fildl %[addr]", 	/* set F?P to defined value */
		X86_FEATURE_FXSAVE_LEAK,
		[addr] "m" (safe_address));
	task_thread_info(tsk)->status &= ~TS_USEDFPU;
}

獲取當前CPU的TSS段值。把next_p->thread.esp0存入TSS段中。

load_TLS。重置全局頁表。

#define load_TLS(t, cpu) native_load_tls(t, cpu)
static inline void native_load_tls(struct thread_struct *t, unsigned int cpu)
{
	unsigned int i;
	struct desc_struct *gdt = get_cpu_gdt_table(cpu);

	for (i = 0; i < GDT_ENTRY_TLS_ENTRIES; i++)
		gdt[GDT_ENTRY_TLS_MIN + i] = t->tls_array[i];
}

savesegment。存儲上個進程的gs段信息，存儲在task_struct->thread.gs中。
```
#define savesegment(seg, value) \
	asm volatile("mov %%" #seg ",%0":"=rm" (value))
```
TSS中IO權限位圖更新，如果有必要。也就是如果進程有單獨使用IO權限位圖的話。
測試寄存器。

進程相關的硬件上下文信息存儲在進程運行的TSS段中。

/* This is the TSS defined by the hardware. */
struct i386_hw_tss {
	unsigned short	back_link,__blh;
	unsigned long	esp0;
	unsigned short	ss0,__ss0h;
	unsigned long	esp1;
	unsigned short	ss1,__ss1h;	/* ss1 is used to cache MSR_IA32_SYSENTER_CS */
	unsigned long	esp2;
	unsigned short	ss2,__ss2h;
	unsigned long	__cr3;
	unsigned long	eip;
	unsigned long	eflags;
	unsigned long	eax,ecx,edx,ebx;
	unsigned long	esp;
	unsigned long	ebp;
	unsigned long	esi;
	unsigned long	edi;
	unsigned short	es, __esh;
	unsigned short	cs, __csh;
	unsigned short	ss, __ssh;
	unsigned short	ds, __dsh;
	unsigned short	fs, __fsh;
	unsigned short	gs, __gsh;
	unsigned short	ldt, __ldth;
	unsigned short	trace, io_bitmap_base;
} __attribute__((packed));

struct tss_struct {
	struct i386_hw_tss x86_tss;

	/*
	 * The extra 1 is there because the CPU will access an
	 * additional byte beyond the end of the IO permission
	 * bitmap. The extra byte must be all 1 bits, and must
	 * be within the limit.
	 */
	unsigned long	io_bitmap[IO_BITMAP_LONGS + 1];
	/*
	 * Cache the current maximum and the last task that used the bitmap:
	 */
	unsigned long io_bitmap_max;
	struct thread_struct *io_bitmap_owner;
	/*
	 * pads the TSS to be cacheline-aligned (size is 0x100)
	 */
	unsigned long __cacheline_filler[35];
	/*
	 * .. and then another 0x100 bytes for emergency kernel stack
	 */
	unsigned long stack[64];
} __attribute__((packed));

zmxiangde_88

發佈了65 篇原創文章 · 獲贊 27 · 訪問量 58萬+

私信關注

Linux內核中的上下文切換

TCP/IP：認識TCP

Socket編程指南

內核的bootmem內存分配器

淺析MySQL二進制日誌

inet_ntoa在64位機器上出錯

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結