轉自：http://blog.csdn.net/sailor_8318/article/details/2870184

【摘要】本文詳解了Linux內核搶佔實現機制。首先介紹了內核搶佔和用戶搶佔的概念和區別，接着分析了不可搶佔內核的特點及實時系統中實現內核搶佔的必要性。然後分析了禁止內核搶佔的情況和內核搶佔的時機，最後介紹了實現搶佔內核所做的改動以及何時需要重新調度。

【關鍵字】內核搶佔，用戶搶佔，中斷，實時性，自旋鎖，搶佔時機，調度時機，schedule，preempt count

1 內核搶佔概述

2.6新的可搶佔式內核是指內核搶佔，即當進程位於內核空間時，有一個更高優先級的任務出現時，如果當前內核允許搶佔，則可以將當前任務掛起，執行優先級更高的進程。

在2.5.4版本之前，Linux內核是不可搶佔的，高優先級的進程不能中止正在內核中運行的低優先級的進程而搶佔CPU運行。進程一旦處於核心態(例如用戶進程執行系統調用)，則除非進程自願放棄CPU，否則該進程將一直運行下去，直至完成或退出內核。與此相反，一個可搶佔的Linux內核可以讓Linux內核如同用戶空間一樣允許被搶佔。當一個高優先級的進程到達時，不管當前進程處於用戶態還是核心態，如果當前允許搶佔，可搶佔內核的Linux都會調度高優先級的進程運行。

2 用戶搶佔

內核即將返回用戶空間的時候，如果need resched標誌被設置，會導致schedule()被調用，此時就會發生用戶搶佔。在內核返回用戶空間的時候，它知道自己是安全的。所以，內核無論是在從中斷處理程序還是在系統調用後返回，都會檢查need resched標誌。如果它被設置了，那麼，內核會選擇一個其他(更合適的)進程投入運行。

簡而言之，用戶搶佔在以下情況時產生：

從系統調返回用戶空間。

從中斷處理程序返回用戶空間。

3 不可搶佔內核的特點

在不支持內核搶佔的內核中，內核代碼可以一直執行，到它完成爲止。也就是說，調度程序沒有辦法在一個內核級的任務正在執行的時候重新調度—內核中的各任務是協作方式調度的，不具備搶佔性。內核代碼一直要執行到完成(返回用戶空間)或明顯的阻塞爲止。

在單CPU情況下，這樣的設定大大簡化了內核的同步和保護機制。可以分兩步對此加以分析：

首先，不考慮進程在內核中自願放棄CPU的情況(也即在內核中不發生進程的切換)。一個進程一旦進入內核就將一直運行下去，直到完成或退出內核。在其沒有完成或退出內核之前，不會有另外一個進程進入內核，即進程在內核中的執行是串行的，不可能有多個進程同時在內核中運行，這樣內核代碼設計時就不用考慮多個進程同時執行所帶來的併發問題。Linux的內核開發人員就不用考慮複雜的進程併發執行互斥訪問臨界資源的問題。當進程在訪問、修改內核的數據結構時就不需要加鎖來防止多個進程同時進入臨界區。這時只需再考慮一下中斷的情況，若有中斷處理例程也有可能訪問進程正在訪問的數據結構，那麼進程只要在進入臨界區前先進行關中斷操作，退出臨界區時進行開中斷操作就可以了。

再考慮一下進程自願放棄CPU的情況。因爲對CPU的放棄是自願的、主動的，也就意味着進程在內核中的切換是預先知道的，不會出現在不知道的情況下發生進程的切換。這樣就只需在發生進程切換的地方考慮一下多個進程同時執行所可能帶來的併發問題，而不必在整個內核範圍內都要考慮進程併發執行問題。

4 爲什麼需要內核搶佔？

實現內核的可搶佔對Linux具有重要意義。首先，這是將Linux應用於實時系統所必需的。實時系統對響應時間有嚴格的限定，當一個實時進程被實時設備的硬件中斷喚醒後，它應在限定的時間內被調度執行。而Linux不能滿足這一要求，因爲Linux的內核是不可搶佔的，不能確定系統在內核中的停留時間。事實上當內核執行長的系統調用時，實時進程要等到內核中運行的進程退出內核才能被調度，由此產生的響應延遲，在如今的硬件條件下，會長達100ms級。

這對於那些要求高實時響應的系統是不能接受的。而可搶佔的內核不僅對Linux的實時應用至關重要，而且能解決Linux對多媒體(video, audio)等要求低延遲的應用支持不夠好的缺陷。

由於可搶佔內核的重要性，在Linux2.5.4版本發佈時，可搶佔被併入內核，同SMP一樣作爲內核的一項標準可選配置。

5 什麼情況不允許內核搶佔

有幾種情況Linux內核不應該被搶佔，除此之外Linux內核在任意一點都可被搶佔。這幾種情況是：

² 內核正進行中斷處理。在Linux內核中進程不能搶佔中斷(中斷只能被其他中斷中止、搶佔，進程不能中止、搶佔中斷)，在中斷例程中不允許進行進程調度。進程調度函數schedule()會對此作出判斷，如果是在中斷中調用，會打印出錯信息。

² 內核正在進行中斷上下文的Bottom Half(中斷的底半部)處理。硬件中斷返回前會執行軟中斷，此時仍然處於中斷上下文中。

² 內核的代碼段正持有spinlock自旋鎖、writelock/readlock讀寫鎖等鎖，處幹這些鎖的保護狀態中。內核中的這些鎖是爲了在SMP系統中短時間內保證不同CPU上運行的進程併發執行的正確性。當持有這些鎖時，內核不應該被搶佔，否則由於搶佔將導致其他CPU長期不能獲得鎖而死等。

² 內核正在執行調度程序Scheduler。搶佔的原因就是爲了進行新的調度，沒有理由將調度程序搶佔掉再運行調度程序。

² 內核正在對每個CPU“私有”的數據結構操作(Per-CPU date structures)。在SMP中，對於per-CPU數據結構未用spinlocks保護，因爲這些數據結構隱含地被保護了(不同的CPU有不一樣的per-CPU數據，其他CPU上運行的進程不會用到另一個CPU的per-CPU數據)。但是如果允許搶佔，但一個進程被搶佔後重新調度，有可能調度到其他的CPU上去，這時定義的Per-CPU變量就會有問題，這時應禁搶佔。

爲保證Linux內核在以上情況下不會被搶佔，搶佔式內核使用了一個變量preempt_ count，稱爲內核搶佔鎖。這一變量被設置在進程的PCB結構task_struct中。每當內核要進入以上幾種狀態時，變量preempt_ count就加1，指示內核不允許搶佔。每當內核從以上幾種狀態退出時，變量preempt_ count就減1，同時進行可搶佔的判斷與調度。

從中斷返回內核空間的時候，內核會檢查need_resched和preempt_count的值。如果need_ resched被設置，並且preempt count爲0的話，這說明可能有一個更爲重要的任務需要執行並且可以安全地搶佔，此時，調度程序就會被調用。如果preempt-count不爲0，則說明內核現在處幹不可搶佔狀態，不能進行重新調度。這時，就會像通常那樣直接從中斷返回當前執行進程。如果當前進程持有的所有的鎖都被釋放了，那麼preempt_ count就會重新爲0。此時，釋放鎖的代碼會檢查need_ resched是否被設置。如果是的話，就會調用調度程序。

6 內核搶佔時機

在2.6版的內核中，內核引入了搶佔能力；現在，只要重新調度是安全的，那麼內核就可以在任何時間搶佔正在執行的任務。

那麼，什麼時候重新調度纔是安全的呢？只要premptcount爲0，內核就可以進行搶佔。通常鎖和中斷是非搶佔區域的標誌。由於內核是支持SMP的，所以，如果沒有持有鎖，那麼正在執行的代碼就是可重新導人的，也就是可以搶佔的。

如果內核中的進程被阻塞了，或它顯式地調用了schedule()，內核搶佔也會顯式地發生。這種形式的內核搶佔從來都是受支持的(實際上是主動讓出CPU)，因爲根本無需額外的邏輯來保證內核可以安全地被搶佔。如果代碼顯式的調用了schedule()，那麼它應該清楚自己是可以安全地被搶佔的。

內核搶佔可能發生在：

當從中斷處理程序正在執行，且返回內核空間之前。

當內核代碼再一次具有可搶佔性的時候，如解鎖及使能軟中斷等。

如果內核中的任務顯式的調用schedule()

如果內核中的任務阻塞(這同樣也會導致調用schedule())

7 如何支持搶佔內核

搶佔式Linux內核的修改主要有兩點：一是對中斷的入口代碼和返回代碼進行修改。在中斷的入口內核搶佔鎖preempt_count加1，以禁止內核搶佔；在中斷的返回處，內核搶佔鎖preempt_count減1，使內核有可能被搶佔。

我們說可搶佔Linux內核在內核的任一點可被搶佔，主要就是因爲在任意一點中斷都有可能發生，每當中斷髮生，Linux可搶佔內核在處理完中斷返回時都會進行內核的可搶佔判斷。若內核當前所處狀態允許被搶佔，內核都會重新進行調度選取高優先級的進程運行。這一點是與非可搶佔的內核不一樣的。在非可搶佔的Linux內核中，從硬件中斷返回時，只有當前被中斷進程是用戶態進程時纔會重新調度，若當前被中斷進程是核心態進程，則不進行調度，而是恢復被中斷的進程繼續運行。

另一基本修改是重新定義了自旋鎖、讀、寫鎖，在鎖操作時增加了對preempt count變量的操作。在對這些鎖進行加鎖操作時preemptcount變量加1，以禁止內核搶佔；在釋放鎖時preemptcount變量減1，並在內核的搶佔條件滿足且需要重新調度時進行搶佔調度。下面以spin_lock(), spin_unlock()操作爲例說明：

/////////////////////////////////////////////////////////////////////////

/linux+v2.6.19/kernel/spinlock.c

320void __lockfunc _spin_unlock(spinlock_t *lock)

321{

322 spin_release(&lock->dep_map, 1, _RET_IP_);

323 _raw_spin_unlock(lock);

324 preempt_enable();

325}

326 EXPORT_SYMBOL(_spin_unlock);

178void __lockfunc _spin_lock(spinlock_t *lock)

179{

180 preempt_disable();

181 spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);

182 _raw_spin_lock(lock);

183}

184

185 EXPORT_SYMBOL(_spin_lock);

/////////////////////////////////////////////////////////////////////////

29#define preempt_disable() /

30do { /

31inc_preempt_count(); /

32 barrier(); /

33} while (0)

35#define preempt_enable_no_resched() /

36do { /

37 barrier(); /

38dec_preempt_count(); /

39} while (0)

41#define preempt_check_resched() /

42do { /

43if (unlikely(test_thread_flag(TIF_NEED_RESCHED))) /

44 preempt_schedule(); /

45} while (0)

47#define preempt_enable() /

48do { /

49 preempt_enable_no_resched(); /

50 barrier(); /

51preempt_check_resched(); /

52} while (0)

另外一種可搶佔內核實現方案是在內核代碼段中插入搶佔點(preemption point)的方案。在這一方案中，首先要找出內核中產生長延遲的代碼段，然後在這一內核代碼段的適當位置插入搶佔點，使得系統不必等到這段代碼執行完就可重新調度。這樣對於需要快速響應的事件，系統就可以儘快地將服務進程調度到CPU運行。搶佔點實際上是對進程調度函數的調用，代碼如下:

if (current->need_ resched) schedule();

通常這樣的代碼段是一個循環體，插入搶佔點的方案就是在這一循環體中不斷檢測need_ resched的值，在必要的時候調用schedule()令當前進程強行放棄CPU

8 何時需要重新調度

內核必須知道在什麼時候調用schedule()。如果僅靠用戶程序代碼顯式地調用schedule()，它們可能就會永遠地執行下去。相反，內核提供了一個need_resched標誌來表明是否需要重新執行一次調度。當某個進程耗盡它的時間片時，scheduler tick()就會設置這個標誌；當一個優先級高的進程進入可執行狀態的時候，try_to_wake_up也會設置這個標誌。

set_ tsk_need_resched：設置指定進程中的need_ resched標誌

clear tsk need_resched：清除指定進程中的need_ resched標誌

need_resched()：檢查need_ resched標誌的值;如果被設置就返回真，否則返回假

信號量、等到隊列、completion等機制喚醒時都是基於waitqueue的，而waitqueue的喚醒函數爲default_wake_function，其調用try_to_wake_up將進程更改爲可運行狀態並置待調度標誌。

在返回用戶空間以及從中斷返回的時候，內核也會檢查need_resched標誌。如果已被設置，內核會在繼續執行之前調用調度程序。

每個進程都包含一個need_resched標誌，這是因爲訪問進程描述符內的數值要比訪問一個全局變量快(因爲current宏速度很快並且描述符通常都在高速緩存中)。在2.2以前的內核版本中，該標誌曾經是一個全局變量。2.2到2.4版內核中它在task_struct中。而在2.6版中，它被移到thread_info結構體裏，用一個特別的標誌變量中的一位來表示。可見，內核開發者總是在不斷改進。

/linux+v2.6.19/include/linux/sched.h

1503static inline void set_tsk_need_resched(struct task_struct *tsk)

1504{

1505set_tsk_thread_flag(tsk,TIF_NEED_RESCHED);

1506}

1507

1508static inline void clear_tsk_need_resched(struct task_struct *tsk)

1509{

1510 clear_tsk_thread_flag(tsk,TIF_NEED_RESCHED);

1511}

1512

1513static inline int signal_pending(struct task_struct *p)

1514{

1515 return unlikely(test_tsk_thread_flag(p,TIF_SIGPENDING));

1516}

1517

1518static inline int need_resched(void)

1519{

1520 return unlikely(test_thread_flag(TIF_NEED_RESCHED));

1521}

///////////////////////////////////////////////////////////////////////////////

/linux+v2.6.19/kernel/sched.c

991/*

992 * resched_task - mark a task 'to be rescheduled now'.

993 *

994 * On UP this means the setting of the need_resched flag, on SMP it

995 * might also involve a cross-CPU call to trigger the scheduler on

996 * the target CPU.

997 */

998#ifdef CONFIG_SMP

999

1000#ifndef tsk_is_polling

1001#define tsk_is_polling(t) test_tsk_thread_flag(t, TIF_POLLING_NRFLAG)

1002#endif

1003

1004static void resched_task(struct task_struct *p)

1005{

1006 int cpu;

1007

1008 assert_spin_locked(&task_rq(p)->lock);

1009

1010 if (unlikely(test_tsk_thread_flag(p, TIF_NEED_RESCHED)))

1011 return;

1012

1013 set_tsk_thread_flag(p, TIF_NEED_RESCHED);

1014

1015 cpu = task_cpu(p);

1016 if (cpu == smp_processor_id())

1017 return;

1018

1019 /* NEED_RESCHED must be visible before we test polling */

1020 smp_mb();

1021 if (!tsk_is_polling(p))

1022 smp_send_reschedule(cpu);

1023}

1024#else

1025static inline void resched_task(struct task_struct *p)

1026{

1027 assert_spin_locked(&task_rq(p)->lock);

1028set_tsk_need_resched(p);

1029}

1030#endif

///////////////////////////////////////////////////////////////////////////////

1366/***

1367 * try_to_wake_up - wake up a thread

1368 * @p: the to-be-woken-up thread

1369 * @state: the mask of task states that can be woken

1370 * @sync: do a synchronous wakeup?

1371 *

1372 * Put it on the run-queue if it's not already there. The "current"

1373 * thread is always on the run-queue (except when the actual

1374 * re-schedule is in progress), and as such you're allowed to do

1375 * the simpler "current->state = TASK_RUNNING" to mark yourself

1376 * runnable without the overhead of this.

1377 *

1378 * returns failure only if the task is already active.

1379 */

1380static int try_to_wake_up(struct task_struct *p, unsigned int state, int sync)

///////////////////////////////////////////////////////////////////////////////

1538int fastcall wake_up_process(struct task_struct *p)

1539{

1540 return try_to_wake_up(p, TASK_STOPPED | TASK_TRACED |

1541 TASK_INTERRUPTIBLE | TASK_UNINTERRUPTIBLE, 0);

1542}

1543 EXPORT_SYMBOL(wake_up_process);

1545int fastcall wake_up_state(struct task_struct *p, unsigned int state)

1546{

1547 return try_to_wake_up(p, state, 0);

1548}

1616/*

1617 * wake_up_new_task - wake up a newly created task for the first time.

1618 *

1619 * This function will do some initial scheduler statistics housekeeping

1620 * that must be done for every newly created context, then puts the task

1621 * on the runqueue and wakes it.

1622 */

1623void fastcall wake_up_new_task(struct task_struct *p, unsigned long clone_flags)

3571/*

3572 * The core wakeup function. Non-exclusive wakeups (nr_exclusive == 0) just

3573 * wake everything up. If it's an exclusive wakeup (nr_exclusive == small +ve

3574 * number) then we wake all the non-exclusive tasks and one exclusive task.

3575 *

3576 * There are circumstances in which we can try to wake a task which has already

3577 * started to run but is not in state TASK_RUNNING. try_to_wake_up() returns

3578 * zero in this (rare) case, and we handle it by continuing to scan the queue.

3579 */

3580static void __wake_up_common(wait_queue_head_t *q, unsigned int mode,

3581 int nr_exclusive, int sync, void *key)

///////////////////////////////////////////////////////////////////////////////

3595/**

3596 * __wake_up - wake up threads blocked on a waitqueue.

3597 * @q: the waitqueue

3598 * @mode: which threads

3599 * @nr_exclusive: how many wake-one or wake-many threads to wake up

3600 * @key: is directly passed to the wakeup function

3601 */

3602void fastcall __wake_up(wait_queue_head_t *q, unsigned int mode,

3603 int nr_exclusive, void *key)

3604{

3605 unsigned long flags;

3606

3607 spin_lock_irqsave(&q->lock, flags);

3608 __wake_up_common(q, mode, nr_exclusive, 0, key);

3609 spin_unlock_irqrestore(&q->lock, flags);

3610}

3611 EXPORT_SYMBOL(__wake_up);

3564int default_wake_function(wait_queue_t *curr, unsigned mode, int sync,

3565 void *key)

3566{

3567 return try_to_wake_up(curr->private, mode, sync);

3568}

3569 EXPORT_SYMBOL(default_wake_function);

3652void fastcall complete(struct completion *x)

3653{

3654 unsigned long flags;

3655

3656 spin_lock_irqsave(&x->wait.lock, flags);

3657 x->done++;

3658__wake_up_common(&x->wait, TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE,

3659 1, 0, NULL);

3660 spin_unlock_irqrestore(&x->wait.lock, flags);

3661}

3662 EXPORT_SYMBOL(complete);

9 參考資料

請解釋搶佔式內核與非搶佔式內核的區別聯繫，http://oldlinux.org/oldlinux/viewthread.php?tid=3024

搶佔式內核中的鎖問題，http://hi.baidu.com/juventus/blog/item/a71c8701960454d2277fb5f0.html

http://www.linuxforum.net/forum/showflat.php?Cat=&Board=linuxK&Number=610932&page=

http://linux.chinaunix.net/bbs/viewthread.php?tid=912039

Linux kernel design and development

Linux搶佔式內核就是由Robert Love修改實現的。在他的書中有如下描述：

-----------

User Preemption

User preemption occurs when the kernel is about to return to user-space, need_resched is set, and therefore, the scheduler is invoked. If the kernel is returning to user-space, it knows it is in a safe quiescent state. In other words, if it is safe to continue executing the current task, it is also safe to pick a new task to execute. Consequently, whenever the kernel is preparing to return to user-space either on return from an interrupt or after a system call, the value of need_resched is checked. If it is set, the scheduler is invoked to select a new (more fit) process to execute. Both the return paths for return from interrupt and return from system call are architecture dependent and typically implemented in assembly in entry.S (which, aside from kernel entry code, also contains kernel exit code).

In short, user preemption can occur

When returning to user-space from a system call

When returning to user-space from an interrupt handler

Kernel Preemption

The Linux kernel, unlike most other Unix variants and many other operating systems, is a fully preemptive kernel. In non-preemptive kernels, kernel code runs until completion. That is, the scheduler is not capable of rescheduling a task while it is in the kernel. kernel code is scheduled cooperatively, not preemptively. Kernel code runs until it finishes (returns to user-space) or explicitly blocks. In the 2.6 kernel, however, the Linux kernel became preemptive: It is now possible to preempt a task at any point, so long as the kernel is in a state in which it is safe to reschedule.

So when is it safe to reschedule? The kernel is capable of preempting a task running in the kernel so long as it does not hold a lock. That is, locks are used as markers of regions of non-preemptibility. Because the kernel is SMP-safe, if a lock is not held, the current code is reentrant and capable of being preempted.

The first change in supporting kernel preemption was the addition of a preemption counter, preempt_count, to each process's thread_info. This counter begins at zero and increments once for each lock that is acquired and decrements once for each lock that is released. When the counter is zero, the kernel is preemptible. Upon return from interrupt, if returning to kernel-space, the kernel checks the values of need_resched and preempt_count. If need_resched is set and preempt_count is zero, then a more important task is runnable and it is safe to preempt. Thus, the scheduler is invoked. If preempt_count is nonzero, a lock is held and it is unsafe to reschedule. In that case, the interrupt returns as usual to the currently executing task. When all the locks that the current task is holding are released, preempt_count returns to zero. At that time, the unlock code checks whether need_resched is set. If so, the scheduler is invoked. Enabling and disabling kernel preemption is sometimes required in kernel code and is discussed in Chapter 9

Kernel preemption can also occur explicitly, when a task in the kernel blocks or explicitly calls schedule(). This form of kernel preemption has always been supported because no additional logic is required to ensure that the kernel is in a state that is safe to preempt. It is assumed that the code that explicitly calls schedule() knows it is safe to reschedule.

Kernel preemption can occur

When an interrupt handler exits, before returning to kernel-space

When kernel code becomes preemptible again

If a task in the kernel explicitly calls schedule()

If a task in the kernel blocks (which results in a call to schedule())

Linux內核搶佔實現機制分析

1 內核搶佔概述

2 用戶搶佔

3 不可搶佔內核的特點

4 爲什麼需要內核搶佔？

5 什麼情況不允許內核搶佔

6 內核搶佔時機

7 如何支持搶佔內核

8 何時需要重新調度

9 參考資料

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

Linux中斷處理之時鐘中斷--X86

Linux時鐘處理-時鐘的軟中斷處理

QTime類介紹

Linux進程調度時機Schedule函數解析

新型的按鍵掃描程序，僅三行程序

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結