linux內核函數schedule()實現進程的調度

函數schedule()實現進程的調度。它的任務是從運行隊列rq中找到一個進程，並隨後將CPU分配給這個進程。schedule()可以採取主動調用或被動調用（可延遲的）的方式。
1 直接調用
如果current進程因缺乏資源而要立刻被阻塞，就主動調用調度程序。
a．把current進程插入適當的等待隊列。
b．把current進程的狀態改爲TASK_INTERRUPTIBLE或TASK_UNINTERRUPTIBLE。
c．調用schedule()。
d．檢查資源是否可用，如果不可用就轉到b。
e．一但資源可用就從等待隊列中刪除當前進程current。
內核反覆檢查進程需要的資源是否可用，如果不可用，就調用schedule( )把CPU分配給其它進程，直到資源可用。這些步驟與wait_event( )所執行的步驟很相似。
許多反覆執行長任務的設備驅動程序也直接調用調度程序。每次反覆循環時，驅動程序都檢查TIF_NEED_RESCHED標誌，如果需要就調用schedule()自動放棄CPU。

2 被動調用
被動調用的方法是，把TIF_NEED_RESCHED標誌設置爲1（thread_info），在以後的某個時段調用調度程序schedule()。由於總是在恢復用戶態進程的執行之前檢查這個標誌的值，所以schedule()將在不久之後的某個時間被明確地調用。

被動調用調度程序的典型例子，也是最重要的三個進程調度實務：
a 當 current 進程用完了它的CPU 時間片時，由scheduler_tick( )函數做延遲調用。
b 當一個被喚醒進程的優先權比當前進程的優先權高時，由try_to_wake_up( )函數做延遲調用。

c 當發出系統調用sched_setscheduler( )時。

Schedule函數的主要作用就是從就緒進程中選擇一個優先級最高的進程來代替當前進程運行。
/*
 * schedule() is the main scheduler function.
 */
asmlinkage void __sched schedule(void)
{
       long *switch_count;
       task_t *prev, *next;
       runqueue_t *rq;
       prio_array_t *array;
       struct list_head *queue;
       unsigned long long now;
       unsigned long run_time;
       int cpu, idx, new_prio;
 
       /*
        * Test if we are atomic.  Since do_exit() needs to call into
        * schedule() atomically, we ignore that path for now.
        * Otherwise, whine if we are scheduling when we should not be.
        */
       if (likely(!current->exit_state)) {
              if (unlikely(in_atomic())) {
                     printk(KERN_ERR "scheduling while atomic: "
                            "%s/0x%08x/%d/n",
                            current->comm, preempt_count(), current->pid);
                     dump_stack();
              }
       }
       profile_hit(SCHED_PROFILING, __builtin_return_address(0));
//先禁用內核搶佔並初始化一些局部變量，把current返回的指針賦給prev，並把與本地CPU相對應的運行隊列數據結構的地址賦給rq。
need_resched:
       preempt_disable();
       prev = current;
       release_kernel_lock(prev);
need_resched_nonpreemptible:
       rq = this_rq();
 
 
       /*
        * The idle thread is not allowed to schedule!
        * Remove this check after it has been exercised a bit.
        */
       if (unlikely(prev == rq->idle) && prev->state != TASK_RUNNING) {
              printk(KERN_ERR "bad: scheduling from the idle thread!/n");
              dump_stack();
       }
//調用sched_clock( )函數以讀取TSC，並將它的值轉換成納秒，所獲得的時間戳存放在局部變量now中。然後，schedule( )計算prev所用的時間片長度。通常使用限制在1秒（要轉換成納秒）的時間。run_time的值用來限制進程對CPU的使用。不過，鼓勵進程有較長的平均睡眠時間：run_time /= (CURRENT_BONUS(prev) ? : 1);這是GCC對問號表達式的擴展
       schedstat_inc(rq, sched_cnt);
       now = sched_clock();
       if (likely((long long)(now - prev->timestamp) < NS_MAX_SLEEP_AVG)) {
              run_time = now - prev->timestamp;
              if (unlikely((long long)(now - prev->timestamp) < 0))
                     run_time = 0;
       } else
              run_time = NS_MAX_SLEEP_AVG;
 
       /*
        * Tasks charged proportionately less run_time at high sleep_avg to
        * delay them losing their interactive status
        */
       run_time /= (CURRENT_BONUS(prev) ? : 1);
//關掉本地中斷，並獲得所要保護的運行隊列的自旋鎖
       spin_lock_irq(&rq->lock);
//prev可能是一個正在被終止的進程，schedule( )檢查PF_DEAD標誌確認
       if (unlikely(prev->flags & PF_DEAD))
              prev->state = EXIT_DEAD;
//檢查prev的狀態，如果不是可運行狀態，而且它沒有在內核態被搶佔，就應該從運行隊列刪除prev進程。不過，如果它是非阻塞掛起信號，而且狀態爲TASK_INTERRUPTIBLE，函數就把該進程的狀態設置爲TASK_RUNNING，並將它插入運行隊列。這個操作與把處理器分配給prev是不同的，它只是給prev一次被選中執行的機會。
       switch_count = &prev->nivcsw;
       if (prev->state && !(preempt_count() & PREEMPT_ACTIVE)) {
              switch_count = &prev->nvcsw;
              if (unlikely((prev->state & TASK_INTERRUPTIBLE) &&
                            unlikely(signal_pending(prev))))
                     prev->state = TASK_RUNNING;//再給一次運行的機會，當然不是馬上
              else {//沒辦法，我盡力了
                     if (prev->state == TASK_UNINTERRUPTIBLE)
                            rq->nr_uninterruptible++;
                     deactivate_task(prev, rq);
              }
       }
//檢查運行隊列中剩餘的可運行進程數。如果有可運行的進程，schedule()就調用dependent_sleeper( )函數，在絕大多數情況下，該函數立即返回0。但是，如果內核支持超線程技術，函數檢查要被選中執行的進程，其優先權是否比已經在相同物理CPU的某個邏輯CPU上運行的兄弟進程的優先權低，在這種特殊的情況下，schedule()拒絕選擇低優先權的進程，而去執行swapper進程。如果運行隊列中沒有可運行的進程存在，函數就調用idle_balance( )，從另外一個運行隊列遷移一些可運行進程到本地運行隊列中，idle_balance( )與load_balance( )類似。
 
       cpu = smp_processor_id();
       if (unlikely(!rq->nr_running)) {
go_idle:
              idle_balance(cpu, rq);
              if (!rq->nr_running) {
                     next = rq->idle;
                     rq->expired_timestamp = 0;
                     wake_sleeping_dependent(cpu, rq);
                     /*
                      * wake_sleeping_dependent() might have released
                      * the runqueue, so break out if we got new
                      * tasks meanwhile:
                      */
                     if (!rq->nr_running)
                            goto switch_tasks;
              }
       } else {
              if (dependent_sleeper(cpu, rq)) {
                     next = rq->idle;
                     goto switch_tasks;
              }
              /*
               * dependent_sleeper() releases and reacquires the runqueue
               * lock, hence go into the idle loop if the rq went
               * empty meanwhile:
               */
              if (unlikely(!rq->nr_running))
                     goto go_idle;
       }
//如果idle_balance( ) 沒有成功地把進程遷移到本地運行隊列中，schedule( )就調用wake_sleeping_dependent( )重新調度空閒CPU（即每個運行swapper進程的CPU）中的可運行進程,通常在內核支持超線程技術的時候可能會出現這種情況。然而，在單處理機系統中，或者當把進程遷移到本地運行隊列的種種努力都失敗的情況下，函數就選擇swapper進程作爲next進程並繼續進行下一步驟。
 
//檢查這些可運行進程中是否至少有一個進程是活動的，如果沒有，函數就交換運行隊列數據結構的active和expired數組的內容，因此，所有的過期進程變爲活動進程，而空集合準備接納將要過期的進程。
       array = rq->active;
       if (unlikely(!array->nr_active)) {
              /*
               * Switch the active and expired arrays.
               */
              schedstat_inc(rq, sched_switch);
              rq->active = rq->expired;
              rq->expired = array;
              array = rq->active;
              rq->expired_timestamp = 0;
              rq->best_expired_prio = MAX_PRIO;
       }
//在active 數組的prio_array_t數據結構中搜索一個可運行進程了。首先，schedule()搜索活動進程集合位掩碼的第一個非0位。當對應的優先權鏈表不爲空時，就把位掩碼的相應位置1。因此，第一個非0位的下標對應包含最佳運行進程的鏈表，隨後，返回該鏈表的第一個進程描述符。uCOS也是這麼幹的
       idx = sched_find_first_bit(array->bitmap);
       queue = array->queue + idx;
       next = list_entry(queue->next, task_t, run_list);
//如果next是一個非實時進程而且它正在從TASK_INTERRUPTIBLE 或 TASK_STOPPED狀態被喚醒，調度程序就把自從進程插入運行隊列開始所經過的納秒數加到進程的平均睡眠時間中。換而言之，進程的睡眠時間被增加了，以包含進程在運行隊列中等待CPU所消耗的時間。實時進程是不需要用平均睡眠時間計算優先級的--------他就這麼霸道，我也沒辦法。
       if (!rt_task(next) && next->activated > 0) {
              unsigned long long delta = now - next->timestamp;
              if (unlikely((long long)(now - next->timestamp) < 0))
                     delta = 0;
//對於交互進程和高CPU佔用的進程是要加以區別的，要不就太不公平了，對不？
              if (next->activated == 1)
                     delta = delta * (ON_RUNQUEUE_WEIGHT * 128 / 100) / 128;
 
              array = next->array;
              new_prio = recalc_task_prio(next, next->timestamp + delta);
 
              if (unlikely(next->prio != new_prio)) {
                     dequeue_task(next, array);
                     next->prio = new_prio;
                     enqueue_task(next, array);
              } else
                     requeue_task(next, array);
       }
       next->activated = 0;
//讓next 進程投入運行
switch_tasks:
       if (next == rq->idle)
              schedstat_inc(rq, sched_goidle);
//prefetch 宏把next進程描述符的第一部分字段的內容裝入硬件高速緩存，正是這一點改善了schedule()的性能，因爲對於後續指令的執行（不影響next），數據是並行移動的。
       prefetch(next);
       prefetch_stack(next);
//prev進程的TIF_NEED_RESCHED位要清零
       clear_tsk_need_resched(prev);
       rcu_qsctr_inc(task_cpu(prev));
 
       update_cpu_clock(prev, rq, now);
//減少prev的平均睡眠時間，並把它補充給進程所使用的CPU時間片
       prev->sleep_avg -= run_time;
       if ((long)prev->sleep_avg <= 0)
              prev->sleep_avg = 0;
       prev->timestamp = prev->last_ran = now;
//prev！=next進程切換確實地發生了
       sched_info_switch(prev, next);
       if (likely(prev != next)) {
              next->timestamp = now;
              rq->nr_switches++;
              rq->curr = next;
              ++*switch_count;
 
              prepare_task_switch(rq, next)；
// context_switch( )函數建立next的地址空間
              prev = context_switch(rq, prev, next);
//進程到此就換過來了，prev不再是原來的了，是現在系統裏優先級最高的進程
//宏barrier( )產生一個代碼優化屏障
              barrier();       
/*
               * this_rq must be evaluated again because prev may have moved
               * CPUs since it called schedule(), thus the 'rq' on its stack
               * frame will be invalid.
               */
              finish_task_switch(this_rq(), prev);
       } else   //本來就是第一名，就不用換了
              spin_unlock_irq(&rq->lock);
//schedule( )在需要的時候重新獲得大內核鎖、重新啓用內核搶佔、並檢查是否一些其他的進程已經設置了當前進程的TIF_NEED_RESCHED標誌，如果是，整個schedule( )函數重新開始執行，否則，函數結束。
       prev = current;
       if (unlikely(reacquire_kernel_lock(prev) < 0))
              goto need_resched_nonpreemptible;
       preempt_enable_no_resched();
       if (unlikely(test_thread_flag(TIF_NEED_RESCHED)))
              goto need_resched;
}
//最後schedule成功結束

MyAnqi

發佈了141 篇原創文章 · 獲贊 108 · 訪問量 27萬+

私信關注

linux內核函數schedule()實現進程的調度

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

sql求連續值問題

cs01 CSS Syntax

sql server sp_executesql 中使用表變量進行查詢

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

linux內核函數schedule()實現進程的調度

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結