Linux進程優先級的處理--Linux進程的管理與調度(二十二)

1. linux優先級的表示

1.1 優先級的內核表示

linux優先級概述

在用戶空間通過nice命令設置進程的靜態優先級, 這在內部會調用nice系統調用, 進程的nice值在-20~+19之間. 值越低優先級越高. setpriority系統調用也可以用來設置進程的優先級. 它不僅能夠修改單個線程的優先級, 還能修改進程組中所有進程的優先級, 或者通過制定UID來修改特定用戶的所有進程的優先級

內核使用一些簡單的數值範圍0~139表示內部優先級, 數值越低, 優先級越高。

從0~99的範圍專供實時進程使用, nice的值[-20,19]則映射到範圍100~139

linux2.6內核將任務優先級進行了一個劃分, 實時優先級範圍是0到MAX_RT_PRIO-1（即99），而普通進程的靜態優先級範圍是從MAX_RT_PRIO到MAX_PRIO-1（即100到139）.

優先級範圍	描述
0——99	實時進程
100——139	非實時進程

內核的優先級表示

內核表示優先級的所有信息基本都放在include/linux/sched/prio.h中, 其中定義了一些表示優先級的宏和函數.

優先級數值通過宏來定義, 如下所示,

其中MAX_NICE和MIN_NICE定義了nice的最大最小值

而MAX_RT_PRIO指定了實時進程的最大優先級,而MAX_PRIO則是普通進程的最大優先級數值

/*  http://lxr.free-electrons.com/source/include/linux/sched/prio.h?v=4.6#L4 */
#define MAX_NICE        19
#define MIN_NICE        -20
#define NICE_WIDTH      (MAX_NICE - MIN_NICE + 1)

/* http://lxr.free-electrons.com/source/include/linux/sched/prio.h?v=4.6#L24  */
#define MAX_PRIO        (MAX_RT_PRIO + 40)
#define DEFAULT_PRIO        (MAX_RT_PRIO + 20)

宏	值	描述
MIN_NICE	-20	對應於優先級100, 可以使用NICE_TO_PRIO和PRIO_TO_NICE轉換
NICE_WIDTH	40	nice值得範圍寬度, 即[-20, 19]共40個數字的寬度
MAX_RT_PRIO, MAX_USER_RT_PRIO	10 0	實時進程的最大優先級
MAX_PRIO	14 0	普通進程的最大優先級
DEFAULT_PRIO	12 0	進程的默認優先級, 對應於nice=0
MAX_DL_PRIO	0	使用EDF最早截止時間優先調度算法的實時進程最大的優先級

而內核提供了一組宏將優先級在各種不同的表示形之間轉移

//  http://lxr.free-electrons.com/source/include/linux/sched/prio.h?v=4.6#L27
/*
 * Convert user-nice values [ -20 ... 0 ... 19 ]
 * to static priority [ MAX_RT_PRIO..MAX_PRIO-1 ],
 * and back.
 */
#define NICE_TO_PRIO(nice)      ((nice) + DEFAULT_PRIO)
#define PRIO_TO_NICE(prio)      ((prio) - DEFAULT_PRIO)

/*
 * 'User priority' is the nice value converted to something we
 * can work with better when scaling various scheduler parameters,
 * it's a [ 0 ... 39 ] range.
 */
#define USER_PRIO(p)            ((p)-MAX_RT_PRIO)
#define TASK_USER_PRIO(p)       USER_PRIO((p)->static_prio)
#define MAX_USER_PRIO           (USER_PRIO(MAX_PRIO))

還有一些nice值和rlimit值之間相互轉換的函數nice_to_rlimit和rlimit_to_nice, 這在nice系統調用進行檢查的時候很有用, 他們定義在include/linux/sched/prio.h, L47中, 如下所示

/*
 * Convert nice value [19,-20] to rlimit style value [1,40].
 */
static inline long nice_to_rlimit(long nice)
{
    return (MAX_NICE - nice + 1);
}

/*
 * Convert rlimit style value [1,40] to nice value [-20, 19].
 */
static inline long rlimit_to_nice(long prio)
{
    return (MAX_NICE - prio + 1);
}

DEF最早截至時間優先實時調度算法的優先級描述

此外新版本的內核還引入了EDF實時調度算法, 它的優先級比RT進程和NORMAL/BATCH進程的優先級都要高, 關於EDF的優先級的設置信息都早內核頭文件include/linux/sched/deadline.h

因此內核將MAX_DL_PRIO設置爲0, 可以參見內核文件include/linux/sched/deadline.h

#define MAX_DL_PRIO             0

此外也提供了一些EDF優先級處理所需的函數, 如下所示, 可以參見內核文件include/linux/sched/deadline.h

static inline int dl_prio(int prio)
{
    if (unlikely(prio < MAX_DL_PRIO))
            return 1;
    return 0;
}

static inline int dl_task(struct task_struct *p)
{
    return dl_prio(p->prio);
}

static inline bool dl_time_before(u64 a, u64 b)
{
    return (s64)(a - b) < 0;
}

1.2 進程的優先級表示

struct task_struct
{
    /* 進程優先級
     * prio: 動態優先級，範圍爲100~139，與靜態優先級和補償(bonus)有關
     * static_prio: 靜態優先級，static_prio = 100 + nice + 20 (nice值爲-20~19,所以static_prio值爲100~139)
     * normal_prio: 沒有受優先級繼承影響的常規優先級，具體見normal_prio函數，跟屬於什麼類型的進程有關
     */
    int prio, static_prio, normal_prio;
    /* 實時進程優先級 */
    unsigned int rt_priority;
}

動態優先級靜態優先級實時優先級

其中task_struct採用了三個成員表示進程的優先級:prio和normal_prio表示動態優先級, static_prio表示進程的靜態優先級.

爲什麼表示動態優先級需要兩個值prio和normal_prio 調度器會考慮的優先級則保存在prio. 由於在某些情況下內核需要暫時提高進程的優先級, 因此需要用prio表示. 由於這些改變不是持久的, 因此靜態優先級static_prio和普通優先級normal_prio不受影響.

此外還用了一個字段rt_priority保存了實時進程的優先級

字段	描述
static_prio	用於保存靜態優先級, 是進程啓動時分配的優先級, ，可以通過nice和sched_setscheduler系統調用來進行修改, 否則在進程運行期間會一直保持恆定
rt_priority	用於保存實時優先級
normal_prio	表示基於進程的靜態優先級static_prio和調度策略計算出的優先級. 因此即使普通進程和實時進程具有相同的靜態優先級, 其普通優先級也是不同的, 進程分叉(fork)時, 子進程會繼承父進程的普通優先級
prio	保存進程的動態優先級

實時進程的優先級用實時優先級rt_priority來表示

2 進程優先級的計算

前面說了task_struct中的幾個優先級的字段

靜態優先級	實時優先級	普通優先級	動態優先級
static_prio	rt_priority	normal_prio	prio

但是這些優先級是如何關聯的呢, 動態優先級prio又是如何計算的呢?

2.1 normal_prio函數設置普通優先級normal_prio

靜態優先級static_prio(普通進程)和實時優先級rt_priority(實時進程)是計算的起點

因此他們也是進程創建的時候設定好的, 我們通過nice修改的就是普通進程的靜態優先級static_prio

首先通過靜態優先級static_prio計算出普通優先級normal_prio, 該工作可以由nromal_prio來完成, 該函數定義在kernel/sched/core.c#L861

/*
 * __normal_prio - return the priority that is based on the static prio
 * 普通進程(非實時進程)的普通優先級normal_prio就是靜態優先級static_prio
 */
static inline int __normal_prio(struct task_struct *p)
{
    return p->static_prio;
}

/*
 * Calculate the expected normal priority: i.e. priority
 * without taking RT-inheritance into account. Might be
 * boosted by interactivity modifiers. Changes upon fork,
 * setprio syscalls, and whenever the interactivity
 * estimator recalculates.
 */
static inline int normal_prio(struct task_struct *p)
{
    int prio;

    if (task_has_dl_policy(p))              /*  EDF調度的實時進程  */
            prio = MAX_DL_PRIO-1;
    else if (task_has_rt_policy(p))       /*  普通實時進程的優先級  */
            prio = MAX_RT_PRIO-1 - p->rt_priority;
    else                                              /*  普通進程的優先級  */
            prio = __normal_prio(p);
    return prio;
}

進程類型	調度器	普通優先級normal_prio
EDF實時進程	EDF	MAX_DL_PRIO-1 = -1
普通實時進程	RT	MAX_RT_PRIO-1 - p->rt_priority = 99 - rt_priority
普通進程	CFS	__normal_prio(p) = static_prio

普通優先級normal_prio需要根據普通進程和實時進程進行不同的計算, 其中__normal_prio適用於普通進程, 直接將普通優先級normal_prio設置爲靜態優先級static_prio. 而實時進程的普通優先級計算依據其實時優先級rt_priority.

3.1.1 輔助函數task_has_dl_policy和task_has_rt_policy

定義在kernel/sched/sched.h#L117 中

其本質其實就是傳入task->policy調度策略字段看其值等於SCHED_NORMAL, SCHED_BATCH, SCHED_IDLE, SCHED_FIFO, SCHED_RR, SCHED_DEADLINE中的哪個, 從而確定其所屬的調度類, 進一步就確定了其進程類型

static inline int idle_policy(int policy)
{
    return policy == SCHED_IDLE;
}
static inline int fair_policy(int policy)
{
    return policy == SCHED_NORMAL || policy == SCHED_BATCH;
}

static inline int rt_policy(int policy)
{
    return policy == SCHED_FIFO || policy == SCHED_RR;
}

static inline int dl_policy(int policy)
{
        return policy == SCHED_DEADLINE;
}
static inline bool valid_policy(int policy)
{
        return idle_policy(policy) || fair_policy(policy) ||
                rt_policy(policy) || dl_policy(policy);
}

static inline int task_has_rt_policy(struct task_struct *p)
{
        return rt_policy(p->policy);
}

static inline int task_has_dl_policy(struct task_struct *p)
{
        return dl_policy(p->policy);
}

2.1.2 關於rt_priority數值越大, 實時進程優先級越高的問題

我們前面提到了數值越小, 優先級越高, 但是此處我們會發現rt_priority的值越大, 其普通優先級越小, 從而優先級越高.

因此網上出現了一種說法, 優先級越高？這又是怎麼回事？難道有一種說法錯了嗎？

實際的原因是這樣的，對於一個實時進程，他有兩個參數來表明優先級——prio 和 rt_priority，

prio纔是調度所用的最終優先級數值，這個值越小，優先級越高；

而rt_priority 被稱作實時進程優先級，他要經過轉化——prio=MAX_RT_PRIO - 1- p->rt_priority;

MAX_RT_PRIO = 100, ;這樣意味着rt_priority值越大，優先級越高；

而內核提供的修改優先級的函數，是修改rt_priority的值，所以越大，優先級越高。

所以用戶在使用實時進程或線程，在修改優先級時，就會有“優先級值越大，優先級越高的說法”，也是對的。

2.1.3 爲什麼需要__normal_prio函數我們肯定會奇怪, 爲什麼增加了一個__normal_prio函數做了這麼簡單的工作, 這個其實是有歷史原因的: 在早期的o(1)調度器中, 普通優先級的計算涉及相當多技巧性地工作, 必須檢測交互式進程並提高其優先級, 而必須”懲罰”非交互進程, 以便是得系統獲得更好的交互體驗. 這需要很多啓發式的計算, 他們可能完成的很好, 也可能不工作

2.2 effective_prio函數設置動態優先級prio

可以通過函數effective_prio用靜態優先級static_prio計算動態優先級prio, 即·

p->prio = effective_prio(p);

該函數定義在kernel/sched/core.c, line 861

/*
 * Calculate the current priority, i.e. the priority
 * taken into account by the scheduler. This value might
 * be boosted by RT tasks, or might be boosted by
 * interactivity modifiers. Will be RT if the task got
 * RT-boosted. If not then it returns p->normal_prio.
 */
static int effective_prio(struct task_struct *p)
{
    p->normal_prio = normal_prio(p);
    /*
     * If we are RT tasks or we were boosted to RT priority,
     * keep the priority unchanged. Otherwise, update priority
     * to the normal priority:
     */
    if (!rt_prio(p->prio))
            return p->normal_prio;
    return p->prio;
}

我們會發現函數首先effective_prio設置了普通優先級, 顯然我們用effective_prio同時設置了兩個優先級(普通優先級normal_prio和動態優先級prio)

因此計算動態優先級的流程如下

設置進程的普通優先級(實時進程99-rt_priority, 普通進程爲static_priority)
計算進程的動態優先級(實時進程則維持動態優先級的prio不變, 普通進程的動態優先級即爲其普通優先級)

最後, 我們綜述一下在針對不同類型進程的計算結果

進程類型	實時優先級rt_priority	靜態優先級static_prio	普通優先級normal_prio	動態優先級prio
EDF調度的實時進程	rt_priority	不使用	MAX_DL_PRIO-1	維持原prio不變
RT算法調度的實時進程	rt_priority	不使用	MAX_RT_PRIO-1-rt_priority	維持原prio不變
普通進程	不使用	static_prio	static_prio	static_prio

2.2.1 爲什麼effective_prio使用優先級數值檢測實時進程

t_prio會檢測普通優先級是否在實時範圍內, 即是否小於MAX_RT_PRIO.參見include/linux/sched/rt.h#L6

static inline int rt_prio(int prio)
{
    if (unlikely(prio < MAX_RT_PRIO))
        return 1;
    return 0;
}

而前面我們在normal_prio的時候, 則通過task_has_rt_policy來判斷其policy屬性來確定

policy == SCHED_FIFO || policy == SCHED_RR;

那麼爲什麼effective_prio重檢測實時進程是rt_prio基於優先級數值, 而非task_has_rt_policy或者rt_policy?

對於臨時提高至實時優先級的非實時進程來說, 這個是必要的, 這種情況可能發生在是哦那個實時互斥量(RT-Mutex)時.

2.3 設置prio的時機

在新進程用wake_up_new_task喚醒時, 或者使用nice系統調用改變其靜態優先級時, 則會通過effective_prio的方法設置p->prio

wake_up_new_task(), 計算此進程的優先級和其他調度參數，將新的進程加入到進程調度隊列並設此進程爲可被調度的，以後這個進程可以被進程調度模塊調度執行。

進程創建時copy_process通過調用sched_fork來初始化和設置調度器的過程中會設置子進程的優先級

2.4 nice系統調用的實現

nice系統調用是的內核實現是sys_nice, 其定義在kernel/sched/core.c#L7498，

它在通過一系列檢測後, 通過set_user_nice函數，其定義在kernel/sched/core.c#L3497

關於其具體實現我們會在另外一篇博客裏面詳細講

2.5 fork時優先級的繼承

在進程分叉處子進程時, 子進程的靜態優先級繼承自父進程. 子進程的動態優先級p->prio則被設置爲父進程的普通優先級, 這確保了實時互斥量引起的優先級提高不會傳遞到子進程.

可以參照sched_fork函數, 在進程複製的過程中copy_process通過調用sched_fork來設置子進程優先級, 參見sched_fork函數

/*
 * fork()/clone()-time setup:
 */
int sched_fork(unsigned long clone_flags, struct task_struct *p)
{
    /*  ......  */
    /*
     * Make sure we do not leak PI boosting priority to the child.
     * 子進程的動態優先級被設置爲父進程普通優先級 
     */
    p->prio = current->normal_prio;

    /*
     * Revert to default priority/policy on fork if requested.
     * sched_reset_on_fork標識用於判斷是否恢復默認的優先級或調度策略

     */
    if (unlikely(p->sched_reset_on_fork))  /*  如果要恢復默認的調度策略, 即SCHED_NORMAL  */
    {
        /*   首先是設置靜態優先級static_prio
         *   由於要恢復默認的調度策略
         *   對於父進程是實時進程的情況, 靜態優先級就設置爲DEFAULT_PRIO
         *
         *   對於父進程是非實時進程的情況, 要保證子進程優先級不小於DEFAULT_PRIO
         *   父進程nice < 0即static_prio < 的重新設置爲DEFAULT_PRIO的重新設置爲DEFAULT_PRIO
         *   父進程nice > 0的時候, 則什麼也沒做
         *   */
        if (task_has_dl_policy(p) || task_has_rt_policy(p))
        {
            p->policy = SCHED_NORMAL;           /*  普通進程調度策略  */
            p->static_prio = NICE_TO_PRIO(0);   /*  靜態優先級爲nice = 0 即DEFAULT_PRIO*/
            p->rt_priority = 0;                             /*  實時優先級爲0  */
        }
        else if (PRIO_TO_NICE(p->static_prio) < 0)  /*  */
            p->static_prio = NICE_TO_PRIO(0);   /*  */

        /*  接着就通過__normal_prio設置其普通優先級和動態優先級
          *  這裏做了一個優化, 因爲用sched_reset_on_fork標識設置恢復默認調度策略後
          *  創建的子進程是是SCHED_NORMAL的非實時進程
          *  因此就不需要繞一大圈用effective_prio設置normal_prio和prio了 
          *  直接用__normal_prio設置就可  */
        p->prio = p->normal_prio = __normal_prio(p); /*  設置*/

        /*  設置負荷權重  */
        set_load_weight(p);

        /*
         * We don't need the reset flag anymore after the fork. It has
         * fulfilled its duty:
         */
        p->sched_reset_on_fork = 0;
    }
    /*  ......  */
}

3 總結

task_struct採用了四個成員表示進程的優先級:prio和normal_prio表示動態優先級, static_prio表示進程的靜態優先級. 同時還用了rt_priority表示實時進程的優先級

字段	描述
static_prio	用於保存靜態優先級, 是進程啓動時分配的優先級, ，可以通過nice和sched_setscheduler系統調用來進行修改, 否則在進程運行期間會一直保持恆定
prio	進程的動態優先級, 這個有顯示纔是調度器重點考慮的進程優先級
normal_prio	普通進程的靜態優先級static_prio和調度策略計算出的優先級. 因此即使普通進程和實時進程具有相同的靜態優先級, 其普通優先級也是不同的, 進程分叉(fork)時, 子進程會繼承父進程的普通優先級, 可以通過normal_prio來計算(非實時進程用static_prIo計算, 實時進程用rt_priority計算)
rt_priority	實時進程的靜態優先級

調度器會考慮的優先級則保存在prio. 由於在某些情況下內核需要暫時提高進程的優先級, 因此需要用prio表示. 由於這些改變不是持久的, 因此靜態優先級static_prio和普通優先級normal_prio不受影響.

此外還用了一個字段rt_priority保存了實時進程的優先級靜態優先級static_prio(普通進程)和實時優先級rt_priority(實時進程)是計算的起點, 通過他們計算進程的普通優先級normal_prio和動態優先級prio.

內核通過normal_prIo函數計算普通優先級normal_prio

通過effective_prio函數計算動態優先級prio

參考進程調度之sys_nice()系統調用 linux調度器源碼研究 - 概述（一）深入 Linux 的進程優先級

Linux進程優先級的處理--Linux進程的管理與調度(二十二)

1. linux優先級的表示

1.1 優先級的內核表示

1.2 進程的優先級表示

2 進程優先級的計算

2.1 normal_prio函數設置普通優先級normal_prio

3.1.1 輔助函數task_has_dl_policy和task_has_rt_policy

2.1.2 關於rt_priority數值越大, 實時進程優先級越高的問題

2.2 effective_prio函數設置動態優先級prio

2.2.1 爲什麼effective_prio使用優先級數值檢測實時進程

2.3 設置prio的時機

2.4 nice系統調用的實現

2.5 fork時優先級的繼承

3 總結

Linux分頁機制之概述--Linux內存管理(六)

Linux內存描述之內存區域zone--Linux內存管理(三)

Linux內存描述之概述--Linux內存管理(一)

android 休眠喚醒機制分析(一) — wake_lock【轉】

服務器體系(SMP, NUMA, MPP)與共享存儲器架構(UMA和NUMA)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結