linux調度器_第三代cfs(4)_總手稿_完結篇

這是自己之前自己寫的手稿，在我這裏用UE打開的格式有點不一樣，所以在這也許有點亂，大概還是可以看的，有興趣的朋友可以看看。

前段時間看了那麼久的調度器，感覺又忘了差不多了，還是來整理下。
1.先理理還能記下什麼：
a.goto在這裏很合適，可以生成最有的彙編代碼；
b.虛擬時間是個很牛B的東西。
c.第一代，從第一個找到最後一個看哪個優先級高；
第二代，把優先級分成四十個等級，然後從最高的開始找，然而，進程多了以後，粒度不夠細，把副教授跟正教授分到了一塊；
第三代，很牛。據史料記載，從Linux2.6.23（kernel/sched.c）到Linux4.0.1都在用，而現在是4.0.1是否仍在用，待考證已考證是的（linux-4.0.1\kernel\sched\fair.c）Ingo Molnar？
d，睡眠的時候，虛擬時間不變，睡醒後爲避免一直佔用，所以重新定虛擬運行時間，利用紅黑樹結構。
e，第三代是從系統角度考慮，根據進程對系統的渴望程度來選擇進程，而不是像之前的從進程角度考慮，哪個進程優先級高就選擇哪個。

好吧，開始跟書跟源碼理理先。
wait_runtime ?公平的理論研究
激活調度的兩種方法：1，直接的，比如進程打算睡眠或出於其他原因放棄CPU；
2，週期性機制，以固定的頻率運行，不時檢測是否有必要進行進程切換。
內核支持不同的調度策略：
1，完全公平調度；
2，實時調度；
3，在無事可做時，調度空閒進程。
各個進程的task_sruct有幾個成員與調度有關：
task_struct()
{
...
int prio, static_prio, normal_prio;
//static_prio靜態優先級在進程啓動時分配，可以用nice(),sched_setscheduler()修改，負責一直不變
//normal_prio是基於靜態優先級和調度策略計算出的優先級。子進程會繼承父進程的這個優先級
//prio是調度器考慮的優先級,（前面都是浮雲）。
unsigned int rt_priority；//實時進程優先級，最低爲0，最大爲99。
struct list_head run_list;//循環實時調度器使用，完全公平調度器不用。表頭
const struct sched_class *sched_class;//表示所屬的調度器類，調度器不限於調度進程還可以調度更大的實體，如組調度。
struct sched_entity se;
unsigned int policy;
//調度策略 5種
/*
* Scheduling policies
*/
#define SCHED_NORMAL0 //主要講此類
#define SCHED_FIFO1 //軟實時，先進先出機制（fifo）
#define SCHED_RR2 //軟實時，循環的機制
#define SCHED_BATCH3 // 用於非交互，CPU使用密集的批處理進程
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE5 //基本不用，重要性比較低，相對權重總是最小的

cpumask_t cpus_allowed;//位域，在多處理器上使用，用來限制進程可以在哪些CPU上運行
unsigned int time_slice;//循環實時調度器使用，完全公平調度器不用。所剩CPU時間段
...
}

調度器類
提供調度器和各個調度方法之間的關聯。名字基本都叫出了作用。（不贅述，無非入隊，出隊等等）一些函數指針等等。
struct sched_class
{
const struct sched_class *next；
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
...
}

二解優先級
這裏應該算是一個重頭戲，怕太美，不忍心看，像件藝術品，像只昂貴的機械手錶，從一個齒輪，一根錶針，一點點拼裝起來，
這個過程本身就是一種享受，但當花了很長時間組裝起來後，你看着秒針跳動，齒輪旋轉，擦了擦額頭的汗水，油污並沒有影響笑容。
然後發現唾液不由自主的分泌加快。忍不住嚥了兩口口水。我還沒有開始分解就已經飢渴難耐了。算算時間，聽會聽力，明天晚上拆分linux內核優先級。

靜態優先級
內核使用0-139（包含）來表示內部優先級，值越低優先級越高。0-99給實時進程用，100-139剛好映射進程40個nice值（（-20）-19）給普通進程用。
顯然實時進程總是比普通進程優先級高。
//Priority of a process goes from 0..MAX_PRIO-1 使用0-139來表示優先級
//RT priority is 0..MAX_RT_PRIO-1實時進程從0..（100-1）
//This allows kernel threads to set their priority to a value higher than any user task 內核線程可以把優先級設置的比用戶進程高
#define MAX_USER_RT_PRIO100
#define MAX_RT_PRIOMAX_USER_RT_PRIO

#define MAX_PRIO(MAX_RT_PRIO + 40)//100+40
#define DEFAULT_PRIO(MAX_RT_PRIO + 20)

只用靜態優先級是不夠的，還必須考慮動態優先級（task_struct->prio），普通優先級task_struct->normal_prio，靜態優先級task_struct->static_prio。靜態優先級是起點。

計算函數 p->prio = effective_prio(p);

static int effective_prio(struct task_struct *p)
{
p->normal_prio = normal_prio(p);//計算普通優先級，接下函數分解
/*
* If we are RT tasks or we were boosted to RT priority,
* keep the priority unchanged. Otherwise, update priority
* to the normal priority:
*/
if (!rt_prio(p->prio))//如果非實時進程，
return p->normal_prio;//返回普通優先級，即動態優先級=普通優先級
return p->prio;//實時進程，則（）返回動態優先級。。。估計別處會計算？
}
//計算普通優先級
static inline int normal_prio(struct task_struct *p)
{
int prio;

if (task_has_rt_policy(p))//若是實時進程
prio = MAX_RT_PRIO-1 - p->rt_priority;//100-1 - 實時進程優先級(就是這個實時進程的等級)
else
prio = __normal_prio(p);//就是返回靜態優先級
return prio;
}

static inline int __normal_prio(struct task_struct *p)//爲什麼要額外曾經一個這樣的函數，
//歷史原因，在原來O（1）調度器中，計算涉及很多技巧性工作，
{ //檢測提高交互式進程優先級，“懲罰”非交互，待深入，還是感謝現在調度器
return p->static_prio;
}

判斷是否爲實時進程的兩種不同方法
static inline int rt_prio(int prio)//effective_prio()中調用，與100這個界限值比較，考慮到了後面的優先級反轉情況
{
if (unlikely(prio < MAX_RT_PRIO))
return 1;
return 0;
}

static inline int task_has_rt_policy(struct task_struct *p) //normal_prio()中調用，是利用進程本源屬性比較的
{
return rt_policy(p->policy);
}
static inline int rt_policy(int policy)
{
if (unlikely(policy == SCHED_FIFO) || unlikely(policy == SCHED_RR))
return 1;
return 0;
}

優先級基本上是都算完了：
static_prio normal_prio prio
非實時進程 static_priostatic_prio static_prio
優先級題高的非實時 static_priostatic_prio prio不變
實時進程 static_prioMAX_RT_PRIO-1 - p->rt_priority prio不變

進程的重要性，不僅要考慮優先級，還要考慮task_struct->se.load的負荷權重

權重
struct load_weight
{
unsigned long weight, inv_weight;//另一個小插曲，內核不僅維護負荷權重本身,另一個數值用於計算被負荷權重除的結果
//，long 類型，所以1/weight無法直接
};
進程每降低一個nice值，則多獲得10%的CPU時間，每升高一個nice值則放棄10%的時間。
此處說道nice值，估計僅僅在非實時進程中有用吧？

/*
* Nice levels are multiplicative, with a gentle 10% change for every
* nice level changed. I.e. when a CPU-bound task goes from nice 0 to
* nice 1, it will get ~10% less CPU time than another CPU-bound task
* that remained on nice 0.
*
* The "10% effect" is relative and cumulative: from _any_ nice level,
* if you go up 1 level, it's -10% CPU usage, if you go down 1 level
* it's +10% CPU usage. (to achieve that we use a multiplier of 1.25.
* If a task goes up by ~10% and another task goes down by ~10% then
* the relative distance between them is ~25%.)
*/
static const int prio_to_weight[40] = {
/* -20 */ 88761, 71755, 56483, 46273, 36291,
/* -15 */ 29154, 23254, 18705, 14949, 11916,
/* -10 */ 9548, 7620, 6100, 4904, 3906,
/* -5 */ 3121, 2501, 1991, 1586, 1277,
/* 0 */ 1024, 820, 655, 526, 423,
/* 5 */ 335, 272, 215, 172, 137,
/* 10 */ 110, 87, 70, 56, 45,
/* 15 */ 36, 29, 23, 18, 15,
};
/*
* Inverse (2^32/x) values of the prio_to_weight[] array, precalculated.
*
* In cases where the weight does not change often, we can use the
* precalculated inverse to speed up arithmetics by turning divisions
* into multiplications:
*/
static const u32 prio_to_wmult[40] = {
/* -20 */ 48388, 59856, 76040, 92818, 118348,
/* -15 */ 147320, 184698, 229616, 287308, 360437,
/* -10 */ 449829, 563644, 704093, 875809, 1099582,
/* -5 */ 1376151, 1717300, 2157191, 2708050, 3363326,
/* 0 */ 4194304, 5237765, 6557202, 8165337, 10153587,
/* 5 */ 12820798, 15790321, 19976592, 24970740, 31350126,
/* 10 */ 39045157, 49367440, 61356676, 76695844, 95443717,
/* 15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
};

//此段大意，CPU的nice值下降一級，那麼將多獲得10%的CPU時間，而這個10%中有一個相對(relative)的概念
//來舉個例子吧，只有AB兩個進程在運行，nice值原本都是0，知權值load都爲1024，
//則A進程1024/（1024+1024）=50%的CPU，想象下若要拉開10%的差距，此消彼長,則A=55%，B=45%；
//若此時A的nice值不變，即權值不變，B的nice值上升一級，權值該便多少呢？這就是我們這個權值表的由來了？
//1024/(1024+B的權值) ≈ 55%；計算出來的1024/0.55 - 1024 = 837左右，可見於820相差並不大。
//而內核直接用1/（1+1.25）≈ 0.4444；取1.25這個基數，這個約等於放大就是我們上面的情況了。
//應該是從0這個nice值對應1024分別向兩邊擴展的

轉換代碼要考慮實時進程，實時進程的權重是普通進程的2倍，SCHED_IDLE進程權重總是非常小，前面也提到過

/*
* To aid in avoiding the subversion of "niceness" due to uneven distribution
* of tasks with abnormal "nice" values across CPUs the contribution that
* each task makes to its run queue's load is weighted according to its
* scheduling class and "nice" value. For SCHED_NORMAL tasks this is just a
* scaled version of the new time slice allocation that they receive on time
* slice expiry etc.
*/
//最後一句說這對於SCHED_NORMAL（普通進程）來說，權值是一種有鱗（有層次）的新時間片分配方法。

#define WEIGHT_IDLEPRIO2
#define WMULT_IDLEPRIO(1 << 31)

static void set_load_weight(struct task_struct *p)
{
if (task_has_rt_policy(p)) {
p->se.load.weight = prio_to_weight[0] * 2;//實時進程等於普通進程的最大權值*2
p->se.load.inv_weight = prio_to_wmult[0] >> 1;//這個反轉，爲了不常變（does not change often），快速計算
return;
}

/*
* SCHED_IDLE tasks get minimal weight://SCHED_IDLE進程權重總是非常小
*/
if (p->policy == SCHED_IDLE) {
p->se.load.weight = WEIGHT_IDLEPRIO;
p->se.load.inv_weight = WMULT_IDLEPRIO;
return;
}

p->se.load.weight = prio_to_weight[p->static_prio - MAX_RT_PRIO]; //普通進程的計算方法
p->se.load.inv_weight = prio_to_wmult[p->static_prio - MAX_RT_PRIO];
}
//每次進程被加到就緒隊列時，內核會調用inc_nr_running(),這不僅確保就緒隊列能跟蹤記錄有多少進程在運行，
//而且還將進程的權重添加到就緒隊列的權重中
static void inc_nr_running(struct task_struct *p, struct rq *rq)
{
rq->nr_running++;
inc_load(rq, p);
}

核心調度器
1.週期性調度器
如果當前進程應該被重新調度，那麼會在task_struct中設置TIF_NEED_RESCHED標誌
2.核心調度器
__sched schedule（）{}； //__sched這個函數前綴用於可能調用schedule（）函數的函數

完全公平類調度（重要）

核心調度器知道有關完全公平調度器的所有信息
/*
* All the scheduling class methods:
*/
static const struct sched_class fair_sched_class = {
.next = &idle_sched_class,
.enqueue_task = enqueue_task_fair,
.dequeue_task = dequeue_task_fair,
...
}

CFS的數據結構
/* CFS-related fields in a runqueue */
struct cfs_rq
{
struct load_weight load;
unsigned long nr_running;//計算隊列上可運行進程的數目

u64 min_vruntime;//跟蹤隊列上所有進程的最小虛擬運行時間，可能比紅黑樹最左邊的樹節點的vruntime大

struct rb_root tasks_timeline;//用於按時間排序的紅黑樹中管理所有進程
...//省略一些關於組調度的信息

}
完全公平類算法依賴於虛擬時鐘，但在數據結構中並沒有這個變量，是因爲虛擬時鐘可以根據實際時鐘跟負荷權重算出來。所以命名爲虛擬時鐘。
計算虛擬時鐘的函數是update_curr()

static void update_curr(struct cfs_rq *cfs_rq)
{
struct sched_entity *curr = cfs_rq->curr;//確認就緒隊列的當前執行進程
u64 now = rq_of(cfs_rq)->clock;//獲取主調度器就緒隊列額實際時鐘值
unsigned long delta_exec;

if (unlikely(!curr))//如果就緒隊列上沒進程正在執行，無事可做，返回
return;

/*
* Get the amount of time the current task was running
* since the last time we changed load (this cannot
* overflow on 32 bits):
*/
delta_exec = (unsigned long)(now - curr->exec_start);// 可以理解爲if的else，負責內核計算當前和上一次負荷權重變化時的時間差

__update_curr(cfs_rq, curr, delta_exec);// 更新當前進程CPU話費的物理時間和虛擬時間
curr->exec_start = now;

}

__update_curr(cfs_rq, curr, delta_exec)
__update_curr()
{
unsigned long delta_exec_weighted;
u64 vruntime;

schedstat_set(curr->exec_max, max((u64)delta_exec, curr->exec_max));

curr->sum_exec_runtime += delta_exec;//物理時間比較好算，直解把時間差加進來就可以了
schedstat_add(cfs_rq, exec_clock, delta_exec);
delta_exec_weighted = delta_exec;// 對於運行在nice級別0的進程來說，定義虛擬時間權重和物理時間是的等等
if (unlikely(curr->load.weight != NICE_0_LOAD)) {
delta_exec_weighted = calc_delta_fair(delta_exec_weighted,// 計算其他nice值得，小塊執行權值
&curr->load);
}
curr->vruntime += delta_exec_weighted;

/*
* maintain cfs_rq->min_vruntime to be a monotonic increasing
* value tracking the leftmost vruntime in the tree.
*/
if (first_fair(cfs_rq)) {
vruntime = min_vruntime(curr->vruntime,
__pick_next_entity(cfs_rq)->vruntime);
} else
vruntime = curr->vruntime;

cfs_rq->min_vruntime =
max_vruntime(cfs_rq->min_vruntime, vruntime);//確保min_vruntime 只會增加不會減少
}

參考了《深入理解linux內核架構》
calc_delta_fair(delta_exec_weighted, &curr->load);
delta_exec
calc_delta_mine(delta_exec, NICE_0_LOAD, lw);
{
有點繞哦；
}
delta_exec_weighted = delta_exec * NICE_0_LOAD/curr->load.weight;
//需指定越重要的進程權值越大，那麼 delta_exec_weighted 就小，即虛擬運行時間curr->vruntime += delta_exec_weighted;就增加慢。
//注意這都是針對非實時進程的，以上算是證明了越重要的進程，虛擬運行時間增加的越慢，那麼就越靠近左邊，下次運行機會大。

//那麼原始的curr->exec_start 在哪裏設置呢？
delta_exec = (unsigned long)(now - curr->exec_start);
/*
* We are picking a new current task - update its stats:
*/
static inline void
update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
/*
* We are starting a new run period:
*/
se->exec_start = rq_of(cfs_rq)->clock;//rq_of是一個輔助函數，用去確定與CFS就緒隊列相關的struct rq實例，看樣子這個clock設置進程開始的時間
}
//然後delta_exec = (unsigned long)(now - curr->exec_start);
//更新一下後curr->exec_start = now;

//內核設置min_vruntime必須保證該值是單調遞增的。

核心思路來了：紅黑樹的排序過程是根據下列鍵進行排序的
static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
return se->vruntime - cfs_rq->min_vruntime;
}
//鍵值較小的點，排序的位置越考左，因此會被更快地調度。
1.在進程運行時，其vruntime穩定地增加，他在紅黑樹中總是向右移動。
2.如果進程進入睡眠，則vruntime保持不變。因爲每個隊列的min_vruntime保持增加。所以睡醒後，在紅黑樹的位置會考左，因爲鍵值減小了。

//週期性調度器
//電量不足的時候，可關閉
主要作用：
1.管理調度相關的統計量
2.激活負責當前進程的調度類的週期性調度方法。

完全公平調度器中不再存在所謂時間片概念，這個運行時間是變化的，跟權值，可運行進程數目都有關係
* NOTE: this latency value is not the same as the concept of
* 'timeslice length' - timeslices in CFS are of variable length
* and have no persistent notion like in traditional, time-slice
* based scheduling concepts.

void scheduler_tick(void)//整體看起來比二代裏面內容少多了
{
int cpu = smp_processor_id();
struct rq *rq = cpu_rq(cpu);
struct task_struct *curr = rq->curr;
u64 next_tick = rq->tick_timestamp + TICK_NSEC;

spin_lock(&rq->lock);
__update_rq_clock(rq);
/*
* Let rq->clock advance by at least TICK_NSEC:
*/
if (unlikely(rq->clock < next_tick))
rq->clock = next_tick;
rq->tick_timestamp = rq->clock;//更新時間戳
update_cpu_load(rq);
if (curr != rq->idle) /* FIXME: needed? */
curr->sched_class->task_tick(rq, curr);//實現方式取決於底層的調度器類。.task_tick = task_tick_fair,= task_tick_idle,task_tick_rt,
spin_unlock(&rq->lock);//先看task_tick_fair

}

//先看task_tick_fair，形式上俺負責
static void task_tick_fair(struct rq *rq, struct task_struct *curr)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &curr->se;

for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
entity_tick(cfs_rq, se);//實際上交由本函數負責
}
}
//真正幹活的
static void entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
/*
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);// 更新統計量

if (cfs_rq->nr_running > 1 || !sched_feat(WAKEUP_PREEMPT)) //如果可運行進程多於1個，就會搶佔，負責是什麼都不幹的
check_preempt_tick(cfs_rq, curr);
}
//可以搶佔時候，做什麼，確保沒有哪個進程能夠比延遲週期中確定的份額運行得更長。
/*
* Preempt the current task with a newly woken task if needed:
*/
static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
unsigned long ideal_runtime, delta_exec;

ideal_runtime = sched_slice(cfs_rq, curr);//計算完美時間
delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
if (delta_exec > ideal_runtime)//如果運行時間大於，之前計算出來的理想時間（即超出了延時限制）
resched_task(rq_of(cfs_rq)->curr);
}

//計算完美時間，可以隨着可運行進程數目的多少來彈性變化。
/*
* The idea is to set a period in which each task runs once.
*
* When there are too many tasks (sysctl_sched_nr_latency) we have to stretch
* this period because otherwise the slices get too small.
*
* p = (nr <= nl) ? l : l*nr/nl
*/
static u64 __sched_period(unsigned long nr_running)
{
u64 period = sysctl_sched_latency;
unsigned long nr_latency = sched_nr_latency;

if (unlikely(nr_running > nr_latency)) {
period *= nr_running;
do_div(period, nr_latency);
}

return period;
}

實時調度類
實時進程與普通進程有一個根本不同之處：如果系統有一個實時進程且可運行，那麼調度器總會選擇它運行，出發有一個優先級更高的實時進程。
循環進程（SCHED_RR）。
先進先出（SCHED_FIFO）.

比較簡單，就是選優先級比較高的運行。

前段時間看了那麼久的調度器，感覺又忘了差不多了，還是來整理下。

1.先理理還能記下什麼：
a.goto在這裏很合適，可以生成最有的彙編代碼；
b.虛擬時間是個很牛B的東西。
c.第一代，從第一個找到最後一個看哪個優先級高；
第二代，把優先級分成四十個等級，然後從最高的開始找，然而，進程多了以後，粒度不夠細，把副教授跟正教授分到了一塊；
第三代，很牛。據史料記載，從Linux2.6.23（kernel/sched.c）到Linux4.0.1都在用，而現在是4.0.1是否仍在用，待考證已考證是的（linux-4.0.1\kernel\sched\fair.c）Ingo Molnar？
d，睡眠的時候，虛擬時間不變，睡醒後爲避免一直佔用，所以重新定虛擬運行時間，利用紅黑樹結構。
e，第三代是從系統角度考慮，根據進程對系統的渴望程度來選擇進程，而不是像之前的從進程角度考慮，哪個進程優先級高就選擇哪個。

好吧，開始跟書跟源碼理理先。
wait_runtime ?公平的理論研究
激活調度的兩種方法：1，直接的，比如進程打算睡眠或出於其他原因放棄CPU；
2，週期性機制，以固定的頻率運行，不時檢測是否有必要進行進程切換。
內核支持不同的調度策略：
1，完全公平調度；
2，實時調度；
3，在無事可做時，調度空閒進程。
各個進程的task_sruct有幾個成員與調度有關：
task_struct()
{
...
int prio, static_prio, normal_prio;
//static_prio靜態優先級在進程啓動時分配，可以用nice(),sched_setscheduler()修改，負責一直不變
//normal_prio是基於靜態優先級和調度策略計算出的優先級。子進程會繼承父進程的這個優先級
//prio是調度器考慮的優先級,（前面都是浮雲）。
unsigned int rt_priority；//實時進程優先級，最低爲0，最大爲99。
struct list_head run_list;//循環實時調度器使用，完全公平調度器不用。表頭
const struct sched_class *sched_class;//表示所屬的調度器類，調度器不限於調度進程還可以調度更大的實體，如組調度。
struct sched_entity se;
unsigned int policy;
//調度策略 5種
/*
* Scheduling policies
*/
#define SCHED_NORMAL0 //主要講此類
#define SCHED_FIFO1 //軟實時，先進先出機制（fifo）
#define SCHED_RR2 //軟實時，循環的機制
#define SCHED_BATCH3 // 用於非交互，CPU使用密集的批處理進程
/* SCHED_ISO: reserved but not implemented yet */
#define SCHED_IDLE5 //基本不用，重要性比較低，相對權重總是最小的

cpumask_t cpus_allowed;//位域，在多處理器上使用，用來限制進程可以在哪些CPU上運行
unsigned int time_slice;//循環實時調度器使用，完全公平調度器不用。所剩CPU時間段
...
}

調度器類
提供調度器和各個調度方法之間的關聯。名字基本都叫出了作用。（不贅述，無非入隊，出隊等等）一些函數指針等等。
struct sched_class
{
const struct sched_class *next；
void (*enqueue_task) (struct rq *rq, struct task_struct *p, int wakeup);
void (*dequeue_task) (struct rq *rq, struct task_struct *p, int sleep);
...
}

二解優先級
這裏應該算是一個重頭戲，怕太美，不忍心看，像件藝術品，像只昂貴的機械手錶，從一個齒輪，一根錶針，一點點拼裝起來，
這個過程本身就是一種享受，但當花了很長時間組裝起來後，你看着秒針跳動，齒輪旋轉，擦了擦額頭的汗水，油污並沒有影響笑容。
然後發現唾液不由自主的分泌加快。忍不住嚥了兩口口水。我還沒有開始分解就已經飢渴難耐了。算算時間，聽會聽力，明天晚上拆分linux內核優先級。

靜態優先級
內核使用0-139（包含）來表示內部優先級，值越低優先級越高。0-99給實時進程用，100-139剛好映射進程40個nice值（（-20）-19）給普通進程用。
顯然實時進程總是比普通進程優先級高。
//Priority of a process goes from 0..MAX_PRIO-1 使用0-139來表示優先級
//RT priority is 0..MAX_RT_PRIO-1實時進程從0..（100-1）
//This allows kernel threads to set their priority to a value higher than any user task 內核線程可以把優先級設置的比用戶進程高
#define MAX_USER_RT_PRIO100
#define MAX_RT_PRIOMAX_USER_RT_PRIO

#define MAX_PRIO(MAX_RT_PRIO + 40)//100+40
#define DEFAULT_PRIO(MAX_RT_PRIO + 20)

只用靜態優先級是不夠的，還必須考慮動態優先級（task_struct->prio），普通優先級task_struct->normal_prio，靜態優先級task_struct->static_prio。靜態優先級是起點。

計算函數 p->prio = effective_prio(p);

static int effective_prio(struct task_struct *p)
{
p->normal_prio = normal_prio(p);//計算普通優先級，接下函數分解
/*
* If we are RT tasks or we were boosted to RT priority,
* keep the priority unchanged. Otherwise, update priority
* to the normal priority:
*/

if (!rt_prio(p->prio))//如果非實時進程，

return p->normal_prio; //返回普通優先級，即動態優先級=普通優先級
return p->prio;/實時進程，則（）返回動態優先級。。。估計別處會計算？
}
//計算普通優先級
static inline int normal_prio(struct task_struct *p)
{
int prio;

if (task_has_rt_policy(p))//若是實時進程
prio = MAX_RT_PRIO-1 - p->rt_priority;//100-1 - 實時進程優先級(就是這個實時進程的等級)
else
prio = __normal_prio(p);//就是返回靜態優先級
return prio;
}

static inline int __normal_prio(struct task_struct *p)//爲什麼要額外曾經一個這樣的函數，
//歷史原因，在原來O（1）調度器中，計算涉及很多技巧性工作，

{

//檢測提高交互式進程優先級，“懲罰”非交互，待深入，還是感謝現在調度器

return p->static_prio;
}

判斷是否爲實時進程的兩種不同方法
static inline int rt_prio(int prio)//effective_prio()中調用，與100這個界限值比較，考慮到了後面的優先級反轉情況
{
if (unlikely(prio < MAX_RT_PRIO))
return 1;
return 0;
}

static inline int task_has_rt_policy(struct task_struct *p) //normal_prio()中調用，是利用進程本源屬性比較的
{
return rt_policy(p->policy);
}
static inline int rt_policy(int policy)
{
if (unlikely(policy == SCHED_FIFO) || unlikely(policy == SCHED_RR))
return 1;
return 0;
}

優先級基本上是都算完了：
static_prio normal_prio prio
非實時進程 static_priostatic_prio static_prio
優先級題高的非實時 static_priostatic_prio prio不變
實時進程 static_prioMAX_RT_PRIO-1 - p->rt_priority prio不變

進程的重要性，不僅要考慮優先級，還要考慮task_struct->se.load的負荷權重

權重
struct load_weight
{
unsigned long weight, inv_weight;//另一個小插曲，內核不僅維護負荷權重本身,另一個數值用於計算被負荷權重除的結果
//，long 類型，所以1/weight無法直接
};
進程每降低一個nice值，則多獲得10%的CPU時間，每升高一個nice值則放棄10%的時間。
此處說道nice值，估計僅僅在非實時進程中有用吧？

/*
* Nice levels are multiplicative, with a gentle 10% change for every
* nice level changed. I.e. when a CPU-bound task goes from nice 0 to
* nice 1, it will get ~10% less CPU time than another CPU-bound task
* that remained on nice 0.
*
* The "10% effect" is relative and cumulative: from _any_ nice level,
* if you go up 1 level, it's -10% CPU usage, if you go down 1 level
* it's +10% CPU usage. (to achieve that we use a multiplier of 1.25.
* If a task goes up by ~10% and another task goes down by ~10% then
* the relative distance between them is ~25%.)
*/
static const int prio_to_weight[40] = {
/* -20 */ 88761, 71755, 56483, 46273, 36291,
/* -15 */ 29154, 23254, 18705, 14949, 11916,
/* -10 */ 9548, 7620, 6100, 4904, 3906,
/* -5 */ 3121, 2501, 1991, 1586, 1277,
/* 0 */ 1024, 820, 655, 526, 423,
/* 5 */ 335, 272, 215, 172, 137,
/* 10 */ 110, 87, 70, 56, 45,
/* 15 */ 36, 29, 23, 18, 15,
};
/*
* Inverse (2^32/x) values of the prio_to_weight[] array, precalculated.
*
* In cases where the weight does not change often, we can use the
* precalculated inverse to speed up arithmetics by turning divisions
* into multiplications:
*/
static const u32 prio_to_wmult[40] = {
/* -20 */ 48388, 59856, 76040, 92818, 118348,
/* -15 */ 147320, 184698, 229616, 287308, 360437,
/* -10 */ 449829, 563644, 704093, 875809, 1099582,
/* -5 */ 1376151, 1717300, 2157191, 2708050, 3363326,
/* 0 */ 4194304, 5237765, 6557202, 8165337, 10153587,
/* 5 */ 12820798, 15790321, 19976592, 24970740, 31350126,
/* 10 */ 39045157, 49367440, 61356676, 76695844, 95443717,
/* 15 */ 119304647, 148102320, 186737708, 238609294, 286331153,
};

//此段大意，CPU的nice值下降一級，那麼將多獲得10%的CPU時間，而這個10%中有一個相對(relative)的概念
//來舉個例子吧，只有AB兩個進程在運行，nice值原本都是0，知權值load都爲1024，
//則A進程1024/（1024+1024）=50%的CPU，想象下若要拉開10%的差距，此消彼長,則A=55%，B=45%；
//若此時A的nice值不變，即權值不變，B的nice值上升一級，權值該便多少呢？這就是我們這個權值表的由來了？
//1024/(1024+B的權值) ≈ 55%；計算出來的1024/0.55 - 1024 = 837左右，可見於820相差並不大。
//而內核直接用1/（1+1.25）≈ 0.4444；取1.25這個基數，這個約等於放大就是我們上面的情況了。
//應該是從0這個nice值對應1024分別向兩邊擴展的

轉換代碼要考慮實時進程，實時進程的權重是普通進程的2倍，SCHED_IDLE進程權重總是非常小，前面也提到過

/*
* To aid in avoiding the subversion of "niceness" due to uneven distribution
* of tasks with abnormal "nice" values across CPUs the contribution that
* each task makes to its run queue's load is weighted according to its
* scheduling class and "nice" value. For SCHED_NORMAL tasks this is just a
* scaled version of the new time slice allocation that they receive on time
* slice expiry etc.
*/
//最後一句說這對於SCHED_NORMAL（普通進程）來說，權值是一種有鱗（有層次）的新時間片分配方法。

#define WEIGHT_IDLEPRIO2
#define WMULT_IDLEPRIO(1 << 31)

static void set_load_weight(struct task_struct *p)
{
if (task_has_rt_policy(p)) {
p->se.load.weight = prio_to_weight[0] * 2;//實時進程等於普通進程的最大權值*2
p->se.load.inv_weight = prio_to_wmult[0] >> 1;//這個反轉，爲了不常變（does not change often），快速計算
return;
}

/*
* SCHED_IDLE tasks get minimal weight://SCHED_IDLE進程權重總是非常小
*/
if (p->policy == SCHED_IDLE) {
p->se.load.weight = WEIGHT_IDLEPRIO;
p->se.load.inv_weight = WMULT_IDLEPRIO;
return;
}

p->se.load.weight = prio_to_weight[p->static_prio - MAX_RT_PRIO]; //普通進程的計算方法
p->se.load.inv_weight = prio_to_wmult[p->static_prio - MAX_RT_PRIO];
}
//每次進程被加到就緒隊列時，內核會調用inc_nr_running(),這不僅確保就緒隊列能跟蹤記錄有多少進程在運行，
//而且還將進程的權重添加到就緒隊列的權重中
static void inc_nr_running(struct task_struct *p, struct rq *rq)
{
rq->nr_running++;
inc_load(rq, p);
}

核心調度器
1.週期性調度器
如果當前進程應該被重新調度，那麼會在task_struct中設置TIF_NEED_RESCHED標誌
2.核心調度器
__sched schedule（）{}； //__sched這個函數前綴用於可能調用schedule（）函數的函數

完全公平類調度（重要）

核心調度器知道有關完全公平調度器的所有信息
/*
* All the scheduling class methods:
*/
static const struct sched_class fair_sched_class = {
.next = &idle_sched_class,
.enqueue_task = enqueue_task_fair,
.dequeue_task = dequeue_task_fair,
...
}

CFS的數據結構
/* CFS-related fields in a runqueue */
struct cfs_rq
{
struct load_weight load;
unsigned long nr_running;//計算隊列上可運行進程的數目

u64 min_vruntime;//跟蹤隊列上所有進程的最小虛擬運行時間，可能比紅黑樹最左邊的樹節點的vruntime大

struct rb_root tasks_timeline;//用於按時間排序的紅黑樹中管理所有進程
...//省略一些關於組調度的信息

}
完全公平類算法依賴於虛擬時鐘，但在數據結構中並沒有這個變量，是因爲虛擬時鐘可以根據實際時鐘跟負荷權重算出來。所以命名爲虛擬時鐘。
計算虛擬時鐘的函數是update_curr()

static void update_curr(struct cfs_rq *cfs_rq)
{
struct sched_entity *curr = cfs_rq->curr;//確認就緒隊列的當前執行進程
u64 now = rq_of(cfs_rq)->clock;//獲取主調度器就緒隊列額實際時鐘值
unsigned long delta_exec;

if (unlikely(!curr))//如果就緒隊列上沒進程正在執行，無事可做，返回
return;

/*
* Get the amount of time the current task was running
* since the last time we changed load (this cannot
* overflow on 32 bits):
*/
delta_exec = (unsigned long)(now - curr->exec_start);// 可以理解爲if的else，負責內核計算當前和上一次負荷權重變化時的時間差

__update_curr(cfs_rq, curr, delta_exec);// 更新當前進程CPU話費的物理時間和虛擬時間
curr->exec_start = now;

}

__update_curr(cfs_rq, curr, delta_exec)
__update_curr()
{
unsigned long delta_exec_weighted;
u64 vruntime;

schedstat_set(curr->exec_max, max((u64)delta_exec, curr->exec_max));

curr->sum_exec_runtime += delta_exec;//物理時間比較好算，直解把時間差加進來就可以了
schedstat_add(cfs_rq, exec_clock, delta_exec);
delta_exec_weighted = delta_exec;// 對於運行在nice級別0的進程來說，定義虛擬時間權重和物理時間是的等等
if (unlikely(curr->load.weight != NICE_0_LOAD)) {
delta_exec_weighted = calc_delta_fair(delta_exec_weighted,// 計算其他nice值得，小塊執行權值
&curr->load);
}
curr->vruntime += delta_exec_weighted;

/*
* maintain cfs_rq->min_vruntime to be a monotonic increasing
* value tracking the leftmost vruntime in the tree.
*/
if (first_fair(cfs_rq)) {
vruntime = min_vruntime(curr->vruntime,
__pick_next_entity(cfs_rq)->vruntime);
} else
vruntime = curr->vruntime;

cfs_rq->min_vruntime =
max_vruntime(cfs_rq->min_vruntime, vruntime);//確保min_vruntime 只會增加不會減少
}

參考了《深入理解linux內核架構》
calc_delta_fair(delta_exec_weighted, &curr->load);
delta_exec
calc_delta_mine(delta_exec, NICE_0_LOAD, lw);
{
有點繞哦；
}
delta_exec_weighted = delta_exec * NICE_0_LOAD/curr->load.weight;
//需指定越重要的進程權值越大，那麼 delta_exec_weighted 就小，即虛擬運行時間curr->vruntime += delta_exec_weighted;就增加慢。
//注意這都是針對非實時進程的，以上算是證明了越重要的進程，虛擬運行時間增加的越慢，那麼就越靠近左邊，下次運行機會大。

//那麼原始的curr->exec_start 在哪裏設置呢？
delta_exec = (unsigned long)(now - curr->exec_start);
/*
* We are picking a new current task - update its stats:
*/
static inline void
update_stats_curr_start(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
/*
* We are starting a new run period:
*/
se->exec_start = rq_of(cfs_rq)->clock;//rq_of是一個輔助函數，用去確定與CFS就緒隊列相關的struct rq實例，看樣子這個clock設置進程開始的時間
}
//然後delta_exec = (unsigned long)(now - curr->exec_start);
//更新一下後curr->exec_start = now;

//內核設置min_vruntime必須保證該值是單調遞增的。

核心思路來了：紅黑樹的排序過程是根據下列鍵進行排序的
static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
return se->vruntime - cfs_rq->min_vruntime;
}
//鍵值較小的點，排序的位置越考左，因此會被更快地調度。
1.在進程運行時，其vruntime穩定地增加，他在紅黑樹中總是向右移動。
2.如果進程進入睡眠，則vruntime保持不變。因爲每個隊列的min_vruntime保持增加。所以睡醒後，在紅黑樹的位置會考左，因爲鍵值減小了。

//週期性調度器
//電量不足的時候，可關閉
主要作用：
1.管理調度相關的統計量
2.激活負責當前進程的調度類的週期性調度方法。

完全公平調度器中不再存在所謂時間片概念，這個運行時間是變化的，跟權值，可運行進程數目都有關係
* NOTE: this latency value is not the same as the concept of
* 'timeslice length' - timeslices in CFS are of variable length
* and have no persistent notion like in traditional, time-slice
* based scheduling concepts.

void scheduler_tick(void)//整體看起來比二代裏面內容少多了
{
int cpu = smp_processor_id();
struct rq *rq = cpu_rq(cpu);
struct task_struct *curr = rq->curr;
u64 next_tick = rq->tick_timestamp + TICK_NSEC;

spin_lock(&rq->lock);
__update_rq_clock(rq);
/*
* Let rq->clock advance by at least TICK_NSEC:
*/
if (unlikely(rq->clock < next_tick))
rq->clock = next_tick;
rq->tick_timestamp = rq->clock;//更新時間戳
update_cpu_load(rq);
if (curr != rq->idle) /* FIXME: needed? */
curr->sched_class->task_tick(rq, curr);//實現方式取決於底層的調度器類。.task_tick = task_tick_fair,= task_tick_idle,task_tick_rt,
spin_unlock(&rq->lock);//先看task_tick_fair

}

//先看task_tick_fair，形式上俺負責
static void task_tick_fair(struct rq *rq, struct task_struct *curr)
{
struct cfs_rq *cfs_rq;
struct sched_entity *se = &curr->se;

for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
entity_tick(cfs_rq, se);//實際上交由本函數負責
}
}
//真正幹活的
static void entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
/*
* Update run-time statistics of the 'current'.
*/
update_curr(cfs_rq);// 更新統計量

if (cfs_rq->nr_running > 1 || !sched_feat(WAKEUP_PREEMPT)) //如果可運行進程多於1個，就會搶佔，負責是什麼都不幹的
check_preempt_tick(cfs_rq, curr);
}
//可以搶佔時候，做什麼，確保沒有哪個進程能夠比延遲週期中確定的份額運行得更長。
/*
* Preempt the current task with a newly woken task if needed:
*/
static void
check_preempt_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr)
{
unsigned long ideal_runtime, delta_exec;

ideal_runtime = sched_slice(cfs_rq, curr);//計算完美時間
delta_exec = curr->sum_exec_runtime - curr->prev_sum_exec_runtime;
if (delta_exec > ideal_runtime)//如果運行時間大於，之前計算出來的理想時間（即超出了延時限制）
resched_task(rq_of(cfs_rq)->curr);
}

//計算完美時間，可以隨着可運行進程數目的多少來彈性變化。
/*
* The idea is to set a period in which each task runs once.
*
* When there are too many tasks (sysctl_sched_nr_latency) we have to stretch
* this period because otherwise the slices get too small.
*
* p = (nr <= nl) ? l : l*nr/nl
*/
static u64 __sched_period(unsigned long nr_running)
{
u64 period = sysctl_sched_latency;
unsigned long nr_latency = sched_nr_latency;

if (unlikely(nr_running > nr_latency)) {
period *= nr_running;
do_div(period, nr_latency);
}

return period;
}

實時調度類
實時進程與普通進程有一個根本不同之處：如果系統有一個實時進程且可運行，那麼調度器總會選擇它運行，出發有一個優先級更高的實時進程。
循環進程（SCHED_RR）。
先進先出（SCHED_FIFO）.

比較簡單，就是選優先級比較高的運行。

linux調度器_第三代cfs(4)_總手稿_完結篇

druid數據源 xml配置

條款25:考慮寫一個不拋一場的swap函數

k8s學習記錄1_組件說明

k8s學習記錄3_daemonSet, job, 服務發現

c++知識點_lambda的好處

Xgboost的優點分析

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結