轉自:https://blog.csdn.net/xiaoqiaoq0/article/details/107135747/
前言
本文繼續整理CPU調度WALT相關內容,主要整理如下內容:
- WALT是什麼?
- WALT 計算?
- WALT 計算數據如何使用?
1. WALT是什麼?
WALT:Windows-Assist Load Tracing的縮寫:
- 從字面意思來看,是以window作爲輔助項來跟蹤CPU LOAD;
- 實質上是一種計算方法,用數據來表現CPU當前的loading情況,用於後續任務調度、遷移、負載均衡等功能;
1.1 爲什麼需要WALT ?
對於一項技術的發展,尤其是一種計算方式的引入,一定是伴隨着過去的技術不在適用於當前事務發展的需要,或者這項技術可以讓人更懶;
1.1.1 PELT的計算方式的不足?
PELT的引進的時候,linux的主流還在於服務器使用,更多關注設備性能的體現,彼時功耗還不是考慮的重點,而隨着移動設備的發展,功耗和響應速度成爲被人們直接感知到的因素,成爲當前技術發展主要考慮的因素:
- 對於當前的移動設備,在界面處理的應用場景,需要儘快響應,否則user會明顯感覺到卡頓;
- 對於當前移動設備,功耗更是一個必須面對的因素,手機需要頻繁充電,那銷量一定好不了;
- 根據用戶場景決定task是否heavy的要求,比如顯示的內容不同,其task重要程度也不同,即同一個類別的TASK也需要根據具體情況動態改變;
而基於當前PELT的調度情況(衰減的計算思路),更能體現連續的趨勢情況,而對於快速的突變性質的情況,不是很友好:
- 對於快速上升和快速下降的情況響應速度較慢,由於衰減的計算過程,所以實際的Loading上升和下降需要一定週期後才能在數據上反饋出來,導致響應速度慢;
- PELT基於其衰減機制,所以對於一個task sleep 一段時間後,則其負載計算減小,但是如果此時該Task爲網絡傳輸這種,週期性的需要cpu和freq的能力,則不能快速響應(因爲該計算方式更能體現趨向性、平均效果)
1.2 WALT如何處理
根據上述的原因,我們瞭解到,當前需要在PELT的基礎上(保持其好處),實現一種更能適用於當前需求的計算方式:
- 數據上報更加及時;
- 數據直接體現現狀;
- 對算力的消耗不會增加(算力);
1.2.1 WALT 處理
我這裏總結了WALT所能(需要)做到的效果:
- 繼續保持對於所有Task-entity的跟蹤 ;
- 在此前usage(load)的基礎上,添加對於demand的記錄,用於之後預測;
- 每個CPU上runqueue 的整體負載仍爲所有Task統計的sum;
- 核心在於計算差異,由之前的衰減的方式變更爲劃分window的方式:數據採集更能快速體現實際變化(對比與PELT的趨勢),如下爲Linux官方的一些資料:
- A task’s demand is the maximum of its contribution to the most recently completed window and its average demand over the past N windows.
- WALT “forgets” blocked time entirely:即只統計runable和running time,可以對於Task的實際耗時有更準確的統計,可以通過demand預測;
- CPU busy time - The sum of execution times of all tasks in the most recently completed window;
- WALT “forgets” cpu utilization as soon as tasks are taken off of the runqueue;
1.2.2 應用補充
- task分配前各個CPU和task負載的統計;
- task migration 遷移
- 大小核的分配;
- EAS 分配;
1.3 版本導入
- linux 4.8.2 之後導入(但是在bootlin查看code,最新5.8仍沒有對應文件)
- android 4.4之後導入(android kernel 4.9 中是有這部分的)
2. Kernel如何啓用WALT
android kernel code中已經集成了這部分內容,不過根據廠商的差異,可能存在沒有啓用的情況:
- 打開宏測試:
- menuconfig ==》Genernal setup ==》CPU/Task time and stats accounting ==》support window based load tracking
- 圖示:
- 直接修改
- kernel/arch/arm64/config/defconfig中添加CONFIG_SCHED_WALT=y
- build image 驗證修改是否生效:
demo:/sys/kernel/tracing # zcat /proc/config.gz | grep WALTCONFIG_SCHED_WALT=y
CONFIG_HID_WALTOP=y - 測試
當前只是在ftrace中可以看到確實有統計walt的數據,但是沒有實際的應用來確認具體是否有改善或者其他數據(當然Linux的資料中有一些數據,但是並非本地測試);
3. WALT計算
本小節從原理和code 來說明,WALT採用的計算方式:
- windows 是如何劃分的?
- 對於Task如何分類,分別做怎樣的處理?
- WALT部分數據如何更新?
- WALT更新的數據如何被調度、EAS使用?
3.1 Windows劃分
首先來看輔助計算項window是如何劃分的?
簡單理解,就是將系統自啓動開始以一定時間作爲一個週期,分別統計不同週期內Task的Loading情況,並將其更新到Runqueue中;
則還有哪些內容需要考慮?
- 一個週期即window設置爲多久比較合適?這個根據實際項目不同調試不同的值,目前Kernel中是設置的標準是20ms;
- 具體統計多少個window內的Loading情況?根據實際項目需要調整,目前Kernel中設置爲5個window;
所以對於一個Task和window,可能存在如下幾種情況:
ps:ms = mark_start(Task開始),ws = window_start(當前window開始), wc = wallclock(當前系統時間)
- Task在這個window內啓動,且做統計時仍在這個window內,即Task在一個window內;
- Task在前一個window內啓動,做統計時在當前window內,即Task跨過兩個window;
- Task在前邊某一個window內啓動,做統計時在當前window內,即Task跨過多個完整window;
即Task在Window的劃分只有上述三種情況,所有的計算都是基於上述劃分的;
3.2 Task 分類
可以想到的是,對於不同類別的Task或者不同狀態的Task計算公式都是不同的,WALT將Task劃分爲如下幾個類別:
上圖中有將各個Task event的調用函數列出來;
3.2.1 更新demand判斷
在更新demand時,會首先根據Task event判斷此時是否需要更新:
對應function:
static int account_busy_for_task_demand(struct task_struct *p, int event)
{
/* No need to bother updating task demand for exiting tasks
* or the idle task. */
//task 已退出或者爲IDLE,則不需要計算
if (exiting_task(p) || is_idle_task(p))
return 0;
/* When a task is waking up it is completing a segment of non-busy
* time. Likewise, if wait time is not treated as busy time, then
* when a task begins to run or is migrated, it is not running and
* is completing a segment of non-busy time. */
// 默認 walt_account_wait_time是1,則只有TASK_WAKE
if (event == TASK_WAKE || (!walt_account_wait_time &&
(event == PICK_NEXT_TASK || event == TASK_MIGRATE)))
return 0;
return 1;
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
3.2.2 更新CPU busy time判斷
在更新CPU busy time時,會首先根據Task event判斷此時是否需要更新:
對應function:
static int account_busy_for_cpu_time(struct rq *rq, struct task_struct *p,
u64 irqtime, int event)
{
//是否爲idle task or other task?
if (is_idle_task(p)) {
/* TASK_WAKE && TASK_MIGRATE is not possible on idle task! */
// 是schedule 觸發的下一個task爲idle task
if (event == PICK_NEXT_TASK)
return 0;
/* PUT_PREV_TASK, TASK_UPDATE && IRQ_UPDATE are left */
// 如果是中斷或者等待IO的IDLE TASK,是要計算busy time的;
return irqtime || cpu_is_waiting_on_io(rq);
}
//wake 喚醒操作不需要計算;
if (event == TASK_WAKE)
return 0;
//不是IDLE TASK則以下幾個類型需要計算
if (event == PUT_PREV_TASK || event == IRQ_UPDATE ||
event == TASK_UPDATE)
return 1;
/* Only TASK_MIGRATE && PICK_NEXT_TASK left */
//默認是0
return walt_freq_account_wait_time;
}
3.3 數據如何更新?(調用邏輯)
前邊兩個小結已經介紹了Task在window上統計邏輯和不同Task統計不同數據判斷,這裏具體來看核心調用邏輯,首先上一張圖:
這個圖是在xmind導出來的結構圖,不清楚是否可以放大查看,這裏具體介紹流程:
- 入口函數walt_update_task_ravg
- demand更新函數
- cpu busy time 更新函數
3.3.1 入口函數介紹
對應function:
/* Reflect task activity on its demand and cpu's busy time statistics */
void walt_update_task_ravg(struct task_struct *p, struct rq *rq,
int event, u64 wallclock, u64 irqtime)
{
//判斷返回
if (walt_disabled || !rq->window_start)
return;
lockdep_assert_held(&rq->lock);
//更新window_start和cum_window_demand
update_window_start(rq, wallclock);
if (!p->ravg.mark_start)
goto done;
//更新數據:demand和busy_time
update_task_demand(p, rq, event, wallclock);
update_cpu_busy_time(p, rq, event, wallclock, irqtime);
done:
// trace
trace_walt_update_task_ravg(p, rq, event, wallclock, irqtime);
// 更新mark_start
p->ravg.mark_start = wallclock;
}
函數主要做三件事情:
- 更新當前 window start時間爲之後數據更新做準備;
- 更新對應task的demand數值,需要注意這裏也會對應更新RQ中的數據;
- 更新對應task的cpu busy time佔用;
這個函數是WALT計算的主要入口,可以看到調用它的位置有很多,即上圖最左側內容,簡單來說就是在中斷、喚醒、遷移、調度這些case下都會更新Loading情況,這裏不一一詳細說明了;
- task awakend
- task start execute
- task stop execute
- task exit
- window rollover
- interrupt
- scheduler_tick
- task migration
- freq change
3.3.2 更新window start
這裏主要是在計算之前更新window_start確保rq 窗口起始值準確:
對應function:
static void
update_window_start(struct rq *rq, u64 wallclock)
{
s64 delta;
int nr_windows;
//計算時間
delta = wallclock - rq->window_start;
/* If the MPM global timer is cleared, set delta as 0 to avoid kernel BUG happening */
if (delta < 0) {
delta = 0;
/*
* WARN_ONCE(1,
* "WALT wallclock appears to have gone backwards or reset\n");
*/
}
if (delta < walt_ravg_window) // 不足一個window週期,則直接返回;
return;
nr_windows = div64_u64(delta, walt_ravg_window);//計算window數量
rq->window_start += (u64)nr_windows * (u64)walt_ravg_window;//統計window_start時間
rq->cum_window_demand = rq->cumulative_runnable_avg;//實質還得使用cumulative_runnable_avg
}
3.3.3 更新demand
3.3.3.1 demand主要邏輯:
對應function:
/*
* Account cpu demand of task and/or update task's cpu demand history
*
* ms = p->ravg.mark_start;
* wc = wallclock
* ws = rq->window_start
*
* Three possibilities:
*
* a) Task event is contained within one window.
* window_start < mark_start < wallclock
*
* ws ms wc
* | | |
* V V V
* |---------------|
*
* In this case, p->ravg.sum is updated *iff* event is appropriate
* (ex: event == PUT_PREV_TASK)
*
* b) Task event spans two windows.
* mark_start < window_start < wallclock
*
* ms ws wc
* | | |
* V V V
* -----|-------------------
*
* In this case, p->ravg.sum is updated with (ws - ms) *iff* event
* is appropriate, then a new window sample is recorded followed
* by p->ravg.sum being set to (wc - ws) *iff* event is appropriate.
*
* c) Task event spans more than two windows.
*
* ms ws_tmp ws wc
* | | | |
* V V V V
* ---|-------|-------|-------|-------|------
* | |
* |<------ nr_full_windows ------>|
*
* In this case, p->ravg.sum is updated with (ws_tmp - ms) first *iff*
* event is appropriate, window sample of p->ravg.sum is recorded,
* 'nr_full_window' samples of window_size is also recorded *iff*
* event is appropriate and finally p->ravg.sum is set to (wc - ws)
* *iff* event is appropriate.
*
* IMPORTANT : Leave p->ravg.mark_start unchanged, as update_cpu_busy_time()
* depends on it!
*/
static void update_task_demand(struct task_struct *p, struct rq *rq,
int event, u64 wallclock)
{
u64 mark_start = p->ravg.mark_start;//mark start 可以看到是task 的值;
u64 delta, window_start = rq->window_start;//window start是 rq的值;
int new_window, nr_full_windows;
u32 window_size = walt_ravg_window;
//第一個判斷條件,ms和ws,即當前task的start實際是否在這個window內;
new_window = mark_start < window_start;
if (!account_busy_for_task_demand(p, event)) {
if (new_window)
/* If the time accounted isn't being accounted as
* busy time, and a new window started, only the
* previous window need be closed out with the
* pre-existing demand. Multiple windows may have
* elapsed, but since empty windows are dropped,
* it is not necessary to account those. */
update_history(rq, p, p->ravg.sum, 1, event);
return;
}
// 如果ms > ws,則是case a:將wc-ms,在此週期內的實際執行時間;
if (!new_window) {
/* The simple case - busy time contained within the existing
* window. */
add_to_task_demand(rq, p, wallclock - mark_start);
return;
}
//超過 1個window的情況
/* Busy time spans at least two windows. Temporarily rewind
* window_start to first window boundary after mark_start. */
//從ms 到 ws的時間,包含多個完整window
delta = window_start - mark_start;
nr_full_windows = div64_u64(delta, window_size);
window_start -= (u64)nr_full_windows * (u64)window_size;
//ws 計算到ws_tmp這裏:
/* Process (window_start - mark_start) first */
//先添加最開始半個週期的demand
add_to_task_demand(rq, p, window_start - mark_start);
/* Push new sample(s) into task's demand history */
//更新history
update_history(rq, p, p->ravg.sum, 1, event);
if (nr_full_windows)
update_history(rq, p, scale_exec_time(window_size, rq),
nr_full_windows, event);
/* Roll window_start back to current to process any remainder
* in current window. */
// 還原 window_start
window_start += (u64)nr_full_windows * (u64)window_size;
/* Process (wallclock - window_start) next */
//更新最後的週期,可以看到整體類似於pelt的計算,增加了history的操作;
mark_start = window_start;
add_to_task_demand(rq, p, wallclock - mark_start);
}
//demand計算更新:
static void add_to_task_demand(struct rq *rq, struct task_struct *p,
u64 delta)
{
//demand需要做一次轉換,將實際運行時間,轉換爲CPU 能力比例,一般就是獲取CPU 的capcurr 然後除1024;
delta = scale_exec_time(delta, rq);
p->ravg.sum += delta;
//這裏有個判斷當sum超過window size的時候修改;
if (unlikely(p->ravg.sum > walt_ravg_window))
p->ravg.sum = walt_ravg_window;
}
3.3.3.2 update history 邏輯:
update_history 整理:
- 本函數在Task進入一個新的Window的時候調用;
- 更新Task中的demand,根據過往幾個Window的情況;
- 同步更新Rq中的Usage,根據當前demand計算值;
對應function:
/*
* Called when new window is starting for a task, to record cpu usage over
* recently concluded window(s). Normally 'samples' should be 1. It can be > 1
* when, say, a real-time task runs without preemption for several windows at a
* stretch.
*/
static void update_history(struct rq *rq, struct task_struct *p,
u32 runtime, int samples, int event)
{
u32 *hist = &p->ravg.sum_history[0];//對應window 指針鏈接
int ridx, widx;
u32 max = 0, avg, demand;
u64 sum = 0;
/* Ignore windows where task had no activity */
if (!runtime || is_idle_task(p) || exiting_task(p) || !samples)
goto done;
/* Push new 'runtime' value onto stack */
widx = walt_ravg_hist_size - 1;// history數量最大位置
ridx = widx - samples;//計算鏈表中需要去除的window數量
//如下兩個for循環就是將新增加的window添加到history鏈表中,並更新sum值和max值;
for (; ridx >= 0; --widx, --ridx) {
hist[widx] = hist[ridx];
sum += hist[widx];
if (hist[widx] > max)
max = hist[widx];
}
for (widx = 0; widx < samples && widx < walt_ravg_hist_size; widx++) {
hist[widx] = runtime;
sum += hist[widx];
if (hist[widx] > max)
max = hist[widx];
}
// Task中sum賦值;
p->ravg.sum = 0;
//demand根據策略不同,從history window中計算,我們默認是policy2 就是 WINDOW_STATS_MAX_RECENT_AVG,在過去平均值和當前值中選擇大的那個;
if (walt_window_stats_policy == WINDOW_STATS_RECENT) {
demand = runtime;
} else if (walt_window_stats_policy == WINDOW_STATS_MAX) {
demand = max;
} else {
avg = div64_u64(sum, walt_ravg_hist_size);
if (walt_window_stats_policy == WINDOW_STATS_AVG)
demand = avg;
else
demand = max(avg, runtime);
}
/*
* A throttled deadline sched class task gets dequeued without
* changing p->on_rq. Since the dequeue decrements hmp stats
* avoid decrementing it here again.
*
* When window is rolled over, the cumulative window demand
* is reset to the cumulative runnable average (contribution from
* the tasks on the runqueue). If the current task is dequeued
* already, it's demand is not included in the cumulative runnable
* average. So add the task demand separately to cumulative window
* demand.
*/
//進行runnable_avg參數矯正,前提爲並非deadline類型task
if (!task_has_dl_policy(p) || !p->dl.dl_throttled) {
if (task_on_rq_queued(p))//在runqueue中排隊,但是沒有實際執行
fixup_cumulative_runnable_avg(rq, p, demand);//在rq中添加當前demand和task中記錄demand的差值,更新到cumulative_runnable_avg
else if (rq->curr == p)//當前執行的就是這個Task
fixup_cum_window_demand(rq, demand);//在rq中添加demand
}
//最後將計算出來的demand更新到Task中;
p->ravg.demand = demand;
done:
trace_walt_update_history(rq, p, runtime, samples, event);
return;
}
//更新cumulative_runnable_avg的值;
static void