CPU負載均衡之WALT學習【轉】

轉自:https://blog.csdn.net/xiaoqiaoq0/article/details/107135747/

前言

本文繼續整理CPU調度WALT相關內容,主要整理如下內容:

  1. WALT是什麼?
  2. WALT 計算?
  3. WALT 計算數據如何使用?

1. WALT是什麼?

WALT:Windows-Assist Load Tracing的縮寫:
- 從字面意思來看,是以window作爲輔助項來跟蹤CPU LOAD;
- 實質上是一種計算方法,用數據來表現CPU當前的loading情況,用於後續任務調度、遷移、負載均衡等功能;

1.1 爲什麼需要WALT ?

對於一項技術的發展,尤其是一種計算方式的引入,一定是伴隨着過去的技術不在適用於當前事務發展的需要,或者這項技術可以讓人更懶;

1.1.1 PELT的計算方式的不足?

PELT的引進的時候,linux的主流還在於服務器使用,更多關注設備性能的體現,彼時功耗還不是考慮的重點,而隨着移動設備的發展,功耗和響應速度成爲被人們直接感知到的因素,成爲當前技術發展主要考慮的因素:

  1. 對於當前的移動設備,在界面處理的應用場景,需要儘快響應,否則user會明顯感覺到卡頓;
  2. 對於當前移動設備,功耗更是一個必須面對的因素,手機需要頻繁充電,那銷量一定好不了;
  3. 根據用戶場景決定task是否heavy的要求,比如顯示的內容不同,其task重要程度也不同,即同一個類別的TASK也需要根據具體情況動態改變;

而基於當前PELT的調度情況(衰減的計算思路),更能體現連續的趨勢情況,而對於快速的突變性質的情況,不是很友好:

  1. 對於快速上升和快速下降的情況響應速度較慢,由於衰減的計算過程,所以實際的Loading上升和下降需要一定週期後才能在數據上反饋出來,導致響應速度慢;
  2. PELT基於其衰減機制,所以對於一個task sleep 一段時間後,則其負載計算減小,但是如果此時該Task爲網絡傳輸這種,週期性的需要cpu和freq的能力,則不能快速響應(因爲該計算方式更能體現趨向性、平均效果)

1.2 WALT如何處理

根據上述的原因,我們瞭解到,當前需要在PELT的基礎上(保持其好處),實現一種更能適用於當前需求的計算方式:

  1. 數據上報更加及時;
  2. 數據直接體現現狀;
  3. 對算力的消耗不會增加(算力);

1.2.1 WALT 處理

我這裏總結了WALT所能(需要)做到的效果:

  1. 繼續保持對於所有Task-entity的跟蹤 ;
  2. 在此前usage(load)的基礎上,添加對於demand的記錄,用於之後預測;
  3. 每個CPU上runqueue 的整體負載仍爲所有Task統計的sum;
  4. 核心在於計算差異,由之前的衰減的方式變更爲劃分window的方式:數據採集更能快速體現實際變化(對比與PELT的趨勢),如下爲Linux官方的一些資料:
    1. A task’s demand is the maximum of its contribution to the most recently completed window and its average demand over the past N windows.
    2. WALT “forgets” blocked time entirely:即只統計runable和running time,可以對於Task的實際耗時有更準確的統計,可以通過demand預測;
    3. CPU busy time - The sum of execution times of all tasks in the most recently completed window;
    4. WALT “forgets” cpu utilization as soon as tasks are taken off of the runqueue;

1.2.2 應用補充

  1. task分配前各個CPU和task負載的統計;
  2. task migration 遷移
  3. 大小核的分配;
  4. EAS 分配;

1.3 版本導入

  1. linux 4.8.2 之後導入(但是在bootlin查看code,最新5.8仍沒有對應文件)
  2. android 4.4之後導入(android kernel 4.9 中是有這部分的)

2. Kernel如何啓用WALT

android kernel code中已經集成了這部分內容,不過根據廠商的差異,可能存在沒有啓用的情況:

  1. 打開宏測試:
    1. menuconfig ==》Genernal setup ==》CPU/Task time and stats accounting ==》support window based load tracking
    2. 圖示:kernel config
  2. 直接修改
    1. kernel/arch/arm64/config/defconfig中添加CONFIG_SCHED_WALT=y
  3. build image 驗證修改是否生效:
    demo:/sys/kernel/tracing # zcat /proc/config.gz | grep WALT

    CONFIG_SCHED_WALT=y
    CONFIG_HID_WALTOP=y

  4. 測試
    當前只是在ftrace中可以看到確實有統計walt的數據,但是沒有實際的應用來確認具體是否有改善或者其他數據(當然Linux的資料中有一些數據,但是並非本地測試);

3. WALT計算

本小節從原理和code 來說明,WALT採用的計算方式:

  1. windows 是如何劃分的?
  2. 對於Task如何分類,分別做怎樣的處理?
  3. WALT部分數據如何更新?
  4. WALT更新的數據如何被調度、EAS使用?

3.1 Windows劃分

首先來看輔助計算項window是如何劃分的?
簡單理解,就是將系統自啓動開始以一定時間作爲一個週期,分別統計不同週期內Task的Loading情況,並將其更新到Runqueue中;

則還有哪些內容需要考慮?

  1. 一個週期即window設置爲多久比較合適?這個根據實際項目不同調試不同的值,目前Kernel中是設置的標準是20ms;
  2. 具體統計多少個window內的Loading情況?根據實際項目需要調整,目前Kernel中設置爲5個window;

所以對於一個Task和window,可能存在如下幾種情況:
在這裏插入圖片描述
ps:ms = mark_start(Task開始),ws = window_start(當前window開始), wc = wallclock(當前系統時間)

  1. Task在這個window內啓動,且做統計時仍在這個window內,即Task在一個window內;
  2. Task在前一個window內啓動,做統計時在當前window內,即Task跨過兩個window;
  3. Task在前邊某一個window內啓動,做統計時在當前window內,即Task跨過多個完整window;
    在這裏插入圖片描述
    即Task在Window的劃分只有上述三種情況,所有的計算都是基於上述劃分的;

3.2 Task 分類

可以想到的是,對於不同類別的Task或者不同狀態的Task計算公式都是不同的,WALT將Task劃分爲如下幾個類別:
Tadk分類
上圖中有將各個Task event的調用函數列出來;

3.2.1 更新demand判斷

在更新demand時,會首先根據Task event判斷此時是否需要更新:
demand對類別的差異
對應function:

static int account_busy_for_task_demand(struct task_struct *p, int event)
{
	/* No need to bother updating task demand for exiting tasks
	 * or the idle task. */
	 //task 已退出或者爲IDLE,則不需要計算
	if (exiting_task(p) || is_idle_task(p))
		return 0;

	/* When a task is waking up it is completing a segment of non-busy
	 * time. Likewise, if wait time is not treated as busy time, then
	 * when a task begins to run or is migrated, it is not running and
	 * is completing a segment of non-busy time. */
	// 默認 walt_account_wait_time是1,則只有TASK_WAKE 
	if (event == TASK_WAKE || (!walt_account_wait_time &&
			 (event == PICK_NEXT_TASK || event == TASK_MIGRATE)))
		return 0;

	return 1;
}
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19

3.2.2 更新CPU busy time判斷

在更新CPU busy time時,會首先根據Task event判斷此時是否需要更新:
busy time對event的差異
對應function:

static int account_busy_for_cpu_time(struct rq *rq, struct task_struct *p,
				     u64 irqtime, int event)
{
//是否爲idle task or other task?	
	if (is_idle_task(p)) {
		/* TASK_WAKE && TASK_MIGRATE is not possible on idle task! */
		// 是schedule 觸發的下一個task爲idle task
		if (event == PICK_NEXT_TASK)
			return 0;
	
		/* PUT_PREV_TASK, TASK_UPDATE && IRQ_UPDATE are left */
		// 如果是中斷或者等待IO的IDLE TASK,是要計算busy time的;
		return irqtime || cpu_is_waiting_on_io(rq);
	}

	//wake 喚醒操作不需要計算;
	if (event == TASK_WAKE)
		return 0;

	//不是IDLE TASK則以下幾個類型需要計算
	if (event == PUT_PREV_TASK || event == IRQ_UPDATE ||
					 event == TASK_UPDATE)
		return 1;

	/* Only TASK_MIGRATE && PICK_NEXT_TASK left */
	//默認是0
	return walt_freq_account_wait_time;
}

 

3.3 數據如何更新?(調用邏輯)

前邊兩個小結已經介紹了Task在window上統計邏輯和不同Task統計不同數據判斷,這裏具體來看核心調用邏輯,首先上一張圖:
WALT
這個圖是在xmind導出來的結構圖,不清楚是否可以放大查看,這裏具體介紹流程:

  1. 入口函數walt_update_task_ravg
  2. demand更新函數
  3. cpu busy time 更新函數

3.3.1 入口函數介紹

walt_update_task_ravg
對應function:

/* Reflect task activity on its demand and cpu's busy time statistics */
void walt_update_task_ravg(struct task_struct *p, struct rq *rq,
		 int event, u64 wallclock, u64 irqtime)
{
	//判斷返回
	if (walt_disabled || !rq->window_start)
		return;
	lockdep_assert_held(&rq->lock);
	//更新window_start和cum_window_demand
	update_window_start(rq, wallclock);

	if (!p->ravg.mark_start)
		goto done;
	//更新數據:demand和busy_time
	update_task_demand(p, rq, event, wallclock);
	update_cpu_busy_time(p, rq, event, wallclock, irqtime);

done:
	// trace
	trace_walt_update_task_ravg(p, rq, event, wallclock, irqtime);
	// 更新mark_start	
	p->ravg.mark_start = wallclock;
}

 

函數主要做三件事情:

  1. 更新當前 window start時間爲之後數據更新做準備;
  2. 更新對應task的demand數值,需要注意這裏也會對應更新RQ中的數據;
  3. 更新對應task的cpu busy time佔用;

這個函數是WALT計算的主要入口,可以看到調用它的位置有很多,即上圖最左側內容,簡單來說就是在中斷、喚醒、遷移、調度這些case下都會更新Loading情況,這裏不一一詳細說明了;

  1. task awakend
  2. task start execute
  3. task stop execute
  4. task exit
  1. window rollover
  2. interrupt
  3. scheduler_tick
  1. task migration
  2. freq change

3.3.2 更新window start

這裏主要是在計算之前更新window_start確保rq 窗口起始值準確:
在這裏插入圖片描述
對應function:

static void
update_window_start(struct rq *rq, u64 wallclock)
{
	s64 delta;
	int nr_windows;
	//計算時間
	delta = wallclock - rq->window_start;
	/* If the MPM global timer is cleared, set delta as 0 to avoid kernel BUG happening */
	if (delta < 0) {
		delta = 0;
		/*
		 * WARN_ONCE(1,
		 * "WALT wallclock appears to have gone backwards or reset\n");
		 */
	}

	if (delta < walt_ravg_window) // 不足一個window週期,則直接返回;
		return;

	nr_windows = div64_u64(delta, walt_ravg_window);//計算window數量
	rq->window_start += (u64)nr_windows * (u64)walt_ravg_window;//統計window_start時間

	rq->cum_window_demand = rq->cumulative_runnable_avg;//實質還得使用cumulative_runnable_avg
}

 

3.3.3 更新demand

3.3.3.1 demand主要邏輯:

在這裏插入圖片描述
對應function:

/*
 * Account cpu demand of task and/or update task's cpu demand history
 *
 * ms = p->ravg.mark_start;
 * wc = wallclock
 * ws = rq->window_start
 *
 * Three possibilities:
 *
 *	a) Task event is contained within one window.
 *		window_start < mark_start < wallclock
 *
 *		ws   ms  wc
 *		|    |   |
 *		V    V   V
 *		|---------------|
 *
 *	In this case, p->ravg.sum is updated *iff* event is appropriate
 *	(ex: event == PUT_PREV_TASK)
 *
 *	b) Task event spans two windows.
 *		mark_start < window_start < wallclock
 *
 *		ms   ws   wc
 *		|    |    |
 *		V    V    V
 *		-----|-------------------
 *
 *	In this case, p->ravg.sum is updated with (ws - ms) *iff* event
 *	is appropriate, then a new window sample is recorded followed
 *	by p->ravg.sum being set to (wc - ws) *iff* event is appropriate.
 *
 *	c) Task event spans more than two windows.
 *
 *		ms ws_tmp			   ws  wc
 *		|  |				   |   |
 *		V  V				   V   V
 *		---|-------|-------|-------|-------|------
 *		   |				   |
 *		   |<------ nr_full_windows ------>|
 *
 *	In this case, p->ravg.sum is updated with (ws_tmp - ms) first *iff*
 *	event is appropriate, window sample of p->ravg.sum is recorded,
 *	'nr_full_window' samples of window_size is also recorded *iff*
 *	event is appropriate and finally p->ravg.sum is set to (wc - ws)
 *	*iff* event is appropriate.
 *
 * IMPORTANT : Leave p->ravg.mark_start unchanged, as update_cpu_busy_time()
 * depends on it!
 */
static void update_task_demand(struct task_struct *p, struct rq *rq,
		 int event, u64 wallclock)
{
	u64 mark_start = p->ravg.mark_start;//mark start 可以看到是task 的值;
	u64 delta, window_start = rq->window_start;//window start是 rq的值;
	int new_window, nr_full_windows;
	u32 window_size = walt_ravg_window;

	//第一個判斷條件,ms和ws,即當前task的start實際是否在這個window內;	
	new_window = mark_start < window_start;
	if (!account_busy_for_task_demand(p, event)) {
		if (new_window)
			/* If the time accounted isn't being accounted as
			 * busy time, and a new window started, only the
			 * previous window need be closed out with the
			 * pre-existing demand. Multiple windows may have
			 * elapsed, but since empty windows are dropped,
			 * it is not necessary to account those. */
			update_history(rq, p, p->ravg.sum, 1, event);
		return;
	}

	// 如果ms > ws,則是case a:將wc-ms,在此週期內的實際執行時間;
	if (!new_window) {
		/* The simple case - busy time contained within the existing
		 * window. */
		add_to_task_demand(rq, p, wallclock - mark_start);
		return;
	}

	//超過 1個window的情況
	/* Busy time spans at least two windows. Temporarily rewind
	 * window_start to first window boundary after mark_start. */
	//從ms 到 ws的時間,包含多個完整window
	delta = window_start - mark_start;
	nr_full_windows = div64_u64(delta, window_size);
	window_start -= (u64)nr_full_windows * (u64)window_size;
	//ws 計算到ws_tmp這裏:
	
	/* Process (window_start - mark_start) first */
	//先添加最開始半個週期的demand
	add_to_task_demand(rq, p, window_start - mark_start);

	/* Push new sample(s) into task's demand history */
	
	//更新history
	update_history(rq, p, p->ravg.sum, 1, event);
	if (nr_full_windows)
		update_history(rq, p, scale_exec_time(window_size, rq),
				   nr_full_windows, event);

	/* Roll window_start back to current to process any remainder
	 * in current window. */
	// 還原 window_start 
	window_start += (u64)nr_full_windows * (u64)window_size;

	/* Process (wallclock - window_start) next */
	//更新最後的週期,可以看到整體類似於pelt的計算,增加了history的操作;
	mark_start = window_start;
	add_to_task_demand(rq, p, wallclock - mark_start);
}		

//demand計算更新:
static void add_to_task_demand(struct rq *rq, struct task_struct *p,
		u64 delta)
{
	//demand需要做一次轉換,將實際運行時間,轉換爲CPU 能力比例,一般就是獲取CPU 的capcurr 然後除1024;
	delta = scale_exec_time(delta, rq);
	p->ravg.sum += delta;
	//這裏有個判斷當sum超過window size的時候修改;
	if (unlikely(p->ravg.sum > walt_ravg_window))
		p->ravg.sum = walt_ravg_window;
}
3.3.3.2 update history 邏輯:

update_history 整理:

  1. 本函數在Task進入一個新的Window的時候調用;
  2. 更新Task中的demand,根據過往幾個Window的情況;
  3. 同步更新Rq中的Usage,根據當前demand計算值;
    在這裏插入圖片描述
    對應function:
/*
 * Called when new window is starting for a task, to record cpu usage over
 * recently concluded window(s). Normally 'samples' should be 1. It can be > 1
 * when, say, a real-time task runs without preemption for several windows at a
 * stretch.
 */
 
static void update_history(struct rq *rq, struct task_struct *p,
			 u32 runtime, int samples, int event)
{
	u32 *hist = &p->ravg.sum_history[0];//對應window 指針鏈接
	int ridx, widx;
	u32 max = 0, avg, demand;
	u64 sum = 0;

	/* Ignore windows where task had no activity */
	if (!runtime || is_idle_task(p) || exiting_task(p) || !samples)
			goto done;

	/* Push new 'runtime' value onto stack */
	widx = walt_ravg_hist_size - 1;// history數量最大位置
	ridx = widx - samples;//計算鏈表中需要去除的window數量

//如下兩個for循環就是將新增加的window添加到history鏈表中,並更新sum值和max值;	
	for (; ridx >= 0; --widx, --ridx) {
		hist[widx] = hist[ridx];
		sum += hist[widx];
		if (hist[widx] > max)
			max = hist[widx];
	}

	for (widx = 0; widx < samples && widx < walt_ravg_hist_size; widx++) {
		hist[widx] = runtime;
		sum += hist[widx];
		if (hist[widx] > max)
			max = hist[widx];
	}
// Task中sum賦值;
	p->ravg.sum = 0;

//demand根據策略不同,從history window中計算,我們默認是policy2 就是 WINDOW_STATS_MAX_RECENT_AVG,在過去平均值和當前值中選擇大的那個;
	if (walt_window_stats_policy == WINDOW_STATS_RECENT) {
		demand = runtime;
	} else if (walt_window_stats_policy == WINDOW_STATS_MAX) {
		demand = max;
	} else {
		avg = div64_u64(sum, walt_ravg_hist_size);
		if (walt_window_stats_policy == WINDOW_STATS_AVG)
			demand = avg;
		else
			demand = max(avg, runtime);
	}

	/*
	 * A throttled deadline sched class task gets dequeued without
	 * changing p->on_rq. Since the dequeue decrements hmp stats
	 * avoid decrementing it here again.
	 *
	 * When window is rolled over, the cumulative window demand
	 * is reset to the cumulative runnable average (contribution from
	 * the tasks on the runqueue). If the current task is dequeued
	 * already, it's demand is not included in the cumulative runnable
	 * average. So add the task demand separately to cumulative window
	 * demand.
	 */
//進行runnable_avg參數矯正,前提爲並非deadline類型task	 
	if (!task_has_dl_policy(p) || !p->dl.dl_throttled) {
		if (task_on_rq_queued(p))//在runqueue中排隊,但是沒有實際執行
			fixup_cumulative_runnable_avg(rq, p, demand);//在rq中添加當前demand和task中記錄demand的差值,更新到cumulative_runnable_avg
		else if (rq->curr == p)//當前執行的就是這個Task
			fixup_cum_window_demand(rq, demand);//在rq中添加demand
	}
//最後將計算出來的demand更新到Task中;
	p->ravg.demand = demand;

done:
	trace_walt_update_history(rq, p, runtime, samples, event);
	return;
}

//更新cumulative_runnable_avg的值;
static void
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章