數據庫內核雜談(十九)自動駕駛數據庫 - Workload預測

{"type":"doc","content":[{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"歡迎閱讀新一期數據庫內核雜談。和所有催更的同學說一聲抱歉,由於最近工作實在太忙了,拖更太久了。這一期內核雜談,書接上文,接着聊"},{"type":"link","attrs":{"href":"https:\/\/noise.page\/about\/)","title":"xxx","type":null},"content":[{"type":"text","text":"NoisePage"}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"項目-自動駕駛數據庫。這一期介紹的學術論文是 Query-based Workload Forecasting for Self-Driving Database Management Systems,發表在SIGMOD-2018年上(說個題外話,那一屆SIGMOD在休斯頓開,我還去參加了,但當時沒注意到這篇文章,光生病了。。。在會場賓館躺了兩天。。。那一屆的best paper也是Professor Pavlo的學生的)。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Why? 爲什麼要做預測?"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"做學術研究最重要的一點就是,要講清楚爲什麼要做這個研究,Why?預測workloads,好處是顯而易見的。這可以使數據庫爲將要來的workloads提前做好準備,無論是從資源角度(比如提前分配好計算資源)或者是具體的workloads的優化方面(比如,提前建立好索引,或者設置對應的優化器選項等等)。特別是對於雲原生,存儲計算分離的數據庫系統來說,準確地預測workloads,不僅可以未雨綢繆,爲海量workloads來之前做好準備,也可以在休漁期,養精蓄銳,減少成本。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"預測workloads的難點"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"介紹完了爲什麼,再來討論一下這個問題的難點。這一點,文中並沒有直接提及。但我覺得,這是一個很重要的問題。試想,如果實際生產情況中,workloads非常有規律性,那還有必要花精力去預測workloads嗎?這個問題就體現出了學術界和工業界的區別。在工業界,只有這是一個真實的問題,並且優先級很高,纔會集中資源去解決它。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"作爲一個數據平臺相關從業者,我可以比較負責任地說,大部分(比例可能在80%-90%)的workloads,都是非常有規律性的。因爲絕大部分的workloads都是定時ETL任務。這些任務,每天,由定時器驅動(通常在凌晨),用來處理當天的數據。並且,當數據一旦進入一個穩定狀態,每天處理的數據量,以及需要的計算資源,都是非常有規律性的。如此,直接使用昨天的歷史數據來預測今天的workloads,可能準確率已經能達到將近80%-90%。數據平臺只要集中精力來優化數據庫,服務好這些大量,有規律性的workloads,已經是一個很大的成功了。可能就不會花精力去預測並針對剩下的小部分長尾的workloads。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"文中並沒有因爲難點去展開,不過,倒是介紹了用來做實驗的三類workloads:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Admissions: 是一個大學網站的用例,學生提交申請,教授審覈申請;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"BusTracker:一個公交系統的app,用來幫助找到最近的公交站的信息。。這和數據庫。。有很大關係嗎?感覺只有只讀操作;"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"MOOC:是CMU的一個在線學習的網站應用。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"我自己的理解是,用例可以調整一下。因爲這些用例不太具備代表性,在數據量上也不會很大。這樣,預測實驗的說服力就降低了。這類研究,如果可以和互聯網公司合作,結果應該會更有說服力。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"下面,介紹本文使用的系統和方法。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"QueryBot5000 簡介"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"整個預測系統稱爲 "},{"type":"link","attrs":{"href":"https:\/\/github.com\/malin1993ml\/QueryBot5000","title":"xxx","type":null},"content":[{"type":"text","text":"QueryBot5000"}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"(以下簡稱QB5),它可以作爲一個獨立應用部署,並接受數據庫系統發來的查詢語句。結合架構圖,來看一下它如何工作。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/02\/02bbdb4199172a78feed35c083a7546c.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"QB5的工作流程分成三步:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"1)Pre-Processor(預處理):"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"對數據庫發來的查詢語句進行預處理來生成語句template,例如把語句中的常數項替換成抽象符號(SELECT * FROM foo WHERE id = 5; ===> SELECT * FROM foo WHERE id = $;)。然後對於每個語句template,系統會記錄當前的查詢時間。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"2)Clusterer (聚類):"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"即使經過了預處理,大量的語句template依然使得預測未來pattern計算量巨大。因此,爲了進一步提高效率,QB5通過聚類,將語義相似的template,歸併成組:通過使用在線的聚類算法來進一步合併查詢時間pattern相仿的語句template。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"3)Forecaster (預測):"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"QB5,通過訓練預測模型,對大型的長cluster(擁有很多語句的組)進行預測。預測結果形式是有多少這個組裏的語句會在未來什麼時候再次被提交查詢。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"在整個處理流程中,Pre-Processor會在線實時把新的語句和查詢時間記錄到系統中,而Clusterer和Forecaster都是定期任務,讀取最新的template和cluster信息來進行重新聚類和預測。接下來,依次介紹每個組件。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Pre-Processor(預處理)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"預處理分爲兩個步驟:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"1)將所有的常數項替換成標識符(比如$)比如上面示例中的filtering condition,除此之外,UPDATE語句中對column的SET值,以及INSERT語句中的value值等。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"2)對語句格式,做標準化處理。比如去掉多餘的空格,括號,統一syntax大小寫等。目的就是能夠最大程度保證template的標準化。然後就是對每個template語句的查詢時間做記錄,記錄格式其實是tuple形式,即(時間interval, count)。這個interval,是可調節的參數,可以是一個小時,也可以是一分鐘。最後實驗中有對應不同interval的比較。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"跳出內容,我們來對於使用預處理後的語句template來作爲查詢語句的唯一標識符做一些討論。這個方法的優點在於,實現簡單,主要就是字符串處理。缺點是,由於是文本匹配,即使做了標準化,肯定還是會漏掉一些匹配結果(比如,對於一些特殊的語法糖處理不好)。有什麼方法可以改進呢? 一個想法是,通過parser,把語句變成語法樹,再通過binding以及標準的transformation來變成logical plan。Logical plan相比於文本,肯定是標準化程度更高的表達形式。因爲在這個階段,語法糖已經被替換了,一些transformation應該會把語義相同,但syntax不同的查詢語句都統一起來。因此,用logical plan來代表查詢語句應該能達到更好的效果。文中也提到了一些其他semantic equivalance的研究,只是在本文中沒有使用。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Clusterer (聚類)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"聚類器的作用類似於二次預處理,將語句template的數量進一步壓縮。文中也提到了原因,訓練一個模型需要的時間和計算資源是有成本的(文中提到,訓練一個模型需要3分多鐘)。因此需要想辦法進一步提升效率。聚類器會把類似的template進一步聚類成group。它的做法就是,先定義並抽取出每個template的一些高緯的特徵,然後根據特徵向量來將類似的template聚成一類。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"被提取的特徵有三大類:物理執行層相關的,logical plan相關的,以及語句的歷史記錄。物理執行層相關的特徵包括數據庫運行語句使用的資源情況,比如CPU,內存,以及語句執行結果相關,比如讀取了多少tuple,最終結果集大小等。文中也提到了,數據庫運行資源的使用和數據庫實例以及當前狀態相關。比如,相同的語句,運行在不同實例,或者是相同實例但當前workloads情況不同,都會產生不同的結果。因此,這些信息可能會有帶來噪音。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Logical plan的特徵和上文提到的pre-processing的優化想法類似。比如,當前語句需要讀取哪些表,哪些column。有哪些join pattern等。文中也提到,這些特徵的準確性也是有差異的。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"最後一類特徵,也是本文研究的創新之一,就是引入了語句的歷史記錄(這類語句通常什麼時候發生,發生的的時候會運行多少次等)來作爲特徵,比如下圖顯示了一些不同的語句擁有非常相似的歷史記錄,那它們就更可能被歸爲一類。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/c2\/c2dcd563da2742f6368101c25218f822.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"聚類算法:"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"QB5使用的聚類算法是K-means算法的一個變種,它的不同點在於算法不會被小聚類,或者聚類密度所影響:其中的改進是,在判斷一個點是否屬於某個聚類時,並不是去比較這個點和這個聚類中其他點的距離,而是比較這個點和聚類中心點(center of a cluster)的距離。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"整個算法分成三個步驟:"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"1)對於一個新來的template,算法會先計算出它和已有cluster的similarity score,當score大於某個閾值後,會選擇最大score的那個cluster,並且將這個template加入到這個cluster中,並對cluster的center進行更新。如果沒有任何cluster的score高於閾值,就會以當前模版生成一個新的cluster。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"2)更新完的cluster信息後,還需要對已經屬於這個cluster的所有template重新和center進行比較,來判斷是否需要移除已經不滿足閾值的template(然後再重新對cluster進行計算,並對移出的template重新找cluster)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"3)QB5也會對不同的cluster的center point進行比較,如果similarity score也超過閾值,就會對兩個cluster進行合併。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"下圖給出了三個步驟的概括。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/9d\/9d98a89642aef42736085f0787e9be83.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"這是一個在線計算的聚類算法,根據新template的到來,不停地重複上述步驟,將template加入到cluster中,重新計算,直至收斂。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}},{"type":"strong"}],"text":"聚類剪枝:"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"由於長尾效應,上述算法會產生很多小的cluster,因此算法還會根據某個cluster中語句的頻繁程度將它們排除在訓練模型之外。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"至此,我們得到了所有的cluster,同時也記錄了每個cluster中,語句流量的歷史記錄,文中使用的記錄方式是每分鐘多少語句。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"Forecaster(預測模型)"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"最後一步就是通過預測模型來預測這個cluster的語句在未來的pattern(以分鐘粒度中語句的出現次數的形式)。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"image","attrs":{"src":"https:\/\/static001.geekbang.org\/infoq\/a6\/a6dd7f8fe6b7777851a19db94d717c5d.png","alt":null,"title":null,"style":[{"key":"width","value":"75%"},{"key":"bordertype","value":"none"}],"href":null,"fromPaste":true,"pastePass":true}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"文中介紹了6種不同的預測模型,並從3個角度來分析它們的不同。結合上圖來解釋。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"先介紹3個緯度:"}]},{"type":"numberedlist","attrs":{"start":null,"normalizeStart":1},"content":[{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":1,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Linear代表算法的複雜度是否是線性相關於輸入。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":2,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Memory代表算法是否能將輸入和當前模型的狀態都儲存在內存中(這樣算法的效率就更高)。"}]}]},{"type":"listitem","attrs":{"listStyle":null},"content":[{"type":"paragraph","attrs":{"indent":0,"number":3,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"Kernel代表算法是否能使用Kernel method(核函數,本人機器學習知識也是一知半解,只知道在SVM中有用到核函數,貌似是用來對數據進行不同緯度向量空間做映射的計算方法)。"}]}]}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"然後文中也簡單介紹了這些算法(其實我也不太懂):1)Linear Regression(LR): 線性迴歸,可能大家都比較熟悉,模型相對簡單,計算量也較小。2)RNN:循環神經網絡,這類神經網絡比如LSTM(長-短記憶模型)在自然語言處理上用的比較多。3)KR:核函數迴歸?其他的一些算法文中也並沒給出相信信息,估計是RNN的變種。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"有了查詢語句的樣本,通過預處理和聚類對workloads進行歸類,有了預測模型,下一步就是大量的實驗了。文章剩下的內容都是針對多維度的實驗的解讀。有興趣的同學可以去深入閱讀一下。稍微看了一下,對於很有規律性的workloads,預測是很準確的。但我依然認爲,大部分的workloads(特別是在大數據平臺,生產環境中的),本身具有非常強的週期性,預測難度應該不高。"}]},{"type":"heading","attrs":{"align":null,"level":2},"content":[{"type":"text","text":"寫在最後"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null}},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"雖然本文對於互聯網大數據平臺的直接影響不是特別大(可能,不能直接拿來主義去做成工業實踐),但是數據平臺智能化,特別是在雲原生平臺的基礎上,是一個趨勢。面對數據量越來越大,數據維度越來越多,業務邏輯越發複雜的挑戰,如何利用人工智能來管理,適配,升級數據平臺,未來有很多機會可以想象。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"今天的內容就到這啦。一起學習了NoisePage中如何對未來workloads做預測的方法,通過預處理以及聚類來減少計算量,然後通過預測模型來預測未來屬於某一個cluster中語句出現的pattern。文中也系統系地比較了不同的預測模型對於不同workloads的預測結果。"}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":" "}]},{"type":"paragraph","attrs":{"indent":0,"number":0,"align":null,"origin":null},"content":[{"type":"text","marks":[{"type":"color","attrs":{"color":"#494949","name":"user"}}],"text":"感謝閱讀這一期雜談!"},{"type":"text","marks":[{"type":"color","attrs":{"color":"#303030","name":"user"}}],"text":"2021 年的願景之一是做更多對於技術和管理的輸出,如果想要和我更多交流,歡迎關注我的知識星球:"},{"type":"link","attrs":{"href":"https:\/\/t.zsxq.com\/feEUfay","title":null,"type":null},"content":[{"type":"text","text":"Dr.ZZZ 聊技術和管理"}]},{"type":"text","marks":[{"type":"color","attrs":{"color":"#303030","name":"user"}}],"text":"。"}]}]}
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章