下面這張表,使用scala函數完成以下任務
任務:每個用戶截止到每月爲止的最大單月訪問次數和累計到該月的總訪問次數。
思路:
- 導表,用Source.fromFile.getLines()方法讀表的每一行,轉成Array格式
scala> import scala.io.Source
import scala.io.Source
scala> val lines = Source.fromFile("本地文件路徑").getLines().toArray
lines: Array[String] = Array(
A,2015-01,5, A,2015-01,15, B,2015-01,5,
A,2015-01,8, B,2015-01,25, A,2015-01,5,
A,2015-02,4, A,2015-02,6, B,2015-02,10,
B,2015-02,5, A,2015-03,16, A,2015-03,22,
B,2015-03,23, B,2015-03,10, B,2015-03,11)
- 使用map方法對每一行進行變形(原來是“,”作爲間隔符,拆開,又因爲不同列格式不一樣,組合成元組)
scala> lines.map(x=>{
| var y=x.split(",")
| (y(0),y(1),y(2).toInt)
| })
res0: Array[(String, String, Int)] = Array(
(A,2015-01,5), (A,2015-01,15), (B,2015-01,5),
(A,2015-01,8), (B,2015-01,25), (A,2015-01,5),
(A,2015-02,4), (A,2015-02,6), (B,2015-02,10),
(B,2015-02,5), (A,2015-03,16), (A,2015-03,22),
(B,2015-03,23), (B,2015-03,10), (B,2015-03,11))
- 通過groupBy根據第一列進行分組,得到一個Map,K是去重後的第一列,V是一個數組,元素爲K對應的那一行的元組
scala> lines.map(x=>{
| var y=x.split(",")
| (y(0),y(1),y(2).toInt)
| }).groupBy(x=>x._1)
res1: scala.collection.immutable.Map[String,Array[(String, String, Int)]] = Map(
A -> Array((A,2015-01,5), (A,2015-01,15), (A,2015-01,8), (A,2015-01,5), (A,2015-02,4), (A,2015-02,6), (A,2015-03,16), (A,2015-03,22)),
B -> Array((B,2015-01,5), (B,2015-01,25), (B,2015-02,10), (B,2015-02,5), (B,2015-03,23), (B,2015-03,10), (B,2015-03,11)))
- K不用變,對V進行數據處理,目標是每個V中根據月份排序且對每個月的所有值進行求和
scala> lines.map(x=>{
| var y=x.split(",")
| (y(0),y(1),y(2).toInt)
| }).groupBy(x=>x._1).mapValues(x=>x.groupBy(x=>x._2).
| toArray.sortWith((x,y)=>x._1<y._1).
| map(x=>(x._1,x._2.map(x=>x._3).sum)))
res2: scala.collection.immutable.Map[String,Array[(String, Int)]] = Map(
A -> Array((2015-01,33), (2015-02,10), (2015-03,38)),
B -> Array((2015-01,30), (2015-02,15), (2015-03,44)))
- 到這一步後,基本上數據清洗的雛形就出來了,接下來分別對數據進行累加操作和值比較操作。因爲題目要求是截止到本月的sum和max,因此這裏我們使用scan函數,取最後一位的值tail
- 通過zip,拼接到原數據上
scala> lines.map(x=>{
| var y=x.split(",")
| (y(0),y(1),y(2).toInt)
| }).groupBy(x=>x._1).mapValues(x=>x.groupBy(x=>x._2).
| toArray.sortWith((x,y)=>x._1<y._1).
| map(x=>(x._1,x._2.map(x=>x._3).sum))).
| toArray.map(x=>(x._1,x._2.map(x=>x._1).
| zip(x._2.map(x=>x._2).scan(0)(_+_).tail).
| zip(x._2.map(x=>x._2).scan(0)(_.max(_)).tail)))
res3: Array[(String, Array[((String, Int), Int)])] = Array(
(A,Array(((2015-01,33),33), ((2015-02,43),33), ((2015-03,81),38))),
(B,Array(((2015-01,30),30), ((2015-02,45),30), ((2015-03,89),44))))
- 輸出遍歷,完整代碼以及IDEA結果如下
import scala.io.Source
object pv{
def main(args: Array[String]): Unit = {
val lines = Source.fromFile("files/03pv/pv.txt").getLines().toArray
lines.map(x=>{
var y=x.split(",")
(y(0),y(1),y(2).toInt)
}).groupBy(x=>x._1).mapValues(x=>x.groupBy(x=>x._2).
toArray.sortWith((x,y)=>x._1<y._1).
map(x=>(x._1,x._2.map(x=>x._3).sum))).
toArray.map(x=>(x._1,x._2.map(x=>x._1).
zip(x._2.map(x=>x._2).scan(0)(_+_).tail).
zip(x._2.map(x=>x._2).scan(0)(_.max(_)).tail))).
foreach(x=>{
x._2.foreach(y=> println(
s"用戶:${x._1},日期:${y._1._1},至今爲止的最高訪問次數:${y._2},到本月的總訪問次數:${y._1._2}"))
})
}
}