實例1:
數據格式(消費者ID 消費時間 消費金額)
1 12:01 100
1 12:02 200
1 12:50 100
2 12:50 100
3 13:01 200
需求:統計每個小時,每個用戶的消費總額
思路步驟:
1、id加上時間的小時部分(前兩位)作爲key
2、使用sparkSQl裏面的 groupby.agg()方法 groupby(“id”,“time”).agg(sum(“cousumer”))
代碼:
package com.soul.spark.SparkSQL
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
/**
* @author soulChun
* @create 2019-01-04-15:22
*/
object TextFileApp {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[2]").appName("TextFileApp").getOrCreate()
import spark.implicits._
/**
* ID Time Consumer
* 用戶ID 消費時間 消費金額
* 1 12:01 100
* 1 12:02 100
* 1 12:50 100
* 2 12:50 200
* 3 01:21 100
* 統計每個小時,每個用戶的消費總額
*/
val infoDF = spark.read.textFile("/Users/mac/soul/1.txt")
.map(x => x.split("\t"))
.map(x => User(x(0), x(1).substring(0,2),x(2))).toDF
infoDF.show()
infoDF.groupBy("id", "time").agg(sum("consume")).sort("id").show
Thread.sleep(5*60*1000)
spark.stop()
}
}
case class User(id:String,time:String,consume:String)
結果:
實例2:
數據格式(名稱,訪問次數)
yy,1001
panda,1001
kuaishou,1002
yy,1001
yy,1003
panda,1003
kuaishou,1003
yy,1003
需求:求每個用戶的總訪問量
代碼:
package com.soul.spark.SparkSQL
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
/**
* @author soulChun
* @create 2019-01-08-16:03
* yy,1001
* panda,1001
* kuaishou,1002
* yy,1001
* yy,1003
* panda,1003
* kuaishou,1003
* yy,1003
* 求每個用戶的總訪問量
*/
object PvuvApp {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder().master("local[2]").appName("TextFileApp").getOrCreate()
runBuildInExample(spark)
spark.stop()
}
case class LOGINFO(name:String,count:Int)
def runBuildInExample(spark:SparkSession): Unit ={
import spark.implicits._
val logDF = spark.sparkContext.textFile("/Users/mac/soul/data/tmp/pvuv.txt")
.map(x => x.split(",")).map(x => LOGINFO(x(0), x(1).toInt)).toDF()
// logDF.groupBy("name").count().show()
logDF.groupBy("name").agg(sum("count").as("totalTiems")).show()
}
}