join
join類似於SQL的inner join操作,返回結果是前面和後面集合中配對成功的,過濾掉關聯不上的。
leftOuterJoin
leftOuterJoin類似於SQL中的左外關聯left outer join,返回結果以前面的RDD爲主,關聯不上的記錄爲空。
rightOuterJoin
rightOuterJoin類似於SQL中的有外關聯,返回結果以參數也就是右邊的RDD爲主,關聯不上的記錄爲空
接下來我們通過SQL與代碼相結合的方式,瞭解一下,如何使用Spark實現連接查詢
學生信息表:stu_info(stu_id,stu_name,sex,province),數據如下內容:
1,zhangsan,1,zj
2,lisi,0,gs
3,wangwu,1,bj
4,zhaoliu,0,sh
成績信息score_info(score_id,stu_id,course_name,score_num),數據如下內容:
1,1,yuwen,56
2,1,shuxue,98
3,2,yuwen,76
4,2,shuxue,45
5,3,yuwen,89
6,3,shuxue,99
7,4,yuwen,34
8,4,shuxue,76
需求:實現所有科目、成績對應的學生姓名、性別、省份
現在我們用Sql來描述一下我們想要的功能:
select
stu.stu_id,stu.stu_name,stu.sex,stu.province,score.cource_name,score.score_num
from
stu_info stu
join
score_info score
on
stu.stu_id=score.stu_id
如何使用Scala語言調用Spark函數來實現?
val stuData = sc.textFile(args(0)) val scoreData = sc.textFile(args(1)) val stuRdd=stuData.map(line=>{ val cells=line.split(",") (cells(0),(cells(1),cells(2),(3))) }) val scoreRdd=scoreData.map(line=>{ val cells=line.split(",") (cells(1),(cells(2),cells(3))) }) val failedScoreRdd = scoreData.map(line => { val cells = line.split(",") (cells(0).toInt, cells(1).toInt, cells(2), cells(3).toInt) }).filter(line=>line._4<60) val joinRdd=stuRdd.join(scoreRdd) joinRdd.map(rdd=>{ val stu=rdd._2._1 val score=rdd._2._2 (rdd._1,stu._1,stu._2,stu._3,score._1,score._2) }) joinRdd.repartition(1).saveAsTextFile(args(2))
查看輸出文件:
(4,zhaoliu,0,3,yuwen,34) (4,zhaoliu,0,3,shuxue,76) (2,lisi,0,3,yuwen,76) (2,lisi,0,3,shuxue,45) (3,wangwu,1,3,yuwen,89) (3,wangwu,1,3,shuxue,99) (1,zhangsan,1,3,yuwen,56) (1,zhangsan,1,3,shuxue,98)
以上爲標準非格式化輸出,調整代碼如下:
val joinRdd=stuRdd.join(scoreRdd).map(rdd=>{ val stu=rdd._2._1 val score=rdd._2._2 s"${rdd._1},${stu._1},${stu._2},${stu._3},${score._1},${score._2}" })
輸出信息如下:
4,zhaoliu,0,3,yuwen,34 4,zhaoliu,0,3,shuxue,76 2,lisi,0,3,yuwen,76 2,lisi,0,3,shuxue,45 3,wangwu,1,3,yuwen,89 3,wangwu,1,3,shuxue,99 1,zhangsan,1,3,yuwen,56 1,zhangsan,1,3,shuxue,98
說明:
通過s""來進行字符串拼接,中間變量可用${變量名}來進行格式化輸出
需求:查詢未及格信息
select stu.stu_id,stu.stu_name,stu.sex,stu.province,score.cource_name,score.score_num from stu_info stu join score_info score on stu.stu_id=score.stu_id where score.score_num <60
Scala代碼如下:
val stuRdd=stuData.map(line=>{ val cells=line.split(",") (cells(0),(cells(1),cells(2),(3))) }) val scoreRdd=scoreData.map(line=>{ val cells=line.split(",") (cells(1),(cells(2),cells(3))) }).filter(rdd=>{rdd._2._2.toInt<60}) val joinRdd=stuRdd.join(scoreRdd).map(rdd=>{ val stu=rdd._2._1 val score=rdd._2._2 s"${rdd._1},${stu._1},${stu._2},${stu._3},${score._1},${score._2}" })
輸出信息如下:
1,zhangsan,1,3,yuwen,56 4,zhaoliu,0,3,yuwen,34 2,lisi,0,3,shuxue,45
本文版權歸https://www.mulhyac.com所有,轉載請註明出處.