【大數據學習-實驗-6】Spark應用

1.統計有多少行符合要求

1.文檔test.txt中存儲了若干用戶信息,一個用戶的信息存儲爲一行數據,要求過濾出其中性別爲“男”的用戶,並且統計有多少行符合要求。

18375,2011-5-20,2013-6-5,,4,廣州,廣東,CN,25,2014-3-31,2,0,0,0,100,0,1134,0,2013-6-9,0.25,0,430,297,4,4,195,12123,1,0,0,2,0,0,0,12318,12318,12123,12318,12123,1,0,0,0,22
36041,2010-3-8,2013-9-14,,4,佛山,廣東,CN,38,2014-3-31,4,0,0,0,100,0,8016,0,2014-1-3,0.5,0,531,89,37,60,50466,56506,14,0,0,4,0,0,0,106972,106972,56506,106972,56506,1,0,0,0,43
45690,2006-3-30,2006-12-2,,4,廣州,廣東,CN,43,2014-3-31,2,0,0,0,100,0,2594,0,2014-3-3,0.25,0,536,29,166,166,0,0,0,0,0,2,0,0,0,0,0,0,0,0,1,0,0,0,0
61027,2013-2-6,2013-2-14,,4,廣州,廣東,CN,36,2014-3-31,2,0,0,0,100,0,3934,0,2013-2-26,0.4,0,8,400,12,12,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,1,0,0,0
61340,2013-2-17,2013-2-17,,4,上海,.,CN,29,2014-3-31,2,0,0,0,,0,4222,0,2013-2-23,0.4,0,0,403,6,6,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,1,0,0,0

實現思路:
(1) 讀取數據創建RDD。
(2) 通過filter操作過濾數據,filter的函數判斷數據是否包含“男”字符,可用“contains”方法。
(3) 用count對步驟(2)的結果進行統計,得到行數。

1)準備工作

在本地創建目錄/data/spark/wordcount

mkdir -p /data/spark/wordcount

在目錄中創建文檔 test.txt

cd /data/spark/wordcount
vim test.txt

test文檔中寫入,由於系統不能識別中文,將男改爲M,女改爲W

18375,2011-5-20,2013-6-5,W,4,GZ,GD,CN,25,2014-3-31,2,0,0,0,100,0,1134,0,2013-6-9,0.25,0,430,297,4,4,195,12123,1,0,0,2,0,0,0,12318,12318,12123,12318,12123,1,0,0,0,22
36041,2010-3-8,2013-9-14,M,4,FS,GD,CN,38,2014-3-31,4,0,0,0,100,0,8016,0,2014-1-3,0.5,0,531,89,37,60,50466,56506,14,0,0,4,0,0,0,106972,106972,56506,106972,56506,1,0,0,0,43
45690,2006-3-30,2006-12-2,W,4,GZ,GD,CN,43,2014-3-31,2,0,0,0,100,0,2594,0,2014-3-3,0.25,0,536,29,166,166,0,0,0,0,0,2,0,0,0,0,0,0,0,0,1,0,0,0,0
61027,2013-2-6,2013-2-14,W,4,GZ,GD,CN,36,2014-3-31,2,0,0,0,100,0,3934,0,2013-2-26,0.4,0,8,400,12,12,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,1,0,0,0
61340,2013-2-17,2013-2-17,W,4,SH,.,CN,29,2014-3-31,2,0,0,0,,0,4222,0,2013-2-23,0.4,0,0,403,6,6,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,1,0,0,0

啓動hadoop

cd /apps/hadoop
./sbin/start-dfs.sh
/apps/hadoop/sbin/start-all.sh

啓動spark

/apps/spark/sbin/start-all.sh

將本地文件上傳到hadoop

hadoop fs -put /data/spark/wordcount/ test.txt  /myspark/wordcount

啓動spark-shell

spark-shell

2) 讀取數據創建RDD。

從hadoop讀取數據到RDD

val rdd = sc.textFile("hdfs://localhost:9000/myspark/wordcount/test.txt");

驗證是否讀取成功,顯示讀取到的數據。

rdd.map(line=> (line.split('\t')(0),1)).reduceByKey(_+_).collect
rdd.count()

在這裏插入圖片描述

3) 通過filter操作過濾數據,filter的函數判斷數據是否包含“男”字符,可用“contains”方法。

val linesWithSpark=rdd.filter(line=>line.contains("M"))

在這裏插入圖片描述

4) 用count對步驟(3)的結果進行統計,得到行數。

計算包含男的字符的個數

linesWithSpark.count

顯示出過濾後的數據

linesWithSpark.map(line=> (line.split('\t')(0),1)).reduceByKey(_+_).collect

在這裏插入圖片描述

2.文檔中的單詞計數

2.數據文件words.txt中包含了多行句子,要求對文檔中的單詞計數,並把單詞計數超過3的結果存儲到HDFS上。

WHat is going on there?
I talked to John on email.  We talked about some computer stuff that's it.

I went bike riding in the rain, it was not that cold.

We went to the museum in SF yesterday it was $3 to get in and they had
free food.  At the same time was a SF Giants game, when we got done we
had to take the train with all the Giants fans, they are 1/2 drunk.

實現思路:
(1) 通過textFile的方法讀取數據。
(2) 通過flatMap將字符串切分成單詞。
(3) 通過map將單詞轉化爲(單詞,1)的形式。
(4) 通過reduceByKey將統一個單詞的所有值相加。
(5) 通過filter將單詞大於3的結果過濾出來。
(6) 通過saveAsTextFile將結果寫入到HDFS。

1)準備工作

新建words.txt

cd /data/spark/wordcount
vim words.txt

填入words.txt

WHat is going on there?
I talked to John on email.  We talked about some computer stuff that's it.

I went bike riding in the rain, it was not that cold.

We went to the museum in SF yesterday it was $3 to get in and they had
free food.  At the same time was a SF Giants game, when we got done we
had to take the train with all the Giants fans, they are 1/2 drunk.

將數據上傳到hadoop

hadoop fs -put /data/spark/wordcount/words.txt  /myspark/wordcount 

啓動spark-shell

spark-shell

2) 通過textFile的方法讀取數據

val textFile = sc.textFile("hdfs://localhost:9000/myspark/wordcount/words.txt");

在這裏插入圖片描述

3) 通過flatMap將字符串切分成單詞,通過map將單詞轉化爲(單詞,1)的形式,通過reduceByKey將統一個單詞的所有值相加。

val wordCounts = textFile.flatMap(line=>line.split(“ ”)).map(word=>(word,1)).reduceByKey((a,b)=>a+b)

在這裏插入圖片描述

4)通過filter將單詞大於3的結果過濾出來

saveAsFile=wordCounts.filter(_._2>3).collect()

在這裏插入圖片描述

5) 通過saveAsTextFile將結果寫入到HDFS。

在這裏插入圖片描述

3.統計每個用戶收藏商品數量

3.某電商網站記錄了大量用戶對商品的收藏數據,並將數據存儲在名爲buyer_favorite1的文件中,數據格式以及數據內容如下:

用戶ID(buyer_id),商品ID(goods_id),收藏日期(dt)
用戶id  商品id    收藏日期
10181  1000481  2010-04-04 16:54:31
20001  1001597  2010-04-07 15:07:52
20001  1001560  2010-04-07 15:08:27
20042  1001368  2010-04-08 08:20:30
20067  1002061  2010-04-08 16:45:33

要求使用Spark Scala API或Spark Java API對用戶收藏數據,進行wordcount操作,統計每個用戶收藏商品數量。

1)準備工作

創建文件

mkdir -p /data/spark3/wordcount
cd /data/spark3/wordcount
vim buyer_favorite
i
10181  1000481  2010-04-04 16:54:31
20001  1001597  2010-04-07 15:07:52
20001  1001560  2010-04-07 15:08:27
20042  1001368  2010-04-08 08:20:30
20067  1002061  2010-04-08 16:45:33
Esc
:wq

文件上傳到HDFS

Hadoop fs -put /data/spark3/wordcount/ buyer_favorite /myspark3/wordcount

在這裏插入圖片描述
進入 Spark_shell

 
Spark_shell

在這裏插入圖片描述

var rdd = sc.textFile(“hdfs://localhost:9000 /myspark3/wordcount/ buyer_favorite”);

在這裏插入圖片描述

Rdd.map(line=>(line.split(“ ”)(0),0)).reduceByKey(_+_).collect

在這裏插入圖片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章