1.統計有多少行符合要求
1.文檔test.txt中存儲了若干用戶信息,一個用戶的信息存儲爲一行數據,要求過濾出其中性別爲“男”的用戶,並且統計有多少行符合要求。
18375,2011-5-20,2013-6-5,女,4,廣州,廣東,CN,25,2014-3-31,2,0,0,0,100,0,1134,0,2013-6-9,0.25,0,430,297,4,4,195,12123,1,0,0,2,0,0,0,12318,12318,12123,12318,12123,1,0,0,0,22
36041,2010-3-8,2013-9-14,男,4,佛山,廣東,CN,38,2014-3-31,4,0,0,0,100,0,8016,0,2014-1-3,0.5,0,531,89,37,60,50466,56506,14,0,0,4,0,0,0,106972,106972,56506,106972,56506,1,0,0,0,43
45690,2006-3-30,2006-12-2,女,4,廣州,廣東,CN,43,2014-3-31,2,0,0,0,100,0,2594,0,2014-3-3,0.25,0,536,29,166,166,0,0,0,0,0,2,0,0,0,0,0,0,0,0,1,0,0,0,0
61027,2013-2-6,2013-2-14,女,4,廣州,廣東,CN,36,2014-3-31,2,0,0,0,100,0,3934,0,2013-2-26,0.4,0,8,400,12,12,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,1,0,0,0
61340,2013-2-17,2013-2-17,女,4,上海,.,CN,29,2014-3-31,2,0,0,0,,0,4222,0,2013-2-23,0.4,0,0,403,6,6,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,1,0,0,0
實現思路:
(1) 讀取數據創建RDD。
(2) 通過filter操作過濾數據,filter的函數判斷數據是否包含“男”字符,可用“contains”方法。
(3) 用count對步驟(2)的結果進行統計,得到行數。
1)準備工作
在本地創建目錄/data/spark/wordcount
mkdir -p /data/spark/wordcount
在目錄中創建文檔 test.txt
cd /data/spark/wordcount
vim test.txt
test文檔中寫入,由於系統不能識別中文,將男改爲M,女改爲W
18375,2011-5-20,2013-6-5,W,4,GZ,GD,CN,25,2014-3-31,2,0,0,0,100,0,1134,0,2013-6-9,0.25,0,430,297,4,4,195,12123,1,0,0,2,0,0,0,12318,12318,12123,12318,12123,1,0,0,0,22
36041,2010-3-8,2013-9-14,M,4,FS,GD,CN,38,2014-3-31,4,0,0,0,100,0,8016,0,2014-1-3,0.5,0,531,89,37,60,50466,56506,14,0,0,4,0,0,0,106972,106972,56506,106972,56506,1,0,0,0,43
45690,2006-3-30,2006-12-2,W,4,GZ,GD,CN,43,2014-3-31,2,0,0,0,100,0,2594,0,2014-3-3,0.25,0,536,29,166,166,0,0,0,0,0,2,0,0,0,0,0,0,0,0,1,0,0,0,0
61027,2013-2-6,2013-2-14,W,4,GZ,GD,CN,36,2014-3-31,2,0,0,0,100,0,3934,0,2013-2-26,0.4,0,8,400,12,12,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,1,0,0,0
61340,2013-2-17,2013-2-17,W,4,SH,.,CN,29,2014-3-31,2,0,0,0,,0,4222,0,2013-2-23,0.4,0,0,403,6,6,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,1,0,0,0
啓動hadoop
cd /apps/hadoop
./sbin/start-dfs.sh
/apps/hadoop/sbin/start-all.sh
啓動spark
/apps/spark/sbin/start-all.sh
將本地文件上傳到hadoop
hadoop fs -put /data/spark/wordcount/ test.txt /myspark/wordcount
啓動spark-shell
spark-shell
2) 讀取數據創建RDD。
從hadoop讀取數據到RDD
val rdd = sc.textFile("hdfs://localhost:9000/myspark/wordcount/test.txt");
驗證是否讀取成功,顯示讀取到的數據。
rdd.map(line=> (line.split('\t')(0),1)).reduceByKey(_+_).collect
rdd.count()
3) 通過filter操作過濾數據,filter的函數判斷數據是否包含“男”字符,可用“contains”方法。
val linesWithSpark=rdd.filter(line=>line.contains("M"))
4) 用count對步驟(3)的結果進行統計,得到行數。
計算包含男的字符的個數
linesWithSpark.count
顯示出過濾後的數據
linesWithSpark.map(line=> (line.split('\t')(0),1)).reduceByKey(_+_).collect
2.文檔中的單詞計數
2.數據文件words.txt中包含了多行句子,要求對文檔中的單詞計數,並把單詞計數超過3的結果存儲到HDFS上。
WHat is going on there?
I talked to John on email. We talked about some computer stuff that's it.
I went bike riding in the rain, it was not that cold.
We went to the museum in SF yesterday it was $3 to get in and they had
free food. At the same time was a SF Giants game, when we got done we
had to take the train with all the Giants fans, they are 1/2 drunk.
實現思路:
(1) 通過textFile的方法讀取數據。
(2) 通過flatMap將字符串切分成單詞。
(3) 通過map將單詞轉化爲(單詞,1)的形式。
(4) 通過reduceByKey將統一個單詞的所有值相加。
(5) 通過filter將單詞大於3的結果過濾出來。
(6) 通過saveAsTextFile將結果寫入到HDFS。
1)準備工作
新建words.txt
cd /data/spark/wordcount
vim words.txt
填入words.txt
WHat is going on there?
I talked to John on email. We talked about some computer stuff that's it.
I went bike riding in the rain, it was not that cold.
We went to the museum in SF yesterday it was $3 to get in and they had
free food. At the same time was a SF Giants game, when we got done we
had to take the train with all the Giants fans, they are 1/2 drunk.
將數據上傳到hadoop
hadoop fs -put /data/spark/wordcount/words.txt /myspark/wordcount
啓動spark-shell
spark-shell
2) 通過textFile的方法讀取數據
val textFile = sc.textFile("hdfs://localhost:9000/myspark/wordcount/words.txt");
3) 通過flatMap將字符串切分成單詞,通過map將單詞轉化爲(單詞,1)的形式,通過reduceByKey將統一個單詞的所有值相加。
val wordCounts = textFile.flatMap(line=>line.split(“ ”)).map(word=>(word,1)).reduceByKey((a,b)=>a+b)
4)通過filter將單詞大於3的結果過濾出來
saveAsFile=wordCounts.filter(_._2>3).collect()
5) 通過saveAsTextFile將結果寫入到HDFS。
3.統計每個用戶收藏商品數量
3.某電商網站記錄了大量用戶對商品的收藏數據,並將數據存儲在名爲buyer_favorite1的文件中,數據格式以及數據內容如下:
用戶ID(buyer_id),商品ID(goods_id),收藏日期(dt)
用戶id 商品id 收藏日期
10181 1000481 2010-04-04 16:54:31
20001 1001597 2010-04-07 15:07:52
20001 1001560 2010-04-07 15:08:27
20042 1001368 2010-04-08 08:20:30
20067 1002061 2010-04-08 16:45:33
要求使用Spark Scala API或Spark Java API對用戶收藏數據,進行wordcount操作,統計每個用戶收藏商品數量。
1)準備工作
創建文件
mkdir -p /data/spark3/wordcount
cd /data/spark3/wordcount
vim buyer_favorite
i
10181 1000481 2010-04-04 16:54:31
20001 1001597 2010-04-07 15:07:52
20001 1001560 2010-04-07 15:08:27
20042 1001368 2010-04-08 08:20:30
20067 1002061 2010-04-08 16:45:33
Esc
:wq
文件上傳到HDFS
Hadoop fs -put /data/spark3/wordcount/ buyer_favorite /myspark3/wordcount
進入 Spark_shell
Spark_shell
var rdd = sc.textFile(“hdfs://localhost:9000 /myspark3/wordcount/ buyer_favorite”);
Rdd.map(line=>(line.split(“ ”)(0),0)).reduceByKey(_+_).collect