1.統計有多少行符合要求

1．文檔test.txt中存儲了若干用戶信息，一個用戶的信息存儲爲一行數據，要求過濾出其中性別爲“男”的用戶，並且統計有多少行符合要求。

18375,2011-5-20,2013-6-5,女,4,廣州,廣東,CN,25,2014-3-31,2,0,0,0,100,0,1134,0,2013-6-9,0.25,0,430,297,4,4,195,12123,1,0,0,2,0,0,0,12318,12318,12123,12318,12123,1,0,0,0,22
36041,2010-3-8,2013-9-14,男,4,佛山,廣東,CN,38,2014-3-31,4,0,0,0,100,0,8016,0,2014-1-3,0.5,0,531,89,37,60,50466,56506,14,0,0,4,0,0,0,106972,106972,56506,106972,56506,1,0,0,0,43
45690,2006-3-30,2006-12-2,女,4,廣州,廣東,CN,43,2014-3-31,2,0,0,0,100,0,2594,0,2014-3-3,0.25,0,536,29,166,166,0,0,0,0,0,2,0,0,0,0,0,0,0,0,1,0,0,0,0
61027,2013-2-6,2013-2-14,女,4,廣州,廣東,CN,36,2014-3-31,2,0,0,0,100,0,3934,0,2013-2-26,0.4,0,8,400,12,12,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,1,0,0,0
61340,2013-2-17,2013-2-17,女,4,上海,.,CN,29,2014-3-31,2,0,0,0,,0,4222,0,2013-2-23,0.4,0,0,403,6,6,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,1,0,0,0

實現思路：
（1）讀取數據創建RDD。
（2）通過filter操作過濾數據，filter的函數判斷數據是否包含“男”字符，可用“contains”方法。
（3）用count對步驟（2）的結果進行統計，得到行數。

1）準備工作

在本地創建目錄/data/spark/wordcount

mkdir -p /data/spark/wordcount

在目錄中創建文檔 test.txt

cd /data/spark/wordcount
vim test.txt

test文檔中寫入，由於系統不能識別中文，將男改爲M，女改爲W

18375,2011-5-20,2013-6-5,W,4,GZ,GD,CN,25,2014-3-31,2,0,0,0,100,0,1134,0,2013-6-9,0.25,0,430,297,4,4,195,12123,1,0,0,2,0,0,0,12318,12318,12123,12318,12123,1,0,0,0,22
36041,2010-3-8,2013-9-14,M,4,FS,GD,CN,38,2014-3-31,4,0,0,0,100,0,8016,0,2014-1-3,0.5,0,531,89,37,60,50466,56506,14,0,0,4,0,0,0,106972,106972,56506,106972,56506,1,0,0,0,43
45690,2006-3-30,2006-12-2,W,4,GZ,GD,CN,43,2014-3-31,2,0,0,0,100,0,2594,0,2014-3-3,0.25,0,536,29,166,166,0,0,0,0,0,2,0,0,0,0,0,0,0,0,1,0,0,0,0
61027,2013-2-6,2013-2-14,W,4,GZ,GD,CN,36,2014-3-31,2,0,0,0,100,0,3934,0,2013-2-26,0.4,0,8,400,12,12,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,1,0,0,0
61340,2013-2-17,2013-2-17,W,4,SH,.,CN,29,2014-3-31,2,0,0,0,,0,4222,0,2013-2-23,0.4,0,0,403,6,6,0,0,0,0,2,0,0,0,0,0,0,0,0,0,0,1,0,0,0

啓動hadoop

cd /apps/hadoop
./sbin/start-dfs.sh
/apps/hadoop/sbin/start-all.sh

啓動spark

/apps/spark/sbin/start-all.sh

將本地文件上傳到hadoop

hadoop fs -put /data/spark/wordcount/ test.txt  /myspark/wordcount

啓動spark-shell

spark-shell

2）讀取數據創建RDD。

從hadoop讀取數據到RDD

val rdd = sc.textFile("hdfs://localhost:9000/myspark/wordcount/test.txt");

驗證是否讀取成功，顯示讀取到的數據。

rdd.map(line=> (line.split('\t')(0),1)).reduceByKey(_+_).collect
rdd.count()

3）通過filter操作過濾數據，filter的函數判斷數據是否包含“男”字符，可用“contains”方法。

val linesWithSpark=rdd.filter(line=>line.contains("M"))

4）用count對步驟（3）的結果進行統計，得到行數。

計算包含男的字符的個數

linesWithSpark.count

顯示出過濾後的數據

linesWithSpark.map(line=> (line.split('\t')(0),1)).reduceByKey(_+_).collect

2.文檔中的單詞計數

2．數據文件words.txt中包含了多行句子，要求對文檔中的單詞計數，並把單詞計數超過3的結果存儲到HDFS上。

WHat is going on there?
I talked to John on email.  We talked about some computer stuff that's it.

I went bike riding in the rain, it was not that cold.

We went to the museum in SF yesterday it was $3 to get in and they had
free food.  At the same time was a SF Giants game, when we got done we
had to take the train with all the Giants fans, they are 1/2 drunk.

實現思路：
（1）通過textFile的方法讀取數據。
（2）通過flatMap將字符串切分成單詞。
（3）通過map將單詞轉化爲（單詞，1）的形式。
（4）通過reduceByKey將統一個單詞的所有值相加。
（5）通過filter將單詞大於3的結果過濾出來。
（6）通過saveAsTextFile將結果寫入到HDFS。

1）準備工作

新建words.txt

cd /data/spark/wordcount
vim words.txt

填入words.txt

WHat is going on there?
I talked to John on email.  We talked about some computer stuff that's it.

I went bike riding in the rain, it was not that cold.

We went to the museum in SF yesterday it was $3 to get in and they had
free food.  At the same time was a SF Giants game, when we got done we
had to take the train with all the Giants fans, they are 1/2 drunk.

將數據上傳到hadoop

hadoop fs -put /data/spark/wordcount/words.txt  /myspark/wordcount

啓動spark-shell

spark-shell

2）通過textFile的方法讀取數據

val textFile = sc.textFile("hdfs://localhost:9000/myspark/wordcount/words.txt");

3）通過flatMap將字符串切分成單詞，通過map將單詞轉化爲（單詞，1）的形式，通過reduceByKey將統一個單詞的所有值相加。

val wordCounts = textFile.flatMap(line=>line.split(“ ”)).map(word=>(word,1)).reduceByKey((a,b)=>a+b)

4）通過filter將單詞大於3的結果過濾出來

saveAsFile=wordCounts.filter(_._2>3).collect()

5）通過saveAsTextFile將結果寫入到HDFS。

3.統計每個用戶收藏商品數量

3．某電商網站記錄了大量用戶對商品的收藏數據，並將數據存儲在名爲buyer_favorite1的文件中，數據格式以及數據內容如下：

用戶ID（buyer_id），商品ID（goods_id），收藏日期（dt）
用戶id  商品id    收藏日期
10181  1000481  2010-04-04 16:54:31
20001  1001597  2010-04-07 15:07:52
20001  1001560  2010-04-07 15:08:27
20042  1001368  2010-04-08 08:20:30
20067  1002061  2010-04-08 16:45:33

要求使用Spark Scala API或Spark Java API對用戶收藏數據，進行wordcount操作，統計每個用戶收藏商品數量。

1）準備工作

創建文件

mkdir -p /data/spark3/wordcount
cd /data/spark3/wordcount
vim buyer_favorite
i
10181  1000481  2010-04-04 16:54:31
20001  1001597  2010-04-07 15:07:52
20001  1001560  2010-04-07 15:08:27
20042  1001368  2010-04-08 08:20:30
20067  1002061  2010-04-08 16:45:33
Esc
:wq

文件上傳到HDFS

Hadoop fs -put /data/spark3/wordcount/ buyer_favorite /myspark3/wordcount

進入 Spark_shell

 
Spark_shell

var rdd = sc.textFile(“hdfs://localhost:9000 /myspark3/wordcount/ buyer_favorite”);

Rdd.map(line=>(line.split(“ ”)(0),0)).reduceByKey(_+_).collect

【大數據學習-實驗-6】Spark應用

1.統計有多少行符合要求

1）準備工作

2）讀取數據創建RDD。

3）通過filter操作過濾數據，filter的函數判斷數據是否包含“男”字符，可用“contains”方法。

4）用count對步驟（3）的結果進行統計，得到行數。

2.文檔中的單詞計數

1）準備工作

2）通過textFile的方法讀取數據

3）通過flatMap將字符串切分成單詞，通過map將單詞轉化爲（單詞，1）的形式，通過reduceByKey將統一個單詞的所有值相加。

4）通過filter將單詞大於3的結果過濾出來

5）通過saveAsTextFile將結果寫入到HDFS。

3.統計每個用戶收藏商品數量

1）準備工作

新買的服務器變肉雞，我該怎麼處理

【軟件測試-實驗-8】測試管理工具應用

【Docker學習-12】在IDEA遠程連接docker(附ssl連接加密)超級詳細

2020年初電腦配件清單

Python屬性私有

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

【大數據學習-實驗-6】Spark應用

1.統計有多少行符合要求

1）準備工作

2） 讀取數據創建RDD。

3） 通過filter操作過濾數據，filter的函數判斷數據是否包含“男”字符，可用“contains”方法。

4） 用count對步驟（3）的結果進行統計，得到行數。

2.文檔中的單詞計數

1）準備工作

2） 通過textFile的方法讀取數據

3） 通過flatMap將字符串切分成單詞，通過map將單詞轉化爲（單詞，1）的形式，通過reduceByKey將統一個單詞的所有值相加。

4）通過filter將單詞大於3的結果過濾出來

5） 通過saveAsTextFile將結果寫入到HDFS。

3.統計每個用戶收藏商品數量

1）準備工作

2）讀取數據創建RDD。

3）通過filter操作過濾數據，filter的函數判斷數據是否包含“男”字符，可用“contains”方法。

4）用count對步驟（3）的結果進行統計，得到行數。

2）通過textFile的方法讀取數據

3）通過flatMap將字符串切分成單詞，通過map將單詞轉化爲（單詞，1）的形式，通過reduceByKey將統一個單詞的所有值相加。

5）通過saveAsTextFile將結果寫入到HDFS。