基於spark的單詞計數統計

單詞計數:

直接查看官網:

http://spark.apache.org/examples.html

小案例,自己再次基礎上進一步的實現,我用了兩種語言實現

 

主要文件:

words.txt:

hello me
hello you
hello her
hello me
hello you
hello her
hello me
hello you
hello her
hello me
hello you
hello her

pom.xml:(引入相關的依賴)

<!-- 指定倉庫位置,依次爲aliyun、cloudera和jboss倉庫 -->
    <repositories>
        <repository>
            <id>aliyun</id>
            <url>http://maven.aliyun.com/nexus/content/groups/public/</url>
        </repository>
        <repository>
            <id>cloudera</id>
            <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
        </repository>
        <repository>
            <id>jboss</id>
            <url>http://repository.jboss.com/nexus/content/groups/public</url>
        </repository>
    </repositories>
    <properties>
        <maven.compiler.source>1.8</maven.compiler.source>
        <maven.compiler.target>1.8</maven.compiler.target>
        <encoding>UTF-8</encoding>
        <scala.version>2.11.8</scala.version>
        <scala.compat.version>2.11</scala.compat.version>
        <hadoop.version>2.7.4</hadoop.version>
        <spark.version>2.2.0</spark.version>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.scala-lang</groupId>
            <artifactId>scala-library</artifactId>
            <version>${scala.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-core_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-hive-thriftserver_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <!-- <dependency>
             <groupId>org.apache.spark</groupId>
             <artifactId>spark-streaming-kafka-0-8_2.11</artifactId>
             <version>${spark.version}</version>
         </dependency>-->
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-streaming-kafka-0-10_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>
        <dependency>
            <groupId>org.apache.spark</groupId>
            <artifactId>spark-sql-kafka-0-10_2.11</artifactId>
            <version>${spark.version}</version>
        </dependency>

        <!--<dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.6.0-mr1-cdh5.14.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>1.2.0-cdh5.14.0</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-server</artifactId>
            <version>1.2.0-cdh5.14.0</version>
        </dependency>-->

        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>2.7.4</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-client</artifactId>
            <version>1.3.1</version>
        </dependency>
        <dependency>
            <groupId>org.apache.hbase</groupId>
            <artifactId>hbase-server</artifactId>
            <version>1.3.1</version>
        </dependency>
        <dependency>
            <groupId>com.typesafe</groupId>
            <artifactId>config</artifactId>
            <version>1.3.3</version>
        </dependency>
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>5.1.38</version>
        </dependency>
    </dependencies>

    <build>
        <sourceDirectory>src/main/scala</sourceDirectory>
      <!--  <testSourceDirectory>src/test/scala</testSourceDirectory>-->
        <plugins>
            <!-- 指定編譯java的插件 -->
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.5.1</version>
            </plugin>
            <!-- 指定編譯scala的插件 -->
            <plugin>
                <groupId>net.alchim31.maven</groupId>
                <artifactId>scala-maven-plugin</artifactId>
                <version>3.2.2</version>
                <executions>
                    <execution>
                        <goals>
                            <goal>compile</goal>
                            <goal>testCompile</goal>
                        </goals>
                        <configuration>
                            <args>
                                <arg>-dependencyfile</arg>
                                <arg>${project.build.directory}/.scala_dependencies</arg>
                            </args>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-surefire-plugin</artifactId>
                <version>2.18.1</version>
                <configuration>
                    <useFile>false</useFile>
                    <disableXmlReport>true</disableXmlReport>
                    <includes>
                        <include>**/*Test.*</include>
                        <include>**/*Suite.*</include>
                    </includes>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>2.3</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <filters>
                                <filter>
                                    <artifact>*:*</artifact>
                                    <excludes>
                                        <exclude>META-INF/*.SF</exclude>
                                        <exclude>META-INF/*.DSA</exclude>
                                        <exclude>META-INF/*.RSA</exclude>
                                    </excludes>
                                </filter>
                            </filters>
                            <transformers>
                                <transformer
                                        implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                    <mainClass></mainClass>
                                </transformer>
                            </transformers>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>

java實現:

package cn.itcast;

import org.apache.spark.SparkConf;
import org.apache.spark.SparkContext;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import scala.Tuple2;

import java.util.Arrays;

public class Demo {
        public static void main(String[] ars) {
            //1.創建sc
            SparkConf conf = new SparkConf().setAppName("wc").setMaster("local[*]");
            JavaSparkContext sc = new JavaSparkContext(conf);
            //2.讀取文件
            JavaRDD<String> stringJavaRDD = sc.textFile("F:data\\words.txt");
            //3.每一行按空格切割
            //java中函數要的是接口的實現類對象,通過看源碼發現flatMap要的函數(接口的實現類對象)
            //public interface FlatMapFunction<T, R> extends Serializable {
            //  Iterator<R> call(T t) throws Exception;
            //}
            //T就是String,就是傳入的每一行
            //Iterator<R>,就是Iterator<String>返回的結果就是單詞組成的迭代器
            //java中函數的語法: (參數)->{函數體}
            JavaRDD<String> wordRDD = stringJavaRDD.flatMap(s -> Arrays.asList(s.split(" ")).iterator());
            //4.對每個單詞記爲1
            //public interface PairFunction<T, K, V> extends Serializable {
            //  Tuple2<K, V> call(T t) throws Exception;
            //}
            JavaPairRDD<String, Integer> wordAndOne = wordRDD.mapToPair(w -> new Tuple2<>(w, 1));
            //5.按照key進行聚合
            JavaPairRDD<String, Integer> result = wordAndOne.reduceByKey((a, b) -> a + b);

            //6.收集結果並輸出
            //result.foreach(System.out::println);
            result.foreach(t-> System.out.println(t));
            sc.close();
        }

}

scala實現:

package cn.itcast.sparkhello

import org.apache.spark.rdd.RDD
import org.apache.spark.{SparkConf, SparkContext}

object WordCount_3 {
  def main(args: Array[String]): Unit = {
    //1.創建sc
    val conf = new SparkConf().setAppName("wc").setMaster("local[*]")
    val sc = new SparkContext(conf)
    //2.讀取文件
    //A Resilient Distributed Dataset (RDD)
    //RDD彈性分佈式數據集,可以理解成一個分佈式的集合,即Spark對於本地集合的封裝
    //可以讓程序員操作分佈式的數據就像操作本地集合一樣簡單,這樣就很happy了
    //RDD[每一行]
    val fileRDD:RDD[String]= sc.textFile("F:data\\words.txt")
    // val fileRDD: RDD[String] = sc.textFile("\"F:\\\\黑馬資料\\\\第七階段\\\\Spark資料\\\\資料\\\\data\\\\words.txt\"")
    //3.處理數據
    //3.1針對每一行按照空格切分並壓平,_就代表每一行
    //RDD[單詞]
    //3.2每個單詞記爲1
    //wordRDD.map(w=>(w,1))
    //RDD[(單詞, 1)]
    val wordRDD: RDD[String] = fileRDD.flatMap(line=>line.split(" "))
    val wordAndOneRDD = wordRDD.map(word=>(word,1))

    val wordAndCount: RDD[(String, Int)] = wordAndOneRDD.reduceByKey((a, b)=>a+b)
    //3.handle

    wordAndCount.foreach(println(_))

  }
}

運行效果:(java與scala)

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章