一個Spark maven項目打包並使用spark-submit運行

  • 項目目錄名 countjpgs
  • pom.xml文件(位於項目目錄下)
  • countjpgs => src => main => scala => stubs => CountJPGs.scala
  • weblogs文件存放在HDFS的/loudacre目錄下,是一個包含各種請求的web日誌文件。

pom.xml文件內容:

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>com.cloudera.training.dev1</groupId>
  <artifactId>countjpgs</artifactId>
  <version>1.0</version>
  <packaging>jar</packaging>
  <name>"Count JPGs"</name>
  
  <properties>
    <spark-assembly>/usr/lib/spark/lib/spark-assembly.jar</spark-assembly>
    <hadoop-mapreduce-client-common>/usr/lib/hadoop/client/hadoop-mapreduce-client-common.jar</hadoop-mapreduce-client-common>
    <hadoop-mapreduce-client-core>/usr/lib/hadoop/client/hadoop-mapreduce-client-core.jar</hadoop-mapreduce-client-core>
    <hadoop-common>/usr/lib/hadoop/client/hadoop-common.jar</hadoop-common>
    <avro>/usr/lib/hadoop/client/avro.jar</avro>
    <commons-lang>/usr/lib/hadoop/client/commons-lang.jar</commons-lang>
    <guava>/usr/lib/hadoop/client/guava.jar</guava>
    <slf4j-api>/usr/lib/hadoop/client/slf4j-api.jar</slf4j-api>
    <slf4j-log4j12>/usr/lib/hadoop/client/slf4j-log4j12.jar</slf4j-log4j12>
    <hadoop-common>/usr/lib/hadoop/client/hadoop-common.jar</hadoop-common>
    <hadoop-annotations>/usr/lib/hadoop/client/hadoop-annotations.jar</hadoop-annotations>
  </properties>
  
  <repositories>
    <repository>
      <id>apache-repo</id>
      <name>Apache Repository</name>
      <url>https://repository.apache.org/content/repositories/releases</url>
      <releases>
        <enabled>true</enabled>
      </releases>
      <snapshots>
        <enabled>false</enabled>
      </snapshots>
    </repository>
   <repository>
     <id>cloudera-repo-releases</id>
     <url>https://repository.cloudera.com/artifactory/repo/</url>
   </repository> 
  </repositories>

  <build>
    <plugins>
      <plugin>
        <groupId>org.scala-tools</groupId>
        <artifactId>maven-scala-plugin</artifactId>
	    <version>2.15.2</version>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
            </goals>
          </execution>
        </executions>
      </plugin>
      <plugin>
        <artifactId>maven-compiler-plugin</artifactId>
	    <version>2.5.1</version>
        <configuration>
          <source>1.7</source>
          <target>1.7</target>
        </configuration>
      </plugin>
    </plugins>  
  </build>

  <dependencies>
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>2.10.5</version>
      <scope>system</scope>
      <systemPath>${spark-assembly}</systemPath>
    </dependency>
    <dependency>
       <groupId>org.apache.spark</groupId>
       <artifactId>spark-core_2.10</artifactId>
       <version>local</version>
       <scope>system</scope>
       <systemPath>${spark-assembly}</systemPath>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-common</artifactId>
        <version>local</version>
        <scope>system</scope>
        <systemPath>${hadoop-common}</systemPath>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-mapreduce-client-common</artifactId>
        <version>local</version>
        <scope>system</scope>
        <systemPath>${hadoop-mapreduce-client-common}</systemPath>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>hadoop-annotations</artifactId>
        <version>local</version>
        <scope>system</scope>
        <systemPath>${hadoop-annotations}</systemPath>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>avro</artifactId>
        <version>local</version>
        <scope>system</scope>
        <systemPath>${avro}</systemPath>
    </dependency>
    <dependency>
        <groupId>org.apache.hadoop</groupId>
        <artifactId>slf4j-log4j12</artifactId>
        <version>local</version>
        <scope>system</scope>
        <systemPath>${slf4j-log4j12}</systemPath>
    </dependency>

  </dependencies>
</project>

CountJPGs.scala文件內容:

package stubs

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._

object CountJPGs {
   def main(args: Array[String]) {
     if (args.length < 1) {
       System.err.println("Usage: CountJPGs <file>")
       System.exit(1)
     }
	//val sc = new SparkContext("hdfs","weblogs")
	val sc = new SparkContext()
	//val filepath = "/loudace/weblogs/*66"
	val logfile = args(0)
	val weblogs = sc.textFile(logfile)
	val weblogsJpg = weblogs.filter(_.contains(".jpg"))
	var weblogsJpgCount = weblogsJpg.count()
	println("JPG Count : "+weblogsJpgCount)
	sc.stop
     //TODO: complete exercise
     println("stub is not implemented")
     System.exit(1)

   }
 }

進入到項目根目錄countjpg文件夾下:

$ cd 項目存放路徑/countjpgs

 打包程序:

$ mvn package

打包成功後,jar包會生成在target文件夾下,名稱和項目名類似:

還是進入到項目根目錄countjpg文件夾下:

$ cd 項目存放路徑/countjpgs

使用spark-submit命令運行程序:

$ spark-submit --class stubs.CountJPGs target/countjpgs-1.0.jar /loudacre/weblogs/*

輸出效果:

 

補充:提交到YARN集羣上面運行的命令:

$ spark-submit --class stubs.CountJPGs --master yarn-client --name 'Count JPGs' target/countjpgs-1.0.jar /loudacre/weblogs/*

另外可以在項目根目錄創建一個配置文件,以便在使用spark-submit命令時調用:

$ vim myspark.conf

此文件內容:

spark.app.name My Spark App
spark.master yarn-client
spark.executor.memory 400M

啓動命令:

$ spark-submit --properties-file myspark.conf --class stubs.CountJPGs target/loudacre/weblogs/*

然後就可以在YARN可視化頁面看到相關的配置。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章