- 項目目錄名 countjpgs
- pom.xml文件(位於項目目錄下)
- countjpgs => src => main => scala => stubs => CountJPGs.scala
- weblogs文件存放在HDFS的/loudacre目錄下,是一個包含各種請求的web日誌文件。
pom.xml文件內容:
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.cloudera.training.dev1</groupId>
<artifactId>countjpgs</artifactId>
<version>1.0</version>
<packaging>jar</packaging>
<name>"Count JPGs"</name>
<properties>
<spark-assembly>/usr/lib/spark/lib/spark-assembly.jar</spark-assembly>
<hadoop-mapreduce-client-common>/usr/lib/hadoop/client/hadoop-mapreduce-client-common.jar</hadoop-mapreduce-client-common>
<hadoop-mapreduce-client-core>/usr/lib/hadoop/client/hadoop-mapreduce-client-core.jar</hadoop-mapreduce-client-core>
<hadoop-common>/usr/lib/hadoop/client/hadoop-common.jar</hadoop-common>
<avro>/usr/lib/hadoop/client/avro.jar</avro>
<commons-lang>/usr/lib/hadoop/client/commons-lang.jar</commons-lang>
<guava>/usr/lib/hadoop/client/guava.jar</guava>
<slf4j-api>/usr/lib/hadoop/client/slf4j-api.jar</slf4j-api>
<slf4j-log4j12>/usr/lib/hadoop/client/slf4j-log4j12.jar</slf4j-log4j12>
<hadoop-common>/usr/lib/hadoop/client/hadoop-common.jar</hadoop-common>
<hadoop-annotations>/usr/lib/hadoop/client/hadoop-annotations.jar</hadoop-annotations>
</properties>
<repositories>
<repository>
<id>apache-repo</id>
<name>Apache Repository</name>
<url>https://repository.apache.org/content/repositories/releases</url>
<releases>
<enabled>true</enabled>
</releases>
<snapshots>
<enabled>false</enabled>
</snapshots>
</repository>
<repository>
<id>cloudera-repo-releases</id>
<url>https://repository.cloudera.com/artifactory/repo/</url>
</repository>
</repositories>
<build>
<plugins>
<plugin>
<groupId>org.scala-tools</groupId>
<artifactId>maven-scala-plugin</artifactId>
<version>2.15.2</version>
<executions>
<execution>
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>2.5.1</version>
<configuration>
<source>1.7</source>
<target>1.7</target>
</configuration>
</plugin>
</plugins>
</build>
<dependencies>
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<version>2.10.5</version>
<scope>system</scope>
<systemPath>${spark-assembly}</systemPath>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>local</version>
<scope>system</scope>
<systemPath>${spark-assembly}</systemPath>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>local</version>
<scope>system</scope>
<systemPath>${hadoop-common}</systemPath>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-mapreduce-client-common</artifactId>
<version>local</version>
<scope>system</scope>
<systemPath>${hadoop-mapreduce-client-common}</systemPath>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-annotations</artifactId>
<version>local</version>
<scope>system</scope>
<systemPath>${hadoop-annotations}</systemPath>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>avro</artifactId>
<version>local</version>
<scope>system</scope>
<systemPath>${avro}</systemPath>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>slf4j-log4j12</artifactId>
<version>local</version>
<scope>system</scope>
<systemPath>${slf4j-log4j12}</systemPath>
</dependency>
</dependencies>
</project>
CountJPGs.scala文件內容:
package stubs
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object CountJPGs {
def main(args: Array[String]) {
if (args.length < 1) {
System.err.println("Usage: CountJPGs <file>")
System.exit(1)
}
//val sc = new SparkContext("hdfs","weblogs")
val sc = new SparkContext()
//val filepath = "/loudace/weblogs/*66"
val logfile = args(0)
val weblogs = sc.textFile(logfile)
val weblogsJpg = weblogs.filter(_.contains(".jpg"))
var weblogsJpgCount = weblogsJpg.count()
println("JPG Count : "+weblogsJpgCount)
sc.stop
//TODO: complete exercise
println("stub is not implemented")
System.exit(1)
}
}
進入到項目根目錄countjpg文件夾下:
$ cd 項目存放路徑/countjpgs
打包程序:
$ mvn package
打包成功後,jar包會生成在target文件夾下,名稱和項目名類似:
還是進入到項目根目錄countjpg文件夾下:
$ cd 項目存放路徑/countjpgs
使用spark-submit命令運行程序:
$ spark-submit --class stubs.CountJPGs target/countjpgs-1.0.jar /loudacre/weblogs/*
輸出效果:
補充:提交到YARN集羣上面運行的命令:
$ spark-submit --class stubs.CountJPGs --master yarn-client --name 'Count JPGs' target/countjpgs-1.0.jar /loudacre/weblogs/*
另外可以在項目根目錄創建一個配置文件,以便在使用spark-submit命令時調用:
$ vim myspark.conf
此文件內容:
spark.app.name My Spark App
spark.master yarn-client
spark.executor.memory 400M
啓動命令:
$ spark-submit --properties-file myspark.conf --class stubs.CountJPGs target/loudacre/weblogs/*
然後就可以在YARN可視化頁面看到相關的配置。