初學使用Intellij idea編寫spark程序。由於公司要求用java編寫,但網上基本又是用scala來寫spark程序(雖然確實用scala來寫比java好很多),摸索之後決定把整個流程記錄下來.
開發環境:
- Intellij idea 14
- jdk: 1.7.71
- spark: 1.1.0
- hadoop: 2.4.0
- scala: 2.11.1
- maven: 3.2.5
步驟:
1. 創建maven工程
在src目錄下創建main/java的source文件(在File --> Project Structure...-->Modules-->Sources右鍵添加目錄和修改目錄爲source類型)
在File --> Project Structure...-->Libraries添加spark-assembly-1.1.0-hadoop2.4.0的依賴包
2. 在java目錄下編寫Wordcount例子程序
代碼:
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.FlatMapFunction;
import org.apache.spark.api.java.function.Function2;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;
import java.util.Arrays;
import java.util.List;
import java.util.regex.Pattern;
/**
* Created by yhao on 2015/3/12.
*/
public class JavaWordCount {
private static final Pattern SPACE = Pattern.compile(" ");
public static void main(String[] args)throws Exception {
SparkConf sparkConf = new SparkConf().setAppName("JavaWordCount");
String srcPath = null;
String desPath = "/apps/ca/yanh/output";
if (args.length == 1) {
srcPath = args[0];
} else if(args.length == 2) {
srcPath = args[0];
desPath = args[1];
}
else {
System.out.println("Usage: java -jar jarName [des]");
System.exit(1);
}
JavaSparkContext jsc = new JavaSparkContext(sparkConf);
JavaRDD lines = jsc.textFile(srcPath, 1);
System.out.println("Begin to split!");
JavaRDD words = lines.flatMap(new FlatMapFunction() {
@Override
public Iterable call(String s) throws Exception {
return Arrays.asList(SPACE.split(" "));
}
});
System.out.println("Begin to map!");
JavaPairRDD ones = words.mapToPair(new PairFunction() {
@Override
public Tuple2 call(String s) throws Exception {
return new Tuple2(s, 1);
}
});
System.out.println("Begin to reduce!");
JavaPairRDD counts = ones.reduceByKey(new Function2() {
@Override
public Integer call(Integer i1, Integer i2) throws Exception {
return i1 + i2;
}
});
System.out.println("Begin to save!");
/*List> output = counts.collect();
for(Tuple2<?, ?> tuple: output) {
System.out.println(tuple._1() + ": " + tuple._2());
}*/
counts.saveAsTextFile(desPath);
jsc.stop();
}
}
3. 將程序打包成jar包
在File --> Project Structure...-->Artifacts點擊綠色“+”,Add-->JAR-->From Modules with Dependencies
輸入main class入口函數名,將Output Layout下所有jar包刪掉(因爲spark運行環境已經包含了這些包),然後Apply,OK
編譯程序:Build-->Build Artifacts...,然後選擇要編譯的項目進行編譯
在當前工程生成的out目錄下就可以找到輸出的jar包
4. 運行程序
將jar包上傳至spark集羣,然後使用spark-submit進行提交運行(spark-submit具體參數自行查看)
提交命令:spark-submit --class JavaWordCount ~/JavaWordCount.jar /apps/ca/yanh/data/README.md
FAQ:如果出現以下錯誤:
這是由於缺少本地庫依賴和壓縮包引起。
在此提供了這個包:http://pan.baidu.com/s/1rqkQa
spark-submit命令: spark-submit --driver-library-path :/usr/lib/hadoop/lib/native/ --jars /usr/lib/hadoop/lib/hadoop-lzo-0.6.0.jar --class JavaWordCount ~/JavaWordCount.jar /apps/ca/yanh/data/README.md