軟件環境:
spark-1.6.3-bin-hadoop2.6、hadoop-2.6.4、jdk1.7.0_67、IDEA14.1.5 ;
Hadoop集羣採用僞分佈式安裝,運行過程中只啓動HDFS;Spark只啓動一個Worker;使用虛擬機搭建Hadoop、Spark集羣;Idea直接安裝在Win10上;192.168.128.128是虛擬機ip;本機ip是:192.168.0.183;
Java連接Spark集羣,如果採用YARN的方式,可以參考:Java Web提交任務到Spark ;寫此篇的初衷是,在使用的過程中發現使用YARN調用Spark集羣效率太低,所以嘗試使用Java直接連接Spark Standalone集羣。同時,需要說明一點,這裏使用的是一個節點,如果使用多個節點情況可能有所不同。
本次測試一共進行了5次實驗,最終達到一個既可以連接Spark Standalone集羣,同時可以監控該任務的目的。所有代碼可以在 https://github.com/fansy1990/JavaConnectSaprk01 下載。
任務1:設置master直接連接
1.1. 創建Scala工程
1.2 創建示例程序
package demo
import org.apache.spark.{SparkContext, SparkConf}
/**
* Created by fansy on 2017/7/5.
*/
object WordCount {
def main(args: Array[String]) {
val input = "hdfs://192.168.128.128:8020/user/root/magic"
val output =""
val appName = "word count"
val master = "spark://192.168.128.128:7077"
val conf = new SparkConf().setAppName(appName).setMaster(master)
val sc = new SparkContext(conf)
val line = sc.textFile(input)
line.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _).collect().foreach(println)
sc.stop()
}
}
這裏面直接設置spark集羣中的地址,然後連接運行任務,運行完成後,執行sc.stop,關閉SparkContext,但是運行後出現錯誤:1.3 問題分析
任務2:添加業務邏輯Jar路徑,連接Master
2.1 修改代碼
val jars =Array("C:\\Users\\fansy\\workspace_idea_tmp\\JavaConnectSaprk01\\out\\artifacts\\wordcount\\wordcount.jar")
val conf = new SparkConf().setAppName(appName).setMaster(master).setJars(jars)
2.2 運行代碼,觀察結果
任務3 Driver運行在不同節點的嘗試
3.1 任務描述
3.2 嘗試修改driver.host參數
val conf = new SparkConf().setAppName(appName).setMaster(master).setJars(jars)
.set("spark.eventLog.enabled","true")
.set("spark.eventLog.dir","hdfs://node10:8020/eventLog")
.set("spark.driver.host","192.168.128.128")
.set("spark.driver.port","8993")
val sc = new SparkContext(conf)
運行,發現提交作業都提交不了了,暫時沒有發現原因,好像把Driver設置到其他節點上面這種方式是有問題的(至少目前對於Standalone這種模式來說)。任務4 Java線程運行Spark程序提交到Spark StandAlone集羣
4.1 任務實現思路
4.2 任務實現
package demo03;
import java.util.concurrent.Callable;
/**
* 線程任務
* Created by fansy on 2017/7/5.
*/
public class RunTool implements Callable {
private String input;
private String output;
private String appName;
private String master;
private String jars;
private String logEnabled;
private String logDir;
public RunTool(){}
public RunTool(String[] args){
this.input = args[0];
this.output = args[1];
this.appName = args[2];
this.master = args[3];
this.jars = args[4];
this.logEnabled = args[5];
this.logDir = args[6];
}
@Override
public Boolean call() throws Exception {
return WordCount.run(new String[]{input,output,appName,master,jars,logEnabled,logDir});
}
}
線程類採用實現Callable接口,有返回值,根據返回值在主類中進行判斷;package demo03;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.FutureTask;
/**
* Created by fansy on 2017/7/5.
*/
public class Driver {
public static void main(String[] args) {
// <input> <output> <appName> <master>"
// " <jars> <logEnabled> <logDir>
String[] arg = new String[]{
"hdfs://node10:8020/user/root/magic",
"",
"wordcount" + System.currentTimeMillis(),
"spark://node10:7077",
"C:\\Users\\fansy\\workspace_idea_tmp\\JavaConnectSaprk01\\out\\artifacts\\wordcount\\wordcount.jar",
"true",
"hdfs://node10:8020/eventLog"
};
FutureTask<Boolean> future = new FutureTask<>(new RunTool(arg));
new Thread(future).start();
boolean flag = true;
while(flag){
try{
Thread.sleep(2000);
System.out.println("Job running ...");
if(future.isDone()){
flag = false;
if(future.get().booleanValue()){
System.out.println("Job done with success state");
}else{
System.out.println("Job failed!");
}
}
}catch (InterruptedException|ExecutionException e){
e.printStackTrace();
}
}
}
}
主類中,每2秒刷新次,看下線程任務執行狀態,最後根據線程任務返回值,判斷任務是否執行成功;
任務5 加入多信息監控
5.1 任務描述
5.2 實現思路
5.3 具體實現
package demo04;
import org.apache.spark.SparkContext;
import org.apache.spark.SparkJobInfo;
import org.apache.spark.SparkStatusTracker;
import java.util.concurrent.ExecutionException;
import java.util.concurrent.FutureTask;
/**
* Created by fansy on 2017/7/5.
*/
public class Driver {
public static void main(String[] args) throws InterruptedException {
String master = "spark://node10:7077";
String appName = "wordcount" + System.currentTimeMillis();
String[] jars = "C:\\Users\\fansy\\workspace_idea_tmp\\JavaConnectSaprk01\\out\\artifacts\\wordcount\\wordcount.jar".split(",");
String logEnabled = "true";
String logDir = "hdfs://node10:8020/eventLog";
String[] arg = new String[]{
"hdfs://node10:8020/user/root/magic",
""
};
// 1.獲取SC
SparkContext sc = Utils.getSc(master, appName, jars, logEnabled, logDir);
// 2. 提交任務 線程
FutureTask<Boolean> future = new FutureTask<>(new WordCount(sc, arg));
new Thread(future).start();
// 3. 監控
String appId = sc.applicationId();
System.out.println("AppId:"+appId);
SparkStatusTracker sparkStatusTracker = null;
int[] jobIds ;
SparkJobInfo jobInfo;
while (!sc.isStopped()) {// 如果sc沒有stop,則往下監控
Thread.sleep(2000);
// 獲取所有Job
sparkStatusTracker = sc.statusTracker();
jobIds = sparkStatusTracker.getJobIdsForGroup(null);
for(int jobId :jobIds){
jobInfo = sparkStatusTracker.getJobInfo(jobId).getOrElse(null);
if(jobInfo == null){
System.out.println("JobId:"+jobId+",相關信息獲取不到!");
}else{
System.out.println("JobId:" + jobId + ",任務狀態:" + jobInfo.status().name());
}
}
}
// 4. 檢查線程任務是否返回true
boolean flag = true;
while(flag){
try{
Thread.sleep(200);
System.out.println("Job closing ...");
if(future.isDone()){
flag = false;
if(future.get().booleanValue()){
System.out.println("Job "+appId+" done with success state");
}else{
System.out.println("Job "+appId+" failed!");
}
}
}catch (InterruptedException|ExecutionException e){
e.printStackTrace();
}
}
}
}
Utils工具類主要是獲取SparkContext,每次獲取都是一個新的SparkContext,如下:
package demo04
import org.apache.spark.{SparkContext,SparkConf}
/**
* Created by fansy on 2017/7/6.
*/
object Utils {
/**
* 獲得sc
* @param master
* @param appName
* @param jars
* @return
*/
def getSc(master:String,appName:String,jars:Array[String],logEnabled:String,logDir:String):SparkContext = {
val conf = new SparkConf().setMaster(master).setAppName(appName).setJars(jars)
.set("spark.eventLog.enabled",logEnabled)
.set("spark.eventLog.dir",logDir)
new SparkContext(conf)
}
}
再次運行,即可監控到App裏面每個Job的具體信息了。
思考
腳踏實地,專注
轉載請註明blog地址:http://blog.csdn.net/fansy1990