在做spark應用開發的時候,有兩種方式可以提交任務到集羣中去執行,spark 官網上,給出的提交任務的方式是spark-submit 腳本的方式,一種是 使用spark 隱藏的rest api 。
一、Spark Submit
# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit \
--class "SimpleApp" \
--master local[4] \
target/scala-2.12/simple-project_2.12-1.0.jar
二、REST API from outside the Spark cluster
1. 提交任務到SPARK集羣
curl -X POST http://spark-cluster-ip:6066/v1/submissions/create --header "Content-Type:application/json;charset=UTF-8" --data '{
"action" : "CreateSubmissionRequest",
"appArgs" : [ "myAppArgument1" ],
"appResource" : "file:/myfilepath/spark-job-1.0.jar",
"clientSparkVersion" : "2.4.4",
"environmentVariables" : {
"SPARK_ENV_LOADED" : "1"
},
"mainClass" : "com.mycompany.MyJob",
"sparkProperties" : {
"spark.jars" : "file:/myfilepath/spark-job-1.0.jar",
"spark.driver.supervise" : "false",
"spark.app.name" : "MyJob",
"spark.eventLog.enabled": "true",
"spark.submit.deployMode" : "cluster",
"spark.master" : "spark://spark-cluster-ip:7077"
}
}'
參數說明:
spark-cluster-ip:spark master地址。默認的rest服務端口是6066,如果被佔用會依次查找6067,6068…
“action” : “CreateSubmissionRequest”:請求的內容是提交程序,固定值。
“appArgs” : [ “args1, args2,…” ]:我們的程序jar包所需要的參數,如kafka topic,使用的模型等(說明:如果程序沒有需要的參數,這裏寫”appArgs”:[],不能不寫,否則會把appResource後面的一條解析爲appArgs引發未知錯誤)
“appResource” : “file:/spark.jar”:程序jar包的路徑
“clientSparkVersion” : “2.4.4”:spark的版本
“environmentVariables” : {“SPARK_ENV_LOADED” : “1”}:是否加載Spark環境變量(此項必須要寫,否則會報NullPointException)
“mainClass” : “mainClass”:程序的主類main方法
“sparkProperties” : {…}:spark的參數配置
返回結果:
{
"action" : "CreateSubmissionResponse",
"message" : "Driver successfully submitted as driver-20200115102452-0000",
"serverSparkVersion" : "2.4.4",
"submissionId" : "driver-20200115102452-0000",
"success" : true
}
2. 獲取已提交程序的執行狀態
curl http://spark-cluster-ip:6066/v1/submissions/status/driver-20200115102452-0000
其中driver-20200115102452-0000 爲提交程序後返回的submission Id,也可以在spark 監控頁面查看,running driver 或者完成driver的 submission Id。
返回結果:
{
"action" : "SubmissionStatusResponse",
"driverState" : "FINISHED",
"serverSparkVersion" : "2.4.4",
"submissionId" : "driver-20200115102452-0000",
"success" : true,
"workerHostPort" : "128.96.104.10:37588",
"workerId" : "worker-20201016084158-128.96.104.10-37588"
}
driverState表示程序的運行狀態,包括以下幾個類型:
ERROR(因錯誤沒有提交成功,會顯示錯誤信息),
SUBMITTED(已提交但未開始執行),
RUNNIG(正在運行),
FAILED(執行失敗,會拋出異常),
FINISHED(執行成功)
3. 結束已提交的程序
curl -X POST http://spark-cluster-ip:6066/v1/submissions/kill/driver-20200115102452-0000
返回結果:
{
"action" : "KillSubmissionResponse",
"message" : "Kill request for driver-20181016102452-0000 submitted",
"serverSparkVersion" : "2.4.4",
"submissionId" : "driver-20200115102452-0000",
"success" : true
}
4. 查看spark集羣work的運行信息
curl http://spark-cluster-ip:8080/json/