1、提交應用(Submitting Applications)

用spark的bin目錄下的spark-submit腳本在集羣上啓動應用。它可以通過統一的接口來管理spark所支持的cluster managers，所以不需要爲每一個應用做特殊的配置。

2、打包程序(Bundling Your Application’s Dependencies)

如果你的代碼依賴於其他項目,需要將應用程序打包才能在集羣上分發代碼。爲此,創建一個裝配jar(或“超級”jar)包含代碼及其依賴項。可以用sbt和Maven插件組裝。在打jar包時,spark和Hadoop的依賴包不需要打包,因爲他們是cluster manager在運行時提供的。打好jar包後就可以用bin / spark-submit腳本來提交應用。對於Python,可以使用spark-submit的–py-files參數添加.py,.zip或者.egg文件 to be distributed with your application.如果你依賴於多個Python文件我們建議打包成一個zip或.egg。

3、啓動程序(Launching Applications with spark-submit)

一旦用戶應用程序捆綁,就可以用bin/spark-submit腳本來啓動。這個腳本負責設置spark的類路徑和依賴，並且支持不同的spark支持的不同cluster manager和發佈模式。
./bin/spark-submit \ --class <main-class> \ --master <master-url> \ --deploy-mode <deploy-mode> \ --conf <key>=<value> \
... #other options
<application-jar> \
[application-arguments]

一些通用設置:
--class: 程序入口點 (例如： org.apache.spark.examples.SparkPi)
--master: 集羣的master URL (例如： spark://23.195.26.187:7077)
--deploy-mode: 發佈driver程序在 worker nodes (cluster) 或者 locally as an external client (client) (default: client)
--conf: spark的任意key=value格式的參數.對於包含空格的值，用雙引號例如”key=value”。
--application-jar: 包含您的應用程序和所有依賴項的綁定的路徑。URL必須是全局可見的在你的集羣，例如，hdfs:// path 或者 file:// path 在所有節點是可見的.
--application-arguments: 傳遞給主類的main方法的參數
一個常見的部署策略是從和work machine 合作的gateway machine提交你的應用程序（例如:master node 在獨立的），身體與位於你的工人的機器（在一個獨立的EC2集羣如主節點）。在此設置中，客戶端模式是適當的。在客戶端模式下，該驅動程序直接在spark-submit過程中作爲集羣的客戶端啓動。該應用程序的輸入和輸出連接到控制檯上。因此，這種模式特別適合應用到REPL（例如 spark shell）。
另外，如果應用程序提交的machine遠離work machine（如在您的筆記本電腦），使用集羣模式是常見的，以減少drivers和executors之間的網絡延遲。目前只有yarn支持cluster model下的python應用。
對於Python應用程序，簡單的通過一個.py文件在<application-jar>位置而不是一個JAR，添加Python .zip，.egg，.py文件到搜索路徑用- -py-files。
有幾個可用的選項，是特定的cluster manager正在使用。例如，用一個Spark standalone cluster 的cluster 發佈模式，也可以指定–supervise，以確保驅動程序在失敗與非零的退出代碼時是自動重新啓動。通過執行–help來列舉spark-submit的所有的可用參數。下面是常見的參數的例子：
#Run application locally on 8 cores ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master local[8] \ /path/to/examples.jar \ 100

//Run on a Spark standalone cluster in client deploy mode ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master spark://207.184.161.138:7077 \ --executor-memory 20G \ --total-executor-cores 100 \ /path/to/examples.jar \ 1000

//Run on a Spark standalone cluster in cluster deploy mode with supervise ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master spark://207.184.161.138:7077 \ --deploy-mode cluster \ --supervise \ --executor-memory 20G \ --total-executor-cores 100 \ /path/to/examples.jar \ 1000

//Run on a YARN cluster export HADOOP_CONF_DIR=XXX ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master yarn \ --deploy-mode cluster \ # can be client for client mode --executor-memory 20G \ --num-executors 50 \ /path/to/examples.jar \ 1000

//Run a Python application on a Spark standalone cluster ./bin/spark-submit \ --master spark://207.184.161.138:7077 \ examples/src/main/python/pi.py \ 1000

//Run on a Mesos cluster in cluster deploy mode with supervise ./bin/spark-submit \ --class org.apache.spark.examples.SparkPi \ --master mesos://207.184.161.138:7077 \ --deploy-mode cluster \ --supervise \ --executor-memory 20G \ --total-executor-cores 100 \ http://path/to/examples.jar \ 1000

4、Master URLs

傳遞給spark的master URL有下面的幾種格式：

Master URL	Meaning
local	Run Spark locally with one worker thread (i.e. no parallelism at all).
local[K]	Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
local[*]	Run Spark locally with as many worker threads as logical cores on your machine.
spark://HOST: PORT	Connect to the given Spark standalone cluster master. The port must be whichever one your master is configured to use, which is 7077 by default.
mesos://HOST: PORT	Connect to the given Mesos cluster. The port must be whichever one your is configured to use, which is 5050 by default. Or, for a Mesos cluster using ZooKeeper, use mesos://zk://…. To submit with `--deploy-mode` cluster, the HOST: PORT should be configured to connect to the MesosClusterDispatcher.
yarn	Connect to a YARN cluster in client or cluster mode depending on the value of –deploy-mode. The cluster location will be found based on the HADOOP_CONF_DIR or YARN_CONF_DIR variable.

5、（從文件加載配置）Loading Configuration from a File

spark-submit腳本可以從參數文件加載默認的spark配置參數值，並且將它們傳遞給應用程序。默認會從spark目錄conf/spark-defaults.conf讀取參數。更多信息參考http://spark.apache.org/docs/latest/configuration.html#loading-default-configurations。
加載默認的spark配置方式可以避免spark-submit必須需要確切參數。例如，如果spark.master參數被設置，就可以安全的省略spark-submit的--master標誌。總之，配置值顯示設置在sparkConf是最高優先級，其次是通過spark-submit提交的，最後是默認文件的值。
如果不清楚配置參數從哪裏來的，可以輸出細粒度的調試信息通過運行spark-submit的參數--verbose

6、（依賴管理）Advanced Dependency Management

當用spark-submit提交時，應用jar包和任何–jars參數下的jar包都被自動分發到集羣。URLs提供後--jars必須用逗號隔開。這個列表包括driver和executor classpaths。用--jars的目錄擴展不起作用。
spark使用下面的URL模式，以允許不同的策略來分發jar包：
- file: - Absolute paths and file:/ URIs are served by the driver’s HTTP file server, and every executor pulls the file from the driver HTTP server.
- hdfs:, http:, https:, ftp: - these pull down files and JARs from the URI as expected
- local: - a URI starting with local:/ is expected to exist as a local file on each worker node. This means that no network IO will be incurred, and works well for large files/JARs that are pushed to each worker, or shared via NFS, GlusterFS, etc.
注意，在executor節點每一個SparkContext會將jar包和文件複製到工作目錄。隨着時間的推移會用大量的空間，需要定時清理。yarn可以自動清理，spark standalone可以配置參數spark.worker.cleanup.appDataTtl
用戶還可以用–packages提供一個用逗號分隔的maven座標列表。使用此命令時將處理所有傳遞的依賴關係。用標誌–repositories可以添加附加倉庫（或者SBT解析器）用逗號分隔的方式。這些命令可以被pyspark，spark-shell，和spark-submit使用。
對於python，等效於--py-files參數可以用來分配.egg,.zip和.py庫到executors。

Spark提交應用（Submitting Applications）

1、提交應用(Submitting Applications)

2、打包程序(Bundling Your Application’s Dependencies)

3、啓動程序(Launching Applications with spark-submit)

4、Master URLs

5、（從文件加載配置）Loading Configuration from a File

6、（依賴管理）Advanced Dependency Management

flink sql實戰案例之商品銷量實時統計

flink寫入HDFS中文亂碼

mongo to hive的實踐與優化

《維度模型系列》-1初識維度模型

Hbase新舊api對比

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結