一、Spark-Submit提交参数

1.1、补充算子

transformations：

（1）mapPartitionWithIndex：类似于mapPartitions,除此之外还会携带分区的索引值。

（2）repartition：增加或减少分区。会产生shuffle。（多个分区分到一个分区不会产生shuffle）

（3）coalesce：coalesce常用来减少分区，第二个参数是减少分区的过程中是否产生shuffle。

true为产生shuffle，false不产生shuffle。默认是false。

如果coalesce设置的分区数比原来的RDD的分区数还多的话，第二个参数设置为false不会起作用，如果设置成true，效果和repartition一样。即repartition(numPartitions) = coalesce(numPartitions,true)

（4）groupByKey：作用在K，V格式的RDD上。根据Key进行分组。作用在（K，V），返回（K，Iterable <V>）。

（5）zip：将两个RDD中的元素（KV格式/非KV格式）变成一个KV格式的RDD,两个RDD的每个分区元素个数必须相同。

（6）zipWithIndex：该函数将RDD中的元素和这个元素在RDD中的索引号（从0开始）组合成（K,V）对。

Action：

（1）countByKey：作用到K,V格式的RDD上，根据Key计数相同Key的数据集元素。

（2）countByValue：根据数据集每个元素相同的内容来计数。返回相同内容的元素对应的条数。

（3）reduce：根据聚合逻辑聚合数据集中的每个元素。

1.2、补充 PV&UV概念了解

（1）PV

什么是PV：是网站分析的一个术语，用以衡量网站用户访问的网页的数量。对于广告主，PV值可预期它可以带来多少广告收入。一般来说，PV与来访者的数量成正比，但是PV并不直接决定页面的真实来访者数量，如同一个来访者通过不断的刷新页面，也可以制造出非常高的PV。

什么是PV值：

所有访问者在24小时（0点到24点）内看了某个网站多少个页面或某个网页多少次。PV是指页面刷新的次数，每一次页面刷新，就算做一次PV流量。

度量方法：

度量方法就是从浏览器发出一个对网络服务器的请求（Request），网络服务器接到这个请求后，会将该请求对应的一个网页（Page）发送给浏览器，从而产生了一个PV。那么在这里只要是这个请求发送给了浏览器，无论这个页面是否完全打开（下载完成），那么都是应当计为1个PV。

（2）什么是UV值

即独立访客数，指访问某个站点或点击某个网页的不同IP地址的人数。在同一天内，UV只记录第一次进入网站的具有独立IP的访问者，在同一天内再次访问该网站则不计数。UV提供了一定时间内不同观众数量的统计指标，而没有反应出网站的全面活动。

1.3、Spark-Submit提交参数

（1）Options:

--master： MASTER_URL, 可以是spark://host:port, mesos://host:port, yarn, yarn-cluster,yarn-client, local
--deploy-mode：DEPLOY_MODE, Driver程序运行的地方，client或者cluster,默认是client。
--class：CLASS_NAME, 主类名称，含包名
--jars：逗号分隔的本地JARS, Driver和executor依赖的第三方jar包
--files：用逗号隔开的文件列表,会放置在每个executor工作目录中
--conf：spark的配置属性
--driver-memory：Driver程序使用内存大小（例如：1000M，5G），默认1024M
--executor-memory：每个executor内存大小（如：1000M，2G），默认1G

（2）Spark standalone with cluster deploy mode only

--driver-cores：Driver程序的使用core个数（默认为1），仅限于Spark standalone模式

（3）Spark standalone or Mesos with cluster deploy mode only

--supervise：失败后是否重启Driver，仅限于Spark alone或者Mesos模式

（4）Spark standalone and Mesos only

--total-executor-cores：executor使用的总核数，仅限于SparkStandalone、Spark on Mesos模式

（5）Spark standalone and YARN only

--executor-cores：每个executor使用的core数，Spark on Yarn默认为1，standalone默认为worker上所有可用的core。

（6）YARN-only

--driver-cores：driver使用的core,仅在cluster模式下，默认为1。
--queue ：QUEUE_NAME 指定资源队列的名称,默认：default
--num-executors：一共启动的executor数量，默认是2个。

二、编码

2.1、二次排序

SparkConf sparkConf = new SparkConf()
.setMaster("local")
.setAppName("SecondarySortTest");
final JavaSparkContext sc = new JavaSparkContext(sparkConf);

JavaRDD<String> secondRDD = sc.textFile("secondSort.txt");

JavaPairRDD<SecondSortKey, String> pairSecondRDD = secondRDD.mapToPair(new PairFunction<String, SecondSortKey, String>() {

	/**
	 * 
	 */
	private static final long serialVersionUID = 1L;

	@Override
	public Tuple2<SecondSortKey, String> call(String line) throws Exception {
           String[] splited = line.split(" ");
           int first = Integer.valueOf(splited[0]);
           int second = Integer.valueOf(splited[1]);
           SecondSortKey secondSortKey = new SecondSortKey(first,second);
           return new Tuple2<SecondSortKey, String>(secondSortKey,line);
	}
});

pairSecondRDD.sortByKey(false).foreach(new  
               VoidFunction<Tuple2<SecondSortKey,String>>() {
	
	/**
	 * 
	 */
	private static final long serialVersionUID = 1L;

	@Override
	public void call(Tuple2<SecondSortKey, String> tuple) throws Exception {
             System.out.println(tuple._2);
	}
});

public class SecondSortKey  implements Serializable,Comparable<SecondSortKey>{
	/**
	 * 
	 */
	private static final long serialVersionUID = 1L;
	private int first;
	private int second;
	public int getFirst() {
		return first;
	}
	public void setFirst(int first) {
		this.first = first;
	}
	public int getSecond() {
		return second;
	}
	public void setSecond(int second) {
		this.second = second;
	}
	public SecondSortKey(int first, int second) {
		super();
		this.first = first;
		this.second = second;
	}
	@Override
	public int compareTo(SecondSortKey o1) {
		if(getFirst() - o1.getFirst() ==0 ){
			return getSecond() - o1.getSecond();
		}else{
			return getFirst() - o1.getFirst();
		}
	}
}

2.2、分组取topN和topN

SparkConf conf = new SparkConf()
.setMaster("local")
.setAppName("TopOps");
JavaSparkContext sc = new JavaSparkContext(conf);
JavaRDD<String> linesRDD = sc.textFile("scores.txt");

JavaPairRDD<String, Integer> pairRDD = linesRDD.mapToPair(new PairFunction<String, String, Integer>() {

/**
 * 
 */
private static final long serialVersionUID = 1L;

@Override
public Tuple2<String, Integer> call(String str) throws Exception {
	String[] splited = str.split("\t");
	String clazzName = splited[0];
	Integer score = Integer.valueOf(splited[1]);
	return new Tuple2<String, Integer> (clazzName,score);
        }
});

pairRDD.groupByKey().foreach(new 
            VoidFunction<Tuple2<String,Iterable<Integer>>>() {

    /**
     * 
     */
    private static final long serialVersionUID = 1L;

    @Override
    public void call(Tuple2<String, Iterable<Integer>> tuple) throws Exception {
	String clazzName = tuple._1;
	Iterator<Integer> iterator = tuple._2.iterator();
	
	Integer[] top3 = new Integer[3];
	
	while (iterator.hasNext()) {
         Integer score = iterator.next();

           for (int i = 0; i < top3.length; i++) {
	     if(top3[i] == null){
                top3[i] = score;
                break;
	      }else if(score > top3[i]){
                 for (int j = 2; j > i; j--) {
	            top3[j] = top3[j-1];
                 }
                top3[i] = score;
                break;
	     }
       }
 }
 System.out.println("class Name:"+clazzName);
 for(Integer sscore : top3){
      System.out.println(sscore);
  }
}
});

三、SparkShell的使用

（1）概念

SparkShell是Spark自带的一个快速原型开发工具，也可以说是Spark的scala REPL(Read-Eval-Print-Loop),即交互式shell。支持使用scala语言来进行Spark的交互式编程。

（2）使用

启动Standalone集群，./start-all.sh 
在客户端上启动spark-shell:
./spark-shell --master spark://node1:7077
启动hdfs，创建目录spark/test,上传文件wc.txt
启动hdfs集群：
    start-all.sh
创建目录：
    hdfs dfs -mkdir -p /spark/test
上传wc.txt
    hdfs dfs -put /root/test/wc.txt /spark/test/
wc附件：
 

运行wordcount
sc.textFile("hdfs://node1:9000/spark/test/wc.txt")
.flatMap(_.split(" ")).map((_,1)).reduceByKey(_+_).foreach(println)

Spark学习（4）——Spark任务提交参数详解

一、Spark-Submit提交参数

1.1、补充算子

transformations：

Action：

1.2、补充 PV&UV概念了解

1.3、Spark-Submit提交参数

二、编码

2.1、二次排序

2.2、分组取topN和topN

三、SparkShell的使用

linux安装cuda和cudnn

测试人员都是画画大神，让我看看谁还不会用代码图？

Object.values()对象遍历

我拍了拍Redis，被移出了群聊···

网络现代化通向云原生应用的高速公路

面试官：说说你对序列化的理解

我宣布，这是我找到的史上AI最全论文体系！

Spark學習（3）——Spark基於Standalone+Yarn任務提交流程詳解

Spark學習（4）——Spark任務提交參數詳解

Spark學習（6）——SparkSQL(1)

Spark學習（5）——Spark源碼學習(1)

Hadoop學習（7）——Hive高級應用(1)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結