scala写spark读取postgres数据库写入es中

 pg表中数据大概160w条200多M,方法一速度是五分半,方法二速度是四分半。

 方法一:



import org.apache.spark.sql.SparkSession
import org.elasticsearch.spark.sql.EsSparkSQL

/**
  * @Auther: sss
  * @Date: 2020/6/2 10:46
  * @Description: 
  */
object WriteES02 {
  def main(args: Array[String]): Unit = {


    val session = SparkSession.builder()
      .master("local[*]")
      .appName("test")
      .config("xpack.security.user", "elastic:elastic123")
      .config("es.net.http.auth.user", "elastic")
      .config("es.net.http.auth.pass", "elastic123")
      .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .config("es.nodes.wan.only", "true")
      .config("es.nodes", "emg4032")
      .config("es.port", "9200")
      .getOrCreate()

    val jdbcDF = session.read.format("jdbc")
      .option("url", "jdbc:postgresql://xxxx:38888/bigdata")
      .option("dbtable", "public.yk_photo_gps")
      .option("user", "postgres")
      .option("password", "password")
      .load()

    jdbcDF.createTempView("test")
    val result = session.sql("select  *  from  test  where utc >  1590163200000  and utc < 1590184800000 ")

    val esConf = Map(
      "es.resource" -> "pg02/_doc" //es索引
    )

    EsSparkSQL.saveToEs(result, esConf)
    session.stop()

  }
}

 

方法二:


import java.util.Properties

import org.apache.spark.sql.SparkSession
import org.elasticsearch.spark.sql.EsSparkSQL

/**
  * @Auther: sss
  * @Date: 2020/6/2 15:18
  * @Description:
  */
object WriteES04 {
  def main(args: Array[String]): Unit = {


    val a = System.currentTimeMillis()


    val session = SparkSession.builder()
      .master("local[*]")
      .appName("test")
      .config("xpack.security.user", "elastic:elastic123")
      .config("es.net.http.auth.user", "elastic")
      .config("es.net.http.auth.pass", "elastic123")
      .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .config("es.nodes.wan.only", "true")
      .config("es.nodes", "emg4032")
      .config("es.port", "9200")
      .getOrCreate()

    val url = "jdbc:postgresql://xxxx:38888/bigdata"
    val tableName = "public.yk_photo_gps"
    val columnName = "terminalid"
    val lowerBound = 2913588
    val upperBound = 4534684
    val numPartitions = 16

    var properties = new Properties()
    properties.put("driver", "org.postgresql.Driver")
    properties.put("user", "postgres")
    properties.put("password", "password")
    //properties.put("fetchsize", "30")   //使用默认的就好,最好不要设置



    val jdbcDF = session.read.jdbc(url, tableName, columnName,lowerBound,upperBound, numPartitions,properties)

    jdbcDF.createTempView("test")

    val result = session.sql("select  *  from  test  where utc >  1590163200000  and utc < 1590184800000 ")

    val esConf = Map(
      "es.resource" -> "pg04/_doc" //es索引
    )

    EsSparkSQL.saveToEs(result, esConf)
    session.stop()

    println("启动时间:" + a + "|||||结束时间:" + System.currentTimeMillis())
  }
}

================================================================================================

后来我读取4个G的数据时,就体现出 方法二 的速度了

根据数据总条数(1200w条,2.5G)设置了3600个numpartitions,根据tid字段进行分区,下界设置的是tid字段最小值,上界设置的是tid字段最大值

提交脚本如下:

#!/bin/bash

nohup spark-submit --master yarn --deploy-mode cluster --class es.WriteES02 --name pg_es02 --driver-memory 1G --executor-memory 4G --executor-cores 4 --num-executors 10 --jars  /sparkapp/cxb/kafka_write_es/elasticsearch-spark-20_2.11-7.7.0.jar /sparkapp/cxb/kafka_write_es/pg/es-demo-1.0-SNAPSHOT.jar

 方法二的UI界面,2分13秒跑完

 

方法一运行时就会报错:同样的资源,因为没有设置分区,导致只有一个executors在跑任务!

ERROR cluster.YarnClusterScheduler: Lost executor 10 on 106.emapgo.com: Executor heartbeat timed out after 131426 ms

20/06/08 17:20:56 INFO scheduler.DAGScheduler: ResultStage 0 (runJob at EsSparkSQL.scala:101) failed in 827.396 s due to Job aborted due to stage failure:
Aborting TaskSet 0.0 because task 0 (partition 0)
cannot run anywhere due to node and executor blacklist.
Most recent failure:
Lost task 0.2 in stage 0.0 (TID 2, emg106.emapgo.com, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 168371 ms

Blacklisting behavior can be configured via spark.blacklist.*.

========================================================================================

再次改进 2分45秒跑完

--executor-memory 2G --executor-cores 2 --num-executors 20

#!/bin/bash

nohup spark-submit --master yarn --deploy-mode cluster --class es.WriteES02 --name pg_es02 --driver-memory 1G --executor-memory 2G --executor-cores 2 --num-executors 20 --jars  /sparkapp/cxb/kafka_write_es/elasticsearch-spark-20_2.11-7.7.0.jar /sparkapp/cxb/kafka_write_es/pg/es-demo-1.0-SNAPSHOT.jar

 

 

参考:

https://www.jianshu.com/p/c18a8197e6bf

https://www.cnblogs.com/Kaivenblog/p/12622008.html

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章