scala寫spark讀取postgres數據庫寫入es中

 pg表中數據大概160w條200多M,方法一速度是五分半,方法二速度是四分半。

 方法一:



import org.apache.spark.sql.SparkSession
import org.elasticsearch.spark.sql.EsSparkSQL

/**
  * @Auther: sss
  * @Date: 2020/6/2 10:46
  * @Description: 
  */
object WriteES02 {
  def main(args: Array[String]): Unit = {


    val session = SparkSession.builder()
      .master("local[*]")
      .appName("test")
      .config("xpack.security.user", "elastic:elastic123")
      .config("es.net.http.auth.user", "elastic")
      .config("es.net.http.auth.pass", "elastic123")
      .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .config("es.nodes.wan.only", "true")
      .config("es.nodes", "emg4032")
      .config("es.port", "9200")
      .getOrCreate()

    val jdbcDF = session.read.format("jdbc")
      .option("url", "jdbc:postgresql://xxxx:38888/bigdata")
      .option("dbtable", "public.yk_photo_gps")
      .option("user", "postgres")
      .option("password", "password")
      .load()

    jdbcDF.createTempView("test")
    val result = session.sql("select  *  from  test  where utc >  1590163200000  and utc < 1590184800000 ")

    val esConf = Map(
      "es.resource" -> "pg02/_doc" //es索引
    )

    EsSparkSQL.saveToEs(result, esConf)
    session.stop()

  }
}

 

方法二:


import java.util.Properties

import org.apache.spark.sql.SparkSession
import org.elasticsearch.spark.sql.EsSparkSQL

/**
  * @Auther: sss
  * @Date: 2020/6/2 15:18
  * @Description:
  */
object WriteES04 {
  def main(args: Array[String]): Unit = {


    val a = System.currentTimeMillis()


    val session = SparkSession.builder()
      .master("local[*]")
      .appName("test")
      .config("xpack.security.user", "elastic:elastic123")
      .config("es.net.http.auth.user", "elastic")
      .config("es.net.http.auth.pass", "elastic123")
      .config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      .config("es.nodes.wan.only", "true")
      .config("es.nodes", "emg4032")
      .config("es.port", "9200")
      .getOrCreate()

    val url = "jdbc:postgresql://xxxx:38888/bigdata"
    val tableName = "public.yk_photo_gps"
    val columnName = "terminalid"
    val lowerBound = 2913588
    val upperBound = 4534684
    val numPartitions = 16

    var properties = new Properties()
    properties.put("driver", "org.postgresql.Driver")
    properties.put("user", "postgres")
    properties.put("password", "password")
    //properties.put("fetchsize", "30")   //使用默認的就好,最好不要設置



    val jdbcDF = session.read.jdbc(url, tableName, columnName,lowerBound,upperBound, numPartitions,properties)

    jdbcDF.createTempView("test")

    val result = session.sql("select  *  from  test  where utc >  1590163200000  and utc < 1590184800000 ")

    val esConf = Map(
      "es.resource" -> "pg04/_doc" //es索引
    )

    EsSparkSQL.saveToEs(result, esConf)
    session.stop()

    println("啓動時間:" + a + "|||||結束時間:" + System.currentTimeMillis())
  }
}

================================================================================================

後來我讀取4個G的數據時,就體現出 方法二 的速度了

根據數據總條數(1200w條,2.5G)設置了3600個numpartitions,根據tid字段進行分區,下界設置的是tid字段最小值,上界設置的是tid字段最大值

提交腳本如下:

#!/bin/bash

nohup spark-submit --master yarn --deploy-mode cluster --class es.WriteES02 --name pg_es02 --driver-memory 1G --executor-memory 4G --executor-cores 4 --num-executors 10 --jars  /sparkapp/cxb/kafka_write_es/elasticsearch-spark-20_2.11-7.7.0.jar /sparkapp/cxb/kafka_write_es/pg/es-demo-1.0-SNAPSHOT.jar

 方法二的UI界面,2分13秒跑完

 

方法一運行時就會報錯:同樣的資源,因爲沒有設置分區,導致只有一個executors在跑任務!

ERROR cluster.YarnClusterScheduler: Lost executor 10 on 106.emapgo.com: Executor heartbeat timed out after 131426 ms

20/06/08 17:20:56 INFO scheduler.DAGScheduler: ResultStage 0 (runJob at EsSparkSQL.scala:101) failed in 827.396 s due to Job aborted due to stage failure:
Aborting TaskSet 0.0 because task 0 (partition 0)
cannot run anywhere due to node and executor blacklist.
Most recent failure:
Lost task 0.2 in stage 0.0 (TID 2, emg106.emapgo.com, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 168371 ms

Blacklisting behavior can be configured via spark.blacklist.*.

========================================================================================

再次改進 2分45秒跑完

--executor-memory 2G --executor-cores 2 --num-executors 20

#!/bin/bash

nohup spark-submit --master yarn --deploy-mode cluster --class es.WriteES02 --name pg_es02 --driver-memory 1G --executor-memory 2G --executor-cores 2 --num-executors 20 --jars  /sparkapp/cxb/kafka_write_es/elasticsearch-spark-20_2.11-7.7.0.jar /sparkapp/cxb/kafka_write_es/pg/es-demo-1.0-SNAPSHOT.jar

 

 

參考:

https://www.jianshu.com/p/c18a8197e6bf

https://www.cnblogs.com/Kaivenblog/p/12622008.html

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章