pg表中數據大概160w條200多M,方法一速度是五分半,方法二速度是四分半。
方法一:
import org.apache.spark.sql.SparkSession
import org.elasticsearch.spark.sql.EsSparkSQL
/**
* @Auther: sss
* @Date: 2020/6/2 10:46
* @Description:
*/
object WriteES02 {
def main(args: Array[String]): Unit = {
val session = SparkSession.builder()
.master("local[*]")
.appName("test")
.config("xpack.security.user", "elastic:elastic123")
.config("es.net.http.auth.user", "elastic")
.config("es.net.http.auth.pass", "elastic123")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("es.nodes.wan.only", "true")
.config("es.nodes", "emg4032")
.config("es.port", "9200")
.getOrCreate()
val jdbcDF = session.read.format("jdbc")
.option("url", "jdbc:postgresql://xxxx:38888/bigdata")
.option("dbtable", "public.yk_photo_gps")
.option("user", "postgres")
.option("password", "password")
.load()
jdbcDF.createTempView("test")
val result = session.sql("select * from test where utc > 1590163200000 and utc < 1590184800000 ")
val esConf = Map(
"es.resource" -> "pg02/_doc" //es索引
)
EsSparkSQL.saveToEs(result, esConf)
session.stop()
}
}
方法二:
import java.util.Properties
import org.apache.spark.sql.SparkSession
import org.elasticsearch.spark.sql.EsSparkSQL
/**
* @Auther: sss
* @Date: 2020/6/2 15:18
* @Description:
*/
object WriteES04 {
def main(args: Array[String]): Unit = {
val a = System.currentTimeMillis()
val session = SparkSession.builder()
.master("local[*]")
.appName("test")
.config("xpack.security.user", "elastic:elastic123")
.config("es.net.http.auth.user", "elastic")
.config("es.net.http.auth.pass", "elastic123")
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("es.nodes.wan.only", "true")
.config("es.nodes", "emg4032")
.config("es.port", "9200")
.getOrCreate()
val url = "jdbc:postgresql://xxxx:38888/bigdata"
val tableName = "public.yk_photo_gps"
val columnName = "terminalid"
val lowerBound = 2913588
val upperBound = 4534684
val numPartitions = 16
var properties = new Properties()
properties.put("driver", "org.postgresql.Driver")
properties.put("user", "postgres")
properties.put("password", "password")
//properties.put("fetchsize", "30") //使用默認的就好,最好不要設置
val jdbcDF = session.read.jdbc(url, tableName, columnName,lowerBound,upperBound, numPartitions,properties)
jdbcDF.createTempView("test")
val result = session.sql("select * from test where utc > 1590163200000 and utc < 1590184800000 ")
val esConf = Map(
"es.resource" -> "pg04/_doc" //es索引
)
EsSparkSQL.saveToEs(result, esConf)
session.stop()
println("啓動時間:" + a + "|||||結束時間:" + System.currentTimeMillis())
}
}
================================================================================================
後來我讀取4個G的數據時,就體現出 方法二 的速度了
根據數據總條數(1200w條,2.5G)設置了3600個numpartitions,根據tid字段進行分區,下界設置的是tid字段最小值,上界設置的是tid字段最大值
提交腳本如下:
#!/bin/bash
nohup spark-submit --master yarn --deploy-mode cluster --class es.WriteES02 --name pg_es02 --driver-memory 1G --executor-memory 4G --executor-cores 4 --num-executors 10 --jars /sparkapp/cxb/kafka_write_es/elasticsearch-spark-20_2.11-7.7.0.jar /sparkapp/cxb/kafka_write_es/pg/es-demo-1.0-SNAPSHOT.jar
方法二的UI界面,2分13秒跑完
方法一運行時就會報錯:同樣的資源,因爲沒有設置分區,導致只有一個executors在跑任務!
ERROR cluster.YarnClusterScheduler: Lost executor 10 on 106.emapgo.com: Executor heartbeat timed out after 131426 ms
20/06/08 17:20:56 INFO scheduler.DAGScheduler: ResultStage 0 (runJob at EsSparkSQL.scala:101) failed in 827.396 s due to Job aborted due to stage failure:
Aborting TaskSet 0.0 because task 0 (partition 0)
cannot run anywhere due to node and executor blacklist.
Most recent failure:
Lost task 0.2 in stage 0.0 (TID 2, emg106.emapgo.com, executor 12): ExecutorLostFailure (executor 12 exited caused by one of the running tasks) Reason: Executor heartbeat timed out after 168371 ms
Blacklisting behavior can be configured via spark.blacklist.*.
========================================================================================
再次改進 2分45秒跑完
--executor-memory 2G --executor-cores 2 --num-executors 20
#!/bin/bash
nohup spark-submit --master yarn --deploy-mode cluster --class es.WriteES02 --name pg_es02 --driver-memory 1G --executor-memory 2G --executor-cores 2 --num-executors 20 --jars /sparkapp/cxb/kafka_write_es/elasticsearch-spark-20_2.11-7.7.0.jar /sparkapp/cxb/kafka_write_es/pg/es-demo-1.0-SNAPSHOT.jar
參考:
https://www.jianshu.com/p/c18a8197e6bf
https://www.cnblogs.com/Kaivenblog/p/12622008.html