spark直接連接postgre數據庫

注意點:

數據存入Mysql或是postgre注意事項
A. 儘量先設置好存儲模式
默認爲SaveMode.ErrorIfExists模式,該模式下,如果數據庫中已經存在該表,則會直接報異常,導致數據不能存入數據庫.另外三種模式如下:
SaveMode.Append 如果表已經存在,則追加在該表中;若該表不存在,則會先創建表,再插入數據;
SaveMode.Overwrite 重寫模式,其實質是先將已有的表及其數據全都刪除,再重新創建該表,最後插入新的數據;
SaveMode.Ignore 若表不存在,則創建表,並存入數據;在表存在的情況下,直接跳過數據的存儲,不會報錯。
B. 設置存儲模式的步驟爲:
org.apache.spark.sql.SaveMode
......
df.write.mode(SaveMode.Append)
C. 若提前在數據庫中手動創建表,需要注意列名稱和數據類型,
需要保證Spark SQL中schema中的field name與Mysql中的列名稱一致!

特別注意: Scala中的String類型,在MySQL中對應的是Text類型(經過親自測試所知)

上面是本人在Spark SQL 讀取與寫入Mysql方面的遇到的一些坑,特在此備忘。


 

package com.emg.rec_match

import java.util.Properties

import com.emg.real.RealDataDAO.{WayAllDAO, WayRealDAO}
import com.emg.rec_match.model._
import com.emg.rec_match.pool.CreatePGSqlPoolForBigdata
import org.apache.log4j.Logger
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{SaveMode, SparkSession}

import scala.collection.mutable.ListBuffer


object RecMatch {

  private val logger = Logger.getLogger(RecMatch.getClass)

  def main(args: Array[String]): Unit = {

    //創建sparkSession
    val conf = new SparkConf().setAppName("matcher2") //.setMaster("local[*]")
      .set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
      //.set("spark.kryoserializer.buffer.max", "512m")
      .registerKryoClasses(Array[Class[_]](AvgSpeedBuffer.getClass, Point.getClass, UsefulFields.getClass,
      DistAndDegree.getClass, Uuid2oriData.getClass, Uuid2midData.getClass, Uuid2extData.getClass, RoadPart.getClass, Uuid.getClass))
    val spark = SparkSession.builder().config(conf).getOrCreate()
    val sc = spark.sparkContext

    CreateTableDAODay.createTable(utils.CurrentTimeUtil.getFrontHour())

    import spark.implicits._
    //接收數據
    val dataRDD = sc.textFile(args(0))

    val value = dataRDD.map(
      rdd => {
        val line = rdd.split(",", rdd.length)
        val uuid = line(0).toLong
        val lat = line(3).toDouble
        val lon = line(4).toDouble
        val dir = line(7).toDouble
        val speed = line(8).toDouble
        val utc = line(9).toLong
        UsefulFields(uuid, lat, lon, dir, speed, utc)
      }
    )


    val len = sc.longAccumulator //累加器,計數使用

    val rdds: RDD[Uuid2extData] = value.mapPartitions(partition => {
      val datas: List[Uuid2oriData] = WayAllDAO.getRoadIdBatch(partition.toList)
      val result: List[Uuid2extData] = WayRealDAO.getMatchRoadID(datas)
      //len.add(result.length)
      /*      result.map(uuid => {
              Uuid(uuid.uuid, uuid.point.lat, uuid.point.lon, uuid.dir, uuid.speed, uuid.utc)
            }).iterator*/
      result.iterator
    })

    val pro = new Properties
    //必須增加數據庫的用戶名(user)密碼(password),必須指定postgresql驅動(driver)
    pro.put("user", "postgres")
    pro.put("password", "password")
    pro.put("driver", "org.postgresql.Driver")
    rdds.toDF()
      .select("uuid", "point.lat", "point.lon", "dir", "speed", "utc")
      .write.mode("append")
      .jdbc("jdbc:postgresql://192.168.00.00:38888/bigdata", s"cover_${utils.CurrentTimeUtil.getFrontHour().toString}", pro)
    sc.stop()
    spark.stop()
  }
}
 /*
        val unit = rdd.foreachPartition(partition => {
          val poolForBigdata = CreatePGSqlPoolForBigdata()
          val clientForBigdata = poolForBigdata.borrowObject()
          partition.foreach(line => {
            val sql = s"INSERT INTO cover_${utils.CurrentTimeUtil.getFrontHour().toString} VALUES(?,?,?,?,?,?)" //uuid,roadid,roadpart,point,dir,speed,utc
            clientForBigdata.executeUpdate(sql, Array(line.uuid, line.point.lat, line.point.lon, line.dir, line.speed, line.utc)) //
          })
          //使用完成後將對象返回給對象池
          poolForBigdata.returnObject(clientForBigdata)
        })*/

 

參考:https://blog.csdn.net/dream_an/article/details/54962464

     https://blog.csdn.net/dai451954706/article/details/52840011

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章