注意點:
數據存入Mysql或是postgre注意事項
A. 儘量先設置好存儲模式
默認爲SaveMode.ErrorIfExists模式,該模式下,如果數據庫中已經存在該表,則會直接報異常,導致數據不能存入數據庫.另外三種模式如下:
SaveMode.Append 如果表已經存在,則追加在該表中;若該表不存在,則會先創建表,再插入數據;
SaveMode.Overwrite 重寫模式,其實質是先將已有的表及其數據全都刪除,再重新創建該表,最後插入新的數據;
SaveMode.Ignore 若表不存在,則創建表,並存入數據;在表存在的情況下,直接跳過數據的存儲,不會報錯。
B. 設置存儲模式的步驟爲:
org.apache.spark.sql.SaveMode
......
df.write.mode(SaveMode.Append)
C. 若提前在數據庫中手動創建表,需要注意列名稱和數據類型,
需要保證Spark SQL中schema中的field name與Mysql中的列名稱一致!
特別注意: Scala中的String類型,在MySQL中對應的是Text類型(經過親自測試所知)
上面是本人在Spark SQL 讀取與寫入Mysql方面的遇到的一些坑,特在此備忘。
package com.emg.rec_match
import java.util.Properties
import com.emg.real.RealDataDAO.{WayAllDAO, WayRealDAO}
import com.emg.rec_match.model._
import com.emg.rec_match.pool.CreatePGSqlPoolForBigdata
import org.apache.log4j.Logger
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{SaveMode, SparkSession}
import scala.collection.mutable.ListBuffer
object RecMatch {
private val logger = Logger.getLogger(RecMatch.getClass)
def main(args: Array[String]): Unit = {
//創建sparkSession
val conf = new SparkConf().setAppName("matcher2") //.setMaster("local[*]")
.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
//.set("spark.kryoserializer.buffer.max", "512m")
.registerKryoClasses(Array[Class[_]](AvgSpeedBuffer.getClass, Point.getClass, UsefulFields.getClass,
DistAndDegree.getClass, Uuid2oriData.getClass, Uuid2midData.getClass, Uuid2extData.getClass, RoadPart.getClass, Uuid.getClass))
val spark = SparkSession.builder().config(conf).getOrCreate()
val sc = spark.sparkContext
CreateTableDAODay.createTable(utils.CurrentTimeUtil.getFrontHour())
import spark.implicits._
//接收數據
val dataRDD = sc.textFile(args(0))
val value = dataRDD.map(
rdd => {
val line = rdd.split(",", rdd.length)
val uuid = line(0).toLong
val lat = line(3).toDouble
val lon = line(4).toDouble
val dir = line(7).toDouble
val speed = line(8).toDouble
val utc = line(9).toLong
UsefulFields(uuid, lat, lon, dir, speed, utc)
}
)
val len = sc.longAccumulator //累加器,計數使用
val rdds: RDD[Uuid2extData] = value.mapPartitions(partition => {
val datas: List[Uuid2oriData] = WayAllDAO.getRoadIdBatch(partition.toList)
val result: List[Uuid2extData] = WayRealDAO.getMatchRoadID(datas)
//len.add(result.length)
/* result.map(uuid => {
Uuid(uuid.uuid, uuid.point.lat, uuid.point.lon, uuid.dir, uuid.speed, uuid.utc)
}).iterator*/
result.iterator
})
val pro = new Properties
//必須增加數據庫的用戶名(user)密碼(password),必須指定postgresql驅動(driver)
pro.put("user", "postgres")
pro.put("password", "password")
pro.put("driver", "org.postgresql.Driver")
rdds.toDF()
.select("uuid", "point.lat", "point.lon", "dir", "speed", "utc")
.write.mode("append")
.jdbc("jdbc:postgresql://192.168.00.00:38888/bigdata", s"cover_${utils.CurrentTimeUtil.getFrontHour().toString}", pro)
sc.stop()
spark.stop()
}
}
/*
val unit = rdd.foreachPartition(partition => {
val poolForBigdata = CreatePGSqlPoolForBigdata()
val clientForBigdata = poolForBigdata.borrowObject()
partition.foreach(line => {
val sql = s"INSERT INTO cover_${utils.CurrentTimeUtil.getFrontHour().toString} VALUES(?,?,?,?,?,?)" //uuid,roadid,roadpart,point,dir,speed,utc
clientForBigdata.executeUpdate(sql, Array(line.uuid, line.point.lat, line.point.lon, line.dir, line.speed, line.utc)) //
})
//使用完成後將對象返回給對象池
poolForBigdata.returnObject(clientForBigdata)
})*/
參考:https://blog.csdn.net/dream_an/article/details/54962464
https://blog.csdn.net/dai451954706/article/details/52840011