在對數據進行處理的時候,很多時候需要我們進行數據清洗。
下面的案例就是對大量的數據進行處理:
每行代碼完成的任務在備註中都有敘述
package etl
import java.io.File
import java.text.SimpleDateFormat
import java.util.Date
import bean.{Logs, logSchema}
import config.ConfigHelper
import org.apache.commons.io.FileUtils
import org.apache.spark.rdd.RDD
import org.apache.spark.sql
import org.apache.spark.sql.{Row, SaveMode, SparkSession}
/**
* 1、清洗
* 2、轉換
* */
object log2Parquet {
def main(args: Array[String]): Unit = {
//會有兩個參數,一個是輸入路徑,一個是輸出路徑,因此參數<2的話說明有參數丟失,報錯提醒我們檢查
if(args.length<2){
println("參數錯誤")
return
}
//需要創建一個SparkSession
val session=SparkSession
//創建
.builder()
//設置是否是在本地運行
//*代表幾個線程
.master("local[*]")
//設置進程的名稱
.appName(this.getClass.getName)
//壓縮
.config("spark.sql.parquet.compression.codec",ConfigHelper.parquetCode)
//序列化
.config("spark.serializer",ConfigHelper.serializer)
//獲取(如果有就拿原來的,沒有就創建)
.getOrCreate()
//導入隱式轉換
import session.implicits._
//讀取數據
val sources: RDD[String] = session.sparkContext.textFile(args(0))
//切分數據,判斷數據長度(字段的數量)是否大於55
//因爲設置了55個字段,如果切割以後字段數量小於55,說明一定在切分過程中出現了問題
//limit
//什麼都不加:300T////// 剪切完:300T
//加-1:300T////// 剪切完:300T “” “” “” “” “” “”
//filter效率很低,儘量少用
//rdd複用提高效率:
//緩存機制:不是安全的,緩存之前的血統關係
//緩存分爲cache和persist
//cache:底層調用persist,傳給內存
//persist的緩存級別:十二種
//分爲:內存,內存序列化,內存磁盤,內存磁盤序列化,磁盤,堆外內存 X2
// 檢查點機制:將中間的數據,存到一個可靠的文件系統裏,比如HDFS(數據是安全的,會切斷之前的血統關係)
//spilt中的regax屬性裏面的"\\"是轉譯符號,除非後面的符號是","或者";",不然類似於"|"這樣的前面都要加上"\\"
val filted = sources.map(line=>line.split("\\|",line.length)).filter(_.length>=55)
//挨個處理300T和300TATO
//獲取當前時間和數據時間
val format = new SimpleDateFormat("yyyyMMddHHmmss")
val date = new Date()
val currentTime: Long = date.getTime
//清洗300T和300TATO
//四道篩選都滿足的數據才能留下來
//第一道篩選:第一個屬性不包含300T和300TATO的數據剔除
val rdd300T = filted.filter(arr=> arr(0).contains("300T") || arr(0).contains("300TATO"))
//第二道篩選:剔除第18個屬性裏包含復位、SB待機和常用制動報警的數據
.filter(arr => !arr(17).contains("復位") || !arr(17).contains("SB待機") || !arr(17).contains("常用制動報警"))
//第三道篩選:第八個屬性是入庫時間與數據時間相差的時間,超過一定時間的數據剔除
//currentTime:現在的時間
//(format.parse(arr(7))).getTime:arr(7)是數據入庫時間,經過format.parse(),轉換成我們之前設定的時間格式,最後加上.getTime就和currentTime的格式一樣了,是以毫秒爲單位的
//5 * 365 * 24 * 60 * 60 * 1000:5年
//5:5年
//365:一年有365天
//24:一天有24小時
//60:一小時有60分鐘
//60:一分鐘有60秒
//1000:因爲時間最後的單位是毫秒,而秒和毫秒的進制是1000,所以最後乘上1000
.filter(arr => (currentTime - format.parse(arr(7)).getTime < 5 * 365 * 24 * 60 * 60 * 1000))
//第四道篩選:剔除第18個屬性裏面包含休眠、未知、無致命錯誤、一致性消息錯誤、NVMEN故障的數據
.filter(arr => !arr(17).contains("休眠") || !arr(17).contains("未知") || !arr(17).contains("無致命錯誤")
|| !arr(17).contains("一致性消息錯誤") || !arr(17).contains("NVMEN故障"))
//挑出來當前ATP系統處於故障中
//distinct:去重
val rdd300TSF = rdd300T.filter(arr=>arr(17).contains("前ATP系統處於故障中"))
.map(arr=>{
arr(0)+"|"+
arr(1)+"|"+
arr(2)+"|"+
arr(3)+"|"+
arr(4)+"|"+
arr(5)+"|"+
arr(6)+"|"+
//arr(7):20150109102323
//2015:2015年
//01:1月
//09:9日
//10:10點
//23:23分
//23:23秒
//arr(7).substring(0,12)+"00|":substring(0,12):截取字符串,前閉後開,取前十二個數據,即201501091023,最後的秒數不算,都默認爲00
arr(7).substring(0,12)+"00|"+
arr(8)+"|"+
arr(9)+"|"+
arr(10)+"|"+
arr(11)+"|"+
arr(12)+"|"+
arr(13)+"|"+
arr(14)+"|"+
arr(15)+"|"+
arr(16)+"|"+
arr(17)+"|"+
arr(18)+"|"+
arr(19)+"|"+
arr(20)+"|"+
arr(21)+"|"+
arr(22)+"|"+
arr(23)+"|"+
arr(24)+"|"+
arr(25)+"|"+
arr(26)+"|"+
arr(27)+"|"+
arr(28)+"|"+
arr(29)+"|"+
arr(30)+"|"+
arr(31)+"|"+
arr(32)+"|"+
arr(33)+"|"+
arr(34)+"|"+
arr(35)+"|"+
arr(36)+"|"+
arr(37)+"|"+
arr(38)+"|"+
arr(39)+"|"+
arr(40)+"|"+
arr(41)+"|"+
arr(42)+"|"+
arr(43)+"|"+
arr(44)+"|"+
arr(45)+"|"+
arr(46)+"|"+
arr(47)+"|"+
arr(48)+"|"+
arr(49)+"|"+
arr(50)+"|"+
arr(51)+"|"+
arr(52)+"|"+
arr(53)+"|"+
arr(54)
}).distinct().map(_.split("\\|",-1))
//清洗300S和300SATO
//三道篩選都滿足的數據才能留下來
//第一道篩選:第一個屬性不包含300S和300SATO的數據剔除
val rdd300S = filted.filter(arr => arr(0).contains("300S") || arr(0).contains("300SATO"))
//第二道篩選:第十八個屬性包含休眠的數據剔除
.filter(arr => !arr(17).contains("休眠"))
//第三道篩選:剔除當第十個屬性包含CTCS-3並且第十八個屬性是SB待機或者備系的數據
.filter(arr => (!arr(9).contains("CTCS-3")) && (arr(17).contains("SB待機") || arr(17).contains("備系")))
//200H
//剔除第一個屬性不包含200H和第十八個屬性包含休眠或者VC2報的數據
val rdd200H = filted.filter(arr=>arr(0).contains("200H")).filter(arr=> !arr(17).contains("休眠") || !arr(17).contains("VC2報"))
//300H
//剔除第一個屬性不包含300H和第十八個屬性包含休眠的數據
val rdd300H = filted.filter(arr=>arr(0).contains("300H")).filter(arr=> !arr(17).contains("休眠"))
//聚合
val rddAll: RDD[Array[String]] = rdd300T.filter(arr => !arr(17).contains("前ATP系統處於故障中"))
.union(rdd300TSF)
.union(rdd300H)
.union(rdd200H)
.union(rdd300S)
//防止還有別的數據的第一個屬性沒有在之前提到
.union(filted.filter(arr => !arr(0).contains("300T") || !arr(0).contains("300S") || !arr(0).contains("H")))
//rdd轉dataframe
//rdd轉成row
val rowRDD = rddAll.map(arr => Row(
arr(0),
arr(1),
arr(2),
arr(3),
arr(4),
arr(5),
arr(6),
arr(7),
arr(8),
arr(9),
arr(10),
arr(11),
arr(12),
arr(13),
arr(14),
arr(15),
arr(16),
arr(17),
arr(18),
arr(19),
arr(20),
arr(21),
arr(22),
arr(23),
arr(24),
arr(25),
arr(26),
arr(27),
arr(28),
arr(29),
arr(30),
arr(31),
arr(32),
arr(33),
arr(34),
arr(35),
arr(36),
arr(37),
arr(38),
arr(39),
arr(40),
arr(41),
arr(42),
arr(43),
arr(44),
arr(45),
arr(46),
arr(47),
arr(48),
arr(49),
arr(50),
arr(51),
arr(52),
arr(53),
arr(54)
))
//val frame = session.createDataFrame(rowRDD,logSchema.schema)
//其他的api
//構造一個樣例類
val rowRDDLogs: RDD[Logs] = rddAll.map(arr => Logs(
arr(0),
arr(1),
arr(2),
arr(3),
arr(4),
arr(5),
arr(6),
arr(7),
arr(8),
arr(9),
arr(10),
arr(11),
arr(12),
arr(13),
arr(14),
arr(15),
arr(16),
arr(17),
arr(18),
arr(19),
arr(20),
arr(21),
arr(22),
arr(23),
arr(24),
arr(25),
arr(26),
arr(27),
arr(28),
arr(29),
arr(30),
arr(31),
arr(32),
arr(33),
arr(34),
arr(35),
arr(36),
arr(37),
arr(38),
arr(39),
arr(40),
arr(41),
arr(42),
arr(43),
arr(44),
arr(45),
arr(46),
arr(47),
arr(48),
arr(49),
arr(50),
arr(51),
arr(52),
arr(53),
arr(54)
))
val frame = session.createDataFrame(rowRDDLogs)
//true:省略過程字段
//數據過長,比如sssssssssssssssssssssssss,只打印ssssss......
//false則全部打印
frame.show(true)
//判讀輸出文件是否存在
val file = new File(args(1))
if (file.exists()){
//如果輸出文件存在,那麼刪除文件,因爲下面寫出parquet文件的時候會創建一個全新的輸出文件
FileUtils.deleteDirectory(file)
}
//寫出parquet文件
frame.write.partitionBy("MPacketHead_ATPType").mode(SaveMode.Overwrite).parquet(args(1))
//釋放資源
session.stop()
}
}
相關配置文件:
ConfigHelper.scala:
package config
import com.typesafe.config.{Config, ConfigFactory}
object ConfigHelper {
//加載配置文件
private lazy val load: Config = ConfigFactory.load()
//加載壓縮格式
val parquetCode: String = load.getString("parquet.code")
//加載序列化方式
val serializer: String = load.getString("spark.serializer")
}
logSchema.scala:
package bean
import org.apache.spark.sql.types.{StringType, StructField, StructType}
object logSchema {
val schema=StructType(
Array(
//包名_字段名稱
StructField("MPacketHead_ATPType",StringType),
StructField("MPacketHead_TrainID",StringType),
StructField("MPacketHead_TrainNum",StringType),
StructField("MPacketHead_AttachRWBureau",StringType),
StructField("MPacketHead_ViaRWBureau",StringType),
StructField("MPacketHead_CrossDayTrainNum",StringType),
StructField("MPacketHead_DriverID",StringType),
StructField("MATPBaseInfo_DataTime",StringType),
StructField("MATPBaseInfo_Speed",StringType),
StructField("MATPBaseInfo_Level",StringType),
StructField("MATPBaseInfo_Mileage",StringType),
StructField("MATPBaseInfo_Braking",StringType),
StructField("MATPBaseInfo_EmergentBrakSpd",StringType),
StructField("MATPBaseInfo_CommonBrakSpd",StringType),
StructField("MATPBaseInfo_RunDistance",StringType),
StructField("MATPBaseInfo_Direction",StringType),
StructField("MATPBaseInfo_LineID",StringType),
StructField("MATPBaseInfo_AtpError",StringType),
StructField("MBalisePocket_BaliseID",StringType),
StructField("MBalisePocket_BaliseMile",StringType),
StructField("MBalisePocket_BaliseType",StringType),
StructField("MBalisePocket_Direction",StringType),
StructField("MBalisePocket_LineID",StringType),
StructField("MBalisePocket_AttachRWBureau",StringType),
StructField("MBalisePocket_BaliseNum",StringType),
StructField("MBalisePocket_Station",StringType),
StructField("MBalisePocket_BaliseError",StringType),
StructField("Signal_SignalID",StringType),
StructField("Signal_SignalName",StringType),
StructField("Signal_Station",StringType),
StructField("Signal_SignalMile",StringType),
StructField("Signal_Direction",StringType),
StructField("Signal_LineID",StringType),
StructField("Signal_Longitude",StringType),
StructField("Signal_Latitude",StringType),
StructField("Signal_SignalError",StringType),
StructField("RunNextSignal_SignalID",StringType),
StructField("RunNextSignal_SignalName",StringType),
StructField("RunNextSignal_Station",StringType),
StructField("RunNextSignal_SignalMile",StringType),
StructField("RunNextSignal_Direction",StringType),
StructField("RunNextSignal_LineID",StringType),
StructField("RunNextSignal_Longitude",StringType),
StructField("RunNextSignal_Latitude",StringType),
StructField("DriverInfo_DriverID",StringType),
StructField("DriverInfo_DriverName",StringType),
StructField("DriverInfo_DriverPhone",StringType),
StructField("DriverInfo_DriverOption",StringType),
StructField("DriverInfo_Validit",StringType),
StructField("RunDirection",StringType),
StructField("UUID",StringType),
StructField("Temperature",StringType),
StructField("Road",StringType),
StructField("Weather",StringType),
StructField("Humidity",StringType)
)
)
}
Logs.scala:
package bean
case class Logs (
val MPacketHead_ATPType:String,
val MPacketHead_TrainID:String,
val MPacketHead_TrainNum:String,
val MPacketHead_AttachRWBureau:String,
val MPacketHead_ViaRWBureau:String,
val MPacketHead_CrossDayTrainNum:String,
val MPacketHead_DriverID:String,
val MATPBaseInfo_DataTime:String,
val MATPBaseInfo_Speed:String,
val MATPBaseInfo_Level:String,
val MATPBaseInfo_Mileage:String,
val MATPBaseInfo_Braking:String,
val MATPBaseInfo_EmergentBrakSpd:String,
val MATPBaseInfo_CommonBrakSpd:String,
val MATPBaseInfo_RunDistance:String,
val MATPBaseInfo_Direction:String,
val MATPBaseInfo_LineID:String,
val MATPBaseInfo_AtpError:String,
val MBalisePocket_BaliseID:String,
val MBalisePocket_BaliseMile:String,
val MBalisePocket_BaliseType:String,
val MBalisePocket_Direction:String,
val MBalisePocket_LineID:String,
val MBalisePocket_AttachRWBureau:String,
val MBalisePocket_BaliseNum:String,
val MBalisePocket_Station:String,
val MBalisePocket_BaliseError:String,
val Signal_SignalID:String,
val Signal_SignalName:String,
val Signal_Station:String,
val Signal_SignalMile:String,
val Signal_Direction:String,
val Signal_LineID:String,
val Signal_Longitude:String,
val Signal_Latitude:String,
val Signal_SignalError:String,
val RunNextSignal_SignalID:String,
val RunNextSignal_SignalName:String,
val RunNextSignal_Station:String,
val RunNextSignal_SignalMile:String,
val RunNextSignal_Direction:String,
val RunNextSignal_LineID:String,
val RunNextSignal_Longitude:String,
val RunNextSignal_Latitude:String,
val DriverInfo_DriverID:String,
val DriverInfo_DriverName:String,
val DriverInfo_DriverPhone:String,
val DriverInfo_DriverOption:String,
val DriverInfo_Validit:String,
val RunDirection:String,
val UUID:String,
val Temperature:String,
val Road:String,
val Weather:String,
val Humidity:String
)
// extends Product{
// //記錄數據的長度
// //給你一個角標,返回一個相應的值
// //進行拼配
//}
application.conf:
#配置文件
#配置壓縮格式
parquet.code="snappy"
#配置序列化方式
spark.serializer="org.apache.spark.serializer.KryoSerializer"