Spark DataFrame日期格式問題

背景

US和UK的5日之後的PLA數據都沒有存入ES,後來發現是日期格式不符合。
讀取的時候,已經是timestamp類型了(見如下dataFrame.printSchema
現在關鍵是DataFrame和CSV的格式轉換問題
看來要顯示轉換比較靠譜

    val userBiddingResultSchema = FbUserBiddingResultPojo.structType
    val userBiddingResultDf = sparkSession.read.schema(userBiddingResultSchema).csv(mdlResultPath)
是數據格式問題,US的幾十萬數據中,有條是url裏面帶了逗號,這樣就錯位了,我們現在在spark中指定的是inferSchema爲true,也就是依賴spark自動解析列的數據並判斷類型,當US的數據中存在問題,這一列就沒有這樣整齊劃一了,spark將其判斷爲string

在這裏插入圖片描述

經驗

  1. spark中讀取csv,不能設置inferSchema爲true,依賴spark自身解析,還是要顯式指定schema
  2. ES的index,put double(我已經弄了),date字段本身就可以自動解析;其實只要putInteger,就可以,不用事先專門新建ES的index ES創建和更改index.md
  3. TimeStamp匹配問題還是要儘快解決,現在就pla和text和display的依賴於clickId加上timestamp的匹配,但是兩者有時區的差異,我當時的解決是去掉了小時,會造成零點前後的數據匹配不上,週末形成解決方案

TimeStamp匹配問題解決

ISO 8601 https://blog.csdn.net/dai451954706/article/details/46930167
convert String to date; and convert date to String
https://www.cnblogs.com/mlfh1234/p/9210046.html

my code:

import java.sql.{Date, Timestamp}
import java.util.Locale
import java.text.SimpleDateFormat

object Demo {
  def main(args:Array[String]): Unit = {
    val str: String = ""
    println("begin...")
    val loc = new Locale("en")
    val fm = new SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX",loc)
    val tm = "2019-10-14T11:35:41.005-07:00"
    val dt2 = fm.parse(tm)
    println(dt2.getTime())

    //reverse
    val ts = new Timestamp(System.currentTimeMillis())
    println(ts)
    println(fm.format(ts))
  }
}

對比測試的結果卻是:日期轉化和字符串直接處理,結果上沒有區別,都是966條,實際源文件是962條,也就是有2條記錄重複了
通過寫java程序解析

1572995047000Cj0KCQiAtf_tBRDtARIsAIbAKe187WBetzRXJ-It8hL053IZKhOiIs2eWazHPBlxLNeUkDsLCjwoZ4waAlHREALw_wcB | count : 4
1573091679000Cj0KCQjwr-_tBRCMARIsAN413WTp0pgFVL16yWU9VowmLmPL9gvuLox0DSBHS0yeCNcNQSJkbnsim84aAlIjEALw_wcB | count : 4

在原始文件中找到了重複的記錄(主鍵完全一樣,但是後面幾個金額不一樣,有0的):

121行:
Google,136,PLA,2,Cj0KCQiAtf_tBRDtARIsAIbAKe187WBetzRXJ-It8hL053IZKhOiIs2eWazHPBlxLNeUkDsLCjwoZ4waAlHREALw_wcB,2156222200820561,1140946,1824172998,1,2019-11-05,2019-11-05T16:04:07.000-07:00,1,101,47.99,48.85382,48.85382,,1,154.3780712
606行:
Google,136,PLA,2,Cj0KCQiAtf_tBRDtARIsAIbAKe187WBetzRXJ-It8hL053IZKhOiIs2eWazHPBlxLNeUkDsLCjwoZ4waAlHREALw_wcB,2156222200820561,1140946,1824172998,0,2019-11-05,2019-11-05T16:04:07.000-07:00,1,101,47.99,0.0,0.0,0.0,1,0.0

在內存中已經是時間的長格式了

2019-11-08 15:35:44 INFO CodeGenerator:54 - Code generated in 11.354684 ms
±------±------±----------±---------------±-----±----------±-------------------±------------------±---------------------±-----±–±-----------------±-----------------±-----------------±-----------------±-------------------±---------+
|partner|channel|tgt_site_id| rotation_id|abc_id|campaign_id| ck_trans_dt| conversionName|conversionCurrencyCode|status|cnt| igmbsum| gmbsum| dgmbsum| ibuyersum|conversion_value_sum| batchDate|
±------±------±----------±---------------±-----±----------±-------------------±------------------±---------------------±-----±–±-----------------±-----------------±-----------------±-----------------±-------------------±---------+
| Bing| PLA| 71|7091533165446554| null| 350769209|2019-10-29 00:00:…|offline_conversions| EUR| 0| 6| 37.38476| 49.21999999999999|12.041509999999999|1.5713719999999998| 118.13584159999999|2019-10-10|
| Bing| PLA| 71|7091533165446554| null| 350769219|2019-10-27 00:00:…|offline_conversions| EUR| 0| 44|2645.4809920000007| 2878.731397|2544.0942840000007|11.541987999999998| 8359.719934719998|2019-10-10|
±------±------±----------±---------------±-----±----------±-------------------±------------------±---------------------±-----±–±-----------------±-----------------±-----------------±-----------------±-------------------±---------+
dataFrame.printSchema打印如下:
root
|-- partner: string (nullable = true)
|-- channel: string (nullable = true)
|-- tgt_site_id: integer (nullable = true)
|-- rotation_id: long (nullable = true)
|-- abc_id: string (nullable = true)
|-- campaign_id: integer (nullable = true)
|-- ck_trans_dt: timestamp (nullable = true)
|-- conversionName: string (nullable = true)
|-- conversionCurrencyCode: string (nullable = true)
|-- status: integer (nullable = true)
|-- cnt: long (nullable = false)
|-- igmbsum: double (nullable = true)
|-- gmbsum: double (nullable = true)
|-- dgmbsum: double (nullable = true)
|-- ibuyersum: double (nullable = true)
|-- conversion_value_sum: double (nullable = true)
|-- batchDate: string (nullable = false)

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章