Spark抽取轉換182個用戶的軌跡數據到ES,Kibana展示

GeoLife GPS Trajectories

該GPS軌跡數據集出自微軟研究GeoLift項目。從2007年四月到2012年八月收集了182個用戶的軌跡數據。這些數據包含了一系列以時間爲序的點,每一個點包含經緯度、海拔等信息。包含了17621個軌跡,總距離120多萬公里,總時間48000多小時。這些數據不僅僅記錄了用戶在家和在工作地點的位置軌跡,還記錄了大範圍的戶外活動軌跡,比如購物、旅遊、遠足、騎自行車。
下載地址:https://www.microsoft.com/en-us/download/details.aspx?id=52367


1.文件結構及數據結構

├── Data
│   ├── 000
│   │   └── Trajectory
│   │       ├── 20081023025304.plt
│   │       ├── 20081024020959.plt
│   │       ├── 20090521231053.plt
│   │       └── 20090705025307.plt
│   ├── 001
│   │   └── Trajectory
│   │       ├── 20081023055305.plt
│   │       ├── 20081023234104.plt

數據結構Example:

39.906631,116.385564,0,492,40097.5864583333,2009-10-11,14:04:30
39.906554,116.385625,0,492,40097.5865162037,2009-10-11,14:04:35
Line 1…6 are useless in this dataset, and can be ignored. Points are described in following lines, one for each line.
Field 1: (緯度)Latitude in decimal degrees.
Field 2: (經度)Longitude in decimal degrees.
Field 3: All set to 0 for this dataset.
Field 4: Altitude in feet (-777 if not valid).
Field 5: Date - number of days (with fractional part) that have passed since 12/30/1899.
Field 6: (日期)Date as a string.
Field 7: (時間)Time as a string.
Note that field 5 and field 6&7 represent the same date/time in this dataset. You may use either of them.

2.Spark抽取轉換

2.1 maven添加spark、es依賴
   <dependency>
       <groupId>org.elasticsearch</groupId>
       <artifactId>elasticsearch-spark-20_2.11</artifactId>
       <version>7.2.0</version>
   </dependency>
          <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
   <dependency>
       <groupId>org.apache.spark</groupId>
       <artifactId>spark-core_2.11</artifactId>
       <version>2.4.3</version>
   </dependency>
   <!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
   <dependency>
       <groupId>org.apache.spark</groupId>
       <artifactId>spark-sql_2.11</artifactId>
       <version>2.4.3</version>
   </dependency>
import org.apache.spark.sql.SparkSession

import scala.collection.mutable.ArrayBuffer
import org.elasticsearch.spark.sql._

object GeoToES {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder()
      .master("local[2]")
      .config("es.index.auto.create", "true")
      .config("es.nodes", "127.0.0.1")
      .config("es.port", "9200")
      .appName("log")
      .getOrCreate()

    val sc = spark.sparkContext

    val rdd = sc.wholeTextFiles("./data/geoData/*/*.plt")
    val rdd2 = rdd.map(x => FileAndContext(x._1, x._2))
    val rdd3 = rdd2.flatMap(splitContext)

    val dataFrame = spark.createDataFrame(rdd3)
    dataFrame.show(false)
    dataFrame.saveToEs("person_geo_time_location")
    sc.stop()
    spark.close()
  }

  def splitContext(x: FileAndContext): List[PersonRealTimePosition] = {
    val space = " "
    val arr = ArrayBuffer[PersonRealTimePosition]()
    val personName = x.file.split("/")(6)
    val lines = x.context.split("\r\n")
    for (line <- lines) {
      //39.99999,116.327396,0,92,39752.4790277778,2008-10-31,11:29:48
      val ss = line.split(",")
      arr += (PersonRealTimePosition(personName, Array(ss(1).toDouble,ss(0).toDouble), ss(5) + space + ss(6)))
    }
    arr.toList
  }

  //case class PersonRealTimePosition(personName: String, latitude: Double, longitude: Double, time: String)
  case class PersonRealTimePosition(personName: String, position: Array[Double], time: String)

  case class FileAndContext(file: String, context: String)

}

3.ES

3.1 新建索引
PUT person_geo_time_location
{
  "mappings": {
    "properties": {
      "personName": {
        "type": "keyword"
      },
      "position": {
        "type": "geo_point"
      },
      "time": {
        "type": "date",
        "format": [
          "yyyy-MM-dd HH:mm:ss"
        ]
      }
    }
  }
}
//查看索引
GET person_geo_time_location/_search 
//刪除索引
DELETE person_geo_time_location
3.2 執行 spark程序

4 kibana展示

在這裏插入圖片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章