GeoLife GPS Trajectories
該GPS軌跡數據集出自微軟研究GeoLift項目。從2007年四月到2012年八月收集了182個用戶的軌跡數據。這些數據包含了一系列以時間爲序的點,每一個點包含經緯度、海拔等信息。包含了17621個軌跡,總距離120多萬公里,總時間48000多小時。這些數據不僅僅記錄了用戶在家和在工作地點的位置軌跡,還記錄了大範圍的戶外活動軌跡,比如購物、旅遊、遠足、騎自行車。
下載地址:https://www.microsoft.com/en-us/download/details.aspx?id=52367
1.文件結構及數據結構
├── Data
│ ├── 000
│ │ └── Trajectory
│ │ ├── 20081023025304.plt
│ │ ├── 20081024020959.plt
│ │ ├── 20090521231053.plt
│ │ └── 20090705025307.plt
│ ├── 001
│ │ └── Trajectory
│ │ ├── 20081023055305.plt
│ │ ├── 20081023234104.plt
數據結構Example:
39.906631,116.385564,0,492,40097.5864583333,2009-10-11,14:04:30
39.906554,116.385625,0,492,40097.5865162037,2009-10-11,14:04:35
Line 1…6 are useless in this dataset, and can be ignored. Points are described in following lines, one for each line.
Field 1: (緯度)Latitude in decimal degrees.
Field 2: (經度)Longitude in decimal degrees.
Field 3: All set to 0 for this dataset.
Field 4: Altitude in feet (-777 if not valid).
Field 5: Date - number of days (with fractional part) that have passed since 12/30/1899.
Field 6: (日期)Date as a string.
Field 7: (時間)Time as a string.
Note that field 5 and field 6&7 represent the same date/time in this dataset. You may use either of them.
2.Spark抽取轉換
2.1 maven添加spark、es依賴
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-spark-20_2.11</artifactId>
<version>7.2.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-core -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.11</artifactId>
<version>2.4.3</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-sql -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.11</artifactId>
<version>2.4.3</version>
</dependency>
import org.apache.spark.sql.SparkSession
import scala.collection.mutable.ArrayBuffer
import org.elasticsearch.spark.sql._
object GeoToES {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[2]")
.config("es.index.auto.create", "true")
.config("es.nodes", "127.0.0.1")
.config("es.port", "9200")
.appName("log")
.getOrCreate()
val sc = spark.sparkContext
val rdd = sc.wholeTextFiles("./data/geoData/*/*.plt")
val rdd2 = rdd.map(x => FileAndContext(x._1, x._2))
val rdd3 = rdd2.flatMap(splitContext)
val dataFrame = spark.createDataFrame(rdd3)
dataFrame.show(false)
dataFrame.saveToEs("person_geo_time_location")
sc.stop()
spark.close()
}
def splitContext(x: FileAndContext): List[PersonRealTimePosition] = {
val space = " "
val arr = ArrayBuffer[PersonRealTimePosition]()
val personName = x.file.split("/")(6)
val lines = x.context.split("\r\n")
for (line <- lines) {
//39.99999,116.327396,0,92,39752.4790277778,2008-10-31,11:29:48
val ss = line.split(",")
arr += (PersonRealTimePosition(personName, Array(ss(1).toDouble,ss(0).toDouble), ss(5) + space + ss(6)))
}
arr.toList
}
//case class PersonRealTimePosition(personName: String, latitude: Double, longitude: Double, time: String)
case class PersonRealTimePosition(personName: String, position: Array[Double], time: String)
case class FileAndContext(file: String, context: String)
}
3.ES
3.1 新建索引
PUT person_geo_time_location
{
"mappings": {
"properties": {
"personName": {
"type": "keyword"
},
"position": {
"type": "geo_point"
},
"time": {
"type": "date",
"format": [
"yyyy-MM-dd HH:mm:ss"
]
}
}
}
}
//查看索引
GET person_geo_time_location/_search
//刪除索引
DELETE person_geo_time_location