場景:
1. 數據來源是python爬蟲獲取美團、安居客的數據,使用Java進行ETL清洗後,批量插入MySQL中,由於MySQL在千萬級甚至億級數據量的激增下,查詢緩慢,以及部分功能不能支持業務的需求,比如分詞。。。等
2. 調研組件Hive,發現Hive對於GIS的支持並不友好,首先Hive不支持創建表時指定字段爲geometry類型,這就很尷尬了,對於使用GIS函數的功能,官方並不支持,但是我們可以通過來擴展UDF的方式進行使用,使用請參照:https://blog.csdn.net/qq_32252917/article/details/105378848
3. 調研ElasticSearch,選定版本7.0.0
搭建環境:安裝Logstash,ElasticSearch,Kibana
參考:
安裝logstash和logstash-input-jdbc
經過調研:ElasticSearch支持geo位置信息的查詢,支持分詞的查詢,對於SQL的解決方案也是有的,並且支持Springboot與ElasticSearch的集成,就很nice
清洗後的數據樣例:
1 1070934115905789 張亮麻辣燙 30.147708 120.078346 POINT (120.078346 30.147708) 杭州市轉塘街道金街美地商業中心3-133 4816 4.4000001 8 20 20 快餐小喫 轉塘 10:00-23:59 330100 杭州市 2020-03-27 15:16:37.0
2 1088436107667681 石小吞 30.139899 120.072293 POINT (120.072293 30.139899) 轉塘街道之江泰景大廈2號樓113室-2 4802 4.5 25 20 24 中式簡餐 首爾印象 10:00-21:00 330100 杭州市 2020-03-27 15:16:37.0
3 969590067570034 蜀匯香麻辣香鍋 30.141992 120.071456 POINT (120.071456 30.141992) 杭州市轉塘街道霞鳴街159號、161號(之江商務中心1號樓商117、118) 4141 3.9000001 5 32 21 麻辣香鍋 首爾印象 09:40-22:30 330100 杭州市 2020-03-27 15:16:37.0
4 879266905368275 JIMU佶慕創意生日蛋糕 30.276493 120.095462 POINT (120.095462 30.276493) 五聯西苑51號103室 3986 4.5999999 0 0 115 生日蛋糕 06:00-21:00 330100 杭州市 2020-03-27 15:16:37.0
5 1007591938247898 暖愛蛙蝦跳 30.150440 120.078756 POINT (120.078756 30.15044) 轉塘鎮美院南街象山國際西面2號樓 3983 4.4000001 8 20 19 中式簡餐 轉塘 10:00-23:00 330100 杭州市 2020-03-27 15:16:37.0
6 953075918354303 杭粥西糊 30.150265 120.078895 POINT (120.078895 30.150265) 浙江省杭州市西湖區轉塘街道美院南街89號2號樓2樓216室 3071 4.69999981 8 0 15 快餐小喫 轉塘 07:00-21:00 330100 杭州市 2020-03-27 15:16:37.0
7 1026945060845891 七號の茶 30.147755 120.079269 POINT (120.079269 30.147755) 轉塘街道金街美地商業中心2號樓118商鋪 3050 4.5999999 55 20 12 奶茶果汁 轉塘 09:45-20:45 330100 杭州市 2020-03-27 15:16:37.0
8 891503267213030 二條輕食 30.143201 120.069550 POINT (120.06955 30.143201) 轉塘街道萬美商務中心5號樓313號 2715 4.69999981 75 18 23 沙拉 10:00-20:00 330100 杭州市 2020-03-27 15:16:37.0
9 885451658268086 韓味購炸雞啤酒屋 30.147508 120.078562 POINT (120.078562 30.147508) 轉塘金街美的商業中心3號樓206室(一點點樓上) 2375 4.5999999 品牌 55 20 30 炸雞炸串 轉塘 00:00-02:00,09:40-21:00 330100 杭州市 2020-03-27 15:16:37.0
10 1010946307684654 鍋sir時尚火鍋外賣 30.299309 120.113058 POINT (120.113058 30.299309) 拱墅區塘萍路157號 2137 4.30000019 品牌 68 30 83 小火鍋 城西銀泰 00:00-03:00,09:30-23:59 330100 杭州市 2020-03-27 15:16:37.0
同步MySQL數據到ElasticSearch
1.首先安裝ElasticSearch和Logstash以及Kibana(ElasticSearch安裝使用head插件)
2.ElasticSearch創建索引:
curl -XPUT "http://127.0.0.1:9200/mt"
或者直接使用head插件創建也可以。
導入地理座標數據需要指定字段gis數據格式爲geo_point,指定的方法有多種,這裏說兩種:
1)利用template模板指定gis字段爲地理座標類型(geo_point)
2)直接在kibana控制檯指定gis座標爲地理座標類型(geo_point)
我這裏使用第二種方法:(使用postman工具)
post http://ip:9200/mt/_mapping
{
"properties": {
"gis": {
"type": "geo_point"
}
}
}
只需指定這一個特殊字段即可,其餘字段會在導入的時候自動和相應的字段類型進行匹配
3.Logstash安裝插件:
logstash-input-jdbc
logstash-output-elasticsearch
切換到logstash-7.0.0的home目錄:
mkdir templete
cd templete
vi logstash.json
#logstash模版,導入ElasticSearch的時候對string類型的字段進行分詞,使用的IK分詞插件
{
"index_patterns": ["*"],
"order" : 0,
"version": 1,
"settings": {
"number_of_shards": 1,
"number_of_replicas":0
},
"mappings": {
"date_detection": true,
"numeric_detection": true,
"dynamic_templates": [
{
"string_fields": {
"match": "*",
"match_mapping_type": "string",
"mapping": {
"type": "text",
"norms": false,
"analyzer": "ik_max_word",
"fields": {
"keyword": {
"type": "keyword"
}
}
}
}
}
]
}
}
cd bin
mkdir config
cd config
#====================================================
vi jdbc.conf
input {
jdbc {
jdbc_connection_string => "jdbc:mysql://xxx:3306/xxx?characterEncoding=UTF-8&useSSL=false&autoReconnect=true"
jdbc_user => "xxx"
jdbc_password => "xxx"
jdbc_driver_library => "/data/app/mysql-connector-java-5.1.46.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_paging_enabled => "true"
jdbc_page_size => "50000"
jdbc_default_timezone => "Asia/Shanghai"
statement_filepath => "/data/app/logstash-7.0.0/bin/config/jdbc.sql"
schedule => "* * * * *"
type => "jdbc"
}
}
filter {
#將sql裏的兩個座標字段指定數據格式
mutate {
convert => { "longitude" => "float" }
convert => { "latitude" => "float" }
}
#將兩個座標字段合併成一個字段,注意:字段名必須爲lon,lat,否則報錯
mutate {
rename => {
"lon" => "[gis][longitude]"
"lat" => "[gis][latitude]"
}
}
}
# elasticsearch7.x只允許一個index下只能有一種type類型
output {
elasticsearch {
hosts => "localhost:9200"
index => "mt"
document_type => "_doc"
document_id => "%{id}"
}
}
#====================================================
vi jdbc.sql
select id,poi_id, shop_name,latitude,longitude,concat_ws(',',latitude,longitude) as gis,address, month_sales, score, type_icon,ship_fee,min_price,average_price,third_category,trade_area,ship_time,city_code,city_name,crawl_time,tag from id_mt_shoplist_test
啓動Logstash:
bin/logstash -f config/jdbc.conf &
之後就會開始同步數據,同步完成之後,查詢(postman工具)
#查詢附近1km之內有多少家店
get http://ip:9200/mt/_search
{
"query": {
"bool": {
"must": {
"match_all": {}
},
"filter": {
"geo_distance": {
"distance": "1km",
"gis": {
"lat": 31.299600,
"lon": 121.156099
}
}
}
}
}
}
結果:
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"skipped": 0,
"failed": 0
},
"hits": {
"total": {
"value": 96,
"relation": "eq"
},
"max_score": 1.0,
"hits": [
{
"_index": "mt",
"_type": "_doc",
"_id": "27924",
"_score": 1.0,
"_source": {
"type_icon": "",
"gis": "31.290725,00121.155756",
"address": "安亭鎮澤普路600號",
"third_category": "小龍蝦",
"type": "jdbc",
"latitude": 31.290725,
"trade_area": "新源路",
"average_price": 85.0,
"longitude": 121.155756,
"city_name": "蘇州市",
"poi_id": "1003790892242891",
"id": 27924,
"@version": "1",
"score": 4.300000190734863,
"ship_time": "00:00-01:00,09:00-23:59",
"month_sales": 155,
"min_price": 300.0,
"@timestamp": "2020-04-08T01:53:50.402Z",
"ship_fee": 48.0,
"shop_name": "盱眙兄弟龍蝦",
"crawl_time": "2020-04-07T10:47:00.000Z",
"city_code": "320500",
"tag": ""
}
},
{
"_index": "mt",
"_type": "_doc",
"_id": "28332",
"_score": 1.0,
"_source": {
"type_icon": "",
"gis": "31.290918,00121.156806",
"address": "澤浦路599號",
"third_category": "小龍蝦",
"type": "jdbc",
"latitude": 31.290918,
"trade_area": "新源路",
"average_price": 248.0,
"longitude": 121.156806,
"city_name": "蘇州市",
"poi_id": "921039757353389",
"id": 28332,
"@version": "1",
"score": 0.0,
"ship_time": "00:00-01:00,08:50-23:59",
"month_sales": 70,
"min_price": 300.0,
"@timestamp": "2020-04-08T01:53:50.470Z",
"ship_fee": 48.0,
"shop_name": "辣首龍蝦",
"crawl_time": "2020-04-07T10:47:12.000Z",
"city_code": "320500",
"tag": ""
}
},
{
"_index": "mt",
"_type": "_doc",
"_id": "78062",
"_score": 1.0,
"_source": {
"type_icon": "",
"gis": "31.300276,00121.153649",
"address": "安亭鎮新源路796號1層",
"third_category": "地方小喫",
"type": "jdbc",
"latitude": 31.300276,
"trade_area": "新源路",
"average_price": 23.0,
"longitude": 121.153649,
"city_name": "上海市",
"poi_id": "1003872496642121",
"id": 78062,
"@version": "1",
"score": 4.400000095367432,
"ship_time": "07:00-21:35",
"month_sales": 238,
"min_price": 20.0,
"@timestamp": "2020-04-08T01:54:02.687Z",
"ship_fee": 2.0,
"shop_name": "安亭老街湯糰",
"crawl_time": "2020-04-07T11:12:15.000Z",
"city_code": "310100",
"tag": ""
}
},
{
"_index": "mt",
"_type": "_doc",
"_id": "78066",
"_score": 1.0,
"_source": {
"type_icon": "品牌",
"gis": "31.293330,00121.163573",
"address": "上海市嘉定區安亭鎮墨玉路73-75號4幢1層101室",
"third_category": "生日蛋糕",
"type": "jdbc",
"latitude": 31.29333,
"trade_area": "安亭",
"average_price": 226.0,
"longitude": 121.163573,
"city_name": "上海市",
"poi_id": "1012702949403599",
"id": 78066,
"@version": "1",
"score": 4.300000190734863,
"ship_time": "08:00-18:00",
"month_sales": 192,
"min_price": 100.0,
"@timestamp": "2020-04-08T01:54:02.687Z",
"ship_fee": 0.0,
"shop_name": "GANSO元祖蛋糕",
"crawl_time": "2020-04-07T11:12:15.000Z",
"city_code": "310100",
"tag": ""
}
},
{
"_index": "mt",
"_type": "_doc",
"_id": "78067",
"_score": 1.0,
"_source": {
"type_icon": "",
"gis": "31.298283,00121.157455",
"address": "安亭鎮阜康路199弄213號1-1號(安亭幼兒園對面)",
"third_category": "奶茶果汁",
"type": "jdbc",
"latitude": 31.298283,
"trade_area": "安亭",
"average_price": 17.0,
"longitude": 121.157455,
"city_name": "上海市",
"poi_id": "1084059536052905",
"id": 78067,
"@version": "1",
"score": 5.0,
"ship_time": "00:00-10:00,10:00-23:59",
"month_sales": 176,
"min_price": 85.0,
"@timestamp": "2020-04-08T01:54:02.687Z",
"ship_fee": 20.0,
"shop_name": "MaxSee熱麥喜",
"crawl_time": "2020-04-07T11:12:15.000Z",
"city_code": "310100",
"tag": ""
}
},
{
"_index": "mt",
"_type": "_doc",
"_id": "78070",
"_score": 1.0,
"_source": {
"type_icon": "",
"gis": "31.291620,00121.158791",
"address": "新源路198號104室",
"third_category": "湯類",
"type": "jdbc",
"latitude": 31.29162,
"trade_area": "新源路",
"average_price": 28.0,
"longitude": 121.158791,
"city_name": "上海市",
"poi_id": "882432296311281",
"id": 78070,
"@version": "1",
"score": 4.099999904632568,
"ship_time": "00:00-04:00,04:00-23:58",
"month_sales": 161,
"min_price": 30.0,
"@timestamp": "2020-04-08T01:54:02.687Z",
"ship_fee": 5.0,
"shop_name": "馬氏古法牛肉湯",
"crawl_time": "2020-04-07T11:12:16.000Z",
"city_code": "310100",
"tag": ""
}
},
{
"_index": "mt",
"_type": "_doc",
"_id": "79067",
"_score": 1.0,
"_source": {
"type_icon": "",
"gis": "31.306135,00121.155105",
"address": "安亭鎮民豐路950號",
"third_category": "麪館",
"type": "jdbc",
"latitude": 31.306135,
"trade_area": "安亭",
"average_price": 27.0,
"longitude": 121.155105,
"city_name": "上海市",
"poi_id": "890824662441434",
"id": 79067,
"@version": "1",
"score": 0.0,
"ship_time": "00:00-01:45,09:30-23:59",
"month_sales": 36,
"min_price": 20.0,
"@timestamp": "2020-04-08T01:54:02.822Z",
"ship_fee": 2.0,
"shop_name": "新柴浜麪館",
"crawl_time": "2020-04-07T11:12:47.000Z",
"city_code": "310100",
"tag": ""
}
},
{
"_index": "mt",
"_type": "_doc",
"_id": "79068",
"_score": 1.0,
"_source": {
"type_icon": "",
"gis": "31.305605,00121.152790",
"address": "安亭鎮民豐路999號蘭塘菜市場內8號",
"third_category": "滷味熟食",
"type": "jdbc",
"latitude": 31.305605,
"trade_area": "新源路",
"average_price": 60.0,
"longitude": 121.15279,
"city_name": "上海市",
"poi_id": "896953580776452",
"id": 79068,
"@version": "1",
"score": 0.0,
"ship_time": "06:30-20:35",
"month_sales": 36,
"min_price": 20.0,
"@timestamp": "2020-04-08T01:54:02.822Z",
"ship_fee": 1.0,
"shop_name": "南京鹽水鴨夫妻肺片",
"crawl_time": "2020-04-07T11:12:47.000Z",
"city_code": "310100",
"tag": ""
}
},
{
"_index": "mt",
"_type": "_doc",
"_id": "79075",
"_score": 1.0,
"_source": {
"type_icon": "",
"gis": "31.293785,00121.156816",
"address": "安亭鎮新源路274號1層",
"third_category": "",
"type": "jdbc",
"latitude": 31.293785,
"trade_area": "安亭",
"average_price": 0.0,
"longitude": 121.156816,
"city_name": "上海市",
"poi_id": "974713963614325",
"id": 79075,
"@version": "1",
"score": 0.0,
"ship_time": "00:00-23:59",
"month_sales": 26,
"min_price": 0.0,
"@timestamp": "2020-04-08T01:54:02.822Z",
"ship_fee": 0.0,
"shop_name": "愛尚花藝慶典",
"crawl_time": "2020-04-07T11:12:47.000Z",
"city_code": "310100",
"tag": ""
}
},
{
"_index": "mt",
"_type": "_doc",
"_id": "78856",
"_score": 1.0,
"_source": {
"type_icon": "",
"gis": "31.298157,00121.155791",
"address": "安亭鎮阜康西路269號1層",
"third_category": "麪包/小蛋糕",
"type": "jdbc",
"latitude": 31.298157,
"trade_area": "新源路",
"average_price": 38.0,
"longitude": 121.155791,
"city_name": "上海市",
"poi_id": "931498002711234",
"id": 78856,
"@version": "1",
"score": 5.0,
"ship_time": "07:30-21:00",
"month_sales": 75,
"min_price": 15.0,
"@timestamp": "2020-04-08T01:54:02.798Z",
"ship_fee": 2.0,
"shop_name": "緹小貝麪包坊",
"crawl_time": "2020-04-07T11:12:40.000Z",
"city_code": "310100",
"tag": ""
}
}
]
}
}