Mongodb實現多表join
千萬數量級的table, 如何實現join?
1、通過遍歷其他表,插入到當前表
from pymongo import MongoClient
client = MongoClient("mongodb://192.168.123.64:27017/")
temp = client["gd_raw_data"]["temp"]
prplregistex = client["gd_raw_data"]["prplregistex"]
repairfee = client["gd_raw_data"]["repairfee"]
prplcitemcar = client["gd_raw_data"]["prplcitemcar"]
lossthirdparty_lossmain = client["gd_raw_data"]["lossthirdparty_lossmain"]
lossthirdparty = client["gd_raw_data"]["lossthirdparty"]
lossmain = client["gd_raw_data"]["lossmain"]
citemkind = client["gd_raw_data"]["citemkind"]
check = client["gd_raw_data"]["check"]
query = {}
cursor = temp.find(query, no_cursor_timeout=True)
try:
i = 0
for doc in cursor:
registno = doc['registno']
print("報案號:{}".format(registno))
prplregistex_info = prplregistex.find_one({ "registno": registno},no_cursor_timeout=True)
repairfee_info = repairfee.find_one({ "registno": registno},no_cursor_timeout=True)
prplcitemcar_info = prplcitemcar.find_one({ "registno": registno},no_cursor_timeout=True)
lossthirdparty_lossmain_info = lossthirdparty_lossmain.find_one({ "registno": registno},no_cursor_timeout=True)
lossthirdparty_info = lossthirdparty.find_one({ "registno": registno},no_cursor_timeout=True)
lossmain_info = lossmain.find_one({ "registno": registno},no_cursor_timeout=True)
citemkind_info = citemkind.find_one({ "registno": registno},no_cursor_timeout=True)
check_info = check.find_one({ "registno": registno},no_cursor_timeout=True)
newvalues = {"$set": {"prplregistex_info": prplregistex_info,"repairfee_info": repairfee_info,"prplcitemcar_info": prplcitemcar_info,
"lossthirdparty_lossmain_info": lossthirdparty_lossmain_info,"lossthirdparty_info": lossthirdparty_info,
"lossmain_info": lossmain_info,"citemkind_info": citemkind_info,"check_info": check_info}}
temp.update_one({ "registno": registno}, newvalues)
finally:
client.close()
發現我的PC(i7 6代)實現1700萬多表join需要125個小時,也就是5天5夜,中途服務器容易掛死。
2、優化方式
要麼多線程,要麼分佈式
2.1、mongodb的lookup, 也就是聚合功能
操作之前請務必爲關聯的字段創建索引
db.getCollection("prplcmain").aggregate(
[
{
"$lookup": {
"from": "lida",
"localField": "registno",
"foreignField": "registno",
"as": "carinfo"
}
},
{
"$lookup": {
"from": "prpldriver",
"localField": "registno",
"foreignField": "registno",
"as": "prpldriver"
}
},
{
"$lookup": {
"from": "prplinjured",
"localField": "registno",
"foreignField": "registno",
"as": "prplinjured"
}
},
{
"$lookup": {
"from": "prplinsured",
"localField": "registno",
"foreignField": "registno",
"as": "prplinsured"
}
},
{
"$lookup": {
"from": "regist",
"localField": "registno",
"foreignField": "registno",
"as": "regist"
}
},
{"$out" : "total"}
],
{
"allowDiskUse" : true
}
);
這個相同配置下2個小時內可以搞定
2.2、mapreduce 分佈式join多表
這個還沒研究透徹
https://stackoverflow.com/questions/38882184/join-two-collections-with-mapreduce-in-mongodb