Mongodb實現多表join

Mongodb實現多表join

千萬數量級的table, 如何實現join?

1、通過遍歷其他表,插入到當前表

from pymongo import MongoClient

client = MongoClient("mongodb://192.168.123.64:27017/")
temp = client["gd_raw_data"]["temp"]
prplregistex = client["gd_raw_data"]["prplregistex"]
repairfee = client["gd_raw_data"]["repairfee"]
prplcitemcar = client["gd_raw_data"]["prplcitemcar"]
lossthirdparty_lossmain = client["gd_raw_data"]["lossthirdparty_lossmain"]
lossthirdparty = client["gd_raw_data"]["lossthirdparty"]
lossmain = client["gd_raw_data"]["lossmain"]
citemkind = client["gd_raw_data"]["citemkind"]
check = client["gd_raw_data"]["check"]

query = {}
cursor = temp.find(query, no_cursor_timeout=True)
try:
    i = 0
    for doc in cursor:
        registno = doc['registno']
        print("報案號:{}".format(registno))
        prplregistex_info = prplregistex.find_one({ "registno": registno},no_cursor_timeout=True)
        repairfee_info = repairfee.find_one({ "registno": registno},no_cursor_timeout=True)
        prplcitemcar_info = prplcitemcar.find_one({ "registno": registno},no_cursor_timeout=True)
        lossthirdparty_lossmain_info = lossthirdparty_lossmain.find_one({ "registno": registno},no_cursor_timeout=True)
        lossthirdparty_info = lossthirdparty.find_one({ "registno": registno},no_cursor_timeout=True)
        lossmain_info = lossmain.find_one({ "registno": registno},no_cursor_timeout=True)
        citemkind_info = citemkind.find_one({ "registno": registno},no_cursor_timeout=True)
        check_info = check.find_one({ "registno": registno},no_cursor_timeout=True)

        newvalues = {"$set": {"prplregistex_info": prplregistex_info,"repairfee_info": repairfee_info,"prplcitemcar_info": prplcitemcar_info,
                              "lossthirdparty_lossmain_info": lossthirdparty_lossmain_info,"lossthirdparty_info": lossthirdparty_info,
                              "lossmain_info": lossmain_info,"citemkind_info": citemkind_info,"check_info": check_info}}
        temp.update_one({ "registno": registno}, newvalues)



finally:
    client.close()

發現我的PC(i7 6代)實現1700萬多表join需要125個小時,也就是5天5夜,中途服務器容易掛死。

2、優化方式

要麼多線程,要麼分佈式

2.1、mongodb的lookup, 也就是聚合功能

操作之前請務必爲關聯的字段創建索引

db.getCollection("prplcmain").aggregate(
    [
        {
            "$lookup": {
                "from": "lida",
                "localField": "registno",
                "foreignField": "registno",
                "as": "carinfo"
            }
        },
        {
            "$lookup": {
                "from": "prpldriver",
                "localField": "registno",
                "foreignField": "registno",
                "as": "prpldriver"
            }
        },
        {
            "$lookup": {
                "from": "prplinjured",
                "localField": "registno",
                "foreignField": "registno",
                "as": "prplinjured"
            }
        },
        {
            "$lookup": {
                "from": "prplinsured",
                "localField": "registno",
                "foreignField": "registno",
                "as": "prplinsured"
            }
        },
        {
            "$lookup": {
                "from": "regist",
                "localField": "registno",
                "foreignField": "registno",
                "as": "regist"
            }
        },

        {"$out" : "total"}
    ],
    {
        "allowDiskUse" : true
    }
);

這個相同配置下2個小時內可以搞定

2.2、mapreduce 分佈式join多表

這個還沒研究透徹
https://stackoverflow.com/questions/38882184/join-two-collections-with-mapreduce-in-mongodb

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章