Mongo 大數據字段去重的實現

文章目錄

elasticsearch

場景

數據庫
- mongo
數量級
- 300萬
- 每天增長5000 – 10000
索引
- 都已經建好了
業務需要實時對集合中的tel去重

嘗試1 (distinct)

使用mongo distinct 報錯
- distinct too big, 16mb cap

嘗試2 (aggregate)

第一次分組獲取不同的tel
第二次分組計算這些號碼的數量
耗時極長
- 300集合
- 索引查詢到 1749988
- 去重後 1306650

$pipeline = [
    [
        '$match' => $where
    ],
    [
        '$group' => [
            '_id' => '$tel'
        ]
    ],
    [
        '$group' => [
            '_id' => 1,
            'count' => ['$sum' => 1]
        ]
    ],
];
// 防止大數據溢出
$allowDiskUse = ['allowDiskUse' => true];


$list = MongoCrawlerTels::raw(function ($collection) use ($pipeline, $allowDiskUse) {
    return $collection->aggregate($pipeline, $allowDiskUse);
});

嘗試3 (寫入文件)

mongo cursor 取出1749988數據, 每行一個號碼寫入文件
sort target.txt | uniq | wc -l
- 先使用sort排序，再使用uniq去重,使用wc統計行數
致命的是寫入文件的時間，消耗了超過17秒，所以沒有辦法滿足性能需求

嘗試4 (利用redis的Set)

計劃利用redis的無序集合, 這種可以使用SCARD輕易的獲得去重後的電話號碼數量
性能問題, 寫入redis已經超過17秒了

嘗試5(數組)

數組當然是最方便的了, 但是需要測試下內存會不會爆
從mongo中使用遊標去除176萬數據消耗9.5秒時間, 寫入數據消耗200-300毫秒; 基本滿足需求; 但需要繼續優化(下一步看看elasticsearch)

測試(當前消耗的內存)

當前的極限情況, 300萬都取出來，重複的部分佔30%100
消耗內存： 128.00398254395MB , 總分配: 128.37540435791 MB 當前消耗: 128.37540435791 MB 插入的數量 : 2100000 消耗時間 0.42975521087646秒, 去重後的數量2100000


$memory_start  = memory_get_usage();

// 設置電話號碼列表
$start_time = microtime(true);


$arr = [];
//$total = (3000000 + 7000*365) * 0.7;
$total = 3000000* 0.7;
for ($i = 0; $i < $total; $i++) {
    $arr[$i] = '';
}

$end_time = microtime(true);
$memory_end = memory_get_usage();
$memory_now = $memory_end/(1024*1024);
$memory_need = ($memory_end - $memory_start)/(1024*1024);
$memory_get_peak_usage  = memory_get_peak_usage()/(1024*1024);

$msg = '消耗內存 ： ' . $memory_need . 'MB , 總分配: ' . $memory_get_peak_usage . ' MB 當前消耗: ' . $memory_now. ' MB  插入的數量 : ' . $i . ' 消耗時間 ' . ($end_time - $start_time) . '秒, 去重後的數量' . count($arr);
echo $msg . PHP_EOL;

推測一年後的內存使用情況

+ 每天增長7000
+ 重複的部分 30%

消耗內存： 128.00398254395MB , 總分配: 128.37540435791 MB 當前消耗: 128.37540435791 MB 插入的數量 : 3888500 消耗時間 0.71930503845215秒, 去重後的數量3888500
詭異
- 3888500索引的數組和2100000消耗的內存幾乎一致


$memory_start  = memory_get_usage();

// 設置電話號碼列表
$start_time = microtime(true);


$arr = [];
$total = (3000000 + 7000*365) * 0.7;
for ($i = 0; $i < $total; $i++) {
    $arr[$i] = '';
}

$end_time = microtime(true);
$memory_end = memory_get_usage();
$memory_now = $memory_end/(1024*1024);
$memory_need = ($memory_end - $memory_start)/(1024*1024);
$memory_get_peak_usage  = memory_get_peak_usage()/(1024*1024);

$msg = '消耗內存 ： ' . $memory_need . 'MB , 總分配: ' . $memory_get_peak_usage . ' MB 當前消耗: ' . $memory_now. ' MB  插入的數量 : ' . $i . ' 消耗時間 ' . ($end_time - $start_time) . '秒, 去重後的數量' . count($arr);
echo $msg . PHP_EOL;

結論放在數組中最近一年的時間是沒有問題的
- 畢竟很少搜索到這麼大的時間範圍(2年)


  /**
     * 去重號碼數量
     * @param array $where
     */
    private function _setUniqueTelNumberForList(array $where)
    {
        // 初始化屬性
        $this->setListTelList();

        // 獲取mongo 遊標
        $option = [
            'projection' => [
                'tel' => 1,
                'apikey' => 1,
                '_id' => 0
            ],
        ];
        $cursor = DB::connection('mongodb_backend')->collection('crawler_tels')->raw(function ($collection) use ($where, $option) {
            return $collection->find($where, $option);
        });
        $memory_start  = memory_get_usage();

        // 設置電話號碼列表
        $start_time = microtime(true);
        $i = 0;
        foreach ($cursor as $item) {
            $i++;
            $this->list_tel_list['all'][$item->tel] = '';
            $this->list_tel_list['list_apikey'][$item->apikey][$item->tel] = '';
        }

        $end_time = microtime(true);
        $memory_end = memory_get_usage();
        $memory_now = $memory_end/(1024*1024);
        $memory_need = ($memory_end - $memory_start)/(1024*1024);
        $memory_get_peak_usage  = memory_get_peak_usage()/(1024*1024);

        $msg = '消耗內存 ： ' . $memory_need . 'MB , 總分配: ' . $memory_get_peak_usage . ' MB 當前消耗: ' . $memory_now. ' MB ' . $this->product_id . ' 插入的數量 : ' . $i . ' 消耗時間 ' . ($end_time - $start_time) . '秒, 去重後的數量' . count($this->list_tel_list['all'] ?? []);
        $action = 'unqiue';
        $params = request()->post();
        MongoLog::create(compact('msg', 'action', 'params'));
    }

php 多線程

php多線程不可以使用web服務中
多個線程從公用一個遊標, 從各個線程獲取結果也挺麻煩的
嘗試過之後放棄了

elasticsearch

es Vs mongo 遊標拉去數據的速度

es 條件
- size 從1000 每次實驗 +1000 直到10000爲止； 2000的時候消耗的時間是最少的
- filter的形式


        $params = [
            'index' => $this->index,
            'size' => 2000,
            'scroll' => '30s',
            'body'   => [
                'query' => [
                        'bool' => [
                            'filter' => [
                                [
                                    "range" => ["amount_date" => ["gte" => "20190101"]]
                                ],
                            ]
                        ]
                ]
            ]
        ];

單純的從elasticsearch中分頁拉去數據，實際上還沒有mongo快

mongo 聚合操作

待嘗試

Mongo 大數據字段去重的實現

文章目錄

場景

嘗試1 (distinct)

嘗試2 (aggregate)

嘗試3 (寫入文件)

嘗試4 (利用redis的Set)

嘗試5(數組)

測試(當前消耗的內存)

推測一年後的內存使用情況

php 多線程

elasticsearch

es Vs mongo 遊標拉去數據的速度

mongo 聚合操作

Fielddata is disabled on text fields by default. Set fielddata=true on [make] in order to load

thinkphp5.0 unique驗證器一些有意思的事情

截取金額的小數後兩位(其餘部分捨去)

向量整體置亂算法

Mongo 大數據字段去重的實現

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結