用 scroll 和 bulk 拉取和上傳千萬級 elastic 索引

原創

2020-06-15 04:09

最近遇到個需求，索引遷移，本來應該用es的reindex，不過由於業務限制等原因，沒法使用，使用了 scroll 和 bulk 完成了遷移，記錄代碼

scroll 拉取到csv文件

from elasticsearch import Elasticsearch
from elasticsearch import helpers
import time
import json
# 188 -> 線上

class Get_es:

    def __init__(self):
        self.es188 = Elasticsearch(hosts="xxx", port=9200, http_auth=('xxx', 'xxx'),timeout=100000)

    def GetValue(self):

        query = {
            "query": {"match_all": {}}
        }
        scanResp = helpers.scan(client=self.es188, query=query, size=5000, scroll="3m", index='xxx',
                                doc_type='xx',
                                timeout="3m")

        for k in scanResp:
            yield k

    def write_file(self, k):
        with open('new_es_all.csv', 'ab') as f:
            k = dict(k)
            f.write(k['_id'].encode('utf-8'))
            f.write(b',')
            f.write(json.dumps(k['_source']).encode('utf-8'))
            f.write(b'\n')

    def run(self):
        list1 = self.GetValue()
        for index, k in enumerate(list1, 1):

            self.write_file(k)


if __name__ == "__main__":
    S = Get_es()
    S.run()

bulk 從 csv 上傳到es

import json
from json import JSONDecodeError
import csv
from elasticsearch import helpers
from elasticsearch import Elasticsearch

es = Elasticsearch(host="xxx",
                   port=9200,
                   http_auth=('xx', 'xxx!'),
                   use_ssl=True,
                   verify_certs=False,
                   ssl_assert_hostname=False,
                   timeout=10000
                   )

with open('new_es_all.csv', 'r') as f:
    reader = csv.reader(f)
    actions = []
    count = 0
    for row in reader:
        count += 1
        id = (row[0])
        source = {}
        try:
            source = json.loads(row[1] + "," + row[2])
        except JSONDecodeError as e:
            print(e)

        source["id"] = id
        action = {
            "_index": "xxx",
            "_type": "_doc",
            "_source": source
        }
        actions.append(action)
        if len(actions) >= 5000:
            # 調用bulk ，清空 actions
            helpers.bulk(es, actions)
            actions = []
    helpers.bulk(es, actions)

效率還是可以接受的，平均一天7200w數據，對於當前的需求夠用了

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

用 scroll 和 bulk 拉取和上傳千萬級 elastic 索引

scroll 拉取到csv文件

bulk 從 csv 上傳到es

美團一面：項目中有 10000 個 if else 如何優化？想了半天，被問懵了！

京東面試：如何進行JVM調優？

Python 將PowerPoint (PPT/PPTX) 轉爲HTML

SQL優化-20231016

用 scroll 和 bulk 拉取和上傳千萬級 elastic 索引

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結