python實現Phoenix批量導入數據

官網文檔:

Phoenix provides two methods for bulk loading data into Phoenix tables:
· Single-threaded client loading tool for CSV formatted data via the psql command
· MapReduce-based bulk load tool for CSV and JSON formatted data
The psql tool is typically appropriate for tens of megabytes, while the MapReduce-based loader is typically better for larger load volumes.

上述大意爲:phoenix有兩種方式供批量寫數據。一種是單線程psql方式,另一種是mr分佈式。單線程適合一次寫入十來兆的文件,mr方式更加適合寫大批量數據。

下面分別用兩種方式進行測試批量寫數據

1.準備階段

1、 創建phoenix表(對應的hbase表並不存在)

CREATE TABLE example (

my_pk bigint not null,

m.first_name varchar(50),

m.last_name varchar(50)

CONSTRAINT pk PRIMARY KEY (my_pk));

2、創建二級索引

create index example_first_name_index on example(m.first_name);

3、創建data.csv文件

# -*- coding: UTF-8 -*-
import csv

#python2可以用file替代open,追加改模式w 爲a
with open("test.csv","w") as csvfile:
writer = csv.writer(csvfile)

#先寫入columns_name
# writer.writerow(["class_id","f.student_info"])
# 寫入多行用writerows,分批寫入用writerow
# writer.writerows([[1,'aaaa'],[2,'bbbbb'],[3,'ccccccc']])

list_all = []
for i in range(0, 3):
list=[]
list.append(i)
list.append(str(i)+'aaaaaa')
list_all.append(list)

writer.writerows(list_all)
csvFile.close()

 

4、上傳至hdfs

hadoop fs -rm /kk/kk_test.csv
hadoop fs -put /root/kangkai/kk_test.csv /kk

2.單線程psql方式如下:

[root@hdp18 Templates]#

/usr/hdp/2.5.3.0-37/phoenix/bin/psql.py -t EXAMPLE hdp14:2181 /root/Templates/data.csv

注:

(1)/root/Templates/data.csv爲本地文件

(2) hdp14:2181爲zookeeper對應的主機以及端口

(3) 上述語句還支持不少參數,如-t爲表名,-d爲文件內容分割符,默認爲英文符的逗號。

驗證數據是否寫入正常以及索引表是否有進行同步更新

通過上述結果可知批量導入數據正常以及批量導入數據是會自動更新索引表的。

 

3. mr批量寫數據方式

[root@hdp14 ~]# hadoop jar /home/hadoop/apache-phoenix-4.14.0-cdh5.11.2-bin/phoenix-4.14.0-cdh5.11.2-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table kk_test --input /kk/kk_test.csv

注:

1、官網指出如果爲phoenix4.0及以上要用如下方式(HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar phoenix--client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool –table EXAMPLE –input /data/example.csv)

2、用hadoop命令創建文件上傳文件等操作都用root用戶執行,/tmp/YCB/data.csv爲hdfs上對應的文件路徑

3、該命令可隨意挑集羣中一臺機器,當然也可以通過指定具體機器執行,如添加-z 機器:端口便可(HADOOP_CLASSPATH=/usr/hdp/2.5.3.0-37/hbase/lib/hbase-protocol.jar:/usr/hdp/2.5.3.0-37/hbase/conf/ hadoop jar /usr/hdp/2.5.3.0-37/phoenix/phoenix-4.7.0.2.5.3.0-37-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool -z hdp15:2181 –table EXAMPLE –input /tmp/YCB/data.csv)

驗證結果:

0: jdbc:phoenix:hdp14,hdp15> SELECT * FROM example1_first_name_index;
+---------------+---------+
| M:FIRST_NAME | :MY_PK |
+---------------+---------+
| Joddhn | 12345 |
| Joddhn | 123452 |
| Maryddd | 67890 |
| Maryddd | 678902 |
+---------------+---------+
4 rows selected (0.042 seconds)
0: jdbc:phoenix:hdp14,hdp15> SELECT * FROM example1;
+---------+-------------+---------------+
| MY_PK | FIRST_NAME | LAST_NAME |
+---------+-------------+---------------+
| 12345 | Joddhn | Dois |
| 67890 | Maryddd | Poppssssins |
| 123452 | Joddhn | Dois |
| 678902 | Maryddd | Poppssssins2 |
+---------+-------------+---------------+


由此可知,導入數據的同時,可以創建索引。

 

測試批量導入的速度

在各環境一般的情況下(5個節點,64G內存),大概批量寫入200萬數據用時兩分鐘左右。

4.完整實例

實現加工結果插入Phoenix的功能

import phoenixdb
import phoenixdb.cursor

    def flush_data(self):
        logger.info('flush teacher start')
        database_url = 'https://10.9.9.9:8765/'
        conn_phoenix = phoenixdb.connect(database_url, autocommit=True)
        cur_phoenix = conn_phoenix.cursor()
        try:
            self._insert_phoenix_data(conn_phoenix, cur_phoenix, cur_ss_gp, cur_user_ol, cur_ss_ods, conn_user_ol)
        finally:
            cur_phoenix.closed
            conn_phoenix.closed
        logger.info('end flush')


    def _insert_phoenix_class_data(self, conn_phoenix, cur_phoenix):
        class_file_name = "crm_teacher_detail"
        logger.info('insert _class_detail start')
        command = "rm " + class_file_name + ".csv"
        logger.info(command)
        os.system(command)
        list_row = []
        i_count = 0
        for i, r in enumerate(results):
            class_id = get_value(r[0], 0)
            res_student_info = get_value(r[1], '')
            student_info = ''
            if not res_student_info == '':
                res_student_infos = str(res_student_info).split(';')
                for res in res_student_infos:
                    student_info = student_info + '{' + res + '},'
                student_info = '['+student_info[:-1]+']'

            list_col = []
            list_col.append(class_id)
            list_col.append(student_info)
            list_row.append(list_col)

            # cur_phoenix.execute("UPSERT INTO crm_teacher_class_detail VALUES (?,?)", list_col)

            i_count = i + 1
            if i_count % _BATCH_INSERT_MAX == 0:

                with open(class_file_name+".csv", "a") as csvfile:
                    writer = csv.writer(csvfile)
                    writer.writerows(list_row)

                list_row = []
                print i_count

        if list_row:
            with open(class_file_name + ".csv", "a") as csvfile:
                writer = csv.writer(csvfile)
                writer.writerows(list_row)
            print i_count

        csvfile.close()

        command = "hadoop fs -rm /kk/" + class_file_name + ".csv; hadoop fs -put /data/furion_feature_db/" + class_file_name + ".csv /kk; hadoop jar /home/hadoop/apache-phoenix-4.14.0-cdh5.11.2-bin/phoenix-4.14.0-cdh5.11.2-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table " + class_file_name + " --input /kk/" + class_file_name + ".csv"
        logger.info(command)
        os.system(command)

        # 實現將shell指令結果輸出的功能,可惜一直報錯
        # p = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)
        # out = p.stdout.readlines()
        # for line in out:
        #     logger.info(line.strip())

        logger.info("insert _class_detail finished")

5.小結:

1、速度:

CSV data can be bulk loaded with built in utility named psql. Typical upsert rates are 20K - 50K rows per second (depends on how wide are the rows).

解釋上述意思是:通過bulk loaded 的方式批量寫數據速度大概能達到20K-50K每秒。具體這個數值筆者也只是粗糙測試過,速度也還算挺不錯的。官網出處:https://phoenix.apache.org/faq.html

2、 通過測試確認批量導入會自動更新phoenix二級索引(這個結果不受是否先有hbase表的影響)。

3、導入文件編碼默認是utf-8格式。

4、mr方式支持的參數還有其他的具體如下:

5、mr方式導入數據默認會自動更新指定表的所有索引表,如果只需要更新指定的索引表可用-it 參數指定更新的索引表。對文件默認支持的分割符是逗號,參數爲-d.

6、如果是想通過代碼方式批量導入數據,可以通過代碼先將數據寫到hdfs中,將mr批量導入方式寫到shell腳本中,再通過代碼調用shell腳本(寫批量執行命令的腳本)執行便可(這種方式筆者也沒有試過,等實際有需求再試試了,理論上應該是沒問題的)。

參考官網

https://phoenix.apache.org/bulk_dataload.html

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章