python实现Phoenix批量导入数据

官网文档:

Phoenix provides two methods for bulk loading data into Phoenix tables:
· Single-threaded client loading tool for CSV formatted data via the psql command
· MapReduce-based bulk load tool for CSV and JSON formatted data
The psql tool is typically appropriate for tens of megabytes, while the MapReduce-based loader is typically better for larger load volumes.

上述大意为:phoenix有两种方式供批量写数据。一种是单线程psql方式,另一种是mr分布式。单线程适合一次写入十来兆的文件,mr方式更加适合写大批量数据。

下面分别用两种方式进行测试批量写数据

1.准备阶段

1、 创建phoenix表(对应的hbase表并不存在)

CREATE TABLE example (

my_pk bigint not null,

m.first_name varchar(50),

m.last_name varchar(50)

CONSTRAINT pk PRIMARY KEY (my_pk));

2、创建二级索引

create index example_first_name_index on example(m.first_name);

3、创建data.csv文件

# -*- coding: UTF-8 -*-
import csv

#python2可以用file替代open,追加改模式w 为a
with open("test.csv","w") as csvfile:
writer = csv.writer(csvfile)

#先写入columns_name
# writer.writerow(["class_id","f.student_info"])
# 写入多行用writerows,分批写入用writerow
# writer.writerows([[1,'aaaa'],[2,'bbbbb'],[3,'ccccccc']])

list_all = []
for i in range(0, 3):
list=[]
list.append(i)
list.append(str(i)+'aaaaaa')
list_all.append(list)

writer.writerows(list_all)
csvFile.close()

 

4、上传至hdfs

hadoop fs -rm /kk/kk_test.csv
hadoop fs -put /root/kangkai/kk_test.csv /kk

2.单线程psql方式如下:

[root@hdp18 Templates]#

/usr/hdp/2.5.3.0-37/phoenix/bin/psql.py -t EXAMPLE hdp14:2181 /root/Templates/data.csv

注:

(1)/root/Templates/data.csv为本地文件

(2) hdp14:2181为zookeeper对应的主机以及端口

(3) 上述语句还支持不少参数,如-t为表名,-d为文件内容分割符,默认为英文符的逗号。

验证数据是否写入正常以及索引表是否有进行同步更新

通过上述结果可知批量导入数据正常以及批量导入数据是会自动更新索引表的。

 

3. mr批量写数据方式

[root@hdp14 ~]# hadoop jar /home/hadoop/apache-phoenix-4.14.0-cdh5.11.2-bin/phoenix-4.14.0-cdh5.11.2-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table kk_test --input /kk/kk_test.csv

注:

1、官网指出如果为phoenix4.0及以上要用如下方式(HADOOP_CLASSPATH=/path/to/hbase-protocol.jar:/path/to/hbase/conf hadoop jar phoenix--client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool –table EXAMPLE –input /data/example.csv)

2、用hadoop命令创建文件上传文件等操作都用root用户执行,/tmp/YCB/data.csv为hdfs上对应的文件路径

3、该命令可随意挑集群中一台机器,当然也可以通过指定具体机器执行,如添加-z 机器:端口便可(HADOOP_CLASSPATH=/usr/hdp/2.5.3.0-37/hbase/lib/hbase-protocol.jar:/usr/hdp/2.5.3.0-37/hbase/conf/ hadoop jar /usr/hdp/2.5.3.0-37/phoenix/phoenix-4.7.0.2.5.3.0-37-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool -z hdp15:2181 –table EXAMPLE –input /tmp/YCB/data.csv)

验证结果:

0: jdbc:phoenix:hdp14,hdp15> SELECT * FROM example1_first_name_index;
+---------------+---------+
| M:FIRST_NAME | :MY_PK |
+---------------+---------+
| Joddhn | 12345 |
| Joddhn | 123452 |
| Maryddd | 67890 |
| Maryddd | 678902 |
+---------------+---------+
4 rows selected (0.042 seconds)
0: jdbc:phoenix:hdp14,hdp15> SELECT * FROM example1;
+---------+-------------+---------------+
| MY_PK | FIRST_NAME | LAST_NAME |
+---------+-------------+---------------+
| 12345 | Joddhn | Dois |
| 67890 | Maryddd | Poppssssins |
| 123452 | Joddhn | Dois |
| 678902 | Maryddd | Poppssssins2 |
+---------+-------------+---------------+


由此可知,导入数据的同时,可以创建索引。

 

测试批量导入的速度

在各环境一般的情况下(5个节点,64G内存),大概批量写入200万数据用时两分钟左右。

4.完整实例

实现加工结果插入Phoenix的功能

import phoenixdb
import phoenixdb.cursor

    def flush_data(self):
        logger.info('flush teacher start')
        database_url = 'https://10.9.9.9:8765/'
        conn_phoenix = phoenixdb.connect(database_url, autocommit=True)
        cur_phoenix = conn_phoenix.cursor()
        try:
            self._insert_phoenix_data(conn_phoenix, cur_phoenix, cur_ss_gp, cur_user_ol, cur_ss_ods, conn_user_ol)
        finally:
            cur_phoenix.closed
            conn_phoenix.closed
        logger.info('end flush')


    def _insert_phoenix_class_data(self, conn_phoenix, cur_phoenix):
        class_file_name = "crm_teacher_detail"
        logger.info('insert _class_detail start')
        command = "rm " + class_file_name + ".csv"
        logger.info(command)
        os.system(command)
        list_row = []
        i_count = 0
        for i, r in enumerate(results):
            class_id = get_value(r[0], 0)
            res_student_info = get_value(r[1], '')
            student_info = ''
            if not res_student_info == '':
                res_student_infos = str(res_student_info).split(';')
                for res in res_student_infos:
                    student_info = student_info + '{' + res + '},'
                student_info = '['+student_info[:-1]+']'

            list_col = []
            list_col.append(class_id)
            list_col.append(student_info)
            list_row.append(list_col)

            # cur_phoenix.execute("UPSERT INTO crm_teacher_class_detail VALUES (?,?)", list_col)

            i_count = i + 1
            if i_count % _BATCH_INSERT_MAX == 0:

                with open(class_file_name+".csv", "a") as csvfile:
                    writer = csv.writer(csvfile)
                    writer.writerows(list_row)

                list_row = []
                print i_count

        if list_row:
            with open(class_file_name + ".csv", "a") as csvfile:
                writer = csv.writer(csvfile)
                writer.writerows(list_row)
            print i_count

        csvfile.close()

        command = "hadoop fs -rm /kk/" + class_file_name + ".csv; hadoop fs -put /data/furion_feature_db/" + class_file_name + ".csv /kk; hadoop jar /home/hadoop/apache-phoenix-4.14.0-cdh5.11.2-bin/phoenix-4.14.0-cdh5.11.2-client.jar org.apache.phoenix.mapreduce.CsvBulkLoadTool --table " + class_file_name + " --input /kk/" + class_file_name + ".csv"
        logger.info(command)
        os.system(command)

        # 实现将shell指令结果输出的功能,可惜一直报错
        # p = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)
        # out = p.stdout.readlines()
        # for line in out:
        #     logger.info(line.strip())

        logger.info("insert _class_detail finished")

5.小结:

1、速度:

CSV data can be bulk loaded with built in utility named psql. Typical upsert rates are 20K - 50K rows per second (depends on how wide are the rows).

解释上述意思是:通过bulk loaded 的方式批量写数据速度大概能达到20K-50K每秒。具体这个数值笔者也只是粗糙测试过,速度也还算挺不错的。官网出处:https://phoenix.apache.org/faq.html

2、 通过测试确认批量导入会自动更新phoenix二级索引(这个结果不受是否先有hbase表的影响)。

3、导入文件编码默认是utf-8格式。

4、mr方式支持的参数还有其他的具体如下:

5、mr方式导入数据默认会自动更新指定表的所有索引表,如果只需要更新指定的索引表可用-it 参数指定更新的索引表。对文件默认支持的分割符是逗号,参数为-d.

6、如果是想通过代码方式批量导入数据,可以通过代码先将数据写到hdfs中,将mr批量导入方式写到shell脚本中,再通过代码调用shell脚本(写批量执行命令的脚本)执行便可(这种方式笔者也没有试过,等实际有需求再试试了,理论上应该是没问题的)。

参考官网

https://phoenix.apache.org/bulk_dataload.html

 

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章