sqoop import hive ，export mysql 实践及遇到的问题

sqoop version ： Sqoop 1.4.6-cdh5.15.1

mysql cron_task 数据结构大家先记住一下后面会出现很多问题：

1、从mysql 导入数据到 hive

1.1 第一次是以英文 ','为field 分隔符，如果没有指定 --hive-table default.xxxx，默认为mysql的表名: cron_task

sqoop import --connect jdbc:mysql://192.168.1.1:3306/hive --username hive --passwordhive --table cron_task --fields-terminated-by ',' --hive-import --hive-table default.cron_task_hive --hive-overwrite -m 1

整个过程都很流程,执行的也成功了

查看一下hive 表结构,发现字段都有了

hive> desc cron_task_hive;
OK
id                  	string
cluster_name        	string
create_time         	string
cron                	string
cron_schedule_id    	string
date_number         	int
date_unit           	string
job_id              	string
norminal_time       	string
pipeline            	string
pipeline_id         	string
pipeline_ids        	string
start_time          	string
status              	boolean
submitted           	boolean
update_time         	string
username            	string
workspace_id        	string
Time taken: 0.059 seconds, Fetched: 18 row(s)

查询一下数据发现，查询id的时候所有的内容都出现了，其他字段显示为null

hive> select cluster_name,create_time  from cron_task_hive;
OK
NULL	NULL
NULL	NULL
NULL	NULL
NULL	NULL
NULL	NULL
NULL	NULL
NULL	NULL
NULL	NULL
NULL	NULL
NULL	NULL
Time taken: 0.06 seconds, Fetched: 10 row(s)
hive> select id from cron_task_hive limit 1;
OK
CronTask_414473843221581824,Cluster 1,2019-07-25 17:31:05.0,0

这就很迷了~，这是为什么呢？

后面查看了一下数据发现有一个字段里面存的是json String后的内容，回想起前面我们用的是英文逗号（','）作为分隔符写入 hive，有可能导致了字段的拆分出现了问题

存储json String 的字段

解决:既然我们知道了问题，那我们就来解决问题吧，我们就用'\001' (如果用\001, hive会友好的转换成\u0001)作为分隔符，重新导入数据到hive 表，注意:开始前要先删除hive 表，否则数据还是以前的数据用了--hive-overwrite 也不行

1.2，以'\001' 作为分隔符

drop table cron_task_hive;

sqoop import --connect jdbc:mysql://192.168.1.1:3306/hive --username hive --passwordhive --table cron_task --fields-terminated-by '\001' --hive-import --hive-table default.cron_task_hive --hive-overwrite -m 1

再查询一遍，OK啦！！！！

hive> select cluster_name,create_time  from cron_task;
OK
Cluster 1	2019-07-26 06:31:05.0
Cluster 1	2019-07-26 06:31:05.0
Cluster 1	2019-07-26 06:31:05.0
Cluster 1	2019-07-26 06:31:05.0
Cluster 1	2019-07-26 06:31:05.0
Cluster 1	2019-07-26 06:31:05.0
Cluster 1	2019-07-26 06:31:05.0
Cluster 1	2019-07-26 06:31:05.0
Cluster 1	2019-07-26 06:31:05.0
Cluster 1	2019-07-26 06:31:05.0
Time taken: 0.071 seconds, Fetched: 10 row(s)

祸不单行，又出现了新的问题，不怕，告诉自己关关难过关关过，mysql 中的create_time 显示的是2019-07-25 17:31:05 ，写入 hive 中的时间是2019-07-26 06:31:05.0,足足差了13的小时，也就是说sqoop 从mysql导入 hive数据的过程中时间的值都差了13个小时

参考了博客如下，解决了这个问题:

https://blog.csdn.net/weixin_43079984/article/details/89567962

https://blog.csdn.net/nwpu_geeker/article/details/80155423

时差出现的原因是 mysql 引起的，查看一下mysql 中的时区（注意:如果不是root 的权限请转 root，要不然像我上面的hive是没权限修改的）

在查看一下本地的时区，一般都是东八区

date -R

mysql 查询时区命令

show VARIABLES like "%time_zone";

将SYSTEM改为东八区

set global time_zone = '+08:00';
set time_zone = '+08:00';
flush privileges;

show VARIABLES like "%time_zone";

OK!!!!，我们再重新执行一下上面得 sqoop 脚本，发现搞掂了

查看一下hive 数据,时间数据一致了 oh yeah！！！

hive> select cluster_name,create_time  from cron_task_hive;
OK
Cluster 1	2019-07-25 17:31:05.0
Cluster 1	2019-07-25 17:31:05.0
Cluster 1	2019-07-25 17:31:05.0
Cluster 1	2019-07-25 17:31:05.0
Cluster 1	2019-07-25 17:31:05.0
Cluster 1	2019-07-25 17:31:05.0
Cluster 1	2019-07-25 17:31:05.0
Cluster 1	2019-07-25 17:31:05.0
Cluster 1	2019-07-25 17:31:05.0
Cluster 1	2019-07-25 17:31:05.0
Time taken: 0.071 seconds, Fetched: 10 row(s)

2、从hive 导入数据到 mysql

2.1 命令

sqoop export --connect jdbc:mysql://192.168.1.1:3306/hive --username hive --password hive --table cron_task_local --fields-terminated-by '\001'  --export-dir /user/hive/warehouse/cron_task_hive

执行失败了，发现是mysql 没有新建这张表

mysql 新建一张表跟之前的表一样

create table cron_task_test like cron_task_local;

新建完继续执行上面得sqoop 语句

sqoop export --connect jdbc:mysql://192.168.1.1:3306/hive --username hive --password hive --table cron_task_local --fields-terminated-by '\001'  --export-dir /user/hive/warehouse/cron_task_hive

导入成功！！！！

3、拓展

3.1 有的童鞋就有点强迫症，觉得hive 里面存的时间值后面有个.0 想去掉,其实是属于Timestamp 的类型格式(yyyy-mm-dd hh:mm:ss[.fffffffff])，这里我们也可以处理添加 import --map-column-java，--map-column-hive,export 添加 --map-column-java，注意观察最上面的mysql 表数据结构 create_time: timestamp, norminal_time,start_time,update_time:datetime

第一次命令执行完发现 norminal_time,start_time,update_time 的值都为空

sqoop import --connect jdbc:mysql://192.168.1.1:3306/hive --username hive --password hive --table cron_task --fields-terminated-by '\001' --hive-import --map-column-java create_time=java.sql.Timestamp,norminal_time=java.sql.Date,start_time=java.sql.Date,update_time=java.sql.Date --map-column-hive create_time=TIMESTAMP,norminal_time=TIMESTAMP,start_time=TIMESTAMP,update_time=TIMESTAMP --hive-table default.cron_task_stamp --hive-overwrite -m 1

参考了一篇博客，博主有解释源码，主要是说toJava(Mysql)，toHive 中的字段类型映射，现在就去掉 --map-column-hive 中的norminal，start_time,update_time (https://blog.csdn.net/MuQianHuanHuoZhe/article/details/104423768)

删除 hive table

drop table cron_task_stamp;

执行sqoop 语句

sqoop import --connect jdbc:mysql://192.168.1.1:3306/hive --username hive --password hive --table cron_task --fields-terminated-by '\001' --hive-import --map-column-java create_time=java.sql.Timestamp,norminal_time=java.sql.Date,start_time=java.sql.Date,update_time=java.sql.Date --map-column-hive create_time=TIMESTAMP --hive-table default.cron_task_stamp --hive-overwrite -m 1

查看 hive 内容，发现create_time 确实是少了.0，后面的三个字段也输出了，不过就是少了 HH:MM:ss如果需求允许当然没有什么问题了(可是.......后面我执行export 的时候发现又又又出现了个error)

执行上面成功执行过的sqoop export 语句

sqoop export --connect jdbc:mysql://192.168.1.1:3306/hive --username hive --password hive --table cron_task_test --fields-terminated-by '\001'  --export-dir /user/hive/warehouse/cron_task_stamp

执行失败了......,报错上面说要符合 yyyy-mm-dd hh:mm:ss[.fffffffff] 格式才行，这就很奇怪了我们从hive export mysql的时候 hive 里面只有create_time 是timestamp类型的，其他的都是string 类型的，是不是我们要写入mysql的时候指定一下类型呢？？

export mysql 指定字段类型 sqoop 语句

sqoop export --connect jdbc:mysql://192.168.1.1:3306/hive --username hive --password hive --table cron_task_test --fields-terminated-by '\001' --map-column-java create_time=java.sql.Timestamp,update_time=java.sql.Date,norminal_time=java.sql.Date,start_time=java.sql.Date --export-dir /user/hive/warehouse/cron_task_stamp

哦豁~~~，发现执行成功，看看写入mysql的数据如何

select create_time,norminal_time,start_time,update_time from cron_task_test

发现数据都写入成功了，就是hive 里面的数据，后面的时分秒都用0 替代了，搞掂啦~~~

这个时候有些童鞋又有疑问了，我现在的需求是想保留数据一致，时分秒也要，那这个时候我应该怎么办呢？我们回顾一下上面的import 语句，我们是不是指定了java.sql.Date --map-column-java create_time=java.sql.Timestamp,norminal_time=java.sql.Date,start_time=java.sql.Date,update_time=java.sql.Date，因为Date类型就是yyyy-mm-dd 的写入的时候自动会把时分秒去除掉，那我们如果不指定他让那写入hive 的就是String类型那不就可以保留yyyy-mm-dd HH:MM:ss 了吗？说干就干，抓起袖子撸一把.......

改变了hive 表cron_task_stamp 的数据结构了，如果想继续用这这个表名，记得先删除表哦，否则不生效，也可以用新的表面

//删除 hive 表
drop table cron_task_stamp;

执行sqoop import 语句

sqoop import --connect jdbc:mysql://192.168.1.1:3306/hive --username hive --password hive --table cron_task --fields-terminated-by '\001' --hive-import --map-column-java create_time=java.sql.Timestamp --map-column-hive create_time=TIMESTAMP --hive-table default.cron_task_stamp --hive-overwrite -m 1

执行成功，查看hive 数据，好了时分秒都有了

执行sqoop export语句

sqoop export --connect jdbc:mysql://192.168.1.1:3306/hive --username hive --password hive --table cron_task_test --fields-terminated-by '\001' --map-column-java create_time=java.sql.Timestamp --export-dir /user/hive/warehouse/cron_task_stamp

执行的过程中发现报错了，报主键冲突，后面我说update 模式的 update-key update-mode，现在先删除table 数据，重新执行sqoop export

脚本执行成功，写入mysql的数据正常，完美

这个时候又有的童靴有点奇思妙想了，写入hive的时候原来不是没有指定类型嘛，指定Date的话那就只有年月日，要是我把norminal_time 的类型写成timestamp 那在hive里是不是就不会有.0后缀了呢，那当然是能够实现的啦，我们来实践一下

sqoop import 脚本，执行前记得先删hive 表

sqoop import --connect jdbc:mysql://192.168.1.1:3306/hive --username hive --password hive --table cron_task --fields-terminated-by '\001' --hive-import --map-column-java create_time=java.sql.Timestamp,norminal_time=java.sql.Timestamp --map-column-hive create_time=TIMESTAMP,norminal_time=Timestamp --hive-table default.cron_task_stamp --hive-overwrite -m 1

写入成功，查看hive 数据,very good 这就是我想要的了啊哈哈哈哈哈

3.2 上面我们export 的时候不是出现了一个主键冲突嘛，Duplicate entry 'CronTask_414473843561320448' for key 'PRIMARY'

现在我们就来解决这个问题

--update-mode有两种模式:

一种是updateonly 只更新，也就是说只会更新mysql已有的数据,hive 新的数据不会更新到mysql

一种是allowinsert 同时满足更新跟插入新的数据到mysql

--update-key 主键id,有冲突时进行update操作

sqoop export 命令

sqoop export --connect jdbc:mysql://192.168.1.1:3306/hive --username hive --password hive --table cron_task_test --fields-terminated-by '\001' --map-column-java create_time=java.sql.Timestamp --export-dir /user/hive/warehouse/cron_task_stamp --update-key id --update-mode allowinsert

大家也可以参考这篇博客（https://blog.csdn.net/wiborgite/article/details/80958201），实操一下这两种模式

3.3 针对oracle 出现的问题可以参考下面这篇博客

sqoop 从oracle导数据到hive中，date型数据时分秒截断问题

hive开启行转列功能:
> set hive.cli.print.header=true; // 打印列名
> set hive.cli.print.row.to.vertical=true; // 开启行转列功能, 前提必须开启打印列名功能
> set hive.cli.print.row.to.vertical.num=1; // 设置每行显示的列数
> select * from example_table where pt='2012-03-31' limit 2;

hive查看数据表结构、列类型

> desc TableName;

sqoop import hive ，export mysql 实践及遇到的问题

1、从mysql 导入数据到 hive

2、从hive 导入数据到 mysql

3、拓展

Flink的狀態介紹和有狀態的計算

SparkSQL RDD,DataFrame,DataSet三者的區別與聯繫

hive窗口函數（V1.0）

spark機器學習 K-means聚類算法

Hive建模類型

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結