目錄
簡介
- DataX 是阿里巴巴集團內被廣泛使用的離線數據同步工具/平臺,實現包括 MySQL、SQL Server、Oracle、PostgreSQL、HDFS、Hive、HBase、OTS、ODPS 等各種異構數據源之間高效的數據同步功能。
- DataX本身作爲數據同步框架,將不同數據源的同步抽象爲從源頭數據源讀取數據的Reader插件,以及向目標端寫入數據的Writer插件,理論上DataX框架可以支持任意數據源類型的數據同步工作。同時DataX插件體系作爲一套生態系統, 每接入一套新數據源該新加入的數據源即可實現和現有的數據源互通。
基礎環境搭建
要求
- Linux
- JDK(1.8以上,推薦1.8)
- Python(推薦Python2.6.X) linux centos7 默認2.7
- Apache Maven 3.x (Compile DataX) 如果自己編譯
mysql安裝
前面hadoop安裝文章已經提到過,可以往前參考
- 默認情況下, Hive的元數據保存在了內嵌的 derby 數據庫裏, 生產環境使用 MySQL 來存放 Hive 元數據,靠譜.
- 參考:https://www.cnblogs.com/luohanguo/p/9045391.html
- mysql登錄:mysql -uroot -p
- mysql開通權限
- 指定某臺機器權限:grant all privileges on . to [email protected] identified by ‘root’ with grant option;
- 指定所有機器權限:grant all privileges on . to root@"%" identified by ‘root’ with grant option;
jdk安裝
前面hadoop安裝文章已經提到過,可以往前參考
安裝方式自行百度
配置java_home,vim /etc/profile
export JAVA_HOME=/usr/local/java_1.8.0_121
export JAVA_BIN=JAVA_HOME/lib
export CLASSPATH=.:JAVA_LIB/dt.jar
export PATH=PATH
驗證:java -version
安裝jdk1.8+的版本
datax安裝配置
官方文檔:https://github.com/alibaba/DataX
下載
配置
- 下載完成直接解壓到目錄即可
tar -zxvf datax.tar.gz -C ./
- 環境變量
# datax 配置
export DATAX_HOME=/opt/bigdata/datax/default
export PATH=${DATAX_HOME}/bin:$PATH
基礎測試
- 從stream讀取數據並打印到控制檯
配置如下:
vim stream2stream.json
{
"job": {
"content": [
{
"reader": {
"name": "streamreader",
"parameter": {
"sliceRecordCount": 10,
"column": [
{
"type": "long",
"value": "10"
},
{
"type": "string",
"value": "hello,你好,世界-DataX"
}
]
}
},
"writer": {
"name": "streamwriter",
"parameter": {
"encoding": "UTF-8",
"print": true
}
}
}
],
"setting": {
"speed": {
"channel": 5
}
}
}
}
- 運行
[root@ecs-6531-0002 conf]# datax.py stream2stream.json
DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.
2019-07-27 20:58:40.474 [main] INFO VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2019-07-27 20:58:40.481 [main] INFO Engine - the machine info =>
osInfo: Oracle Corporation 1.8 25.121-b13
jvmInfo: Linux amd64 3.10.0-862.9.1.el7.x86_64
cpu num: 16
totalPhysicalMemory: -0.00G
freePhysicalMemory: -0.00G
maxFileDescriptorCount: -1
currentOpenFileDescriptorCount: -1
GC Names [PS MarkSweep, PS Scavenge]
MEMORY_NAME | allocation_size | init_size
PS Eden Space | 256.00MB | 256.00MB
Code Cache | 240.00MB | 2.44MB
Compressed Class Space | 1,024.00MB | 0.00MB
PS Survivor Space | 42.50MB | 42.50MB
PS Old Gen | 683.00MB | 683.00MB
Metaspace | -0.00MB | 0.00MB
2019-07-27 20:58:40.497 [main] INFO Engine -
{
"content":[
{
"reader":{
"name":"streamreader",
"parameter":{
"column":[
{
"type":"long",
"value":"10"
},
{
"type":"string",
"value":"hello,你好,世界-DataX"
}
],
"sliceRecordCount":10
}
},
"writer":{
"name":"streamwriter",
"parameter":{
"encoding":"UTF-8",
"print":true
}
}
}
],
"setting":{
"speed":{
"channel":5
}
}
}
2019-07-27 20:58:40.512 [main] WARN Engine - prioriy set to 0, because NumberFormatException, the value is: null
2019-07-27 20:58:40.514 [main] INFO PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2019-07-27 20:58:40.514 [main] INFO JobContainer - DataX jobContainer starts job.
2019-07-27 20:58:40.515 [main] INFO JobContainer - Set jobId = 0
2019-07-27 20:58:40.531 [job-0] INFO JobContainer - jobContainer starts to do prepare ...
2019-07-27 20:58:40.532 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] do prepare work .
2019-07-27 20:58:40.532 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do prepare work .
2019-07-27 20:58:40.532 [job-0] INFO JobContainer - jobContainer starts to do split ...
2019-07-27 20:58:40.532 [job-0] INFO JobContainer - Job set Channel-Number to 5 channels.
2019-07-27 20:58:40.533 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] splits to [5] tasks.
2019-07-27 20:58:40.533 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] splits to [5] tasks.
2019-07-27 20:58:40.558 [job-0] INFO JobContainer - jobContainer starts to do schedule ...
2019-07-27 20:58:40.576 [job-0] INFO JobContainer - Scheduler starts [1] taskGroups.
2019-07-27 20:58:40.584 [job-0] INFO JobContainer - Running by standalone Mode.
2019-07-27 20:58:40.606 [taskGroup-0] INFO TaskGroupContainer - taskGroupId=[0] start [5] channels for [5] tasks.
2019-07-27 20:58:40.616 [taskGroup-0] INFO Channel - Channel set byte_speed_limit to -1, No bps activated.
2019-07-27 20:58:40.617 [taskGroup-0] INFO Channel - Channel set record_speed_limit to -1, No tps activated.
2019-07-27 20:58:40.629 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[2] attemptCount[1] is started
2019-07-27 20:58:40.632 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[4] attemptCount[1] is started
2019-07-27 20:58:40.635 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[3] attemptCount[1] is started
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
2019-07-27 20:58:40.651 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
2019-07-27 20:58:40.656 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[1] attemptCount[1] is started
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
10 hello,你好,世界-DataX
2019-07-27 20:58:40.760 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[110]ms
2019-07-27 20:58:40.760 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[1] is successed, used[105]ms
2019-07-27 20:58:40.760 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[2] is successed, used[133]ms
2019-07-27 20:58:40.760 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[3] is successed, used[126]ms
2019-07-27 20:58:40.760 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] taskId[4] is successed, used[128]ms
2019-07-27 20:58:40.761 [taskGroup-0] INFO TaskGroupContainer - taskGroup[0] completed it's tasks.
2019-07-27 20:58:50.621 [job-0] INFO StandAloneJobContainerCommunicator - Total 50 records, 950 bytes | Speed 95B/s, 5 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.003s | Percentage 100.00%
2019-07-27 20:58:50.621 [job-0] INFO AbstractScheduler - Scheduler accomplished all tasks.
2019-07-27 20:58:50.622 [job-0] INFO JobContainer - DataX Writer.Job [streamwriter] do post work.
2019-07-27 20:58:50.622 [job-0] INFO JobContainer - DataX Reader.Job [streamreader] do post work.
2019-07-27 20:58:50.622 [job-0] INFO JobContainer - DataX jobId [0] completed successfully.
2019-07-27 20:58:50.623 [job-0] INFO HookInvoker - No hook invoked, because base dir not exists or is a file: /opt/bigdata/datax/default/hook
2019-07-27 20:58:50.625 [job-0] INFO JobContainer -
[total cpu info] =>
averageCpu | maxDeltaCpu | minDeltaCpu
-1.00% | -1.00% | -1.00%
[total gc info] =>
NAME | totalGCCount | maxDeltaGCCount | minDeltaGCCount | totalGCTime | maxDeltaGCTime | minDeltaGCTime
PS MarkSweep | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s
PS Scavenge | 0 | 0 | 0 | 0.000s | 0.000s | 0.000s
2019-07-27 20:58:50.628 [job-0] INFO JobContainer - PerfTrace not enable!
2019-07-27 20:58:50.629 [job-0] INFO StandAloneJobContainerCommunicator - Total 50 records, 950 bytes | Speed 95B/s, 5 records/s | Error 0 records, 0 bytes | All Task WaitWriterTime 0.000s | All Task WaitReaderTime 0.003s | Percentage 100.00%
2019-07-27 20:58:50.634 [job-0] INFO JobContainer -
任務啓動時刻 : 2019-07-27 20:58:40
任務結束時刻 : 2019-07-27 20:58:50
任務總計耗時 : 10s
任務平均流量 : 95B/s
記錄寫入速度 : 5rec/s
讀出記錄總數 : 50
讀寫失敗總數 : 0
地下這裏是一些基本的統計信息
- ok 到這裏datax就算是簡單的可用狀態
datax實現hive或者spark table到mysql導入
hive表的創建要求
- hive to mysql 其實就是從hdfs直接讀取數據,然後傳入mysql,因爲hive表默認是使用特殊的分割符,我們直接使用datax導入的時候回報錯的
CREATE TABLE IF NOT EXISTS `${hivevar:target_table}`
(
datekey string comment '日期'
,project_id string comment ''
,building_id string comment ''
,unit_id string comment ''
,building_name string comment ''
... 此處省略n多行
,note string comment ''
) comment ''
partitioned by (dt string)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ('field.delim'='\t','serialization.null.format'='') -- 此處是重點,需要配置字段分隔符爲tab鍵和null數據默認值爲null,null這個很關鍵,要不然導入到mysql會報轉換異常
STORED AS TEXTFILE; -- 選擇textfile存儲格式,其他格式datax也支持,可以自己配置
- hive to mysql配置 run.json
{
"job": {
"setting": {
"speed": {
"channel": 3
}
},
"content": [
{
"reader": {
"name": "hdfsreader",
"parameter": {
"path": "/user/root/warehouse/real_estate.db/dim_building/dt=20190720/*", -- hdfs目錄
"defaultFS": "hdfs://ecs-6531-0002.novalocal:9000", --name node地址
"column": [ -- 字段索引類型說明 和hive中保持一致
{
"index": 0,
"type": "string"
},
{
"index": 1,
"type": "string"
},
{
"index": 2,
"type": "string"
},
{
"index": 3,
"type": "string"
},
{
"index": 4,
"type": "string"
},
{
"index": 5,
"type": "string"
},
{
"index": 6,
"type": "string"
},
{
"index": 7,
"type": "string"
},
{
"index": 8,
"type": "string"
},
{
"index": 9,
"type": "string"
},
{
"index": 10,
"type": "string"
},
{
"index": 11,
"type": "string"
},
{
"index": 12,
"type": "string"
},
{
"index": 13,
"type": "string"
},
{
"index": 14,
"type": "string"
},
{
"index": 15,
"type": "string"
},
{
"index": 16,
"type": "string"
},
{
"index": 17,
"type": "string"
},
{
"index": 18,
"type": "string"
},
{
"index": 19,
"type": "string"
},
{
"index": 20,
"type": "string"
},
{
"index": 21,
"type": "string"
},
{
"index": 22,
"type": "string"
},
{
"index": 23,
"type": "string"
},
{
"index": 24,
"type": "string"
},
{
"index": 25,
"type": "string"
},
{
"index": 26,
"type": "string"
},
{
"index": 27,
"type": "string"
},
{
"index": 28,
"type": "string"
},
{
"index": 29,
"type": "string"
},
{
"index": 30,
"type": "string"
},
{
"index": 31,
"type": "string"
},
{
"index": 32,
"type": "string"
},
{
"index": 33,
"type": "string"
},
{
"index": 34,
"type": "string"
},
{
"index": 35,
"type": "string"
},
{
"index": 36,
"type": "string"
},
{
"index": 37,
"type": "string"
}
],
"fileType": "text",
"encoding": "UTF-8",
"fieldDelimiter": "\t" -- 字段分隔符
}
},
"writer": {
"name": "mysqlwriter",
"parameter": {
"writeMode": "insert",
"username": "root",
"password": "xxx",
"column": [ -- mysql字段列表
"dt"
,"project_id"
,"building_id"
,"unit_id"
,"building_name"
,"building_unit"
,"unit_name"
,"developer"
,"design_unit"
,"construction_unit"
,"engineering_supervision_unit"
,"project_cost"
,"start_date"
,"end_date"
,"delivery_date"
,"building_structure_type"
,"building_height"
,"total_floor_cnt"
,"onground_floor_cnt"
,"underground_floor_cnt"
,"building_floor_area"
,"building_construction_area"
,"building_onground_area"
,"building_underground_area"
,"usage_area"
,"green_area"
,"floor_area"
,"corridor_area"
,"exterior_wall_area"
,"top_area"
,"base_area"
,"basis_sharing_coefficient"
,"common_area"
,"common_area_sharing_coefficient"
,"elevator_cnt"
,"house_cnt"
,"parking_cnt"
,"note"
],
"session": [
"set session sql_mode='ANSI'"
],
"preSql": [
"delete from dim_building where dt='20190720'"
],
"connection": [
{
"jdbcUrl": "jdbc:mysql://localhost:13306/real_estate?useUnicode=true&characterEncoding=utf-8", -- mysql地址
"table": [ -- 表名
"dim_building"
]
}
]
}
}
}
]
}
}
- 執行導入
datax.py run.json