簡介

DataX 是阿里巴巴集團內被廣泛使用的離線數據同步工具/平臺，實現包括 MySQL、SQL Server、Oracle、PostgreSQL、HDFS、Hive、HBase、OTS、ODPS 等各種異構數據源之間高效的數據同步功能。
DataX本身作爲數據同步框架，將不同數據源的同步抽象爲從源頭數據源讀取數據的Reader插件，以及向目標端寫入數據的Writer插件，理論上DataX框架可以支持任意數據源類型的數據同步工作。同時DataX插件體系作爲一套生態系統, 每接入一套新數據源該新加入的數據源即可實現和現有的數據源互通。

基礎環境搭建

要求

Linux
JDK(1.8以上，推薦1.8)
Python(推薦Python2.6.X) linux centos7 默認2.7
Apache Maven 3.x (Compile DataX) 如果自己編譯

mysql安裝

前面hadoop安裝文章已經提到過,可以往前參考

默認情況下, Hive的元數據保存在了內嵌的 derby 數據庫裏, 生產環境使用 MySQL 來存放 Hive 元數據,靠譜.
參考:https://www.cnblogs.com/luohanguo/p/9045391.html
mysql登錄:mysql -uroot -p
mysql開通權限
- 指定某臺機器權限:grant all privileges on . to [email protected] identified by ‘root’ with grant option;
- 指定所有機器權限:grant all privileges on . to root@"%" identified by ‘root’ with grant option;

jdk安裝

前面hadoop安裝文章已經提到過,可以往前參考

安裝方式自行百度

配置java_home,vim /etc/profile
export JAVA_HOME=/usr/local/java_1.8.0_121
export JAVA_BIN= $JAVA_HOME/bin export JAVA_LIB=$ JAVA_HOME/lib
export CLASSPATH=.: $JAVA_LIB/tools.jar:$ JAVA_LIB/dt.jar
export PATH= $JAVA_BIN:$ PATH

驗證:java -version

安裝jdk1.8+的版本

datax安裝配置

官方文檔:https://github.com/alibaba/DataX

下載

地址:http://datax-opensource.oss-cn-hangzhou.aliyuncs.com/datax.tar.gz

配置

tar -zxvf datax.tar.gz -C ./

環境變量

# datax 配置
export DATAX_HOME=/opt/bigdata/datax/default
export PATH=${DATAX_HOME}/bin:$PATH

基礎測試

從stream讀取數據並打印到控制檯

配置如下:
vim stream2stream.json
{
  "job": {
    "content": [
      {
        "reader": {
          "name": "streamreader",
          "parameter": {
            "sliceRecordCount": 10,
            "column": [
              {
                "type": "long",
                "value": "10"
              },
              {
                "type": "string",
                "value": "hello，你好，世界-DataX"
              }
            ]
          }
        },
        "writer": {
          "name": "streamwriter",
          "parameter": {
            "encoding": "UTF-8",
            "print": true
          }
        }
      }
    ],
    "setting": {
      "speed": {
        "channel": 5
       }
    }
  }
}

運行

[root@ecs-6531-0002 conf]# datax.py stream2stream.json 

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.


2019-07-27 20:58:40.474 [main] INFO  VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2019-07-27 20:58:40.481 [main] INFO  Engine - the machine info  => 

	osInfo:	Oracle Corporation 1.8 25.121-b13
	jvmInfo:	Linux amd64 3.10.0-862.9.1.el7.x86_64
	cpu num:	16

	totalPhysicalMemory:	-0.00G
	freePhysicalMemory:	-0.00G
	maxFileDescriptorCount:	-1
	currentOpenFileDescriptorCount:	-1

	GC Names	[PS MarkSweep, PS Scavenge]

	MEMORY_NAME                    | allocation_size                | init_size                      
	PS Eden Space                  | 256.00MB                       | 256.00MB                       
	Code Cache                     | 240.00MB                       | 2.44MB                         
	Compressed Class Space         | 1,024.00MB                     | 0.00MB                         
	PS Survivor Space              | 42.50MB                        | 42.50MB                        
	PS Old Gen                     | 683.00MB                       | 683.00MB                       
	Metaspace                      | -0.00MB                        | 0.00MB                         


2019-07-27 20:58:40.497 [main] INFO  Engine - 
{
	"content":[
		{
			"reader":{
				"name":"streamreader",
				"parameter":{
					"column":[
						{
							"type":"long",
							"value":"10"
						},
						{
							"type":"string",
							"value":"hello，你好，世界-DataX"
						}
					],
					"sliceRecordCount":10
				}
			},
			"writer":{
				"name":"streamwriter",
				"parameter":{
					"encoding":"UTF-8",
					"print":true
				}
			}
		}
	],
	"setting":{
		"speed":{
			"channel":5
		}
	}
}

2019-07-27 20:58:40.512 [main] WARN  Engine - prioriy set to 0, because NumberFormatException, the value is: null
2019-07-27 20:58:40.514 [main] INFO  PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2019-07-27 20:58:40.514 [main] INFO  JobContainer - DataX jobContainer starts job.
2019-07-27 20:58:40.515 [main] INFO  JobContainer - Set jobId = 0
2019-07-27 20:58:40.531 [job-0] INFO  JobContainer - jobContainer starts to do prepare ...
2019-07-27 20:58:40.532 [job-0] INFO  JobContainer - DataX Reader.Job [streamreader] do prepare work .
2019-07-27 20:58:40.532 [job-0] INFO  JobContainer - DataX Writer.Job [streamwriter] do prepare work .
2019-07-27 20:58:40.532 [job-0] INFO  JobContainer - jobContainer starts to do split ...
2019-07-27 20:58:40.532 [job-0] INFO  JobContainer - Job set Channel-Number to 5 channels.
2019-07-27 20:58:40.533 [job-0] INFO  JobContainer - DataX Reader.Job [streamreader] splits to [5] tasks.
2019-07-27 20:58:40.533 [job-0] INFO  JobContainer - DataX Writer.Job [streamwriter] splits to [5] tasks.
2019-07-27 20:58:40.558 [job-0] INFO  JobContainer - jobContainer starts to do schedule ...
2019-07-27 20:58:40.576 [job-0] INFO  JobContainer - Scheduler starts [1] taskGroups.
2019-07-27 20:58:40.584 [job-0] INFO  JobContainer - Running by standalone Mode.
2019-07-27 20:58:40.606 [taskGroup-0] INFO  TaskGroupContainer - taskGroupId=[0] start [5] channels for [5] tasks.
2019-07-27 20:58:40.616 [taskGroup-0] INFO  Channel - Channel set byte_speed_limit to -1, No bps activated.
2019-07-27 20:58:40.617 [taskGroup-0] INFO  Channel - Channel set record_speed_limit to -1, No tps activated.
2019-07-27 20:58:40.629 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[2] attemptCount[1] is started
2019-07-27 20:58:40.632 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[4] attemptCount[1] is started
2019-07-27 20:58:40.635 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[3] attemptCount[1] is started
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
2019-07-27 20:58:40.651 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
2019-07-27 20:58:40.656 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[1] attemptCount[1] is started
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
10	hello，你好，世界-DataX
2019-07-27 20:58:40.760 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[110]ms
2019-07-27 20:58:40.760 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[1] is successed, used[105]ms
2019-07-27 20:58:40.760 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[2] is successed, used[133]ms
2019-07-27 20:58:40.760 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[3] is successed, used[126]ms
2019-07-27 20:58:40.760 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[4] is successed, used[128]ms
2019-07-27 20:58:40.761 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] completed it's tasks.
2019-07-27 20:58:50.621 [job-0] INFO  StandAloneJobContainerCommunicator - Total 50 records, 950 bytes | Speed 95B/s, 5 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.003s | Percentage 100.00%
2019-07-27 20:58:50.621 [job-0] INFO  AbstractScheduler - Scheduler accomplished all tasks.
2019-07-27 20:58:50.622 [job-0] INFO  JobContainer - DataX Writer.Job [streamwriter] do post work.
2019-07-27 20:58:50.622 [job-0] INFO  JobContainer - DataX Reader.Job [streamreader] do post work.
2019-07-27 20:58:50.622 [job-0] INFO  JobContainer - DataX jobId [0] completed successfully.
2019-07-27 20:58:50.623 [job-0] INFO  HookInvoker - No hook invoked, because base dir not exists or is a file: /opt/bigdata/datax/default/hook
2019-07-27 20:58:50.625 [job-0] INFO  JobContainer - 
	 [total cpu info] => 
		averageCpu                     | maxDeltaCpu                    | minDeltaCpu                    
		-1.00%                         | -1.00%                         | -1.00%
                        

	 [total gc info] => 
		 NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime     
		 PS MarkSweep         | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s             
		 PS Scavenge          | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s             

2019-07-27 20:58:50.628 [job-0] INFO  JobContainer - PerfTrace not enable!
2019-07-27 20:58:50.629 [job-0] INFO  StandAloneJobContainerCommunicator - Total 50 records, 950 bytes | Speed 95B/s, 5 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.000s |  All Task WaitReaderTime 0.003s | Percentage 100.00%
2019-07-27 20:58:50.634 [job-0] INFO  JobContainer - 
任務啓動時刻                    : 2019-07-27 20:58:40
任務結束時刻                    : 2019-07-27 20:58:50
任務總計耗時                    :                 10s
任務平均流量                    :               95B/s
記錄寫入速度                    :              5rec/s
讀出記錄總數                    :                  50
讀寫失敗總數                    :                   0

地下這裏是一些基本的統計信息

ok 到這裏datax就算是簡單的可用狀態

datax實現hive或者spark table到mysql導入

hive表的創建要求

hive to mysql 其實就是從hdfs直接讀取數據,然後傳入mysql,因爲hive表默認是使用特殊的分割符,我們直接使用datax導入的時候回報錯的

CREATE TABLE IF NOT EXISTS `${hivevar:target_table}`
(
datekey                                 string  comment '日期'
,project_id                             string  comment ''
,building_id                            string  comment ''
,unit_id                                string  comment ''
,building_name                          string  comment ''
... 此處省略n多行
,note                                   string  comment ''
) comment ''
partitioned by (dt string)
ROW FORMAT
SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES ('field.delim'='\t','serialization.null.format'='')  -- 此處是重點,需要配置字段分隔符爲tab鍵和null數據默認值爲null,null這個很關鍵,要不然導入到mysql會報轉換異常
STORED AS TEXTFILE; -- 選擇textfile存儲格式,其他格式datax也支持,可以自己配置

hive to mysql配置 run.json

{
    "job": {
        "setting": {
            "speed": {
                "channel": 3
            }
        },
        "content": [
            {
                "reader": {
                    "name": "hdfsreader",
                    "parameter": {
                        "path": "/user/root/warehouse/real_estate.db/dim_building/dt=20190720/*", -- hdfs目錄
                        "defaultFS": "hdfs://ecs-6531-0002.novalocal:9000", --name node地址
                        "column": [ -- 字段索引類型說明 和hive中保持一致
                               {
                                "index": 0,
                                "type": "string"
                               },
                               {
                                "index": 1,
                                "type": "string"
                               },
                               {
                                "index": 2,
                                "type": "string"
                               },
                               {
                                "index": 3,
                                "type": "string"
                               },
                               {
                                "index": 4,
                                "type": "string"
                               },
                               {
                                "index": 5,
                                "type": "string"
                               },
                               {
                                "index": 6,
                                "type": "string"
                               },
                               {
                                "index": 7,
                                "type": "string"
                               },
                               {
                                "index": 8,
                                "type": "string"
                               },
                               {
                                "index": 9,
                                "type": "string"
                               },
                               {
                                "index": 10,
                                "type": "string"
                               },
                               {
                                "index": 11,
                                "type": "string"
                               },
                               {
                                "index": 12,
                                "type": "string"
                               },
                               {
                                "index": 13,
                                "type": "string"
                               },
                               {
                                "index": 14,
                                "type": "string"
                               },
                               {
                                "index": 15,
                                "type": "string"
                               },
                               {
                                "index": 16,
                                "type": "string"
                               },
                               {
                                "index": 17,
                                "type": "string"
                               },
                               {
                                "index": 18,
                                "type": "string"
                               },
                               {
                                "index": 19,
                                "type": "string"
                               },
                               {
                                "index": 20,
                                "type": "string"
                               },
                               {
                                "index": 21,
                                "type": "string"
                               },
                               {
                                "index": 22,
                                "type": "string"
                               },
                               {
                                "index": 23,
                                "type": "string"
                               },
                               {
                                "index": 24,
                                "type": "string"
                               },
                               {
                                "index": 25,
                                "type": "string"
                               },
                               {
                                "index": 26,
                                "type": "string"
                               },
                               {
                                "index": 27,
                                "type": "string"
                               },
                               {
                                "index": 28,
                                "type": "string"
                               },
                               {
                                "index": 29,
                                "type": "string"
                               },
                               {
                                "index": 30,
                                "type": "string"
                               },
                               {
                                "index": 31,
                                "type": "string"
                               },
                               {
                                "index": 32,
                                "type": "string"
                               },
                               {
                                "index": 33,
                                "type": "string"
                               },
                               {
                                "index": 34,
                                "type": "string"
                               },
                               {
                                "index": 35,
                                "type": "string"
                               },
                               {
                                "index": 36,
                                "type": "string"
                               },
                               {
                                "index": 37,
                                "type": "string"
                               }
                        ],
                        "fileType": "text",
                        "encoding": "UTF-8",
                        "fieldDelimiter": "\t" -- 字段分隔符
                    }

                },
                "writer": {
                    "name": "mysqlwriter",
                    "parameter": {
                        "writeMode": "insert",
                        "username": "root",
                        "password": "xxx",
                        "column": [ -- mysql字段列表
			    "dt"
                            ,"project_id"
                            ,"building_id"
                            ,"unit_id"
                            ,"building_name"
                            ,"building_unit"
                            ,"unit_name"
                            ,"developer"
                            ,"design_unit"
                            ,"construction_unit"
                            ,"engineering_supervision_unit"
                            ,"project_cost"
                            ,"start_date"
                            ,"end_date"
                            ,"delivery_date"
                            ,"building_structure_type"
                            ,"building_height"
                            ,"total_floor_cnt"
                            ,"onground_floor_cnt"
                            ,"underground_floor_cnt"
                            ,"building_floor_area"
                            ,"building_construction_area"
                            ,"building_onground_area"
                            ,"building_underground_area"
                            ,"usage_area"
                            ,"green_area"
                            ,"floor_area"
                            ,"corridor_area"
                            ,"exterior_wall_area"
                            ,"top_area"
                            ,"base_area"
                            ,"basis_sharing_coefficient"
                            ,"common_area"
                            ,"common_area_sharing_coefficient"
                            ,"elevator_cnt"
                            ,"house_cnt"
                            ,"parking_cnt"
                            ,"note"
                        ],
                        "session": [
                            "set session sql_mode='ANSI'"
                        ],
                        "preSql": [
                            "delete from dim_building where dt='20190720'"
                        ],
                        "connection": [
                            {
                                "jdbcUrl": "jdbc:mysql://localhost:13306/real_estate?useUnicode=true&characterEncoding=utf-8", -- mysql地址
                                "table": [ -- 表名
                                    "dim_building"
                                ]
                            }
                        ]
                    }
                }
            }
        ]
    }
}

執行導入

datax.py run.json

僞分佈式系列 - 第四篇 - datax環境搭建,hive導入mysql測試

目錄

簡介

基礎環境搭建

要求

mysql安裝

jdk安裝

datax安裝配置

官方文檔:https://github.com/alibaba/DataX

下載

配置

基礎測試

datax實現hive或者spark table到mysql導入

hive表的創建要求

Kafka存儲機制

HTTP URL 詳解

僞分佈式系列 - 第六篇 - flume-1.9.0-環境搭建

推薦系統系列 - 實例二 - 協同過濾算法-儲備知識

推薦系統系列 - 實例一 - 基於流行度的算法 - 搜索熱詞推薦

Antlr - 使用antlr4實現一個計算器,配合變量可以實現程序裏的複合指標運算

表達式引擎Aviator

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結