DataX系列4-TxtFileWriter介紹 一. 快速介紹 二. 功能與限制 三. 功能說明 四. 測試案例 參考:

一. 快速介紹

  實際生產中,鑑於數據安全、不同夠公司數據交互等原因,很多時候會使用txt、csv等文件格式來交互數據。

  TxtFileWriter提供了向本地文件寫入類CSV格式的一個或者多個表文件。TxtFileWriter服務的用戶主要在於DataX開發、測試同學。

  寫入本地文件內容存放的是一張邏輯意義上的二維表,例如CSV格式的文本信息。

二. 功能與限制

  TxtFileWriter實現了從DataX協議轉爲本地TXT文件功能,本地文件本身是無結構化數據存儲。

TxtFileWriter如下幾個方面約定:

  1. 支持且僅支持寫入 TXT的文件,且要求TXT中shema爲一張二維表。

  2. 支持類CSV格式文件,自定義分隔符。

  3. 支持文本壓縮,現有壓縮格式爲gzip、bzip2。

  4. 支持多線程寫入,每個線程寫入不同子文件。

  5. 文件支持滾動,當文件大於某個size值或者行數值,文件需要切換。 [暫不支持]

我們不能做到:

  1. 單個文件不能支持併發寫入。

三. 功能說明

3.1 配置樣例

{
    "setting": {},
    "job": {
        "setting": {
            "speed": {
                "channel": 2
            }
        },
        "content": [
            {
                "reader": {
                    "name": "txtfilereader",
                    "parameter": {
                        "path": ["/home/haiwei.luo/case00/data"],
                        "encoding": "UTF-8",
                        "column": [
                            {
                                "index": 0,
                                "type": "long"
                            },
                            {
                                "index": 1,
                                "type": "boolean"
                            },
                            {
                                "index": 2,
                                "type": "double"
                            },
                            {
                                "index": 3,
                                "type": "string"
                            },
                            {
                                "index": 4,
                                "type": "date",
                                "format": "yyyy.MM.dd"
                            }
                        ],
                        "fieldDelimiter": ","
                    }
                },
                "writer": {
                    "name": "txtfilewriter",
                    "parameter": {
                        "path": "/home/haiwei.luo/case00/result",
                        "fileName": "luohw",
                        "writeMode": "truncate",
                        "dateFormat": "yyyy-MM-dd"
                    }
                }
            }
        ]
    }
}

3.2 參數說明

3.2.1 path

  • 描述:本地文件系統的路徑信息,TxtFileWriter會寫入Path目錄下屬多個文件。

  • 必選:是

  • 默認值:無

3.2.2 fileName

  • 描述:TxtFileWriter寫入的文件名,該文件名會添加隨機的後綴作爲每個線程寫入實際文件名。

  • 必選:是

  • 默認值:無

3.2.3 writeMode

  • 描述:TxtFileWriter寫入前數據清理處理模式:
  1. truncate,寫入前清理目錄下一fileName前綴的所有文件。
  2. append,寫入前不做任何處理,DataX TxtFileWriter直接使用filename寫入,並保證文件名不衝突。
  3. nonConflict,如果目錄下有fileName前綴的文件,直接報錯。
  • 必選:是

  • 默認值:無

3.2.4 fieldDelimiter

  • 描述:讀取的字段分隔符

  • 必選:否

  • 默認值:,

3.2.5 compress

  • 描述:文本壓縮類型,默認不填寫意味着沒有壓縮。支持壓縮類型爲zip、lzo、lzop、tgz、bzip2。

  • 必選:否

  • 默認值:無壓縮

3.2.6 encoding

  • 描述:讀取文件的編碼配置。

  • 必選:否

  • 默認值:utf-8

3.2.7 nullFormat

  • 描述:文本文件中無法使用標準字符串定義null(空指針),DataX提供nullFormat定義哪些字符串可以表示爲null。

例如如果用戶配置: nullFormat="\N",那麼如果源頭數據是"\N",DataX視作null字段。

  • 必選:否

  • 默認值:\N

3.2.8 dateFormat

  • 描述:日期類型的數據序列化到文件中時的格式,例如 "dateFormat": "yyyy-MM-dd"。

  • 必選:否

  • 默認值:無

3.2.9 fileFormat

  • 描述:文件寫出的格式,包括csv (http://zh.wikipedia.org/wiki/%E9%80%97%E5%8F%B7%E5%88%86%E9%9A%94%E5%80%BC) 和text兩種,csv是嚴格的csv格式,如果待寫數據包括列分隔符,則會按照csv的轉義語法轉義,轉義符號爲雙引號";text格式是用列分隔符簡單分割待寫數據,對於待寫數據包括列分隔符情況下不做轉義。

  • 必選:否

  • 默認值:text

3.2.10 header

  • 描述:txt寫出時的表頭,示例['id', 'name', 'age']。

  • 必選:否

  • 默認值:無

3.3 類型轉換

本地文件本身不提供數據類型,該類型是DataX TxtFileWriter定義:


其中:

  1. 本地文件 Long是指本地文件文本中使用整形的字符串表示形式,例如"19901219"。
  2. 本地文件 Double是指本地文件文本中使用Double的字符串表示形式,例如"3.1415"。
  3. 本地文件 Boolean是指本地文件文本中使用Boolean的字符串表示形式,例如"true"、"false"。不區分大小寫。
  4. 本地文件 Date是指本地文件文本中使用Date的字符串表示形式,例如"2014-12-31",Date可以指定format格式。

四. 測試案例

  現在我們來測試一個從csv讀取,然後再寫入csv的案例.

4.1 數據準備

參考之前Superset的測試數據:
測試數據

將mysql表數據寫入csv文件:

mysql> use test;
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> 
mysql> select * from  fact_sale INTO OUTFILE '/home/backup/fact_sale.csv' FIELDS TERMINATED BY ',';
Query OK, 684371 rows affected (0.69 sec)

mysql> 

查看csv文件格式:


將文件傳到DataX所在的服務器:

4.2 json文件準備

txtfilereader:

{
    "setting": {},
    "job": {
        "setting": {
            "speed": {
                "channel": 2
            }
        },
        "content": [
            {
                "reader": {
                    "name": "txtfilereader",
                    "parameter": {
                        "path": ["/home/backup"],
                        "encoding": "UTF-8",
                        "column": [
                            {
                                "index": 0,
                                "type": "date",
                                "format": "yyyy-MM-dd"
                            },
                            {
                                "index": 1,
                                "type": "string"
                            },
                            {
                                "index": 2,
                                "type": "string"
                            },
                            {
                                "index": 3,
                                "type": "long"
                            },
                            {
                                "index": 4,
                                "type": "long"
                            }
                        ],
                        "fieldDelimiter": ","
                    }
                },
                "writer": {
                    "name": "txtfilewriter",
                    "parameter": {
                        "path": "/home/backup",
                        "fileName": "fact_sale_new",
                        "writeMode": "truncate",
                        "format": "yyyy-MM-dd"
                    }
                }
            }
        ]
    }
}

4.3 運行腳本

cd $datax_home/bin
python datax.py ./txtfilereader.json

運行日誌:

[root@10-31-1-119 bin]# python datax.py ./txtfilereader.json 

DataX (DATAX-OPENSOURCE-3.0), From Alibaba !
Copyright (C) 2010-2017, Alibaba Group. All Rights Reserved.


2021-11-23 11:16:33.256 [main] INFO  VMInfo - VMInfo# operatingSystem class => sun.management.OperatingSystemImpl
2021-11-23 11:16:33.263 [main] INFO  Engine - the machine info  => 

        osInfo: Oracle Corporation 1.8 25.242-b08
        jvmInfo:        Linux amd64 3.10.0-1127.el7.x86_64
        cpu num:        8

        totalPhysicalMemory:    -0.00G
        freePhysicalMemory:     -0.00G
        maxFileDescriptorCount: -1
        currentOpenFileDescriptorCount: -1

        GC Names        [PS MarkSweep, PS Scavenge]

        MEMORY_NAME                    | allocation_size                | init_size                      
        PS Eden Space                  | 256.00MB                       | 256.00MB                       
        Code Cache                     | 240.00MB                       | 2.44MB                         
        Compressed Class Space         | 1,024.00MB                     | 0.00MB                         
        PS Survivor Space              | 42.50MB                        | 42.50MB                        
        PS Old Gen                     | 683.00MB                       | 683.00MB                       
        Metaspace                      | -0.00MB                        | 0.00MB                         


2021-11-23 11:16:33.277 [main] INFO  Engine - 
{
        "content":[
                {
                        "reader":{
                                "name":"txtfilereader",
                                "parameter":{
                                        "column":[
                                                {
                                                        "format":"yyyy-MM-dd",
                                                        "index":0,
                                                        "type":"date"
                                                },
                                                {
                                                        "index":1,
                                                        "type":"string"
                                                },
                                                {
                                                        "index":2,
                                                        "type":"string"
                                                },
                                                {
                                                        "index":3,
                                                        "type":"long"
                                                },
                                                {
                                                        "index":4,
                                                        "type":"long"
                                                }
                                        ],
                                        "encoding":"UTF-8",
                                        "fieldDelimiter":",",
                                        "path":[
                                                "/home/backup"
                                        ]
                                }
                        },
                        "writer":{
                                "name":"txtfilewriter",
                                "parameter":{
                                        "fileName":"fact_sale_new",
                                        "format":"yyyy-MM-dd",
                                        "path":"/home/backup",
                                        "writeMode":"truncate"
                                }
                        }
                }
        ],
        "setting":{
                "speed":{
                        "channel":2
                }
        }
}

2021-11-23 11:16:33.291 [main] WARN  Engine - prioriy set to 0, because NumberFormatException, the value is: null
2021-11-23 11:16:33.295 [main] INFO  PerfTrace - PerfTrace traceId=job_-1, isEnable=false, priority=0
2021-11-23 11:16:33.295 [main] INFO  JobContainer - DataX jobContainer starts job.
2021-11-23 11:16:33.299 [main] INFO  JobContainer - Set jobId = 0
2021-11-23 11:16:33.319 [job-0] WARN  TxtFileWriter$Job - 您使用format配置日期格式化, 這是不推薦的行爲, 請優先使用dateFormat配置項, 兩項同時存在則使用dateFormat.
2021-11-23 11:16:33.339 [job-0] WARN  UnstructuredStorageWriterUtil - 您的encoding配置爲空, 將使用默認值[UTF-8]
2021-11-23 11:16:33.340 [job-0] WARN  UnstructuredStorageWriterUtil - 您沒有配置列分隔符, 使用默認值[,]
2021-11-23 11:16:33.340 [job-0] INFO  JobContainer - jobContainer starts to do prepare ...
2021-11-23 11:16:33.341 [job-0] INFO  JobContainer - DataX Reader.Job [txtfilereader] do prepare work .
2021-11-23 11:16:33.343 [job-0] INFO  TxtFileReader$Job - add file [/home/backup/fact_sale.csv] as a candidate to be read.
2021-11-23 11:16:33.344 [job-0] INFO  TxtFileReader$Job - 您即將讀取的文件數爲: [1]
2021-11-23 11:16:33.345 [job-0] INFO  JobContainer - DataX Writer.Job [txtfilewriter] do prepare work .
2021-11-23 11:16:33.345 [job-0] INFO  TxtFileWriter$Job - 由於您配置了writeMode truncate, 開始清理 [/home/backup] 下面以 [fact_sale_new] 開頭的內容
2021-11-23 11:16:33.348 [job-0] INFO  JobContainer - jobContainer starts to do split ...
2021-11-23 11:16:33.349 [job-0] INFO  JobContainer - Job set Channel-Number to 2 channels.
2021-11-23 11:16:33.351 [job-0] INFO  JobContainer - DataX Reader.Job [txtfilereader] splits to [1] tasks.
2021-11-23 11:16:33.351 [job-0] INFO  TxtFileWriter$Job - begin do split...
2021-11-23 11:16:33.365 [job-0] INFO  TxtFileWriter$Job - splited write file name:[fact_sale_new__7b975784_087a_4270_94a4_11d55d290a68]
2021-11-23 11:16:33.365 [job-0] INFO  TxtFileWriter$Job - end do split.
2021-11-23 11:16:33.365 [job-0] INFO  JobContainer - DataX Writer.Job [txtfilewriter] splits to [1] tasks.
2021-11-23 11:16:33.387 [job-0] INFO  JobContainer - jobContainer starts to do schedule ...
2021-11-23 11:16:33.390 [job-0] INFO  JobContainer - Scheduler starts [1] taskGroups.
2021-11-23 11:16:33.391 [job-0] INFO  JobContainer - Running by standalone Mode.
2021-11-23 11:16:33.399 [taskGroup-0] INFO  TaskGroupContainer - taskGroupId=[0] start [1] channels for [1] tasks.
2021-11-23 11:16:33.407 [taskGroup-0] INFO  Channel - Channel set byte_speed_limit to -1, No bps activated.
2021-11-23 11:16:33.407 [taskGroup-0] INFO  Channel - Channel set record_speed_limit to -1, No tps activated.
2021-11-23 11:16:33.416 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] attemptCount[1] is started
2021-11-23 11:16:33.418 [0-0-0-writer] INFO  TxtFileWriter$Task - begin do write...
2021-11-23 11:16:33.418 [0-0-0-reader] INFO  TxtFileReader$Task - reading file : [/home/backup/fact_sale.csv]
2021-11-23 11:16:33.418 [0-0-0-writer] INFO  TxtFileWriter$Task - write to file : [/home/backup/fact_sale_new__7b975784_087a_4270_94a4_11d55d290a68]
2021-11-23 11:16:33.457 [0-0-0-reader] INFO  UnstructuredStorageReaderUtil - CsvReader使用默認值[{"captureRawRecord":true,"columnCount":0,"comment":"#","currentRecord":-1,"delimiter":",","escapeMode":1,"headerCount":0,"rawRecord":"","recordDelimiter":"\u0000","safetySwitch":false,"skipEmptyRecords":true,"textQualifier":"\"","trimWhitespace":true,"useComments":false,"useTextQualifier":true,"values":[]}],csvReaderConfig值爲[null]
2021-11-23 11:16:35.753 [0-0-0-writer] INFO  TxtFileWriter$Task - end do write
2021-11-23 11:16:35.821 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] taskId[0] is successed, used[2406]ms
2021-11-23 11:16:35.822 [taskGroup-0] INFO  TaskGroupContainer - taskGroup[0] completed it's tasks.
2021-11-23 11:16:43.415 [job-0] INFO  StandAloneJobContainerCommunicator - Total 684371 records, 16363343 bytes | Speed 1.56MB/s, 68437 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.652s |  All Task WaitReaderTime 0.076s | Percentage 100.00%
2021-11-23 11:16:43.415 [job-0] INFO  AbstractScheduler - Scheduler accomplished all tasks.
2021-11-23 11:16:43.416 [job-0] INFO  JobContainer - DataX Writer.Job [txtfilewriter] do post work.
2021-11-23 11:16:43.416 [job-0] INFO  JobContainer - DataX Reader.Job [txtfilereader] do post work.
2021-11-23 11:16:43.416 [job-0] INFO  JobContainer - DataX jobId [0] completed successfully.
2021-11-23 11:16:43.418 [job-0] INFO  HookInvoker - No hook invoked, because base dir not exists or is a file: /home/software/datax/hook
2021-11-23 11:16:43.420 [job-0] INFO  JobContainer - 
         [total cpu info] => 
                averageCpu                     | maxDeltaCpu                    | minDeltaCpu                    
                -1.00%                         | -1.00%                         | -1.00%
                        

         [total gc info] => 
                 NAME                 | totalGCCount       | maxDeltaGCCount    | minDeltaGCCount    | totalGCTime        | maxDeltaGCTime     | minDeltaGCTime     
                 PS MarkSweep         | 0                  | 0                  | 0                  | 0.000s             | 0.000s             | 0.000s             
                 PS Scavenge          | 11                 | 11                 | 11                 | 0.067s             | 0.067s             | 0.067s             

2021-11-23 11:16:43.420 [job-0] INFO  JobContainer - PerfTrace not enable!
2021-11-23 11:16:43.421 [job-0] INFO  StandAloneJobContainerCommunicator - Total 684371 records, 16363343 bytes | Speed 1.56MB/s, 68437 records/s | Error 0 records, 0 bytes |  All Task WaitWriterTime 0.652s |  All Task WaitReaderTime 0.076s | Percentage 100.00%
2021-11-23 11:16:43.422 [job-0] INFO  JobContainer - 
任務啓動時刻                    : 2021-11-23 11:16:33
任務結束時刻                    : 2021-11-23 11:16:43
任務總計耗時                    :                 10s
任務平均流量                    :            1.56MB/s
記錄寫入速度                    :          68437rec/s
讀出記錄總數                    :              684371
讀寫失敗總數                    :                   0

[root@10-31-1-119 bin]# 

寫入的文件加了線程名爲實際的文件名:


參考:

  1. https://github.com/alibaba/DataX/blob/master/txtfilewriter/doc/txtfilewriter.md
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章