新项目使用ogg抽数据按照年月日+小时的形式保存到hdfs,由于ogg数据是保留修改前数据和修改后数据的,所以采用json格式保存文本
{"table":"TEST.TT_SALES_RECORDS","op_type":"U","op_ts":"2020-05-19 02:05:03.000701","current_ts":"2020-05-19T10:05:10.427000","before":{"ID":178733,"PS_ORDER_NO":"PV2003110002","PS_ORDER_STATUS":35121004,"PS_ORDER_TYPE":null,"PS_ORDER_DATE":"2020-03-11 17:52:20","SALES_CONSULTANT":null,"BUYER_NAME":"猪哈哈"},"after":{"ID":178733,"PS_ORDER_NO":"PV2003110002","PS_ORDER_STATUS":35121004,"PS_ORDER_TYPE":null,"PS_ORDER_DATE":"2020-03-12 17:52:20","SALES_CONSULTANT":null,"BUYER_NAME":"猪坚强"}}
问题一:由于数据没有主键,必须配置row_id
具体配置略
问题二:json格式数据中间没有换行符
gg.handler.name.format=json
gg.handler.name.format.jsonDelimiter=CDATA[\N]
最终文本格式:
{"table":"TEST.TT_SALES_RECORDS","op_type":"U","op_ts":"2020-05-19 02:05:03.000701","current_ts":"2020-05-19T10:05:10.427000","tokens":{"DBROWID":"AABIl2AAiAABwGZAAB"},"before":{"ID":178733,"PS_ORDER_NO":"PV2003110002","PS_ORDER_STATUS":35121004,"PS_ORDER_TYPE":null,"PS_ORDER_DATE":"2020-03-11 17:52:20","SALES_CONSULTANT":null,"BUYER_NAME":"猪哈哈"},"after":{"ID":178733,"PS_ORDER_NO":"PV2003110002","PS_ORDER_STATUS":35121004,"PS_ORDER_TYPE":null,"PS_ORDER_DATE":"2020-03-12 17:52:20","SALES_CONSULTANT":null,"BUYER_NAME":"猪坚强"}}
然后使用sqoop把全量数据表按照rowid格式抽取到hdfs:
sqoop import --connect jdbc:oracle:thin:@//IP:1521/DB --username u --password p --query "SELECT rowidtochar(t.ROWID) as ROW_ID,t.* FROM SBPOPT.TT_TEST t where \$CONDITIONS " --delete-target-dir --target-dir /user/asmp/hive/ogg/tt_test --split-by bill_no --as-parquetfile -m 16
之后就可以用spark读数据,解析修改数据&合并到全量数据~
注意事项:
一般源端只有主键,唯一键,外键开了附加日志;
没有主键的表开的是全列附件日志(因为开启全列附加日志会导致日志量大很多,会影响源端数据库的提交效率)
Tips:
鉴于上述情况需要解析数据的时候需要用代码判断是否只有主键!