剛剛接觸flume,打算把mysql的數據寫入到kafka然後再做後期處理,我的kafka已經安裝,沒有安裝的自行安裝,然後開始使用flume,看了看flume的官網,用起來還是比較方便的,但是在使用過程中還是遇到了很多問題,這裏一一列舉一下,以往後續使用者少走彎路!
首先在網上查了查,讀取mysql到kafka的話,有兩種情況,一種是讀取mysql的binlog日誌,晚上一搜,貌似還要自己解析數據,做增量也不方便,還有一種方法是使用一個叫做 flume-ng-sql-source的jar包可以讀取關係型數據庫的數據到kafka,看官網!
https://github.com/mvalleavila/flume-ng-sql-source 那裏有比較詳細的介紹,我下載到了 jar包,
flume-ng-sql-source-1.4.4.jar ,地址如下:
https://pan.baidu.com/s/1dlWpLmt-pstoxmoUHeW9IA 提取碼:thf7
要讀取mysql的話,還需要一個驅動包 mysql-connector-java-5.1.43-bin.jar 地址如下
https://pan.baidu.com/s/14O40gXY1r9iQmQb9TCvy6g 提取碼:d834
以上兩個jar包要放到flume安裝目錄的lib下,因爲我用的是CDH安裝,所以目錄是:
/opt/cloudera/parcels/CDH-6.0.1-1.cdh6.0.1.p0.590678/lib/flume-ng/lib
配置
然後啓動 flume_mysql.conf文件,內容如下;
a1.channels = c1
a1.sources = s1
a1.sinks = k1
###########sources#################
####s1######
a1.sources.s1.type = org.keedio.flume.source.SQLSource
a1.sources.s1.hibernate.connection.url = jdbc:mysql://slave2:3306/aaa
a1.sources.s1.hibernate.connection.user = root
a1.sources.s1.hibernate.connection.password = 123456
a1.sources.s1.hibernate.connection.autocommit = true
a1.sources.s1.table = test2
a1.sources.s1.hibernate.dialect = org.hibernate.dialect.MySQL5Dialect
a1.sources.s1.hibernate.connection.driver_class = com.mysql.jdbc.Driver
a1.sources.s1.run.query.delay=10000
a1.sources.s1.status.file.path = /opt/flume/flume_status
a1.sources.s1.status.file.name = sqlSource.status
a1.sources.s1.batch.size = 1000
a1.sources.s1.max.rows = 1000
a1.sources.s1.hibernate.connection.provider_class = org.hibernate.connection.C3P0ConnectionProvider
a1.sources.s1.hibernate.c3p0.min_size=1
a1.sources.s1.hibernate.c3p0.max_size=100000
############channels###############
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000
############sinks##################
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.topic = my-topic
a1.sinks.k1.brokerList = slave2:9092
a1.sinks.k1.requiredAcks = 1
a1.sinks.k1.batchSize = 20
a1.sinks.k1.channel = c1
a1.sinks.k1.channel = c1
a1.sources.s1.channels=c1
然後如下命令啓動:
flume-ng agent -c conf -f /opt/flume/flume_mysql.conf -n a1 -Dflume.root.logger=INFO,console
會報錯如下;
搜了半天,原來是CDH獨有的問題,沒找到辦法,重新裝了原生flume,問題解決!
下面調試增量讀取mysql數據庫,修改flume_mysql.conf文件如下:
a1.channels = c1
a1.sources = s1
a1.sinks = k1
###########sources#################
####s1######
a1.sources.s1.type = org.keedio.flume.source.SQLSource
a1.sources.s1.hibernate.connection.url = jdbc:mysql://slave2:3306/aaa
a1.sources.s1.hibernate.connection.user = root
a1.sources.s1.hibernate.connection.password = 123456
a1.sources.s1.hibernate.connection.autocommit = true
a1.sources.s1.table = test2
a1.sources.s1.hibernate.dialect = org.hibernate.dialect.MySQL5Dialect
a1.sources.s1.hibernate.connection.driver_class = com.mysql.jdbc.Driver
a1.sources.s1.run.query.delay=10000
a1.sources.s1.status.file.path = /opt/flume/flume_status
a1.sources.s1.status.file.name = sqlSource.status
a1.sources.s1.batch.size = 1000
a1.sources.s1.max.rows = 1000
a1.sources.s1.hibernate.connection.provider_class = org.hibernate.connection.C3P0ConnectionProvider
a1.sources.s1.hibernate.c3p0.min_size=1
a1.sources.s1.hibernate.c3p0.max_size=100000
al.sources.s1.columns.to.select=create_time,id,name
a1.sources.s1.custom.query=select create_time,id,name from test2 where UNIX_TIMESTAMP(create_time)>UNIX_TIMESTAMP('$@$') order by create_time
############channels###############
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 10000
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000
############sinks##################
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
a1.sinks.k1.topic = my-topic
a1.sinks.k1.brokerList = slave2:9092
a1.sinks.k1.requiredAcks = 1
a1.sinks.k1.batchSize = 20
a1.sinks.k1.channel = c1
a1.sinks.k1.channel = c1
a1.sources.s1.channels=c1
mysql數據庫如下:
內容如下:
這樣更新create_time的時間爲最新,就可以增量抽取了!