需求
假設我們要遷移歌曲表song,表結構如下:
-- table song
id bigint(28) auto_increment comment 'id' primary key,
program_series_id bigint(28) not null comment '節目集id',
program_series_name varchar(256) null comment '節目集名稱',
program_id bigint(28) not null comment '節目id',
name varchar(256) null comment '歌曲名稱',
singer_name varchar(256) null comment '歌手名字(冗餘字段)',
lyric text null comment '歌詞',
enable int(1) default 0 not null comment '狀態 0:正常狀態 -1:禁用狀態',
need_pay int(1) default 0 not null comment '是否需要付費 0:不需要, 1:需要',
description varchar(255) null comment '描述信息',
create_time datetime not null comment '創建時間',
last_time datetime not null comment '最後更新時間',
del_flag int(1) default 0 not null comment '是否已刪除 0: 未刪除 1:已刪除'
同步數據有兩個需要注意的地方:由於我們要遷移表的數據量比較大(千萬級),所以我們需要分頁讀取數據,同時我們需要記錄住上次讀取的位置,這樣下次執行的執行的時候可以繼續執行,而不是從頭開始。
定義 Mapping
我們需要在 es 中定義對應的 mapping:
PUT /song
{
"settings": {
"index": {
"number_of_shards": "3",
"number_of_replicas": "1"
}
},
"mappings": {
"properties": {
"createTime": {
"type": "date",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"delFlag": {
"type": "long"
},
"description": {
"type": "text",
"analyzer": "ik_max_word",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"enable": {
"type": "long"
},
"id": {
"type": "long"
},
"lastTime": {
"type": "date",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"lyric": {
"type": "text",
"analyzer": "ik_max_word",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"name": {
"type": "text",
"analyzer": "ik_max_word",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"needPay": {
"type": "long"
},
"programId": {
"type": "text",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"programSeriesId": {
"type": "long"
},
"programSeriesName": {
"type": "text",
"analyzer": "ik_max_word",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
},
"singerName": {
"type": "text",
"analyzer": "ik_max_word",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
}
Logstash
首先需要下載logstash,版本最好和 ES 版本一致, 我這裏使用的 7.1.0。我們使用 Jdbc input plugin 插件進行同步。
拷貝 Jdbc 驅動
由於 logstash 本身沒有 jdbc 驅動,我們需要自己準備,將 jdbc 驅動 jar 包放到 logstash-7.1.0/logstash-core/lib/jars
目錄下
準備 sql 文件
準備我們的 song.sql 文件:
select id,
program_series_id,
program_series_name,
program_id,
name,
singer_name,
lyric,
enable,
need_pay,
description,
create_time,
last_time,
del_flag
from song
where ifnull(`last_time`, str_to_date('1970-01-01 00:00:00', '%Y-%m-%d %H:%i:%s')) >= :sql_last_value
Jdbc input plugin 配置
input {
jdbc {
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_connection_string => "jdbc:mysql://topic.com:3306/oos_topic?useUnicode=true&characterEncoding=UTF-8&autoReconnect=true&useSSL=false&zeroDateTimeBehavior=convertToNull&serverTimezone=Asia/Shanghai"
jdbc_user => root
jdbc_password => ysten123
#啓用追蹤,如果爲true,則需要指定tracking_column
use_column_value => true
#指定追蹤的字段,
tracking_column => "last_time"
#追蹤字段的類型,目前只有數字(numeric)和時間類型(timestamp),默認是數字類型
tracking_column_type => "timestamp"
#記錄最後一次運行的結果
record_last_run => true
#上面運行結果的保存位置
last_run_metadata_path => "/Users/wu/tools/logstash-7.1.0/jdbc-position.txt"
# statement => "select id, program_series_id, program_series_name, program_id, name, singer_name, lyric, enable, need_pay, description, create_time, last_time , del_flag from song where id > :id"
statement_filepath => "/Users/wu/tools/logstash-7.1.0/song.sql"
schedule => " * * * * * *"
jdbc_paging_enabled => true
jdbc_page_size => 50000
}
}
filter {
mutate {
rename => { "create_time" => "createTime" }
rename => { "last_time" => "lastTime" }
rename => { "program_series_id" => "programSeriesId"}
rename => { "program_series_name" => "programSeriesName"}
rename => { "program_id" => "programId"}
rename => { "singer_name" => "singerName"}
rename => { "need_pay" => "needPay"}
rename => { "del_flag" => "delFlag"}
}
}
output {
elasticsearch {
document_id => "%{id}"
document_type => "_doc"
index => "song"
hosts => ["http://localhost:9200", "http://localhost:9201", "http://localhost:9202"]
}
stdout{
codec => rubydebug
}
}
有幾個需要注意的地方:
– SQL 中不要已;
結尾。由於我們的同步過程中使用了分頁,logstash 會對我們 sql 做 count 操作,如果 sql 中有;
, 會導致 logstash 的 count 語句報錯。
– 追蹤字段的類型,目前只有數字(numeric)和時間類型(timestamp),默認是數字類型
– 最好設置一下數據庫的時區
– SQL 中需要顯式的列出 tracking_column
啓動
logstash 啓動的速度比較慢。
bin/logstash -f song.yaml