離線數據模擬實時數據的技術探索

背景

業務場景: 在測試環境中沒有真實的數據,所以需要把離線數據模擬成實時數據,把以年月日爲文件名的目錄和以日期爲字段的文本內容都改爲當日或未來某一天。

查看原始數據格式爲CSV,以純文本(含字符序列,不含二進制)形式存儲表格數據,以製表符爲分隔符,每日大概500-600萬筆。測試環境中,默認安裝了MySQL5.1,其默認存儲引擎爲myisam。樓主接下去將簡要對比下innodb與myisam的某些方面的性能差異。

CSV樣例:
這裏寫圖片描述

由於數據量較大,加載進內存排序,其檢索性能不高。可借用一臺數據庫服務器持久化當日數據,並對錶示時間的字段索引。把凌晨1點-6點之間的數據摒棄掉,爲其入庫時間,入庫需要定時定點操作,可使用quartz框架。對於短時間內進行大數據量插入適合先插入後建索引,所以該臨時表就設置爲每日一刪一建一插一索引。設置每X秒對遠程服務器的realtime表做一次入庫,則需要在X秒內完成兩個步驟:①檢索過去X秒~當前時間的數據;②對遠程表做入庫。

首先探究短時間內插入大數據量的問題,對單文本進行IO操作,其性能瓶頸在IO上,使用多線程讀取文本,要根據我們服務器的磁盤結構:是單片雙片還是三片,在磁頭只有一個的情況下,多個線程於事無補因爲共用一個磁頭,線程切換上下文耗費的時間非但不能提高效率還會拖垮程序性能。所以我們會考慮使用單線程讀,多線程寫。一次寫多少數據好,根據CVS格式數據,我們決定採用JDBC+C3P0做insert into table values(…),(…),….,每一萬筆的時候插入一次。

public class BatchInsertThread implements Runnable{
    private String sql;
    private static Logger log = Logger.getLogger(BatchInsertThread.class);

    public BatchInsertThread(String sql) {
        super();
        this.sql = sql;
    }

    @Override
    public void run() {
        BaseDao.batchInsertData(sql);//異常的時候繼續執行
    }
}

接下去我們嘗試做了入庫,發現時間把控較爲困難,在第一筆數據還沒有完全插入之前,第二筆數據的線程已經啓動,緊接着第三筆數據的線程也啓動…

mysql> show processlist;
+—-+——+—————–+———-+———+——+———————————+——————————————————————————————————+
| Id | User | Host | db | Command | Time | State | Info |
+—-+——+—————–+———-+———+——+———————————+——————————————————————————————————+
| 14 | root | localhost | realtime | Sleep | 7715 | | NULL |
| 15 | root | localhost:47508 | realtime | Query | 5025 | Waiting for table metadata lock | ALTER TABLE realtemp2 ADD INDEX idx_realt_tradedate ( TRADE_DATE) USING BTREE |
| 16 | root | localhost:47510 | realtime | Sleep | 7072 | | NULL |
| 17 | root | localhost:47509 | realtime | Sleep | 7081 | | NULL |
| 18 | root | localhost:47512 | realtime | Sleep | 7473 | | NULL |
| 19 | root | localhost:47514 | realtime | Sleep | 7233 | | NULL |
| 20 | root | localhost:47515 | realtime | Sleep | 6262 | | NULL |
| 21 | root | localhost:47513 | realtime | Sleep | 6745 | | NULL |
| 22 | root | localhost:47516 | realtime | Sleep | 7273 | | NULL |
| 23 | root | localhost:47517 | realtime | Sleep | 7286 | | NULL |
| 24 | root | localhost:47518 | realtime | Sleep | 7344 | | NULL |
| 25 | root | localhost:47519 | realtime | Sleep | 7305 | | NULL |
| 26 | root | localhost:47520 | realtime | Sleep | 7032 | | NULL |
| 27 | root | localhost:47521 | realtime | Sleep | 6551 | | NULL |
| 28 | root | localhost:47522 | realtime | Sleep | 6276 | | NULL |
| 29 | root | localhost:47523 | realtime | Sleep | 6580 | | NULL |
| 30 | root | localhost:47524 | realtime | Sleep | 6117 | | NULL |
| 31 | root | localhost:47525 | realtime | Sleep | 6671 | | NULL |
| 32 | root | localhost:47526 | realtime | Sleep | 5981 | | NULL |
| 33 | root | localhost:47527 | realtime | Sleep | 5202 | | NULL |
| 34 | root | localhost:47528 | realtime | Sleep | 5514 | | NULL |
| 35 | root | localhost:47529 | realtime | Sleep | 2875 | | NULL |
| 36 | root | localhost:47531 | realtime | Query | 3151 | Waiting for table metadata lock | INSERT INTO realtemp2 (CARD_ID, TRADE_DATE, TRADE_ADDRESS,TRADE_TYPE,START_ADDRESS,DESTINATION) VALU |
| 37 | root | localhost:47530 | realtime | Sleep | 3030 | | NULL |
| 38 | root | localhost:47533 | realtime | Sleep | 6470 | | NULL |
| 39 | root | localhost:47534 | realtime | Sleep | 6510 | | NULL |
| 40 | root | localhost:47535 | realtime | Sleep | 6550 | | NULL |
| 41 | root | localhost:47537 | realtime | Sleep | 6229 | | NULL |
| 42 | root | localhost:47536 | realtime | Sleep | 6390 | | NULL |
| 43 | root | localhost:47538 | realtime | Query | 2900 | Waiting for table metadata lock | INSERT INTO realtemp2 (CARD_ID, TRADE_DATE, TRADE_ADDRESS,TRADE_TYPE,START_ADDRESS,DESTINATION) VALU |
| 44 | root | localhost:47539 | realtime | Query | 2927 | Waiting for table metadata lock | INSERT INTO realtemp2 (CARD_ID, TRADE_DATE, TRADE_ADDRESS,TRADE_TYPE,START_ADDRESS,DESTINATION) VALU |
| 45 | root | localhost:47540 | realtime | Sleep | 6350 | | NULL |
| 46 | root | localhost:47541 | realtime | Sleep | 6189 | | NULL |
| 47 | root | localhost:47542 | realtime | Sleep | 6028 | | NULL |
| 48 | root | localhost:47543 | realtime | Sleep | 6069 | | NULL |
| 49 | root | localhost:47544 | realtime | Query | 2882 | Waiting for table metadata lock | INSERT INTO realtemp2 (CARD_ID, TRADE_DATE, TRADE_ADDRESS,TRADE_TYPE,START_ADDRESS,DESTINATION) VALU |
| 50 | root | localhost:47545 | realtime | Sleep | 3971 | | NULL |
| 51 | root | localhost:47546 | realtime | Sleep | 5788 | | NULL |
| 52 | root | localhost:47547 | realtime | Query | 3401 | Waiting for table metadata lock | INSERT INTO realtemp2 (CARD_ID, TRADE_DATE, TRADE_ADDRESS,TRADE_TYPE,START_ADDRESS,DESTINATION) VALU |
| 53 | root | localhost:47548 | realtime | Query | 4761 | Waiting for table metadata lock | INSERT INTO realtemp2 (CARD_ID, TRADE_DATE, TRADE_ADDRESS,TRADE_TYPE,START_ADDRESS,DESTINATION) VALU |
| 54 | root | localhost:47549 | realtime | Sleep | 5908 | | NULL |
| 55 | root | localhost:47550 | realtime | Sleep | 5948 | | NULL |
| 56 | root | localhost:47551 | realtime | Sleep | 4568 | | NULL |
| 57 | root | localhost:47552 | realtime | Sleep | 3293 | | NULL |
| 58 | root | localhost:47553 | realtime | Sleep | 4817 | | NULL |
| 59 | root | localhost:47554 | realtime | Sleep | 5302 | | NULL |
| 60 | root | localhost:47555 | realtime | Sleep | 5707 | | NULL |
| 61 | root | localhost:47556 | realtime | Sleep | 5426 | | NULL |
| 62 | root | localhost:47557 | realtime | Query | 3114 | Waiting for table metadata lock | INSERT INTO realtemp2 (CARD_ID, TRADE_DATE, TRADE_ADDRESS,TRADE_TYPE,START_ADDRESS,DESTINATION) VALU |
| 63 | root | localhost:47558 | realtime | Query | 2836 | Waiting for table metadata lock | INSERT INTO realtemp2 (CARD_ID, TRADE_DATE, TRADE_ADDRESS,TRADE_TYPE,START_ADDRESS,DESTINATION) VALU |
| 64 | root | localhost:47559 | realtime | Sleep | 3167 | | NULL |
| 65 | root | localhost:47560 | realtime | Sleep | 3181 | | NULL |
| 66 | root | localhost | realtime | Query | 0 | init | show processlist |
+—-+——+—————–+———-+———+——+———————————+——————————————————————————————————+
53 rows in set (0.00 sec)

我們發現短時間內建立了大量連接,最終很容易出現幾條Waiting for table metadata lock停滯而產生死鎖。需要重啓數據庫才能解鎖,或者kill掉進程號(有時kill行不通)。以上實驗僅爲樓主測試用,實際情況使用單線程阻塞讀寫即可滿足需求。

然後,我們把卡號(varchar)和交易時間(datetime)做聯合主鍵,同時交易時間在數據插入完之後又單獨建立了索引。我們嘗試在myisam存儲引擎上,把當日數據做入庫。

……
2017 Mar 22 15:28:59,702 DEBUG [TextUtil:47] success at 535
2017 Mar 22 15:29:32,052 DEBUG [TextUtil:47] success at 536
2017 Mar 22 15:30:06,576 DEBUG [TextUtil:47] success at 537
2017 Mar 22 15:30:41,522 DEBUG [TextUtil:47] success at 538
2017 Mar 22 15:31:14,240 DEBUG [TextUtil:47] success at 539
2017 Mar 22 15:31:47,557 DEBUG [TextUtil:47] success at 540
2017 Mar 22 15:32:23,786 DEBUG [TextUtil:47] success at 541
2017 Mar 22 15:32:57,608 DEBUG [TextUtil:47] success at 542
2017 Mar 22 15:33:29,735 DEBUG [TextUtil:47] success at 543
2017 Mar 22 15:34:03,540 DEBUG [TextUtil:47] success at 544
2017 Mar 22 15:34:39,732 DEBUG [TextUtil:47] success at 545
2017 Mar 22 15:35:13,377 DEBUG [TextUtil:47] success at 546
2017 Mar 22 15:35:47,047 DEBUG [TextUtil:47] success at 547
2017 Mar 22 15:35:47,198 DEBUG [Action:55] 插入數據:18051s
2017 Mar 22 16:19:03,794 DEBUG [Action:58] 建立了索引:2596s

在myisam存儲引擎上,每條插入連接一萬筆,使用聯合主鍵,插入數據共18051s≈5.01小時,建立索引共2596s≈43.27分鐘。

5.1版本mysql> SHOW variables like “have_%”;
Connection id: 150899
Current database: realtime
+————————-+———-+
| Variable_name | Value |
+————————-+———-+
| have_community_features | YES |
| have_compress | YES |
| have_crypt | YES |
| have_csv | YES |
| have_dynamic_loading | YES |
| have_geometry | YES |
| have_innodb | YES |
| have_ndbcluster | NO |
| have_openssl | DISABLED |
| have_partitioning | YES |
| have_query_cache | YES |
| have_rtree_keys | YES |
| have_ssl | DISABLED |
| have_symlink | DISABLED |
+————————-+———-+
14 rows in set (0.05 sec)

5.1支持切換爲innodb,要特別設置InnoDB爲默認引擎,需要在 /etc/my.cnf 文件中的 [mysqld] 下面加入default-storage-engine=INNODB 一句,保存,重啓。
再次使用:mysql> show engines;
然而切換了存儲引擎之後,同樣的插入方案還是大概要5個小時左右。

我們把5.1卸載掉,換成5.6,具體步驟參看文本末。
把每次插入的數據換成5萬筆,得到的結果如下:

……
2017 Mar 23 15:41:59,626 DEBUG [TextUtil:54] success at 95
2017 Mar 23 15:43:53,861 DEBUG [TextUtil:54] success at 96
2017 Mar 23 15:45:51,042 DEBUG [TextUtil:54] success at 97
2017 Mar 23 15:47:47,207 DEBUG [TextUtil:54] success at 98
2017 Mar 23 15:49:47,012 DEBUG [TextUtil:54] success at 99
2017 Mar 23 15:51:48,821 DEBUG [TextUtil:54] success at 100
2017 Mar 23 15:53:52,469 DEBUG [TextUtil:54] success at 101
2017 Mar 23 15:55:46,813 DEBUG [TextUtil:54] success at 102
2017 Mar 23 15:57:42,407 DEBUG [TextUtil:54] success at 103
2017 Mar 23 15:59:42,079 DEBUG [TextUtil:54] success at 104
2017 Mar 23 16:01:40,422 DEBUG [TextUtil:54] success at 105
2017 Mar 23 16:01:40,500 DEBUG [Action:56] 插入數據:5867s
2017 Mar 23 16:02:44,538 DEBUG [Action:59] 建立了索引:64s

在5.6的innodb存儲引擎上,每條插入連接五萬筆,使用聯合主鍵,插入數據共5867s≈1.63小時,建立索引共64s。
取日誌記錄的前4條,中間4條和最後4條做對比:

**前4條**
2017 Mar 23 14:23:58,981 DEBUG [TextUtil:54] success at 1
2017 Mar 23 14:24:01,351 DEBUG [TextUtil:54] success at 2
2017 Mar 23 14:24:03,611 DEBUG [TextUtil:54] success at 3
2017 Mar 23 14:24:05,813 DEBUG [TextUtil:54] success at 4
**中間4條**
2017 Mar 23 14:41:07,941 DEBUG [TextUtil:54] success at 53
2017 Mar 23 14:42:05,892 DEBUG [TextUtil:54] success at 54
2017 Mar 23 14:43:05,115 DEBUG [TextUtil:54] success at 55
2017 Mar 23 14:44:01,821 DEBUG [TextUtil:54] success at 56
**最後4條**
2017 Mar 23 15:55:46,813 DEBUG [TextUtil:54] success at 102
2017 Mar 23 15:57:42,407 DEBUG [TextUtil:54] success at 103
2017 Mar 23 15:59:42,079 DEBUG [TextUtil:54] success at 104
2017 Mar 23 16:01:40,422 DEBUG [TextUtil:54] success at 105

我們可以發現,剛開始插入數據,量小,大概每幾秒插入五萬筆;到數據量達250多萬級別,大概每1分鐘插入五萬筆;到數據量達500萬以上級別,大概每2分鐘插入五萬筆。因爲主鍵是聯合主鍵,且第一個字段是字符串,數據庫對於varchar類型數據只能逐個字母匹配,每插一條都得對varchar進行n個字母匹配和對datetime進行比較。我們把聯合主鍵去掉,換成int型的自增id,每次插入的數據還是5萬筆,得到的結果如下:

 5068 2017 Mar 25 01:17:05,881 DEBUG [TextUtil:54] success at 110
 5069 2017 Mar 25 01:17:07,628 DEBUG [TextUtil:54] success at 111
 5070 2017 Mar 25 01:17:08,725 DEBUG [TextUtil:54] success at 112
 5071 2017 Mar 25 01:17:09,698 DEBUG [TextUtil:54] success at 113
 5072 2017 Mar 25 01:17:11,140 DEBUG [TextUtil:54] success at 114
 5073 2017 Mar 25 01:17:13,409 DEBUG [TextUtil:54] success at 115
 5074 2017 Mar 25 01:17:15,346 DEBUG [TextUtil:54] success at 116
 5075 2017 Mar 25 01:17:15,365 DEBUG [LoadNextDayDataJob:43] 插入數據:735s
2017 Mar 25 01:36:33,916 DEBUG [LoadNextDayDataJob:46] 建立了索引:1158s 
2017 Mar 25 01:36:33,917 INFO  [LoadNextDayDataJob:48] MyJob  is end .....................

該日誌是夜間執行quartz任務的,在5.6的innodb存儲引擎上,使用int型自增id,每條插入連接五萬筆,插入數據共735s≈12.25分鐘,建立索引共1158s≈19.30分鐘。

在高峯的時候,傳輸數據包超過數據庫閾值:
這裏寫圖片描述
mysql會根據配置文件會限制server接受的數據包大小。
有時候大的插入和更新會受max_allowed_packet 參數限制,導致寫入或者更新失敗。

5.1 mysql> show VARIABLES like '%max_allowed_packet%';
+--------------------------+------------+
| Variable_name            | Value      |
+--------------------------+------------+
| max_allowed_packet       | 1048576    |
| slave_max_allowed_packet | 1073741824 |
+--------------------------+------------+

5.6mysql> show VARIABLES like '%max_allowed_packet%';
+--------------------------+------------+
| Variable_name            | Value      |
+--------------------------+------------+
| max_allowed_packet       | 4194304    |
| slave_max_allowed_packet | 1073741824 |
+--------------------------+------------+

所以要對遠程5.1的mysql服務器的容量進行修改。


命令行記錄

連接遠程數據庫:mysql -h 192.168.40.128 -P 3306 -u root -p
查看錶結構:show create table tablename;
查看當前的進程情況:mysql> show processlist ;
查看錶索引:show index from realtemp2;
查看mysql已提供什麼存儲引擎:mysql> show engines;
查看mysql當前默認的存儲引擎:mysql> show variables like '%storage_engine%';
查看myql是否支持InnoDB引擎:mysql> SHOW variables like "have_%";
server接受的數據包大小:show VARIABLES like '%max_allowed_packet%';


quartz框架和mybatis的雙數據源

quartz框架

public class QuartzManager {
    private static String JOB_GROUP_NAME = "group1";
    private static String TRIGGER_GROUP_NAME = "trigger1";
    //開始一個simpleSchedule()調度
    public static void startSchedule(String jobName, Job job, String time) {
        try {
            // 1、創建一個JobDetail實例,指定Quartz
            JobDetail jobDetail = JobBuilder.newJob(job.getClass())
            // 任務執行類
            .withIdentity(jobName, JOB_GROUP_NAME)
             // 任務名,任務組
             .build();
            CronScheduleBuilder builder = CronScheduleBuilder.cronSchedule(time);// 每天1點之後觸發
            // 2、創建Trigger
            Trigger trigger = TriggerBuilder.newTrigger().withIdentity(TRIGGER_GROUP_NAME, JOB_GROUP_NAME).startNow()
                    .withSchedule(builder).build();
            // 3、創建Scheduler
            Scheduler scheduler = StdSchedulerFactory.getDefaultScheduler();
            scheduler.start();
            // 4、調度執行
            scheduler.scheduleJob(jobDetail, trigger);
        } catch (SchedulerException e) {
            e.printStackTrace();
        }
    }
}
public class LoadNextDayDataJob implements Job {
    private static final Logger log = Logger.getLogger(LoadNextDayDataJob.class);
    @Override
    public void execute(JobExecutionContext context) throws JobExecutionException {
        log.info("MyJob  is start ..................");
        log.info("Hello quzrtz  " + new SimpleDateFormat("yyyy-MM-dd HH:mm:ss ").format(new Date()));
        RealtDao rd = new RealtDaoImpl(    DataSourceSqlSessionFactory.getSqlSessionFactory(DataSourceSqlSessionFactory.LOCAL_ENVIRONMENT_ID));
        long begin = System.currentTimeMillis();
        log.debug(begin);
        rd.createNewTable("realtemp2");
        String fileName = DateUtil.parseDateToString(new Date(), DateUtil.PATTERN_yyyy_MM_dd);
        try {
 TextUtil.parseText2SQLBatchInsert(Cfg.cfgMap.get(Cfg.DATA_PATH) + fileName + "/part-r-00000");
        } catch (IOException e) {
            log.debug(e);
            e.printStackTrace();
        }
        log.debug("插入數據:" + (System.currentTimeMillis() - begin) / 1000 + "s");
        begin = System.currentTimeMillis();
        rd.createIdxRealtTradeDate("realtemp2");
        log.debug("建立了索引:" + (System.currentTimeMillis() - begin) / 1000 + "s");
        log.info("MyJob  is end .....................");
    }
}

mybatis的雙數據源

SqlMapConfig.xml
<!-- 和spring整合後 environments配置將廢除-->
    <environments default="LOCAL">
        <environment id="LOCAL">
            <!-- 使用jdbc事務管理,事務控制由mybatis-->
            <transactionManager type="JDBC" />
            <!-- 數據庫連接池,由mybatis管理-->
            <dataSource type="POOLED">
                <property name="driver" value="${local.jdbc.driverClassName}" />
                <property name="url" value="${local.jdbc.url}"/>
                <property name="password" value="${local.jdbc.password}"/>
                <property name="username" value="${local.jdbc.username}" />
            </dataSource>
        </environment>
        <environment id="REMOTE">
            <!-- 使用jdbc事務管理,事務控制由mybatis-->
            <transactionManager type="JDBC" />
            <!-- 數據庫連接池,由mybatis管理-->
            <dataSource type="POOLED">
                <property name="driver" value="${remote.jdbc.driverClassName}" />
                <property name="url" value="${remote.jdbc.url}"/>
                <property name="password" value="${remote.jdbc.password}"/>
                <property name="username" value="${remote.jdbc.username}" />
            </dataSource>
        </environment>
    </environments>
//http://zhangbo-peipei-163-com.iteye.com/blog/2052924
//根據mybatis.xml中配置的不同的environment創建對應的SqlSessionFactory
public final class DataSourceSqlSessionFactory {
    private static Logger logger = Logger.getLogger(DataSourceSqlSessionFactory.class);
    private static final String CONFIGURATION_PATH = "SqlMapConfig.xml";
    public final static String LOCAL_ENVIRONMENT_ID = "LOCAL";
    public final static String REMOTE_ENVIRONMENT_ID = "REMOTE";

    public static SqlSessionFactory getSqlSessionFactory(String environment) {
        InputStream inputStream = null;
        SqlSessionFactory sqlSessionFactory = null;
        try {
            inputStream = Resources.getResourceAsStream(CONFIGURATION_PATH);
            sqlSessionFactory = new SqlSessionFactoryBuilder().build(inputStream, environment);
            inputStream.close();
            logger.info("獲取 [ " + environment + " ] 數據源連接成功");
        } catch (IOException e) {
            logger.error("獲取 [ " + environment + " ] 數據源連接失敗,錯誤信息 :" + e);
        }
        return sqlSessionFactory;
    }
}

MySQL5.6的安裝

MySQL的卸載

查看是否有mysql軟件:
rpm -qa|grep mysql
卸載mysql:
yum remove mysql mysql-server mysql-libs mysql-common
rm -rf /var/lib/mysql
rm /etc/my.cnf

查看是否還有mysql軟件,有的話繼續刪除。
軟件卸載完畢後如果需要可以刪除mysql的數據庫:/var/lib/mysql

yum安裝

下載rpm包:
使用yum 安裝mysql,要使用mysql的yum倉庫,先從官網下載適合你係統的倉庫
我們選擇mysql-community-release-el6-5.noarch.rpm
安裝倉庫列表:
yum localinstall mysql-community-release-el6-5.noarch.rpm
安裝mysql:
yum install mysql-community-server

啓動,重置,授權

啓動mysql:
sudo service mysqld start
設置root用戶密碼:

mysql數據庫安裝完以後只會有一個root管理員賬號,但是此時的root賬號還並沒有爲其設置密碼,在第一次啓動mysql服務時,會進行數據庫的一些初始化工作。

/usr/bin/mysqladmin -u root password 'new-password'
mysql遠程連接授權:
GRANT ALL PRIVILEGES ON *.* TO 'username'@'%' IDENTIFIED BY 'password' WITH GRANT OPTION;

參看:
MySQL命令行常用操作
MySQL出現Waiting for table metadata lock的原因以及解決方法
LINUX下的MYSQL怎麼開啓INNODB數據支持
關閉MySQL的DNS反向解析


作者: @nanphonfy
轉載出處 : http://blog.csdn.net/Nanphonfy


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章