大數據批處理比較 spring batch vs flink vs stream Parallel
摘要:本文主要通過實際案例的對比分析,選擇適合自己大數據批處理的應用技術方案
爲什麼使用批處理 ?
場景
- 數據導入
- 導入場景需要開啓事務,保證數據一致性,要麼全部成功,要麼全部失敗
- 實時在線處理,對響應時間有較高要求
- 批量查詢
- 實時在線處理,對響應時間有較高要求
針對以上場景,我們採用傳統的處理方法無法滿足響應時間的要求,我們首先想到的是多線程編程處理,多線程解決方案是沒有問題的,問題點是多線程並行編程要求高,難度大,那麼是否有開源界是否有響應的解決方案呢?答案是:yes.
本文對批處理的方案進行了對比,總結、歸納:
場景:一萬條數據保存到student表,MySQL數據庫
表結構
REATE TABLE `student` (
`id` int(20) NOT NULL AUTO_INCREMENT,
`name` varchar(20) DEFAULT NULL,
`age` int(11) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=220000 DEFAULT CHARSET=utf8;
spring batch vs flink vs stream Parallel 性能對比
方案 | 執行時間 | 備註 |
---|---|---|
循環方案 | 34391s | |
stream Parallel | 8384s | |
springboot batch | 1035s | |
flink | 4310s |
準備工作
Springboot 工程 pom.xml 配置
依賴
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>2.0.1.RELEASE</version>
<relativePath /> <!-- lookup parent from repository -->
</parent>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>5.1.40</version><!--$NO-MVN-MAN-VER$ -->
<scope>runtime</scope>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>druid</artifactId>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter</artifactId>
<exclusions>
<exclusion>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-logging</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-log4j2</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<optional>true</optional>
<scope>compile</scope>
</dependency>
<dependency>
<groupId>junit</groupId>
<artifactId>junit</artifactId>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.mybatis.spring.boot</groupId>
<artifactId>mybatis-spring-boot-starter</artifactId>
<version>2.0.1</version>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-batch</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-data-jpa</artifactId>
</dependency>
<dependency>
<groupId>javax.xml.bind</groupId>
<artifactId>jaxb-api</artifactId>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-java</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java_2.11</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-jdbc_2.12</artifactId>
<version>${flink.version}</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-api-java-bridge_2.11</artifactId>
<version>1.9.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-planner-blink_2.11</artifactId>
<version>1.9.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-scala_2.11</artifactId>
<version>1.9.0</version>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-table-common</artifactId>
<version>1.9.0</version>
</dependency>
溫馨提示 Springboot 版本選用 2.0.1.RELEASE ,否則有兼容問題
application.properties
server.port=8070
spring.application.name=sea-spring-boot-batch
spring.batch.initialize-schema=always
spring.jpa.generate-ddl=true
mybatis.config-location=classpath:mybatis-config.xml
mybatis.mapper-locations=classpath:mybatis/*.xml
mybatis.type-aliases-package=org.sea.spring.cloud.nacos.model
spring.datasource.driverClassName=com.mysql.jdbc.Driver
spring.datasource.url=jdbc:mysql://localhost:3306/test?useUnicode=true&characterEncoding=utf-8&useSSL=false
spring.datasource.username=root
spring.datasource.password=mysql
spring.datasource.type=com.alibaba.druid.pool.DruidDataSource
mybatis-config.xml
<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE configuration PUBLIC "-//mybatis.org//DTD Config 3.0//EN" "http://mybatis.org/dtd/mybatis-3-config.dtd">
<configuration>
<!-- 全局參數 -->
<settings>
<!-- 使全局的映射器啓用或禁用緩存。 -->
<setting name="cacheEnabled" value="true"/>
<!-- 全局啓用或禁用延遲加載。當禁用時,所有關聯對象都會即時加載。 -->
<setting name="lazyLoadingEnabled" value="true"/>
<!-- 當啓用時,有延遲加載屬性的對象在被調用時將會完全加載任意屬性。否則,每種屬性將會按需要加載。 -->
<setting name="aggressiveLazyLoading" value="true"/>
<!-- 是否允許單條sql 返回多個數據集 (取決於驅動的兼容性) default:true -->
<setting name="multipleResultSetsEnabled" value="true"/>
<!-- 是否可以使用列的別名 (取決於驅動的兼容性) default:true -->
<setting name="useColumnLabel" value="true"/>
<!-- 允許JDBC 生成主鍵。需要驅動器支持。如果設爲了true,這個設置將強制使用被生成的主鍵,有一些驅動器不兼容不過仍然可以執行。 default:false -->
<setting name="useGeneratedKeys" value="true"/>
<!-- 指定 MyBatis 如何自動映射 數據基表的列 NONE:不隱射 PARTIAL:部分 FULL:全部 -->
<setting name="autoMappingBehavior" value="PARTIAL"/>
<!-- 這是默認的執行類型 (SIMPLE: 簡單; REUSE: 執行器可能重複使用prepared statements語句;BATCH: 執行器可以重複執行語句和批量更新) -->
<setting name="defaultExecutorType" value="SIMPLE"/>
<!-- 使用駝峯命名法轉換字段。 -->
<setting name="mapUnderscoreToCamelCase" value="true"/>
<!-- 設置本地緩存範圍 session:就會有數據的共享 statement:語句範圍 (這樣就不會有數據的共享 ) defalut:session -->
<setting name="localCacheScope" value="SESSION"/>
<!-- 設置但JDBC類型爲空時,某些驅動程序 要指定值,default:OTHER,插入空值時不需要指定類型 -->
<setting name="jdbcTypeForNull" value="NULL"/>
</settings>
</configuration>
編寫測試
for循環測試用例和 parallelStream測試用例自己編寫,不重點講了,重點是springbootbatch 、 flink.
springbootbatch 測試用例
package org.sea.spring.boot.batch.job;
import java.util.Iterator;
import java.util.List;
import org.sea.spring.boot.batch.model.StudentEntity;
import org.springframework.batch.item.ItemReader;
import org.springframework.batch.item.NonTransientResourceException;
import org.springframework.batch.item.ParseException;
import org.springframework.batch.item.UnexpectedInputException;
public class InputStudentItemReader implements ItemReader<StudentEntity>{
private final Iterator<StudentEntity> iterator;
public InputStudentItemReader(List<StudentEntity> data) {
this.iterator = data.iterator();
}
@Override
public StudentEntity read()
throws Exception, UnexpectedInputException, ParseException, NonTransientResourceException {
if (iterator.hasNext()) {
return this.iterator.next();
} else {
return null;
}
}
}
package org.sea.spring.boot.batch.job;
import java.util.Set;
import javax.validation.ConstraintViolation;
import javax.validation.Validation;
import javax.validation.ValidatorFactory;
import org.springframework.batch.item.validator.ValidationException;
import org.springframework.batch.item.validator.Validator;
import org.springframework.beans.factory.InitializingBean;
public class StudentBeanValidator <T> implements Validator<T>, InitializingBean{
private javax.validation.Validator validator;
@Override
public void afterPropertiesSet() throws Exception {
ValidatorFactory validatorFactory = Validation.buildDefaultValidatorFactory();
validator = validatorFactory.usingContext().getValidator();
}
@Override
public void validate(T value) throws ValidationException {
Set<ConstraintViolation<T>> constraintViolations = validator.validate(value);
if (constraintViolations.size() > 0) {
StringBuilder message = new StringBuilder();
for (ConstraintViolation<T> constraintViolation : constraintViolations) {
message.append(constraintViolation.getMessage()).append("\n");
}
throw new ValidationException(message.toString());
}
}
}
package org.sea.spring.boot.batch.job;
import org.sea.spring.boot.batch.model.StudentEntity;
import org.springframework.batch.item.validator.ValidatingItemProcessor;
import org.springframework.batch.item.validator.ValidationException;
public class StudentItemProcessor extends ValidatingItemProcessor<StudentEntity> {
@Override
public StudentEntity process(StudentEntity item) throws ValidationException {
super.process(item);
return item;
}
}
package org.sea.spring.boot.batch.job;
import java.util.ArrayList;
import java.util.List;
import org.sea.spring.boot.batch.model.StudentEntity;
import org.springframework.batch.core.Job;
import org.springframework.batch.core.Step;
import org.springframework.batch.core.configuration.annotation.EnableBatchProcessing;
import org.springframework.batch.core.configuration.annotation.JobBuilderFactory;
import org.springframework.batch.core.configuration.annotation.StepBuilderFactory;
import org.springframework.batch.core.configuration.annotation.StepScope;
import org.springframework.batch.core.launch.support.RunIdIncrementer;
import org.springframework.batch.core.launch.support.SimpleJobLauncher;
import org.springframework.batch.core.repository.JobRepository;
import org.springframework.batch.core.repository.support.JobRepositoryFactoryBean;
import org.springframework.batch.item.ItemProcessor;
import org.springframework.batch.item.ItemReader;
import org.springframework.batch.item.ItemWriter;
import org.springframework.batch.item.database.BeanPropertyItemSqlParameterSourceProvider;
import org.springframework.batch.item.database.JdbcBatchItemWriter;
import org.springframework.batch.item.validator.Validator;
import org.springframework.batch.support.DatabaseType;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.transaction.PlatformTransactionManager;
import com.alibaba.druid.pool.DruidDataSource;
@Configuration
@EnableBatchProcessing
public class StudentBatchConfig {
/**
* ItemReader定義,用來讀取數據
* @return FlatFileItemReader
*/
@Bean
@StepScope
public InputStudentItemReader reader() {
List<StudentEntity> list =new ArrayList<StudentEntity>(10000);
for(int i=0;i<10000;i++) {
list.add(init(i));
}
return new InputStudentItemReader(list);
}
private StudentEntity init(int i) {
StudentEntity student=new StudentEntity();
student.setName("name"+i);
student.setAge(i);
return student;
}
/**
* ItemProcessor定義,用來處理數據
*
* @return
*/
@Bean
public ItemProcessor<StudentEntity, StudentEntity> processor() {
StudentItemProcessor processor = new StudentItemProcessor();
processor.setValidator(studentBeanValidator());
return processor;
}
@Bean
public Validator<StudentEntity> studentBeanValidator() {
return new StudentBeanValidator<>();
}
/**
* ItemWriter定義,用來輸出數據
* spring能讓容器中已有的Bean以參數的形式注入,Spring Boot已經爲我們定義了dataSource
*
* @param dataSource
* @return
*/
@Bean
public ItemWriter<StudentEntity> writer(DruidDataSource dataSource) {
JdbcBatchItemWriter<StudentEntity> writer = new JdbcBatchItemWriter<>();
//我們使用JDBC批處理的JdbcBatchItemWriter來寫數據到數據庫
writer.setItemSqlParameterSourceProvider(new BeanPropertyItemSqlParameterSourceProvider<>());
String sql="INSERT INTO student (name,age) values(:name,:age)" ;
//在此設置要執行批處理的SQL語句
writer.setSql(sql);
writer.setDataSource(dataSource);
return writer;
}
/**
*
* @param dataSource
* @param transactionManager
* @return
* @throws Exception
*/
@Bean
public JobRepository jobRepository(DruidDataSource dataSource, PlatformTransactionManager transactionManager) throws Exception {
JobRepositoryFactoryBean jobRepositoryFactoryBean = new JobRepositoryFactoryBean();
jobRepositoryFactoryBean.setDataSource(dataSource);
jobRepositoryFactoryBean.setTransactionManager(transactionManager);
jobRepositoryFactoryBean.setDatabaseType(String.valueOf(DatabaseType.MYSQL));
// 下面事務隔離級別的配置是針對Oracle的
// jobRepositoryFactoryBean.setIsolationLevelForCreate(isolationLevelForCreate);
jobRepositoryFactoryBean.afterPropertiesSet();
return jobRepositoryFactoryBean.getObject();
}
/**
* JobLauncher定義,用來啓動Job的接口
*
* @param dataSource
* @param transactionManager
* @return
* @throws Exception
*/
@Bean
public SimpleJobLauncher jobLauncher(DruidDataSource dataSource, PlatformTransactionManager transactionManager) throws Exception {
SimpleJobLauncher jobLauncher = new SimpleJobLauncher();
jobLauncher.setJobRepository(jobRepository(dataSource, transactionManager));
return jobLauncher;
}
/**
* Job定義,我們要實際執行的任務,包含一個或多個Step
*
* @param jobBuilderFactory
* @param s1
* @return
*/
@Bean
public Job importJob(JobBuilderFactory jobBuilderFactory, Step s1) {
return jobBuilderFactory.get("importJob")
.incrementer(new RunIdIncrementer())
.flow(s1)//爲Job指定Step
.end()
.build();
}
/**
* step步驟,包含ItemReader,ItemProcessor和ItemWriter
*
* @param stepBuilderFactory
* @param reader
* @param writer
* @param processor
* @return
*/
@Bean
public Step step1(StepBuilderFactory stepBuilderFactory, ItemReader<StudentEntity> reader, ItemWriter<StudentEntity> writer,
ItemProcessor<StudentEntity, StudentEntity> processor) {
return stepBuilderFactory
.get("step1")
.<StudentEntity, StudentEntity>chunk(1000)//批處理每次提交65000條數據
.reader(reader)//給step綁定reader
.processor(processor)//給step綁定processor
.writer(writer)//給step綁定writer
.build();
}
}
package org.sea.spring.boot.batch.test;
import org.junit.Test;
import org.junit.runner.RunWith;
import org.sea.spring.boot.batch.SpringBootBathApplication;
import org.springframework.batch.core.Job;
import org.springframework.batch.core.JobParameters;
import org.springframework.batch.core.JobParametersBuilder;
import org.springframework.batch.core.launch.JobLauncher;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.test.context.junit4.SpringRunner;
import org.springframework.util.StopWatch;
import lombok.extern.slf4j.Slf4j;
@RunWith(SpringRunner.class)
@SpringBootTest(classes=SpringBootBathApplication.class)
@Slf4j
public class TestBatchService {
@Autowired
private JobLauncher jobLauncher;
@Autowired
private Job importJob;
@Test
public void testBatch1() throws Exception {
StopWatch watch = new StopWatch("testAdd1");
watch.start("保存");
JobParameters jobParameters = new JobParametersBuilder()
.addLong("time", System.currentTimeMillis())
.toJobParameters();
jobLauncher.run(importJob, jobParameters);
watch.stop();
log.info(watch.prettyPrint());
}
}
flink 測試用例
package org.sea.spring.boot.batch;
import java.util.List;
import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.common.typeinfo.BasicTypeInfo;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.io.jdbc.JDBCAppendTableSink;
import org.apache.flink.api.java.io.jdbc.JDBCInputFormat;
import org.apache.flink.api.java.typeutils.RowTypeInfo;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.types.Row;
import org.sea.spring.boot.batch.model.StudentEntity;
public class FLink2Mysql {
private static String driverClass = "com.mysql.jdbc.Driver";
private static String dbUrl = "jdbc:mysql://localhost:3306/test?useUnicode=true&characterEncoding=utf-8&useSSL=false";
private static String userName = "root";
private static String passWord = "mysql";
public static void add(List<StudentEntity> students) {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<StudentEntity> input = env.fromCollection(students);
DataStream<Row> ds = input.map(new RichMapFunction<StudentEntity, Row>() {
private static final long serialVersionUID = 1L;
@Override
public Row map(StudentEntity student) throws Exception {
return Row.of(student.getId(), student.getName(), student.getAge());
}
});
TypeInformation<?>[] fieldTypes = new TypeInformation<?>[] { BasicTypeInfo.INT_TYPE_INFO ,BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.INT_TYPE_INFO };
JDBCAppendTableSink sink = JDBCAppendTableSink.builder().setDrivername(driverClass).setDBUrl(dbUrl)
.setUsername(userName).setPassword(passWord).setParameterTypes(fieldTypes)
.setQuery("insert into student values(?,?,?)").build();
sink.emitDataStream(ds);
try {
env.execute();
} catch (Exception e) {
e.printStackTrace();
}
}
public static void query() {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
TypeInformation<?>[] fieldTypes = new TypeInformation<?>[] { BasicTypeInfo.STRING_TYPE_INFO,
BasicTypeInfo.INT_TYPE_INFO };
RowTypeInfo rowTypeInfo = new RowTypeInfo(fieldTypes);
// 查詢mysql
JDBCInputFormat jdbcInputFormat = JDBCInputFormat.buildJDBCInputFormat().setDrivername(driverClass)
.setDBUrl(dbUrl).setUsername(userName).setPassword(passWord).setQuery("select * from student")
.setRowTypeInfo(rowTypeInfo).finish();
DataStreamSource<Row> input1 = env.createInput(jdbcInputFormat);
input1.print();
try {
env.execute();
} catch (Exception e) {
e.printStackTrace();
}
}
}
package org.sea.spring.boot.batch.test;
import java.util.ArrayList;
import java.util.List;
import org.junit.Test;
import org.sea.spring.boot.batch.FLink2Mysql;
import org.sea.spring.boot.batch.model.StudentEntity;
import org.springframework.util.StopWatch;
import lombok.extern.log4j.Log4j2;
@Log4j2
public class TestFlink {
@Test
public void test() {
//構造
StopWatch watch = new StopWatch("testAdd1");
watch.start("構造");
List<StudentEntity> list =new ArrayList<StudentEntity>(10000);
for(int i=0;i<10000;i++) {
list.add(init(i+210000));
}
watch.stop();
//保存
watch.start("保存");
FLink2Mysql.add(list);
watch.stop();
log.info(watch.prettyPrint());
}
private StudentEntity init(int i) {
StudentEntity student=new StudentEntity();
student.setId(i);
student.setName("name"+i);
student.setAge(i);
return student;
}
}
github源碼:https://github.com/loveseaone/sea-spring-boot-batch.git
如果覺得文章有幫助,關注下作者的公衆號,贊個人氣,不勝感激!
同時可以下載作者整理的工作10多年來閱讀過的電子書籍。
公衆號: TalkNewClass