大數據批處理比較 spring batch vs flink vs stream Parallel

大數據批處理比較 spring batch vs flink vs stream Parallel

摘要:本文主要通過實際案例的對比分析,選擇適合自己大數據批處理的應用技術方案


爲什麼使用批處理 ?

場景

  • 數據導入
    • 導入場景需要開啓事務,保證數據一致性,要麼全部成功,要麼全部失敗
    • 實時在線處理,對響應時間有較高要求
  • 批量查詢
    • 實時在線處理,對響應時間有較高要求

針對以上場景,我們採用傳統的處理方法無法滿足響應時間的要求,我們首先想到的是多線程編程處理,多線程解決方案是沒有問題的,問題點是多線程並行編程要求高,難度大,那麼是否有開源界是否有響應的解決方案呢?答案是:yes.

本文對批處理的方案進行了對比,總結、歸納:

場景:一萬條數據保存到student表,MySQL數據庫

表結構

REATE TABLE `student` (
  `id` int(20) NOT NULL AUTO_INCREMENT,
  `name` varchar(20) DEFAULT NULL,
  `age` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=220000 DEFAULT CHARSET=utf8;

spring batch vs flink vs stream Parallel 性能對比

方案 執行時間 備註
循環方案 34391s
stream Parallel 8384s
springboot batch 1035s
flink 4310s
準備工作

Springboot 工程 pom.xml 配置

依賴

<parent>
		<groupId>org.springframework.boot</groupId>
		<artifactId>spring-boot-starter-parent</artifactId>
		<version>2.0.1.RELEASE</version>
		<relativePath /> <!-- lookup parent from repository -->
	</parent>
<dependency>
			<groupId>mysql</groupId>
			<artifactId>mysql-connector-java</artifactId>
			<version>5.1.40</version><!--$NO-MVN-MAN-VER$ -->
			<scope>runtime</scope>
		</dependency>
		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter-test</artifactId>
			<scope>test</scope>
		</dependency>
		<dependency>
			<groupId>com.alibaba</groupId>
			<artifactId>druid</artifactId>
			<version>1.1.0</version>
		</dependency>
		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter</artifactId>
			<exclusions>
				<exclusion>
					<groupId>org.springframework.boot</groupId>
					<artifactId>spring-boot-starter-logging</artifactId>
				</exclusion>
			</exclusions>
		</dependency>
		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter-log4j2</artifactId>
		</dependency>


		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter-actuator</artifactId>
			<scope>compile</scope>
		</dependency>

		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter-web</artifactId>
			<scope>compile</scope>
		</dependency>

		<dependency>
			<groupId>org.projectlombok</groupId>
			<artifactId>lombok</artifactId>
			<optional>true</optional>
			<scope>compile</scope>
		</dependency>


		<dependency>
			<groupId>junit</groupId>
			<artifactId>junit</artifactId>
			<scope>test</scope>
		</dependency>

		<dependency>
			<groupId>org.mybatis.spring.boot</groupId>
			<artifactId>mybatis-spring-boot-starter</artifactId>
			<version>2.0.1</version>
		</dependency>
		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter-batch</artifactId>
		</dependency>
		<dependency>
			<groupId>org.springframework.boot</groupId>
			<artifactId>spring-boot-starter-data-jpa</artifactId>
		</dependency>
		<dependency>
		   <groupId>javax.xml.bind</groupId>
		   <artifactId>jaxb-api</artifactId>
		</dependency>

		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-java</artifactId>
			<version>${flink.version}</version>
			<scope>provided</scope>
		</dependency>
		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-streaming-java_2.11</artifactId>
			<version>${flink.version}</version>
			<scope>provided</scope>
		</dependency>
		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-jdbc_2.12</artifactId>
			<version>${flink.version}</version>
		</dependency>

		  <dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-table-api-java-bridge_2.11</artifactId>
			<version>1.9.0</version>
		</dependency>
		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-table-planner-blink_2.11</artifactId>
			<version>1.9.0</version>
		</dependency>
		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-streaming-scala_2.11</artifactId>
			<version>1.9.0</version>
		</dependency>
		<dependency>
			<groupId>org.apache.flink</groupId>
			<artifactId>flink-table-common</artifactId>
			<version>1.9.0</version>
		</dependency>

溫馨提示 Springboot 版本選用 2.0.1.RELEASE ,否則有兼容問題

application.properties

server.port=8070
spring.application.name=sea-spring-boot-batch
spring.batch.initialize-schema=always
spring.jpa.generate-ddl=true
mybatis.config-location=classpath:mybatis-config.xml
mybatis.mapper-locations=classpath:mybatis/*.xml
mybatis.type-aliases-package=org.sea.spring.cloud.nacos.model

spring.datasource.driverClassName=com.mysql.jdbc.Driver
spring.datasource.url=jdbc:mysql://localhost:3306/test?useUnicode=true&characterEncoding=utf-8&useSSL=false
spring.datasource.username=root
spring.datasource.password=mysql
spring.datasource.type=com.alibaba.druid.pool.DruidDataSource

mybatis-config.xml

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE configuration PUBLIC "-//mybatis.org//DTD Config 3.0//EN" "http://mybatis.org/dtd/mybatis-3-config.dtd">
        <configuration>    
        <!-- 全局參數 -->    
        <settings>        
                <!-- 使全局的映射器啓用或禁用緩存。 -->        
                <setting name="cacheEnabled" value="true"/>        
                <!-- 全局啓用或禁用延遲加載。當禁用時,所有關聯對象都會即時加載。 -->        
                <setting name="lazyLoadingEnabled" value="true"/>        
                <!-- 當啓用時,有延遲加載屬性的對象在被調用時將會完全加載任意屬性。否則,每種屬性將會按需要加載。 -->        
                <setting name="aggressiveLazyLoading" value="true"/>        
                <!-- 是否允許單條sql 返回多個數據集  (取決於驅動的兼容性) default:true -->        
                <setting name="multipleResultSetsEnabled" value="true"/>        
                <!-- 是否可以使用列的別名 (取決於驅動的兼容性) default:true -->        
                <setting name="useColumnLabel" value="true"/>        
                <!-- 允許JDBC 生成主鍵。需要驅動器支持。如果設爲了true,這個設置將強制使用被生成的主鍵,有一些驅動器不兼容不過仍然可以執行。  default:false  -->        
                <setting name="useGeneratedKeys" value="true"/>        
                <!-- 指定 MyBatis 如何自動映射 數據基表的列 NONE:不隱射 PARTIAL:部分  FULL:全部  -->        
                <setting name="autoMappingBehavior" value="PARTIAL"/>        
                <!-- 這是默認的執行類型  (SIMPLE: 簡單; REUSE: 執行器可能重複使用prepared statements語句;BATCH: 執行器可以重複執行語句和批量更新)  -->        
                <setting name="defaultExecutorType" value="SIMPLE"/>        
                <!-- 使用駝峯命名法轉換字段。 -->        
                <setting name="mapUnderscoreToCamelCase" value="true"/>        
                <!-- 設置本地緩存範圍 session:就會有數據的共享  statement:語句範圍 (這樣就不會有數據的共享 ) defalut:session -->        
                <setting name="localCacheScope" value="SESSION"/>        
                <!-- 設置但JDBC類型爲空時,某些驅動程序 要指定值,default:OTHER,插入空值時不需要指定類型 -->        
                <setting name="jdbcTypeForNull" value="NULL"/>    
         </settings>
</configuration>

編寫測試

for循環測試用例和 parallelStream測試用例自己編寫,不重點講了,重點是springbootbatch 、 flink.

springbootbatch 測試用例
package org.sea.spring.boot.batch.job;

import java.util.Iterator;
import java.util.List;

import org.sea.spring.boot.batch.model.StudentEntity;
import org.springframework.batch.item.ItemReader;
import org.springframework.batch.item.NonTransientResourceException;
import org.springframework.batch.item.ParseException;
import org.springframework.batch.item.UnexpectedInputException;

public class InputStudentItemReader implements ItemReader<StudentEntity>{
	
	private final Iterator<StudentEntity> iterator;

	public InputStudentItemReader(List<StudentEntity> data) {
	        this.iterator = data.iterator();
	}

	@Override
	public StudentEntity read()
			throws Exception, UnexpectedInputException, ParseException, NonTransientResourceException {
		 
		 if (iterator.hasNext()) {
	            return this.iterator.next();
	        } else {
	            return null;
	        }
	}

}

package org.sea.spring.boot.batch.job;
import java.util.Set;

import javax.validation.ConstraintViolation;
import javax.validation.Validation;
import javax.validation.ValidatorFactory;

import org.springframework.batch.item.validator.ValidationException;
import org.springframework.batch.item.validator.Validator;
import org.springframework.beans.factory.InitializingBean;

public class StudentBeanValidator <T> implements Validator<T>, InitializingBean{

	private javax.validation.Validator validator;
	
	@Override
	public void afterPropertiesSet() throws Exception {
		 ValidatorFactory validatorFactory = Validation.buildDefaultValidatorFactory();
	     validator = validatorFactory.usingContext().getValidator();
		
	}

	@Override
	public void validate(T value) throws ValidationException {
	 
        Set<ConstraintViolation<T>> constraintViolations = validator.validate(value);
        if (constraintViolations.size() > 0) {
            StringBuilder message = new StringBuilder();
            for (ConstraintViolation<T> constraintViolation : constraintViolations) {
                message.append(constraintViolation.getMessage()).append("\n");
            }
            throw new ValidationException(message.toString());
        }
		
	}


}

package org.sea.spring.boot.batch.job;

import org.sea.spring.boot.batch.model.StudentEntity;
import org.springframework.batch.item.validator.ValidatingItemProcessor;
import org.springframework.batch.item.validator.ValidationException;

public class StudentItemProcessor extends ValidatingItemProcessor<StudentEntity> {
	 @Override
	    public StudentEntity process(StudentEntity item) throws ValidationException {
	        super.process(item);
	        return item;
	    }
}

package org.sea.spring.boot.batch.job;

import java.util.ArrayList;
import java.util.List;

import org.sea.spring.boot.batch.model.StudentEntity;
import org.springframework.batch.core.Job;
import org.springframework.batch.core.Step;
import org.springframework.batch.core.configuration.annotation.EnableBatchProcessing;
import org.springframework.batch.core.configuration.annotation.JobBuilderFactory;
import org.springframework.batch.core.configuration.annotation.StepBuilderFactory;
import org.springframework.batch.core.configuration.annotation.StepScope;
import org.springframework.batch.core.launch.support.RunIdIncrementer;
import org.springframework.batch.core.launch.support.SimpleJobLauncher;
import org.springframework.batch.core.repository.JobRepository;
import org.springframework.batch.core.repository.support.JobRepositoryFactoryBean;
import org.springframework.batch.item.ItemProcessor;
import org.springframework.batch.item.ItemReader;
import org.springframework.batch.item.ItemWriter;
import org.springframework.batch.item.database.BeanPropertyItemSqlParameterSourceProvider;
import org.springframework.batch.item.database.JdbcBatchItemWriter;
import org.springframework.batch.item.validator.Validator;
import org.springframework.batch.support.DatabaseType;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import org.springframework.transaction.PlatformTransactionManager;

import com.alibaba.druid.pool.DruidDataSource;

@Configuration
@EnableBatchProcessing
public class StudentBatchConfig {

	/**
     * ItemReader定義,用來讀取數據
     * @return FlatFileItemReader
     */
    @Bean
    @StepScope
    public InputStudentItemReader reader() {
    	List<StudentEntity> list =new ArrayList<StudentEntity>(10000);
		for(int i=0;i<10000;i++) {
			list.add(init(i));
		}
        return new InputStudentItemReader(list);
        
    }
    private StudentEntity init(int i) {
		StudentEntity student=new StudentEntity();
		student.setName("name"+i);
		student.setAge(i);
		return student;
		
	}
    /**
     * ItemProcessor定義,用來處理數據
     *
     * @return
     */
    @Bean
    public ItemProcessor<StudentEntity, StudentEntity> processor() {
        StudentItemProcessor processor = new StudentItemProcessor();
        processor.setValidator(studentBeanValidator());
        return processor;
    }
    
    @Bean
    public Validator<StudentEntity> studentBeanValidator() {
        return new StudentBeanValidator<>();
    }
    /**
     * ItemWriter定義,用來輸出數據
     * spring能讓容器中已有的Bean以參數的形式注入,Spring Boot已經爲我們定義了dataSource
     *
     * @param dataSource
     * @return
     */
    @Bean
    public ItemWriter<StudentEntity> writer(DruidDataSource dataSource) {
    	
        JdbcBatchItemWriter<StudentEntity> writer = new JdbcBatchItemWriter<>();
        //我們使用JDBC批處理的JdbcBatchItemWriter來寫數據到數據庫
        writer.setItemSqlParameterSourceProvider(new BeanPropertyItemSqlParameterSourceProvider<>());

        String sql="INSERT INTO student (name,age) values(:name,:age)"  ;
        //在此設置要執行批處理的SQL語句
      
        writer.setSql(sql);
        writer.setDataSource(dataSource);
        return writer;
    }

    /**
     *
     * @param dataSource
     * @param transactionManager
     * @return
     * @throws Exception
     */
    @Bean
    public JobRepository jobRepository(DruidDataSource dataSource, PlatformTransactionManager transactionManager) throws Exception {

        JobRepositoryFactoryBean jobRepositoryFactoryBean = new JobRepositoryFactoryBean();
        jobRepositoryFactoryBean.setDataSource(dataSource);
        jobRepositoryFactoryBean.setTransactionManager(transactionManager);
        jobRepositoryFactoryBean.setDatabaseType(String.valueOf(DatabaseType.MYSQL));
        // 下面事務隔離級別的配置是針對Oracle的
//        jobRepositoryFactoryBean.setIsolationLevelForCreate(isolationLevelForCreate);
        jobRepositoryFactoryBean.afterPropertiesSet();
        return jobRepositoryFactoryBean.getObject();
    }

    /**
     * JobLauncher定義,用來啓動Job的接口
     *
     * @param dataSource
     * @param transactionManager
     * @return
     * @throws Exception
     */
    @Bean
    public SimpleJobLauncher jobLauncher(DruidDataSource dataSource, PlatformTransactionManager transactionManager) throws Exception {
        SimpleJobLauncher jobLauncher = new SimpleJobLauncher();
        jobLauncher.setJobRepository(jobRepository(dataSource, transactionManager));
        return jobLauncher;
    }

    /**
     * Job定義,我們要實際執行的任務,包含一個或多個Step
     *
     * @param jobBuilderFactory
     * @param s1
     * @return
     */
    @Bean
    public Job importJob(JobBuilderFactory jobBuilderFactory, Step s1) {
        return jobBuilderFactory.get("importJob")
                .incrementer(new RunIdIncrementer())
                .flow(s1)//爲Job指定Step
                .end()
                .build();
    }

    /**
     * step步驟,包含ItemReader,ItemProcessor和ItemWriter
     *
     * @param stepBuilderFactory
     * @param reader
     * @param writer
     * @param processor
     * @return
     */
    @Bean
    public Step step1(StepBuilderFactory stepBuilderFactory, ItemReader<StudentEntity> reader, ItemWriter<StudentEntity> writer,
                      ItemProcessor<StudentEntity, StudentEntity> processor) {
        return stepBuilderFactory
                .get("step1")
                .<StudentEntity, StudentEntity>chunk(1000)//批處理每次提交65000條數據
                .reader(reader)//給step綁定reader
                .processor(processor)//給step綁定processor
                .writer(writer)//給step綁定writer
                .build();
    } 
}

package org.sea.spring.boot.batch.test;

import org.junit.Test;
import org.junit.runner.RunWith;
import org.sea.spring.boot.batch.SpringBootBathApplication;
import org.springframework.batch.core.Job;
import org.springframework.batch.core.JobParameters;
import org.springframework.batch.core.JobParametersBuilder;
import org.springframework.batch.core.launch.JobLauncher;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.boot.test.context.SpringBootTest;
import org.springframework.test.context.junit4.SpringRunner;
import org.springframework.util.StopWatch;

import lombok.extern.slf4j.Slf4j;

@RunWith(SpringRunner.class)
@SpringBootTest(classes=SpringBootBathApplication.class) 
@Slf4j
public class TestBatchService {

 
    @Autowired
    private JobLauncher jobLauncher;
    @Autowired
    private Job importJob;

    @Test
    public void testBatch1() throws Exception {
    	StopWatch watch = new StopWatch("testAdd1");
    	watch.start("保存");
        JobParameters jobParameters = new JobParametersBuilder()
                .addLong("time", System.currentTimeMillis())
                
                .toJobParameters();
        jobLauncher.run(importJob, jobParameters);
        watch.stop();
        log.info(watch.prettyPrint());
    }
}


flink 測試用例
package org.sea.spring.boot.batch;

import java.util.List;

import org.apache.flink.api.common.functions.RichMapFunction;
import org.apache.flink.api.common.typeinfo.BasicTypeInfo;
import org.apache.flink.api.common.typeinfo.TypeInformation;
import org.apache.flink.api.java.io.jdbc.JDBCAppendTableSink;
import org.apache.flink.api.java.io.jdbc.JDBCInputFormat;
import org.apache.flink.api.java.typeutils.RowTypeInfo;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.types.Row;
import org.sea.spring.boot.batch.model.StudentEntity;

public class FLink2Mysql {

	private static String driverClass = "com.mysql.jdbc.Driver";
	private static String dbUrl = "jdbc:mysql://localhost:3306/test?useUnicode=true&characterEncoding=utf-8&useSSL=false";
	private static String userName = "root";
	private static String passWord = "mysql";

	public static void add(List<StudentEntity> students) {
		StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

		DataStreamSource<StudentEntity> input = env.fromCollection(students);

		DataStream<Row> ds = input.map(new RichMapFunction<StudentEntity, Row>() {

			private static final long serialVersionUID = 1L;

			@Override
			public Row map(StudentEntity student) throws Exception {
				return Row.of(student.getId(), student.getName(), student.getAge());
			}
		});
		TypeInformation<?>[] fieldTypes = new TypeInformation<?>[] { BasicTypeInfo.INT_TYPE_INFO ,BasicTypeInfo.STRING_TYPE_INFO,
				BasicTypeInfo.INT_TYPE_INFO };

		JDBCAppendTableSink sink = JDBCAppendTableSink.builder().setDrivername(driverClass).setDBUrl(dbUrl)
				.setUsername(userName).setPassword(passWord).setParameterTypes(fieldTypes)
				.setQuery("insert into student values(?,?,?)").build();

		sink.emitDataStream(ds);

		try {
			env.execute();
		} catch (Exception e) {
			e.printStackTrace();
		}
	}

	public static void query() {
		StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
		TypeInformation<?>[] fieldTypes = new TypeInformation<?>[] { BasicTypeInfo.STRING_TYPE_INFO,
				BasicTypeInfo.INT_TYPE_INFO };

		RowTypeInfo rowTypeInfo = new RowTypeInfo(fieldTypes);
		// 查詢mysql
		JDBCInputFormat jdbcInputFormat = JDBCInputFormat.buildJDBCInputFormat().setDrivername(driverClass)
				.setDBUrl(dbUrl).setUsername(userName).setPassword(passWord).setQuery("select * from student")
				.setRowTypeInfo(rowTypeInfo).finish();
		DataStreamSource<Row> input1 = env.createInput(jdbcInputFormat);
		input1.print();
		try {
			env.execute();
		} catch (Exception e) {
			e.printStackTrace();
		}
	}
}

package org.sea.spring.boot.batch.test;

import java.util.ArrayList;
import java.util.List;

import org.junit.Test;
import org.sea.spring.boot.batch.FLink2Mysql;
import org.sea.spring.boot.batch.model.StudentEntity;
import org.springframework.util.StopWatch;

import lombok.extern.log4j.Log4j2;
@Log4j2
public class TestFlink {

	@Test
	public void test() {
		
		//構造
		StopWatch watch = new StopWatch("testAdd1");
		watch.start("構造");
		List<StudentEntity> list =new ArrayList<StudentEntity>(10000);
		for(int i=0;i<10000;i++) {
			list.add(init(i+210000));
		}
		watch.stop();
	 
		//保存
		watch.start("保存");
		FLink2Mysql.add(list);
		watch.stop();
		log.info(watch.prettyPrint());
	}
	 
	private StudentEntity init(int i) {
		StudentEntity student=new StudentEntity();
		 
		
		student.setId(i);
		student.setName("name"+i);
		student.setAge(i);
		return student;
		
	}
}


github源碼:https://github.com/loveseaone/sea-spring-boot-batch.git

如果覺得文章有幫助,關注下作者的公衆號,贊個人氣,不勝感激!
同時可以下載作者整理的工作10多年來閱讀過的電子書籍。
公衆號: TalkNewClass

在這裏插入圖片描述

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章