Greenplum批量操作，數據庫裏面作刪除更新速度最快

這是記一次線上GP大數據庫大量重複問題解決方案 1 建臨時表，把重複的數據備份，2 在備份庫用查詢條件去刪除正式表

最近在玩Greenplum 數據庫，一款分佈式的數據庫，MPP架構，但是有好的也有不如意的，總體感覺還是不錯，底層數據庫還是用PostgreSQL8.2版本，因爲我是GP4.2嘛，最新版本的GP6.0是基本PostgreSQL9.2，在性能上提高了不少。

先說下，遇到的坑和一些數據問題，刪除數據和更新數據，分區表等一些概念

補充：今天發現JDBC提交數據的時候，數據庫報內存溢出，解決辦法是，兩次提交轉爲單條提交，批量3000條，全成一個批次提交，內存會爆點，好處是數據一致性得到保證，最後只能拆開作兩次保存，kafka積壓太多數據了，然後正常消費


   conn = JdbcUtil.getConnection();
            PreparedStatement pst = conn.prepareStatement(sql);
            pst.executeUpdate();

            PreparedStatement pstsql = conn.prepareStatement(sql2);
            pstsql.executeUpdate();
           /* PreparedStatement pstsql = conn.prepareStatement(sql2);
            pstsql.executeUpdate();*/
            conn.commit();
            pst.clearBatch();
            pstsql.clearBatch();
            /* pstsql.clearBatch();*/

依賴放在項目的lib下面

    <dependency>
            <groupId>com.fbcds</groupId>
            <artifactId>fbcds</artifactId>
            <version>1.0</version>
            <scope>system</scope>
            <systemPath>${project.basedir}/src/main/resources/lib/greenplum.jar</systemPath>
        </dependency>

手動加載包打包到容器的時候，找不到驅動

第一步 下載greenplum.jar
    下載地址 http://download.csdn.net/download/enterings/10039723?web=web
第二步 在maven中手動添加本地jar包
    在 cmd命令行中 運行
    mvn install:install-file   -Dfile=D:\maven\greenplum.jar -DgroupId=com.huicai  -DartifactId=greenplum -Dversion=1.0  -Dpackaging=jar
   上面的命令解釋：
    -Dfile：指明你下載的jar包的位置（就是本地存放jar的路徑+jar包名）；
    -DgroupId， -DartifactId，  -Dversion：三個參數，就是指明瞭存放maven倉庫中的位置；
    -Dpackaging ：猜測就是指明文件類型；
第三步 pom.xml添加配置並重新 install maven
    該jar包的引用即可：
    <!-- greenplum -->
    <dependency>
      <groupId>com.huicai</groupId>
      <artifactId>greenplum</artifactId>
      <version>1.0</version>
    </dependency>

連接數據庫，當然還有其它的方法查詢封裝

// 餓漢式
    private static DruidDataSource dataSource = null;

    static {
        dataSource = new DruidDataSource();
        dataSource.setDriverClassName("com.pivotal.jdbc.GreenplumDriver");
        dataSource.setUsername("gptest");
         dataSource.setPassword("123456“);
     dataSource.setUrl("jdbc:pivotal:greenplum://192.168.0.1:8088;DatabaseName=npp_db");

        dataSource.setInitialSize(10);
        dataSource.setMinIdle(10);
        dataSource.setMaxActive(50); // 啓用監控統計功能
        dataSource.setTimeBetweenEvictionRunsMillis(300000);
        dataSource.setValidationQuery("SELECT 'x'");
        // 配置一個連接在池中最小生存的時間，單位是毫秒
        dataSource.setMinEvictableIdleTimeMillis(300000);


    }

批量操作代碼主要是爲了操作線上數據，因爲有大量重複，分區表處理起來，超級麻煩，就先建臨時表保存要刪除的數據，

然後，利用數據庫相關性，先查詢臨時表的ID再去刪除正式表，操作與同插入的方式，在數據庫內部進行操作，速度會快很多，解決思路就是如此，在數據庫裏作刪除，而不是取到程序裏再操作，

解決問題的思路很重要，要去套用哪種處理方式最快。

String tablename = Snmp_Table;
        String sql_info =
                " insert into da_his.test_five_data_his_1 select min(\"id\") \"id\",min(\"timestamp\") as \"timestamp\",max(snmp_index) "
                        + "as snmp_index,max(snmp) as snmp,max(\"in\") as \"in\",max(\"out\") as \"out\"  from " + tablename
                        + " where \"timestamp\">=? " + " and \"timestamp\"<=? and  (" + sl.toString()
                        + ") and \"in\"<'1250000000000' and  \"out\"<'1250000000000' group by "
                        + "\"timestamp\" ,snmp_index  having count(\"timestamp\")>1 ";
        //List<Map<String, Object>> list_info = (List<Map<String, Object>>) Gp_tools.query(sql_info, new Object[] {obj[0], obj[1]},
        //       new AllObjectMapper());
        Gp_tools.updateTemp(sql_info, new Object[] {obj[0], obj[1]}, true);

 public static boolean updateTemp(String sql, Object[] obj, boolean isGenerateKey) {
        Connection conn = null;

        boolean bFlag = true;
        try {
            conn = JdbcUtil.getConnection();
            PreparedStatement pstmt = conn.prepareStatement(sql);
           /* for (int i = 0; i < obj.length; i++) {
                pstmt.setObject(i + 1, obj[i]);
            }*/
            pstmt.setObject(1, obj[0]);
            pstmt.setObject(2, obj[1]);
            pstmt.executeUpdate();
            conn.commit();
        } catch (SQLException ex) {
            ex.printStackTrace();
        } finally {
            try {
                JdbcUtil.releaseConnection(conn);
            } catch (SQLException ex) {
                ex.printStackTrace();
            }
        }
        return bFlag;
    }

另外的批量刪除，操作也很慢

delete from da_his.tb_dev_five_data_1 as devfive using (values ('877|54',277,1444413654),('877|54',277,1444425009),('877|54',277,1444413674)) as tmp(snmp_index,snmp,"id") where 
 devfive.snmp_index=tmp.snmp_index and devfive.snmp=tmp.snmp and devfive.id=tmp.id ;

批量更新操作，原理是一樣的，但是在數據量幾十億又分成幾百個分區，這種操作分分鐘能搞崩掉

 update test set info=tmp.info from (values (1,'new1'),(2,'new2'),(6,'new6')) as tmp (id,info) where test.id=tmp.id;

批量刪除有可能如下官網解釋的，最後group by id字段必須是唯一

DELETE FROM test
WHERE ctid NOT IN (
SELECT min(ctid)
FROM test
GROUP BY x);
(where 'x' is the unique column list)

最張解決方案從臨時表的備份表查詢要刪除的數據，程序裏拼接下面的SQL

delete from da_his.tb_dev_five_data_1  where snmp=472 and "id" in(5357929218,5357927657,5357929285,5357928550,5357929262);

多線程跑起來，很快，可能晚上用的人少，刪除非常快

大樹168

發佈了144 篇原創文章 · 獲贊 28 · 訪問量 16萬+

私信關注

Greenplum批量操作，數據庫裏面作刪除更新速度最快

使用c#強大的表達式樹實現對象的深克隆之解決循環引用的問題

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU啓動那些事（12.A）- uSDHC eMMC啓動時間(RT1170)

本地SSL證書過期輸入命令在IIS自動生成

List使用Stream 分組求和groupingBy， collect、reduce方法流數據處理

kafka不同組消費同一主題topic生產者配置

kafka-stream流式處理示例

mybatis批量插入和批量修改刪除

kafka定時任務拉取數據

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結