從Excel導出宕機到初學Apache POI

學習來由

因爲在定位一個公司的OOM的時候，花了2天時間，定位問題定位出了方向，知道是導出Excel的時候對象佔用太大導致的OOM，但是後來計算了一下數據完全沒有達到OOM的情況。癥結點就是結論是沒錯只是連自己都沒法說服，那玩啥，初步優化解決方案目測是沒太大作用，畢竟不清楚根本原因是啥！

然後就沒啥頭緒，卡死在這個點上了。後來組內大神定位出是大數據量Excel導出中使用的是POI中的XSSFWorkbook對象導致系統的內存佔用過高，最終導致OOM後宕機。下面是大神的分析：

當數據量超出65536條後，在使用HSSFWorkbook或XSSFWorkbook，程序會報OutOfMemoryError：Javaheap space;內存溢出錯誤。這時應該用SXSSFworkbook。

嗯，關於POI是啥我一點也不知道，更別說XSSFWorkbook etc。通過這次定位問題簡單知道POI是一個文件OI的工具，具體機制以及內部原理是啥，怎麼進行編碼完全是一臉懵逼狀態。@_@

作爲碼農這其實是簡單通用的框架了，必須要知道；另外這可是本汪以後吃飯的傢伙，必須做到專業來着。老話說亡羊補牢，簡單學習一下基本操作，重點學習導出。

Why should I use Apache POI?

來源：Apache POI
A major use of the Apache POI api is for Text Extraction applications such as web spiders, index builders, and content management systems.

So why should you use POIFS, HSSF or XSSF?

You’d use POIFS if you had a document written in OLE 2 Compound Document Format, probably written using MFC, that you needed to read in Java. Alternatively, you’d use POIFS to write OLE 2 Compound Document Format if you needed to inter-operate with software running on the Windows platform. We are not just bragging when we say that POIFS is the most complete and correct implementation of this file format to date!

You’d use HSSF if you needed to read or write an Excel file using Java (XLS). You’d use XSSF if you need to read or write an OOXML Excel file using Java (XLSX). The combined SS interface allows you to easily read and write all kinds of Excel files (XLS and XLSX) using Java. Additionally there is a specialized SXSSF implementation which allows to write very large Excel (XLSX) files in a memory optimized way.

OLE: (Object Linkingand Embedding)對象鏈接和嵌入

MFC:(Microsoft Foundation Classes)微軟基礎類庫

OOXML: (Office Open XML standards)微軟公司爲Office 2007產品開發的技術規範，現已成爲國際文檔格式標準，兼容前國際標準開放文檔格式和中國文檔標準“標文通”（外語簡稱：UOF）。

Excel workbooks (SS=HSSF+XSSF)

POI工具用於文檔提取應用，如：網頁爬蟲，索引構建，系統管理。

使用Java進行OLE2格式的讀寫使用POIFS。

使用Java讀寫.xls格式的Excel表格使用HSSF，讀寫.xlsx格式Excel表格使用XSSF，SS就可以同時處理兩種格式的Excel表格。
另外，大量數據的導出使用優化過的SXSSF。

測試環境

操作系統：windows 10 專業版

處理器：Intel Core™ i5-4200M CPU @2.5GHz

內存：12G

JVM: Java HotSpot™ 64-Bit Server VM (25.72-b15, mixed mode)

Java: 版本 1.8.0_72, 供應商 Oracle Corporation

測試代碼

package my.poi;

import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;

import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;


public class XSSF {

    public static void main(String[] args) throws IOException, InterruptedException {
        // Thread.sleep(3 * 1000);  這個是爲了能夠使用監控到內存的使用
        (new XSSF()).generateXLSX(8, 5000, 500);
    }

    public void generateXLSX(int sheetNum, int rowNum, int column) {
        String fileName = FILE_PATH + FILE_NAME_PREFIX + (rowNum * sheetNum) + SEPARATOR + new Random().nextLong() + FILE_NAME_SUFFIX;
        OutputStream out = new FileOutputStream(fileName);
        Workbook workbook = generateSheet(sheetNum, rowNum, column);
        workbook.write(out);
        workbook.close();
        out.close();
    }

    private Workbook generateSheet(int sheetNum, int rowNum, int column) throws IOException {
        Workbook workbook = new XSSFWorkbook();  //其實就是就是new一個對象的問題@_@
        for (int sheetIndex = 0; sheetIndex < sheetNum; sheetIndex++) {
            String sheetName = SHEET_NAME_PREFIX + SEPARATOR + sheetIndex;
            Sheet sheet = workbook.createSheet(sheetName);
            for (int i = 0; i < rowNum; i++) {
                Row row = sheet.createRow(i);
                for (int j = 0; j < column; j++) {
                    Cell cell = row.createCell(j);
                    cell.setCellValue(sheetName + "-" + i + "-" + j);
                }
            }
        }
        return workbook;
    }
}


package my.poi;

import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.usermodel.Workbook;
import org.apache.poi.xssf.usermodel.XSSFWorkbook;

import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;


public class SXSSF {

    public static void main(String[] args) throws IOException, InterruptedException {
        // Thread.sleep(3 * 1000);  這個是爲了能夠使用監控到內存的使用
        (new XSSF()).generateXLSX(8, 5000, 500);
    }

    public void generateXLSX(int sheetNum, int rowNum, int column) {
        String fileName = FILE_PATH + FILE_NAME_PREFIX + (rowNum * sheetNum) + SEPARATOR + new Random().nextLong() + FILE_NAME_SUFFIX;
        OutputStream out = new FileOutputStream(fileName);
        Workbook workbook = generateSheet(sheetNum, rowNum, column);
        workbook.write(out);
        workbook.close();
        out.close();
    }

    private Workbook generateSheet(int sheetNum, int rowNum, int column) throws IOException {
        Workbook workbook = new SXSSFWorkbook();  //其實就是就是new一個對象的問題@_@
        for (int sheetIndex = 0; sheetIndex < sheetNum; sheetIndex++) {
            String sheetName = SHEET_NAME_PREFIX + SEPARATOR + sheetIndex;
            Sheet sheet = workbook.createSheet(sheetName);
            for (int i = 0; i < rowNum; i++) {
                Row row = sheet.createRow(i);
                for (int j = 0; j < column; j++) {
                    Cell cell = row.createCell(j);
                    cell.setCellValue(sheetName + "-" + i + "-" + j);
                }
            }
        }
        return workbook;
    }
}

執行結果：

各生成一個Excel文件：每個文件8個頁籤，每個頁籤5000行，500列

SXSSF	XSSF
大小：1,198,522,368 個字節	大小：3,165,650,944 個字節
已使用：434,768,696 個字節	已使用：2,687,171,360 個字節
最大：3,193,962,496 個字節	最大：3,193,962,496 個字節
執行時間：2min	執行時間：2h+，（沒等到，要先睡了）
生成文件大小：98.7 MB (103,504,645 字節)	預計一樣大(還沒有生成過)
類：1,345 實例：10,322,964 字節：422,870,624	類：943 實例：52,630,412 字節：2,687,336,880

結果分析：

由此可以得出，在Excel導出的時候使用XSSF會因爲對象都保存在內存中，數據越大所需內存越積越多，超過限制最終導致OOM，而使用SXSSF是持久化到硬盤上，對象佔用內存到了上限，再增加對錶格對象不會再佔用內存資源，所以可以避免OOM導致的宕機。

SXSSF 比 XSSF 佔用內存低的原理

知道SXSSF是在Excel大量導出或者JVM堆內存限制比較低的時候替換XSSF的方案。因爲SXSSF是並不是像XSSF一樣把所有需要處理的數據都加載到虛擬機
內存中的，而是隻加載指定Excel表格行數的數據，行數閾值大小windowSize創建表格的設置的（默認是100），當加載新行超過了行數限制，
那就將最早加載的那行數據通過臨時文件的形式持久化到在硬盤中，從而保證了內存中加載的數據的限制，進而確保佔用內存不會隨着數據量的增加而增加。

特別注意的是，持久化到硬盤的文件必須要通過調用.dipose()方法清理掉。

來自官方的說明 SXSSF (Streaming Usermodel API):

SXSSF (package: org.apache.poi.xssf.streaming) is an API-compatible streaming extension of XSSF to be used when very large spreadsheets have to be produced, and heap space is limited. SXSSF achieves its low memory footprint by limiting access to the rows that are within a sliding window, while XSSF gives access to all rows in the document. Older rows that are no longer in the window become inaccessible, as they are written to the disk.

You can specify the window size at workbook construction time via new SXSSFWorkbook(int windowSize) or you can set it per-sheet via SXSSFSheet#setRandomAccessWindowSize(int windowSize)

When a new row is created via createRow() and the total number of unflushed records would exceed the specified window size, then the row with the lowest index value is flushed and cannot be accessed via getRow() anymore.

The default window size is 100 and defined by SXSSFWorkbook.DEFAULT_WINDOW_SIZE.

A windowSize of -1 indicates unlimited access. In this case all records that have not been flushed by a call to flushRows() are available for random access.

Note that SXSSF allocates temporary files that you must always clean up explicitly, by calling the dispose method.

SXSSFWorkbook defaults to using inline strings instead of a shared strings table. This is very efficient, since no document content needs to be kept in memory, but is also known to produce documents that are incompatible with some clients. With shared strings enabled all unique strings in the document has to be kept in memory. Depending on your document content this could use a lot more resources than with shared strings disabled.

Please note that there are still things that still may consume a large amount of memory based on which features you are using, e.g. merged regions, hyperlinks, comments, … are still only stored in memory and thus may require a lot of memory if used extensively.

Carefully review your memory budget and compatibility needs before deciding whether to enable shared strings or not.

驗證測試代碼:

package my.poi;

import junit.framework.Assert;
import org.apache.poi.ss.usermodel.Cell;
import org.apache.poi.ss.usermodel.Row;
import org.apache.poi.ss.usermodel.Sheet;
import org.apache.poi.ss.util.CellReference;
import org.apache.poi.xssf.streaming.SXSSFSheet;
import org.apache.poi.xssf.streaming.SXSSFWorkbook;
import org.junit.Test;

import java.io.FileOutputStream;
import java.io.IOException;

public class SXSSFTest {
  
    @Test
    public void autoFlush() throws Throwable {
        // keep 100 rows in memory, exceeding rows will be flushed to disk
        try (SXSSFWorkbook wb = new SXSSFWorkbook(100);) {
            Sheet sh = wb.createSheet();
            for (int rownum = 0; rownum < 1000; rownum++) {
                Row row = sh.createRow(rownum);
                for (int cellnum = 0; cellnum < 10; cellnum++) {
                    Cell cell = row.createCell(cellnum);
                    String address = new CellReference(cell).formatAsString();
                    cell.setCellValue(address);
                }
            }
            // Rows with rownum < 900 are flushed and not accessible
            for (int rownum = 0; rownum < 900; rownum++) {
                Assert.assertNull(sh.getRow(rownum));
            }
            // ther last 100 rows are still in memory
            for (int rownum = 900; rownum < 1000; rownum++) {
                Assert.assertNotNull(sh.getRow(rownum));
            }

         try (FileOutputStream out = new FileOutputStream("sxssf.xlsx")) {
                       wb.write(out);
                   }

            // dispose of temporary files backing this workbook on disk
            wb.dispose();
        }
    }


    @Test
    public void manuallyFlush(String[] args) throws Throwable {
        // turn off auto-flushing and accumulate all rows in memory
        try (SXSSFWorkbook wb = new SXSSFWorkbook(-1)) {
            SXSSFSheet sh = (SXSSFSheet)wb.createSheet();
            for (int rownum = 0; rownum < 1000; rownum++) {
                Row row = sh.createRow(rownum);
                for (int cellnum = 0; cellnum < 10; cellnum++) {
                    Cell cell = row.createCell(cellnum);
                    String address = new CellReference(cell).formatAsString();
                    cell.setCellValue(address);
                }
                // manually control how rows are flushed to disk
                if (rownum % 100 == 0) {
                    sh.flushRows(100); // retain 100 last rows and flush all others
                    // ((SXSSFSheet)sh).flushRows() is a shortcut for ((SXSSFSheet)sh).flushRows(0),
                    // this method flushes all rows
                }
            }

            try (FileOutputStream out = new FileOutputStream("sxssf.xlsx")) {
                wb.write(out);
            }
            // dispose of temporary files backing this workbook on disk
            wb.dispose();
        }
    }

}

講道理，講到底還是讀的書太少，太吃虧@_@。少年現在還是需要還是好好學習，張張姿勢(^_^)
嗯，公司修改的代碼貌似wb最後貌似沒有加上.dispose()方法，趕緊偷偷去加上，深藏功與名。

附：

Maven依賴的jar包

  <!-- https://mvnrepository.com/artifact/org.apache.poi/poi -->
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi</artifactId>
            <version>${poi.version}</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.poi/poi-ooxml -->
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml</artifactId>
            <version>3.11</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.poi/poi-ooxml-schemas -->
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-ooxml-schemas</artifactId>
            <version>3.11</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.poi/poi-scratchpad -->
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-scratchpad</artifactId>
            <version>3.11</version>
        </dependency>
        <!-- https://mvnrepository.com/artifact/org.apache.poi/poi-excelant -->
        <dependency>
            <groupId>org.apache.poi</groupId>
            <artifactId>poi-excelant</artifactId>
            <version>3.11</version>
        </dependency>

從Excel導出宕機到初學Apache POI

學習來由

Why should I use Apache POI?

測試環境

測試代碼

執行結果：

結果分析：

SXSSF 比 XSSF 佔用內存低的原理

附：

自學編程兩個月，現在我月入 4 萬元

Google Chrome驅動程序 124.0.6367.62（正式版本）去哪下載？

湘潭大學程序設計實踐 1195

hdu 2553 N皇后問題（回溯）

湘潭大學程序設計實踐 1194

js、jQuery實踐：拐帶web文本

從Excel導出宕機到初學Apache POI

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結