用Java將QQ、企業QQ導出的消息（mht格式）（支持大文件）解析爲html格式，包含解析圖片內容

原創

itrider

2020-06-02 08:14

代碼地址見本文最後。

因爲特殊原因，更換了通訊工具，需要將原聊天消息進行備份，並能進行瀏覽或者查詢。

發現QQ消息可以導出mht格式的消息，這種文件格式內部其實就是講html、css、圖片（圖片是經過base64轉換）按照一定規律全部寫入到mht文件中的，只要按照規律解析即可。

在解析的過程中，如果是文件體積比較大，就需要考慮進行分頁，否則生成的html文件很大，我遇到解析後最大的單html文件達到了500M（導出全部消息），瀏覽起來很不方便，因此增加了分頁功能。

首先上效果：

1.將程序放到mht文件所在文件夾。

程序會自動查詢當前文件夾的mht文件，並進行轉換，最後將文件保存到mht同名的文件夾中。

2.雙擊運行run.bat文件，可選分頁。

注意：在mht文件小於100M的情況下，即使選擇了分頁，程序不會也不會進行分頁。

3.預覽效果

部分代碼解析：

1.生成單文件html

/**
     * 創建單文件html
     * @param inputFile
     * @param outputFilePath
     */
    public static void readAndCreateFile(String inputFile, String outputFilePath) {
        String htmlFileName = parseHtmlFileName(inputFile);

        File file = new File(inputFile);
        BufferedInputStream fis = null;
        BufferedReader reader = null;
        try {
            fis = new BufferedInputStream(new FileInputStream(file));
            reader = new BufferedReader(new InputStreamReader(fis,"utf-8"),5*1024*1024);

            boolean isCreatedHtml = false, isHtmlContent = false;
            String line = "";
            StringBuilder sb = null;
            //String [] resType = null;
            String resName = null;
            //boolean isGetResType = false;
            boolean isGetResName = false;
            StringBuilder resSb = new StringBuilder();
            while((line = reader.readLine()) != null){
                if(!isCreatedHtml) {
                    if (isHtmlStartTag(line)) {
                        isHtmlContent = true;
                        sb = new StringBuilder(line).append("\n");
                    }else{
                        if(isHtmlContent) {
                            if (isHtmlEndTag(line)) {
                                sb.append(line).append("\n");
                                createHtmlFile(outputFilePath, htmlFileName, sb.toString(), 0, true);
                                sb.delete(0, sb.length());
                                isCreatedHtml = true;
                            } else {
                                sb.append(line).append("\n");
                            }
                        }
                    }
                }

                /**
                 * 開始解析資源文件
                 */
                if(isCreatedHtml) {
                    /*if(!isGetResType) {
                        resType = parseResourceType(line);
                        if (resType != null) {
                            isGetResType = true;
                            continue;
                        }
                    }*/
                    if(!isGetResName) {
                        resName = parseResourceName(line);
                        if (resName != null) {
                            isGetResName = true;
                            resSb.delete(0, resSb.length());
                            continue;
                        }
                    }
                    if(isGetResName) {
                        if(line.length() > 0) {
                            if(line.contains("------=_NextPart_")) {
                                //isGetResType = false;
                                isGetResName = false;
                                generateImage(resSb.toString(), (outputFilePath + File.separator + htmlFileName), resName);
                            }else{
                                resSb.append(line).append("\n");
                            }
                        }
                    }
                }
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {
                if(reader != null) {
                    reader.close();
                }
                if(fis != null) {
                    fis.close();
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

2.生成分頁html

/**
     * 創建分頁html
     * @param inputFile
     * @param outputFilePath
     */
    public static void readAndCreateMultFile(String inputFile, String outputFilePath) {
        String htmlFileName = parseHtmlFileName(inputFile);

        File file = new File(inputFile);
        BufferedInputStream fis = null;
        BufferedReader reader = null;
        try {
            fis = new BufferedInputStream(new FileInputStream(file));
            reader = new BufferedReader(new InputStreamReader(fis,"utf-8"),5*1024*1024);

            boolean isCreatedHtml = false, isTableContent = false;
            String line = "";
            StringBuilder sb = null;
            int trLine = 1;
            int htmlNo = 1;
            //String [] resType = null;
            String resName = null;
            //boolean isGetResType = false;
            boolean isGetResName = false;
            StringBuilder resSb = new StringBuilder();
            while((line = reader.readLine()) != null){
                if(!isCreatedHtml) {
                    Matcher startMatcher = tableStartPattern.matcher(line);
                    if (startMatcher.find()) {
                        isTableContent = true;
                        /**
                         * 將table後面的內容拼接起來
                         */
                        sb = new StringBuilder(line.substring(startMatcher.end())).append("\n");
                    }else{
                        if(isTableContent) {
                            trLine++;
                            Matcher endMacher = tableEndPattern.matcher(line);
                            if (endMacher.find()) {
                                sb.append(line.substring(0, endMacher.start()));
                                createHtmlFile(outputFilePath, htmlFileName, sb.toString(), htmlNo, true);
                                isCreatedHtml = true;
                            } else {
                                sb.append(line).append("\n");
                                if(trLine % 1000 == 0) {
                                    createHtmlFile(outputFilePath, htmlFileName, sb.toString(), htmlNo, false);
                                    htmlNo ++;
                                    sb.delete(0, sb.length());
                                }
                            }
                        }
                    }
                }

                /**
                 * 開始解析資源文件
                 */
                if(isCreatedHtml) {
                    /*if(!isGetResType) {
                        resType = parseResourceType(line);
                        if (resType != null) {
                            isGetResType = true;
                            continue;
                        }
                    }*/
                    if(!isGetResName) {
                        resName = parseResourceName(line);
                        if (resName != null) {
                            isGetResName = true;
                            resSb.delete(0, resSb.length());
                            continue;
                        }
                    }
                    if(isGetResName) {
                        if(line.length() > 0) {
                            if(line.contains("------=_NextPart_")) {
                                //isGetResType = false;
                                isGetResName = false;
                                generateImage(resSb.toString(), (outputFilePath + File.separator + htmlFileName), resName);
                            }else{
                                resSb.append(line).append("\n");
                            }
                        }
                    }
                }
            }
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        } catch (UnsupportedEncodingException e) {
            e.printStackTrace();
        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            try {
                if(reader != null) {
                    reader.close();
                }
                if(fis != null) {
                    fis.close();
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

注意：

1.考慮到性能問題，分頁生成的時候，只做了上一頁、下一頁，因爲個人已經覺得這個夠用了，並沒有開發頁碼的功能，讀者可對這個進行擴展，思路大概是：將mht中的html部分讀取到，主要是table中的tr部分，然後將這些逐條讀取到並放入list，然後根據list進行分頁。

2.讀者在轉換之後，還是請保留您的原始mht文件，雖然功能經過測試，也對數據進行了一定量的驗證，但是不保證在轉換的過程中可能因bug或未知因素導致數據丟失，因此，請保留原始mht文件，請保留原始mht文件，請保留原始mht文件！

3.花了一點時間開發的，並未對代碼進行優化，可能有些地方有部分重複代碼，讀者請自行優化。

4.測試過6G和10G左右的文件，更大的文件暫時未測試。

後續：

後續如果有空閒時間，可能會做一定的改進，使用Lucene相關技術，對生成的html進行索引並支持搜索，方便搜索消息。

代碼地址：

https://github.com/itriders/mht2html

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

用Java將QQ、企業QQ導出的消息（mht格式）（支持大文件）解析爲html格式，包含解析圖片內容

自學編程兩個月，現在我月入 4 萬元

Google Chrome驅動程序 124.0.6367.62（正式版本）去哪下載？

Hibernate原生SQL使用別名（表字段使用了別名與Bean中字段名不一致）後無法獲取數據的問題

用Java將QQ、企業QQ導出的消息（mht格式）（支持大文件）解析爲html格式，包含解析圖片內容

關於Linux下snmp的Timeout: No Response from localhost錯誤

Centos7.8切換阿里雲（aliyun）yum源

解決vc6卡死,需要打補丁[有下載地址]

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結