JAVA通過epublib解析EPUB格式的電子書

原創

光滑的秃头

2020-06-28 09:28

什麼是 epub 格式

就像視頻文件有 MP4,AVI,RMVB 等等一樣！電子書也有很多種格式：一文看懂mobi,azw3,epub格式電子書

可以將 epub 格式的電子書更換後綴名，然後解壓打開查看裏面的文件信息。

Java 解析 Epub 格式電子書

剛接到這個需求的時候，在網上找了很久，沒找到很好的解析方法，最後找到了 epublib 這個解析庫，但是下載對應的 jar 很麻煩，最終在 maven 倉庫搜索找到了。

epublib 解析庫

epublib：a Java library for reading and writing epub files （一個用於讀寫 epub 文件的 Java 庫）

GitHub: https://github.com/psiegman/epublib

官方網址：http://www.siegmann.nl/epublib

API 地址：http://www.siegmann.nl/static/epublib/apidocs/（英文的）

第一步：引入對應的 pom 文件

        <dependency>
            <groupId>com.positiondev.epublib</groupId>
            <artifactId>epublib-core</artifactId>
            <version>3.1</version>
        </dependency>
        <!--html解析 -->
        <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.12.1</version>
        </dependency>

第二步：常用關鍵類

1.Book    		表示電子書。通過 book 對象可以獲取 resource，Metadata 等具體內容
2.Resource 		表示電子書內容資源， 一個 Resource 就是電子書的一部分內容，這資源信息可以是 html,css,js,圖片等；
3.Resources 	表示電子書全部的 Resource 對象。可以用過 id,herf,MediaType 來獲取對應的 Resource 對象
4.MetaData 		表示電子書的開篇信息。比如，作者，出版社，語言等；
5.Spine 		電子書的 resource 順序，有人說是目錄信息，其實不是，是 resource 的閱讀順序，是線性結構的
6.TableOfContent	電子書的目錄信息，是樹形結構的。可以獲取到目錄對應的resource。
7.MediaType		Resource 的類型描述。用於說明此 Resource 是何種類型（CSS/JS/圖片/HTML/ VEDIO 等）。

第三步：解析一個epub文件

 public static void main(String[] args) {

        File file = new File("E:\\Download\\紅樓夢.epub");
        InputStream in = null;
        try {
            //從輸入流當中讀取epub格式文件
            EpubReader reader = new EpubReader();
            in = new FileInputStream(file);
            Book book = reader.readEpub(in);
            //獲取到書本的頭部信息
            Metadata metadata = book.getMetadata();
            System.out.println("FirstTitle爲："+metadata.getFirstTitle());
            //獲取到書本的全部資源
            Resources resources = book.getResources();
            System.out.println("所有資源數量爲："+resources.size());
            //獲取所有的資源數據
            Collection<String> allHrefs = resources.getAllHrefs();
            for (String href : allHrefs) {
                Resource resource = resources.getByHref(href);
                //data就是資源的內容數據，可能是css,html,圖片等等
                byte[] data = resource.getData();
                // 獲取到內容的類型  css,html,還是圖片
                MediaType mediaType = resource.getMediaType();
            }
            //獲取到書本的內容資源
            List<Resource> contents = book.getContents();
            System.out.println("內容資源數量爲："+contents.size());
            //獲取到書本的spine資源 線性排序
            Spine spine = book.getSpine();
            System.out.println("spine資源數量爲："+spine.size());
            //通過spine獲取所有的數據
            List<SpineReference> spineReferences = spine.getSpineReferences();
            for (SpineReference spineReference : spineReferences) {
                Resource resource = spineReference.getResource();
                //data就是資源的內容數據，可能是css,html,圖片等等
                byte[] data = resource.getData();
                // 獲取到內容的類型  css,html,還是圖片
                MediaType mediaType = resource.getMediaType();
            }
            //獲取到書本的目錄資源
            TableOfContents tableOfContents = book.getTableOfContents();
            System.out.println("目錄資源數量爲："+tableOfContents.size());
            //獲取到目錄對應的資源數據
            List<TOCReference> tocReferences = tableOfContents.getTocReferences();
            for (TOCReference tocReference : tocReferences) {
                Resource resource = tocReference.getResource();
                //data就是資源的內容數據，可能是css,html,圖片等等
                byte[] data = resource.getData();
                // 獲取到內容的類型  css,html,還是圖片
                MediaType mediaType = resource.getMediaType();
                if(tocReference.getChildren().size()>0){
                    //獲取子目錄的內容
                }
            }
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            //一定要關閉資源
            try {
                if (in != null) {
                    in.close();
                }
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }

注意事項

1 解析後得到的data內容數據是html格式的富文本內容，如果需要純文本，可以通過jsoup獲取P標籤的文本內容就可以了，但是獲取後的純文本排版就會亂。
2 資源當中可能會存在圖片和css等等，不在目錄或者spine當中的內容,可以通過Resources.getByHref等方法獲取。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

JAVA通過epublib解析EPUB格式的電子書

什麼是 epub 格式

Java 解析 Epub 格式電子書

epublib 解析庫

第一步：引入對應的 pom 文件

第二步：常用關鍵類

第三步：解析一個epub文件

注意事項

DAPPER 事務 TRANSACTION

Java中線程的創建方式

SpringBoot整合Druid+全局事務管理+Mybatis-Plus+代碼生成器

Linux su ：user xxx does not exits

記錄一次實際過程中的MySql數據庫SQL優化

JAVA發送短信（SMS服務）和SpringBoot發送郵件實現

Nginx學習日誌（四）自定義404，500，502，503等錯誤頁面

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結