Java去除完全閉合html標籤、去除部分未閉合的html標籤的幾種方式

原創

2020-06-29 22:25

1. 使用正則表達式去除html標籤

    /**
     * 去除Html標籤
     * 
     * @param html
     *            html內容
     * @return 去除標籤後的文本
     */
    public static String dislodgeHtmlLabel(String html) {

        if (Objects.isNull(html)) {
            return "";
        }

        // 過濾script標籤
        Pattern pscript = Pattern.compile("<script[^>]*?>[\\s\\S]*?<\\/script>", Pattern.CASE_INSENSITIVE);

        // 過濾style的正則
        Pattern pstyle = Pattern.compile("<style[^>]*?>[\\s\\S]*?<\\/style>", Pattern.CASE_INSENSITIVE);

        // 過濾html標籤的正則
        Pattern phtml = Pattern.compile("<[^>]+>", Pattern.CASE_INSENSITIVE);

        // 執行過濾script標籤
        Matcher mscript = pscript.matcher(html);
        html = mscript.replaceAll("");

        // 執行過濾style標籤
        Matcher mstyle = pstyle.matcher(html);
        html = mstyle.replaceAll("");

        // 執行過濾html標籤
        Matcher mHtml = phtml.matcher(html);
        html = mHtml.replaceAll("");

        // 返回文本字符串
        return html.trim();
    }

2. 使用Jsoup組件去除html標籤

pom文件：

      <dependency>
            <groupId>org.jsoup</groupId>
            <artifactId>jsoup</artifactId>
            <version>1.12.1</version>
      </dependency>

  
    /**
     * 去除Html標籤
     * 
     * @param html
     *            html內容
     * @return 去除標籤後的文本
     */
    public static String wipeOffHtmlLabel(String html) {

        // null則返回空字符串
        if (Objects.isNull(html)) {
            return "";
        }

        // 異常返回自身
        try {
            return Jsoup.parse(html).text();
        } catch (Exception e) {
            return html;
        }
    }

3. 運行情況


    public static void main(String[] args) {

        // 未閉合的html標籤
        String notClosed = "<p>你大爺的</p><p><a href=\"http://www.baidu.com\" rel=\"noopener noreferrer\" target=\"_blank\">百度</a><br><br><img alt=\"\" src=\"http://www.baidu.c";

        // 正常閉合html標籤
        String normal = "<p><strong><span style=\"font-size: 18px;\">你是個傻小子</span></strong></p>";

        // 1. 使用正則表達式
        String notClosed1 = dislodgeHtmlLabel(notClosed);
        String normal1 = dislodgeHtmlLabel(normal);
        System.out.println(notClosed1);
        System.out.println(normal1);

        System.out.println("*****************************************");
        // 2. 使用Jsoup組件
        String notClosed2 = wipeOffHtmlLabel(notClosed);
        String normal2 = wipeOffHtmlLabel(normal);
        System.out.println(notClosed2);
        System.out.println(normal2);

    }

結果：

4. 結論

完全閉合的標籤推薦使用正則表達式，因爲是輕量級的。
如果是從數據庫中查出截斷的html文本，最好使用Jsoup組件，支持去除未閉合的標籤。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Java去除完全閉合html標籤、去除部分未閉合的html標籤的幾種方式

1. 使用正則表達式去除html標籤

2. 使用Jsoup組件去除html標籤

3. 運行情況

4. 結論

《Python進階》學習筆記

Leetcode 3161. 物塊放置查詢

一個docker容器暴露多個端口

leetcode 60 排列序列

微服務實踐之使用 Visual Studio 2022 調試Dapr 應用程序

wpf附加屬性理解 WPF附加屬性

Java去除完全閉合html標籤、去除部分未閉合的html標籤的幾種方式

Java實現將ASCII碼轉化爲中文顯示和將ASCII編碼的HTML文件轉化爲中文的HTML文件

LeetCode 34. 在排序數組中查找元素的第一個和最後一個位置（Java）

LeetCode 299. 猜數字遊戲（Java）

使用SpringBoot/Spring時，如何中斷運行中的接口

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結