JSoup快速入門

Jsoup

Jsoup是用於解析HTML,就類似XML解析器用於解析XML.Jsoup它解析HTML成爲真實世界的HTML。它與jquery選擇器的語法非常相似,並且非常靈活容易使用以獲得所需的結果。在本教程中,我們將介紹很多Jsoup的例子。

能用Jsoup實現什麼?

  • 從URL,文件或字符串中刮取並解析HTML
  • 查找和提取數據,使用DOM遍歷或CSS選擇器
  • 操縱HTML元素,屬性和文本
  • 根據安全的白名單清理用戶提交的內容,以防止XSS攻擊
  • 輸出整潔的HTML

安裝 - 運行時依賴關係

您可以使用下面的maven依賴項將Jsoup jar包含到項目中。

<span style="color:#333344"><span style="color:black"><code style="margin-left:0px" class="language-xml"><span style="color:#990055"><span style="color:#990055"><span style="color:#999999"><</span>dependency</span><span style="color:#999999">></span></span>
  <span style="color:slategray"><!-- jsoup HTML parser library @ http://jsoup.org/ --></span>
  <span style="color:#990055"><span style="color:#990055"><span style="color:#999999"><</span>groupId</span><span style="color:#999999">></span></span>org.jsoup<span style="color:#990055"><span style="color:#990055"><span style="color:#999999"></</span>groupId</span><span style="color:#999999">></span></span>
  <span style="color:#990055"><span style="color:#990055"><span style="color:#999999"><</span>artifactId</span><span style="color:#999999">></span></span>jsoup<span style="color:#990055"><span style="color:#990055"><span style="color:#999999"></</span>artifactId</span><span style="color:#999999">></span></span>
  <span style="color:#990055"><span style="color:#990055"><span style="color:#999999"><</span>version</span><span style="color:#999999">></span></span>1.10.2<span style="color:#990055"><span style="color:#990055"><span style="color:#999999"></</span>version</span><span style="color:#999999">></span></span>
<span style="color:#990055"><span style="color:#990055"><span style="color:#999999"></</span>dependency</span><span style="color:#999999">></span></span>
</code></span></span>

XML

JSoup應用的主要類

雖然完整的類庫中有很多類,但大多數情況下,給出下面3個類的英文我們需要重點了解的。

1. org.jsoup.Jsoup類

Jsoup類是任何Jsoup程序的入口點,並將提供從各種來源加載和解析HTML文檔的方法。

Jsoup類的一些重要方法如下:

方法 描述
static Connection connect(String url) 創建並返回URL的連接。
static Document parse(File in, String charsetName) 將指定的字符集文件解析成文檔。
static Document parse(String html) 將給定的HTML代碼解析成文檔。
static String clean(String bodyHtml, Whitelist whitelist) 從輸入HTML返回安全的HTML,通過解析輸入HTML並通過允許的標籤和屬性的白名單進行過濾。

2. org.jsoup.nodes.Document類

該類表示通過Jsoup庫加載HTML文檔。可以使用此類執行適用於整個HTML文檔的操作。

Element類的重要方法可以參見 - http://jsoup.org/apidocs/org/jsoup/nodes/Document.html

3. org.jsoup.nodes.Element類

HTML元素是由標籤名稱,屬性和子節點組成。使用元類,您可以提取數據,遍歷節點和操作HTML。

Element類的重要方法可參見 - http://jsoup.org/apidocs/org/jsoup/nodes/Element.html

應用實例

現在我們來看一些使用Jsoup API處理HTML文檔的例子。

1.載入文件

從URL加載文檔,使用Jsoup.connect()方法從URL加載HTML。

<span style="color:#333344"><span style="color:black"><code style="margin-left:0px" class="language-java"><span style="color:#0077aa">try</span>
<span style="color:#999999">{</span>
    Document document <span style="color:#a67f59">=</span> Jsoup<span style="color:#999999">.</span><span style="color:#dd4a68">connect</span><span style="color:#999999">(</span><span style="color:#669900">"http://www.yiibai.com"</span><span style="color:#999999">)</span><span style="color:#999999">.</span><span style="color:#dd4a68">get</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
    System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span>document<span style="color:#999999">.</span><span style="color:#dd4a68">title</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
<span style="color:#999999">}</span> 
<span style="color:#0077aa">catch</span> <span style="color:#999999">(</span>IOException e<span style="color:#999999">)</span> 
<span style="color:#999999">{</span>
    e<span style="color:#999999">.</span><span style="color:#dd4a68">printStackTrace</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
<span style="color:#999999">}</span>
</code></span></span>

Java的

2.從文件加載文檔

使用Jsoup.parse()方法從文件加載HTML。

<span style="color:#333344"><span style="color:black"><code style="margin-left:0px" class="language-java"><span style="color:#0077aa">try</span>
<span style="color:#999999">{</span>
    Document document <span style="color:#a67f59">=</span> Jsoup<span style="color:#999999">.</span><span style="color:#dd4a68">parse</span><span style="color:#999999">(</span> <span style="color:#0077aa">new</span> File<span style="color:#999999">(</span> <span style="color:#669900">"D:/temp/index.html"</span> <span style="color:#999999">)</span> <span style="color:#999999">,</span> <span style="color:#669900">"utf-8"</span> <span style="color:#999999">)</span><span style="color:#999999">;</span>
    System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span>document<span style="color:#999999">.</span><span style="color:#dd4a68">title</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
<span style="color:#999999">}</span> 
<span style="color:#0077aa">catch</span> <span style="color:#999999">(</span>IOException e<span style="color:#999999">)</span> 
<span style="color:#999999">{</span>
    e<span style="color:#999999">.</span><span style="color:#dd4a68">printStackTrace</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
<span style="color:#999999">}</span>
</code></span></span>

Java的

3.從String加載文檔

使用Jsoup.parse()方法從字符串加載HTML。

<span style="color:#333344"><span style="color:black"><code style="margin-left:0px" class="language-java"><span style="color:#0077aa">try</span>
<span style="color:#999999">{</span>
    String html <span style="color:#a67f59">=</span> <span style="color:#669900">"<html><head><title>First parse</title></head>"</span>
                    <span style="color:#a67f59">+</span> <span style="color:#669900">"<body><p>Parsed HTML into a doc.</p></body></html>"</span><span style="color:#999999">;</span>
    Document document <span style="color:#a67f59">=</span> Jsoup<span style="color:#999999">.</span><span style="color:#dd4a68">parse</span><span style="color:#999999">(</span>html<span style="color:#999999">)</span><span style="color:#999999">;</span>
    System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span>document<span style="color:#999999">.</span><span style="color:#dd4a68">title</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
<span style="color:#999999">}</span> 
<span style="color:#0077aa">catch</span> <span style="color:#999999">(</span>IOException e<span style="color:#999999">)</span> 
<span style="color:#999999">{</span>
    e<span style="color:#999999">.</span><span style="color:#dd4a68">printStackTrace</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
<span style="color:#999999">}</span>
</code></span></span>

Java的

4.從HTML獲取標題

如上圖所示,調用document.title()方法HTML電子雜誌頁面的標題。

<span style="color:#333344"><span style="color:black"><code style="margin-left:0px" class="language-java"><span style="color:#0077aa">try</span>
<span style="color:#999999">{</span>
    Document document <span style="color:#a67f59">=</span> Jsoup<span style="color:#999999">.</span><span style="color:#dd4a68">parse</span><span style="color:#999999">(</span> <span style="color:#0077aa">new</span> File<span style="color:#999999">(</span><span style="color:#669900">"C:/Users/xyz/Desktop/yiibai-index.html"</span><span style="color:#999999">)</span><span style="color:#999999">,</span> <span style="color:#669900">"utf-8"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
    System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span>document<span style="color:#999999">.</span><span style="color:#dd4a68">title</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
<span style="color:#999999">}</span> 
<span style="color:#0077aa">catch</span> <span style="color:#999999">(</span>IOException e<span style="color:#999999">)</span> 
<span style="color:#999999">{</span>
    e<span style="color:#999999">.</span><span style="color:#dd4a68">printStackTrace</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
<span style="color:#999999">}</span>
</code></span></span>

Java的

5.獲取HTML頁面的Fav圖標

假設favicon圖像將HTML的英文的文檔<head>部分中的第一個圖像,您可以使用下面的代碼。

<span style="color:#333344"><span style="color:black"><code style="margin-left:0px" class="language-java">String favImage <span style="color:#a67f59">=</span> <span style="color:#669900">"Not Found"</span><span style="color:#999999">;</span>
<span style="color:#0077aa">try</span> <span style="color:#999999">{</span>
    Document document <span style="color:#a67f59">=</span> Jsoup<span style="color:#999999">.</span><span style="color:#dd4a68">parse</span><span style="color:#999999">(</span><span style="color:#0077aa">new</span> File<span style="color:#999999">(</span><span style="color:#669900">"C:/Users/zkpkhua/Desktop/yiibai-index.html"</span><span style="color:#999999">)</span><span style="color:#999999">,</span> <span style="color:#669900">"utf-8"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
    Element element <span style="color:#a67f59">=</span> document<span style="color:#999999">.</span><span style="color:#dd4a68">head</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">.</span><span style="color:#dd4a68">select</span><span style="color:#999999">(</span><span style="color:#669900">"link[href~=.*\\.(ico|png)]"</span><span style="color:#999999">)</span><span style="color:#999999">.</span><span style="color:#dd4a68">first</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
    <span style="color:#0077aa">if</span> <span style="color:#999999">(</span>element <span style="color:#a67f59">==</span> null<span style="color:#999999">)</span> 
    <span style="color:#999999">{</span>
        element <span style="color:#a67f59">=</span> document<span style="color:#999999">.</span><span style="color:#dd4a68">head</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">.</span><span style="color:#dd4a68">select</span><span style="color:#999999">(</span><span style="color:#669900">"meta[itemprop=image]"</span><span style="color:#999999">)</span><span style="color:#999999">.</span><span style="color:#dd4a68">first</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
        <span style="color:#0077aa">if</span> <span style="color:#999999">(</span>element <span style="color:#a67f59">!=</span> null<span style="color:#999999">)</span> 
        <span style="color:#999999">{</span>
            favImage <span style="color:#a67f59">=</span> element<span style="color:#999999">.</span><span style="color:#dd4a68">attr</span><span style="color:#999999">(</span><span style="color:#669900">"content"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
        <span style="color:#999999">}</span>
    <span style="color:#999999">}</span> 
    <span style="color:#0077aa">else</span>
    <span style="color:#999999">{</span>
        favImage <span style="color:#a67f59">=</span> element<span style="color:#999999">.</span><span style="color:#dd4a68">attr</span><span style="color:#999999">(</span><span style="color:#669900">"href"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
    <span style="color:#999999">}</span>
<span style="color:#999999">}</span> 
<span style="color:#0077aa">catch</span> <span style="color:#999999">(</span>IOException e<span style="color:#999999">)</span> 
<span style="color:#999999">{</span>
    e<span style="color:#999999">.</span><span style="color:#dd4a68">printStackTrace</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
<span style="color:#999999">}</span>
System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span>favImage<span style="color:#999999">)</span><span style="color:#999999">;</span>
</code></span></span>

Java的

6.獲取HTML頁面中的所有鏈接

要獲取網頁中的所有鏈接,請使用以下代碼。

<span style="color:#333344"><span style="color:black"><code style="margin-left:0px" class="language-java"><span style="color:#0077aa">try</span>
<span style="color:#999999">{</span>
    Document document <span style="color:#a67f59">=</span> Jsoup<span style="color:#999999">.</span><span style="color:#dd4a68">parse</span><span style="color:#999999">(</span><span style="color:#0077aa">new</span> File<span style="color:#999999">(</span><span style="color:#669900">"C:/Users/zkpkhua/Desktop/yiibai-index.html"</span><span style="color:#999999">)</span><span style="color:#999999">,</span> <span style="color:#669900">"utf-8"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
    Elements links <span style="color:#a67f59">=</span> document<span style="color:#999999">.</span><span style="color:#dd4a68">select</span><span style="color:#999999">(</span><span style="color:#669900">"a[href]"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>  
    <span style="color:#0077aa">for</span> <span style="color:#999999">(</span>Element link <span style="color:#a67f59">:</span> links<span style="color:#999999">)</span> 
    <span style="color:#999999">{</span>
         System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span><span style="color:#669900">"link : "</span> <span style="color:#a67f59">+</span> link<span style="color:#999999">.</span><span style="color:#dd4a68">attr</span><span style="color:#999999">(</span><span style="color:#669900">"href"</span><span style="color:#999999">)</span><span style="color:#999999">)</span><span style="color:#999999">;</span>  
         System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span><span style="color:#669900">"text : "</span> <span style="color:#a67f59">+</span> link<span style="color:#999999">.</span><span style="color:#dd4a68">text</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">)</span><span style="color:#999999">;</span>  
    <span style="color:#999999">}</span>
<span style="color:#999999">}</span> 
<span style="color:#0077aa">catch</span> <span style="color:#999999">(</span>IOException e<span style="color:#999999">)</span> 
<span style="color:#999999">{</span>
    e<span style="color:#999999">.</span><span style="color:#dd4a68">printStackTrace</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
<span style="color:#999999">}</span>
</code></span></span>

Java的

7.獲取HTML頁面中的所有圖像

要獲取網頁中顯示的所有圖像,請使用以下代碼。

<span style="color:#333344"><span style="color:black"><code style="margin-left:0px" class="language-java"><span style="color:#0077aa">try</span>
<span style="color:#999999">{</span>
    Document document <span style="color:#a67f59">=</span> Jsoup<span style="color:#999999">.</span><span style="color:#dd4a68">parse</span><span style="color:#999999">(</span><span style="color:#0077aa">new</span> File<span style="color:#999999">(</span><span style="color:#669900">"C:/Users/zkpkhua/Desktop/yiibai-index.html"</span><span style="color:#999999">)</span><span style="color:#999999">,</span> <span style="color:#669900">"utf-8"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
    Elements images <span style="color:#a67f59">=</span> document<span style="color:#999999">.</span><span style="color:#dd4a68">select</span><span style="color:#999999">(</span><span style="color:#669900">"img[src~=(?i)\\.(png|jpe?g|gif)]"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
    <span style="color:#0077aa">for</span> <span style="color:#999999">(</span>Element image <span style="color:#a67f59">:</span> images<span style="color:#999999">)</span> 
    <span style="color:#999999">{</span>
        System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span><span style="color:#669900">"src : "</span> <span style="color:#a67f59">+</span> image<span style="color:#999999">.</span><span style="color:#dd4a68">attr</span><span style="color:#999999">(</span><span style="color:#669900">"src"</span><span style="color:#999999">)</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
        System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span><span style="color:#669900">"height : "</span> <span style="color:#a67f59">+</span> image<span style="color:#999999">.</span><span style="color:#dd4a68">attr</span><span style="color:#999999">(</span><span style="color:#669900">"height"</span><span style="color:#999999">)</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
        System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span><span style="color:#669900">"width : "</span> <span style="color:#a67f59">+</span> image<span style="color:#999999">.</span><span style="color:#dd4a68">attr</span><span style="color:#999999">(</span><span style="color:#669900">"width"</span><span style="color:#999999">)</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
        System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span><span style="color:#669900">"alt : "</span> <span style="color:#a67f59">+</span> image<span style="color:#999999">.</span><span style="color:#dd4a68">attr</span><span style="color:#999999">(</span><span style="color:#669900">"alt"</span><span style="color:#999999">)</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
    <span style="color:#999999">}</span>
<span style="color:#999999">}</span> 
<span style="color:#0077aa">catch</span> <span style="color:#999999">(</span>IOException e<span style="color:#999999">)</span> 
<span style="color:#999999">{</span>
    e<span style="color:#999999">.</span><span style="color:#dd4a68">printStackTrace</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
<span style="color:#999999">}</span>
</code></span></span>

Java的

8.獲取URL的元信息

元信息包括Google等搜索引擎用來確定網頁內容的索引爲目的。它們以HTML頁面的HEAD部分中的一些標籤的形式存在。要獲取有關網頁的元信息,請使用下面的代碼。

<span style="color:#333344"><span style="color:black"><code style="margin-left:0px" class="language-java"><span style="color:#0077aa">try</span>
<span style="color:#999999">{</span>
    Document document <span style="color:#a67f59">=</span> Jsoup<span style="color:#999999">.</span><span style="color:#dd4a68">parse</span><span style="color:#999999">(</span><span style="color:#0077aa">new</span> File<span style="color:#999999">(</span><span style="color:#669900">"C:/Users/zkpkhua/Desktop/yiibai-index.html"</span><span style="color:#999999">)</span><span style="color:#999999">,</span> <span style="color:#669900">"utf-8"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>

    String description <span style="color:#a67f59">=</span> document<span style="color:#999999">.</span><span style="color:#dd4a68">select</span><span style="color:#999999">(</span><span style="color:#669900">"meta[name=description]"</span><span style="color:#999999">)</span><span style="color:#999999">.</span><span style="color:#dd4a68">get</span><span style="color:#999999">(</span><span style="color:#990055">0</span><span style="color:#999999">)</span><span style="color:#999999">.</span><span style="color:#dd4a68">attr</span><span style="color:#999999">(</span><span style="color:#669900">"content"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>  
    System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span><span style="color:#669900">"Meta description : "</span> <span style="color:#a67f59">+</span> description<span style="color:#999999">)</span><span style="color:#999999">;</span>  

    String keywords <span style="color:#a67f59">=</span> document<span style="color:#999999">.</span><span style="color:#dd4a68">select</span><span style="color:#999999">(</span><span style="color:#669900">"meta[name=keywords]"</span><span style="color:#999999">)</span><span style="color:#999999">.</span><span style="color:#dd4a68">first</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">.</span><span style="color:#dd4a68">attr</span><span style="color:#999999">(</span><span style="color:#669900">"content"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>  
    System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span><span style="color:#669900">"Meta keyword : "</span> <span style="color:#a67f59">+</span> keywords<span style="color:#999999">)</span><span style="color:#999999">;</span>  
<span style="color:#999999">}</span> 
<span style="color:#0077aa">catch</span> <span style="color:#999999">(</span>IOException e<span style="color:#999999">)</span> 
<span style="color:#999999">{</span>
    e<span style="color:#999999">.</span><span style="color:#dd4a68">printStackTrace</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
<span style="color:#999999">}</span>
</code></span></span>

Java的

9.在HTML頁面中獲取表單屬性

在網頁中獲取表單輸入元素非常簡單。使用唯一ID查找FORM元素; 然後找到該表單中存在的所有輸入元素。

<span style="color:#333344"><span style="color:black"><code style="margin-left:0px" class="language-java">Document doc <span style="color:#a67f59">=</span> Jsoup<span style="color:#999999">.</span><span style="color:#dd4a68">parse</span><span style="color:#999999">(</span><span style="color:#0077aa">new</span> File<span style="color:#999999">(</span><span style="color:#669900">"c:/temp/yiibai-index.html"</span><span style="color:#999999">)</span><span style="color:#999999">,</span><span style="color:#669900">"utf-8"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>  
Element formElement <span style="color:#a67f59">=</span> doc<span style="color:#999999">.</span><span style="color:#dd4a68">getElementById</span><span style="color:#999999">(</span><span style="color:#669900">"loginForm"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>  

Elements inputElements <span style="color:#a67f59">=</span> formElement<span style="color:#999999">.</span><span style="color:#dd4a68">getElementsByTag</span><span style="color:#999999">(</span><span style="color:#669900">"input"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>  
<span style="color:#0077aa">for</span> <span style="color:#999999">(</span>Element inputElement <span style="color:#a67f59">:</span> inputElements<span style="color:#999999">)</span> <span style="color:#999999">{</span>  
    String key <span style="color:#a67f59">=</span> inputElement<span style="color:#999999">.</span><span style="color:#dd4a68">attr</span><span style="color:#999999">(</span><span style="color:#669900">"name"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>  
    String value <span style="color:#a67f59">=</span> inputElement<span style="color:#999999">.</span><span style="color:#dd4a68">attr</span><span style="color:#999999">(</span><span style="color:#669900">"value"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>  
    System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span><span style="color:#669900">"Param name: "</span><span style="color:#a67f59">+</span>key<span style="color:#a67f59">+</span><span style="color:#669900">" \nParam value: "</span><span style="color:#a67f59">+</span>value<span style="color:#999999">)</span><span style="color:#999999">;</span>  
<span style="color:#999999">}</span>
</code></span></span>

Java的

10.更新元素的屬性/內容

只要您使用上述方法找到您想要的元素; 可以使用Jsoup API來更新這些元素的屬性或innerHTML。例如,想更新文檔中存在的“ rel = nofollow”的所有鏈接。

<span style="color:#333344"><span style="color:black"><code style="margin-left:0px" class="language-java"><span style="color:#0077aa">try</span>
<span style="color:#999999">{</span>
    Document document <span style="color:#a67f59">=</span> Jsoup<span style="color:#999999">.</span><span style="color:#dd4a68">parse</span><span style="color:#999999">(</span><span style="color:#0077aa">new</span> File<span style="color:#999999">(</span><span style="color:#669900">"C:/Users/zkpkhua/Desktop/yiibai.com.html"</span><span style="color:#999999">)</span><span style="color:#999999">,</span> <span style="color:#669900">"utf-8"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
    Elements links <span style="color:#a67f59">=</span> document<span style="color:#999999">.</span><span style="color:#dd4a68">select</span><span style="color:#999999">(</span><span style="color:#669900">"a[href]"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>  
    links<span style="color:#999999">.</span><span style="color:#dd4a68">attr</span><span style="color:#999999">(</span><span style="color:#669900">"rel"</span><span style="color:#999999">,</span> <span style="color:#669900">"nofollow"</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
<span style="color:#999999">}</span> 
<span style="color:#0077aa">catch</span> <span style="color:#999999">(</span>IOException e<span style="color:#999999">)</span> 
<span style="color:#999999">{</span>
    e<span style="color:#999999">.</span><span style="color:#dd4a68">printStackTrace</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">;</span>
<span style="color:#999999">}</span>
</code></span></span>

Java的

10.消除不信任的HTML(以防止XSS)

假設在應用程序中,想顯示用戶提交的HTML片段。例如用戶可以在評論框中放入HTML內容。這可能會導致非常嚴重的問題,如果您允許直接顯示此HTML。用戶可以在其中放入一些惡意腳本,並將用戶重定向到另一個髒網站。

爲了清理這個HTML,Jsoup提供Jsoup.clean()方法。此方法期望HTML格式的字符串,並將返回清潔的HTML。要執行此任務,Jsoup使用白名單過濾器.jsoup白名單過濾器通過解析輸入HTML(在安全的沙盒環境中)工作,然後遍歷解析樹,只允許將已知安全的標籤和屬性(和值)通過清理後輸出。

它不使用正則表達式,這對於此任務是不合適的。

清潔器不僅用於避免XSS,還限制了用戶可以提供的元素的範圍:您可以使用文本,強元素,不能但構造div或表元素。

<span style="color:#333344"><span style="color:black"><code style="margin-left:0px" class="language-java">String dirtyHTML <span style="color:#a67f59">=</span> <span style="color:#669900">"<p><a href='http://www.yiibai.com/' onclick='sendCookiesToMe()'>Link</a></p>"</span><span style="color:#999999">;</span>

String cleanHTML <span style="color:#a67f59">=</span> Jsoup<span style="color:#999999">.</span><span style="color:#dd4a68">clean</span><span style="color:#999999">(</span>dirtyHTML<span style="color:#999999">,</span> Whitelist<span style="color:#999999">.</span><span style="color:#dd4a68">basic</span><span style="color:#999999">(</span><span style="color:#999999">)</span><span style="color:#999999">)</span><span style="color:#999999">;</span>

System<span style="color:#999999">.</span>out<span style="color:#999999">.</span><span style="color:#dd4a68">println</span><span style="color:#999999">(</span>cleanHTML<span style="color:#999999">)</span><span style="color:#999999">;</span>
</code></span></span>

Java的

執行後輸出結果如下 -

<span style="color:#333344"><span style="color:black"><code style="margin-left:0px" class="language-java"><span style="color:#a67f59"><</span>p<span style="color:#a67f59">></span><span style="color:#a67f59"><</span>a href<span style="color:#a67f59">=</span><span style="color:#669900">"http://www.yiibai.com/"</span> rel<span style="color:#a67f59">=</span><span style="color:#669900">"nofollow"</span><span style="color:#a67f59">></span>Link<span style="color:#a67f59"><</span><span style="color:#a67f59">/</span>a<span style="color:#a67f59">></span><span style="color:#a67f59"><</span><span style="color:#a67f59">/</span>p<span style="color:#a67f59">></span></code></span></span>
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章