java jsoup 解析字符串默認加了“/n”符號的處理

原創

2020-06-19 04:11

/**
	 * @param args
	 */
	public static void main(String[] args) {
		// TODO Auto-generated method stub
		String str="<p>供應t1紫<a href='http://www.cncu.cn' target='_blank'>銅</a>板、<a href='http://www.cncu.cn/product/tjthj_ct_zt/' target='_blank'>紫銅</a>帶，質量很優質，歡迎新老客戶前來採購。</p><p><a href='http://www.cncu.cn/product/tjthj_qt/ ' target='_blank'>紫銅帶</a>用途：高純度，組織細密，含氧量極低。無氣孔、沙眼、疏鬆，導電性能極佳，電蝕出的模具表面精度高，經熱處理工藝，電極無方向性，適合精打，細打，具有良好的熱電道性、加工性、延展性、防蝕性及耐候性等。有良好的導電、導熱、耐蝕和加工性能,可以焊接和釺焊 的<br /></p>";
		//System.out.println(processContentTest(str));
		//String str="<p>供應t1紫<a href='http://www.cncu.cn' target='_blank'>銅</a>";
		System.out.println("--------------------------------");
		System.out.println(RemoveImgAcontent(str));
	}
	public static String RemoveImgAcontent(String initcontent){
		Pattern p = Pattern.compile("</?(A|a)(\n|.)*?>");
		//Document doc = Jsoup.parseBodyFragment(initcontent); // or Jsoup.parse(...);
		Document doc = Jsoup.parseBodyFragment(initcontent);
		Elements images = doc.select("img");

		
		for(Element image : images){
			
			String altStr=image.attr("alt");
			Matcher m1 = p.matcher(altStr);
			altStr = m1.replaceAll("");
			//image.removeAttr("alt");
			image.attr("alt", altStr);
			
			String titleStr=image.attr("title");
			Matcher m2 = p.matcher(titleStr);
			titleStr = m2.replaceAll("");
			//image.removeAttr("title");
			image.attr("title", titleStr);
		}


		String endcontent=doc.select("body").html();
		return endcontent;
	}

如上例所示默認解析後生成的內容中會加上/n符號

看圖做對比
解析前

解析後：

看到沒多了一個 /n
查看jsoup的源碼調試後發現是


package org.jsoup.nodes;
....
void outerHtmlHead(StringBuilder accum, int depth, Document.OutputSettings out) {
        String html = Entities.escape(getWholeText(), out);
        if (out.prettyPrint() && parent() instanceof Element && !Element.preserveWhitespace((Element) parent())) {
            html = normaliseWhitespace(html);
        }

        if (out.prettyPrint() && ((siblingIndex() == 0 && parentNode instanceof Element && ((Element) parentNode).tag().formatAsBlock() && !isBlank()) || (out.outline() && siblingNodes().size()>0 && !isBlank()) ))
            indent(accum, depth, out);
        accum.append(html);
    }

源碼中該方法裏的indent方法


/**
     * 
     * @param accum
     * @param depth
     * @param out
     */
    protected void indent(StringBuilder accum, int depth, Document.OutputSettings out) {
           	accum.append("\n").append(StringUtil.padding(depth * out.indentAmount()));
    }

就是這個方法加的
看了他的邏輯

 if (out.prettyPrint() && ((siblingIndex() == 0 && parentNode instanceof Element && ((Element) parentNode).tag().formatAsBlock() && !isBlank()) || (out.outline() && siblingNodes().size()>0 && !isBlank()) ))

out.prettyPrint() 應該是是否格式化，看了她的方法發現默認就是true的，
siblingIndex() 大概意思是判斷是否根節點之類的意思

parentNode instanceof Element 大概是判斷是否符合element格式

!isBlank()是判斷不等於空

幾個條件都是true 加起來就進入了調用indent方法

要想不讓jsoup默認給我的字符串加入 /n
我的解決辦法是直接註釋 indet方法的實現讓它變成一個空殼方法

這個不是很完美，但是能解決問題

最完美的方案是能夠重寫該indent方法

可惜我還不清楚如何重寫

順便說下我是從官網把jsoup的源碼下載下來解壓到項目進行研究
這是jsoup官網源碼包下載地址：http://jsoup.org/download

查看圖片附件

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

java jsoup 解析字符串默認加了“/n”符號的處理

IkAnalyzer分詞、詞頻、內鏈優化

OAuthProblemException{error='unsupported_response_type', description='Invalid re

vtiger crm6.0自定義短信服務商二次開發

java監聽器+quartz實現每天動態時間執行任務的功能

java按照每週分組改進版

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結