正則表達式處理XML,HTML

<tr>
<td>5345454354</td><td>2010-3-29 13:48:33</td><td>周杰倫</td>
</tr>
<tr>
<td>6565465466</td><td>2010-3-29 15:34:38</td><td>張學友</td>
</tr>
<tr>
<td>6546546546</td><td>2010-3-30 19:30:50</td><td>劉德華</td>
</tr>
<tr>
<td>9875646545</td><td>2010-3-31 2:20:58</td><td>郭富城</td>
</tr>
<tr>
<td>7868768768</td><td>2010-3-31 8:03:11</td><td>梁朝偉</td>
</tr>

若想取標記<td></td>之間的內容, 可以這樣分析

<td>(.*?)</td> 
string str = "..........";
string pstr = "<td>(.*?)</td>";
MatchCollection mc = Regex.Matches(str, pstr);
for (int i = 0; i < mc.Count; i++)
{
    Response.Write(mc[i].Result("$1"));
}
MatchCollection mc = Regex.Matches(html,@"(?is)(?<=<td>).+?(?=</td>)");
foreach(Match m in mc)
{
    //Response.Write(m.Value);//web
    MessageBox.Show(m.Value);
}

表達式說明

  • (?<=Expression) 逆序肯定環視,表示所在位置左側能夠匹配Expression
  • (?<!Expression) 逆序否定環視,表示所在位置左側不能匹配Expression
  • (?=Expression) 順序肯定環視,表示所在位置右側能夠匹配Expression
  • (?!Expression) 順序否定環視,表示所在位置右側不能匹配Expression
(?is)(?<=<td>).+?(?=</td>)
  • (?is) 模式修飾,i表示忽略大小寫,s表示單行模式.能匹配回車換行
  • (?<=<td>) 逆序肯定環視,需要匹配的結果以<td>開頭,但是<td>匹配,結果中不包含<td>
  • .+? 任意字符,每次匹配到符合的(任意字符),即嘗試匹配後面的表達式,直到後面的表達式失敗,回溯上一次匹配結果。
  • (?=</td>) 順序肯定環視,匹配的結果最後要以</td>結尾,但</td>不匹配,結果中不包含</td>

正則取xml內容比dom4j快50倍?

long t1 = System.nanoTime();
String str = "<xml><ToUserName><![CDATA[gh_520f99dff7cc]]></ToUserName><FromUserName><![CDATA[oBAMOs3aZB0dkbILsBR1wksbmli4]]></FromUserName><CreateTime>1416900555</CreateTime><MsgType><![CDATA[event]]></MsgType><Event><![CDATA[MASSSENDJOBFINISH]]></Event><MsgID>2348714844</MsgID><Status><![CDATA[send success]]></Status><TotalCount>1</TotalCount><FilterCount>1</FilterCount><SentCount>1</SentCount><ErrorCount>0</ErrorCount></xml>";
//			Document doc = null;
//			try {
//				doc = DocumentHelper.parseText(str);
//			} catch (DocumentException e) {
//				log.error("解析羣發xml錯誤:"+e.getMessage(), e);
//			}
//			
//			Element root = doc.getRootElement();
//			String msgid = root.elementTextTrim("MsgID");
//			String Status = root.elementTextTrim("Status");
//			String TotalCount = root.elementTextTrim("TotalCount");
//			String FilterCount = root.elementTextTrim("FilterCount");
//			String SentCount = root.elementTextTrim("SentCount");
//			String ErrorCount = root.elementTextTrim("ErrorCount");
			String msgid = RegExp.getString(str,
					"(?<=<MsgID>)[\\s\\S]*?(?=</MsgID>)").trim();
			String Status = RegExp.getString(str,
				"(?<=<Status><!\\[CDATA\\[)[\\s\\S]*?(?=\\]\\]></Status>)")
				.trim();
			String TotalCount = RegExp.getString(str,
				"(?<=<TotalCount>)[\\s\\S]*?(?=</TotalCount>)")
				.trim();
			String FilterCount = RegExp.getString(str,
				"(?<=<FilterCount>)[\\s\\S]*?(?=</FilterCount>)")
				.trim();
			String SentCount = RegExp.getString(str,
				"(?<=<SentCount>)[\\s\\S]*?(?=</SentCount>)")
				.trim();
			String ErrorCount = RegExp.getString(str,
				"(?<=<ErrorCount>)[\\s\\S]*?(?=</ErrorCount>)")
				.trim();
			long t2 = System.nanoTime();
			log.info(t2-t1);
			log.info((t2-t1)*0.000001);
			log.info(msgid+", "+Status+", "+TotalCount+", "+FilterCount+", "+SentCount+", "+ErrorCount);

dom4j運行結果:

2014-11-26 15:25:29,716 INFO [Test] 70 - <220279310>
2014-11-26 15:25:29,719 INFO [Test] 71 - <220.27930999999998>《==看這裏
2014-11-26 15:25:29,719 INFO [Test] 72 - <2348714844, send success, 1, 1, 1, 0>

正則運行結果:

2014-11-26 15:28:08,575 INFO [Test] 70 - <4633684>
2014-11-26 15:28:08,578 INFO [Test] 71 - <4.633684>《==看這裏
2014-11-26 15:28:08,578 INFO [Test] 72 - <2348714844</MsgID>, <![CDATA[send success]]></Status>, 1</TotalCount>, 1</FilterCount>, 1</SentCount>, 0</ErrorCount>> 

正則代碼:

public class RegExp {
    public static ArrayList<String> getStrs(String source, String regex) {
        Pattern p = Pattern.compile(regex);
        Matcher m = p.matcher(source);
        ArrayList<String> list = new ArrayList();

        while (m.find()) {
            list.add(source.substring(m.start(), m.end()));
        }

        return list;
    }

    public static String getString(String source, String regex) {
        ArrayList<String> list = getStrs(source, regex);

        if (list.size() > 0) {
            return (String) list.get(0);
        }

        return "";
    }

    public static ArrayList<String> getStrs(String source, String beginStr,
        String endStr, boolean isLong) {
        if (isLong) {
            return getStrs(source,
                "(?<=" + replay(beginStr) + ")[\\s\\S]*(?=" + replay(endStr) +
                ")");
        }

        return getStrs(source,
            "(?<=" + replay(beginStr) + ")[\\s\\S]*?(?=" + replay(endStr) +
            ")");
    }

    public static String getString(String source, String beginStr,
        String endStr, boolean isLong) {
        if (isLong) {
            return getString(source,
                "(?<=" + replay(beginStr) + ")[\\s\\S]*(?=" + replay(endStr) +
                ")");
        }

        return getString(source,
            "(?<=" + replay(beginStr) + ")[\\s\\S]*?(?=" + replay(endStr) +
            ")");
    }

    private static String replay(String source) {
        String result = "";
        result = source.replace("\\", "\\\\");
        result = source.replace(".", "\\.");
        result = result.replace("(", "\\(");
        result = result.replace(")", "\\)");
        result = result.replace("[", "\\[");
        result = result.replace("]", "\\]");
        result = result.replace("{", "\\{");
        result = result.replace("}", "\\}");
        result = result.replace("$", "\\$");
        result = result.replace("?", "\\?");
        result = result.replace("&", "\\&");
        result = result.replace("*", "\\*");
        result = result.replace("!", "\\!");
        result = result.replace("^", "\\^");
        result = result.replace("+", "\\+");
        result = result.replace("#", "\\#");

        return result;
    }
}

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章