java 爬蟲中的正則表達式(筆記)

最近在寫一個爬蟲的小程序，編程語言爲Java，過程中遇到許多的小問題，打算把這些問題及解決方法都記下來。

今天寫第一更，我們往往感興趣的是網頁裏邊的內容，在我們抓取到網頁的時候，真正需要的是抓取網頁裏邊和主題相關的內容。

Java爬蟲在抓取網頁內容時，經常遇到的使用正則表達式來有選擇性的抓取網頁的內容。正則表達式在計算機科學中，是指一個用來描述或者符合某個句法規則的字符串的單個字符串。正則表達式在大多數情況下，被用來檢索和/或替換那些符合某個模式的文本內容。許多程序設計語言都支持利用正則表達式進行字符串的操作。

正則表達式在網絡爬蟲中，主要的應用：

1.對URL鏈接進行發過濾，只提取符合特定格式的鏈接；

2.提取網頁內容。

對於正則表達式的數學原理我就不在這裏贅述了。。。。。

Java正則表達式：在JDK1.4以後java.util.regex包提供了對正則表達式的支持（好像是這個樣子的），在爬蟲常用的查找指定內容、去除指定內容、文字替換、截取等功能會比較常用（下面來舉例一一說明）：

查找：

Pattern pattern=Pattern.compile("Java.*");//查找以Java開頭，任意字符結尾的字符串

Matcher matcher=pattern.matcher("Java是一種編程語言");

boolean b=matcher.matches();

System.out.println(b);

多條件分割字符串：

Pattern pattern=Pattern.compile("[,|]+");

String[] strs=pattern.split("Java Hello World Java ,Hello,,World|Sun");

for (int i=0;i<strs.length;i++){

System.out.println(strs[i]);

文字替換（首次出現字符）：

Pattern pattern =Pattern.compile("正則表達式");

Matcher matcher=pattern.matcher("正則表達式 hello worle 正則表達式 hello world!!");

System.out.println(matcher.replaceFirst("Java"));//替換掉第一次出現的字符串

文字替換（全部）：

Pattern pattern =Pattern.compile("正則表達式");

Matcher matcher=pattern.matcher("正則表達式 hello worle 正則表達式 hello world!!");

System.out.println(matcher.replaceAll("Java"));//替換掉所有出現的字符串

文字替換（置換字符）：

Pattern pattern =Pattern.compile("正則表達式");

Matcher matcher=pattern.matcher("正則表達式 hello worle 正則表達式 hello world!!");

StringBuffer sbr=new StringBuffer();

while(matcher.find()){

matcher.appendRepalcement(sbr,"Java");

}

matcher.appendTail(sbr);

System.out.println(sbr.toString());

//appendReplacement()與appendAll()的區別：前者是一次性找到需要替換的再一次性全部替換，後者是找到一個替換一個。。。。

驗證是否爲郵件地址：

String str="[email protected]";

Pattern pattern=Pattern.compile("[\\w\\.\\-]+@([\\w\\-])+[\\w\\- ]+"Pattern.CASE_INSENSITIVE);

Matcher matcher=pattern.matcher(str);

System.out.println(matcher.matcher());

去除html標記：

Pattern pattern=Pattern.compile("<.+?>",Pattern.DOTALL);

Matcher matcher =pattern.matcher("<a href=\"index.html\">主頁</a>");

String string=matcher.replaceAll("");

System.out.println(string);

截取http://地址：

Pattern pattern=Pattern.compile("(http://|https://){1}[\\w\\.\\-/:]+");

Matcher matcher=pattern.matcher("dsdsds<http://dsds//gfgffdfd>fdf");

StringBuffer buffer=new StringBuffer();

while(matcher.find()){

buffer.append(matcher.group());

buffer.append("\r\n");

System.out.println(buffer.toString());

}

查找HTML中對應條件字符串:

Pattern pattern=Pattern.compile("href=\"(.+?)\"");

Matcher matcher=pattern.matcher("<a href=\"index.html\">主頁</a>");

while(matcher.find()){

System.out.println(matcher.group());

}

替換{}中的文字

String str="Java目前的發展史是由{0}年到{3}年";

String[][] object={new String[] {"\\{0\\}","1995"},new String[]{"\\{1\\}","2007"}};

public static String replace(String sourceString,Object[] object){

String temp=sourceString;

for(int i=0;i<object.length;i++){

String[] result=(String[])object[i];

Pattern pattern=Pattern.compile(result[0]);

Matcher matcher=pattern.matcher(temp);

temp=matcher.replaceAll(result[1]);

}

return temp;

}

System.out.println(replace(str,object));

還有。。。。。。。。。。。。。。

以正則表達式查詢制定目錄下的文件。。。。

java 爬蟲中的正則表達式(筆記)

關於遊戲付費的一點想法

我通過CKA和CKS啦！

給自己兩個月的時間，只做一件事情，看看自己到底行不行

python裝飾器學習筆記

python socket模塊學習

我的友情鏈接

python多線程學習

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結