java 爬虫中的正则表达式(笔记)

最近在写一个爬虫的小程序，编程语言为Java，过程中遇到许多的小问题，打算把这些问题及解决方法都记下来。

今天写第一更，我们往往感兴趣的是网页里边的内容，在我们抓取到网页的时候，真正需要的是抓取网页里边和主题相关的内容。

Java爬虫在抓取网页内容时，经常遇到的使用正则表达式来有选择性的抓取网页的内容。正则表达式在计算机科学中，是指一个用来描述或者符合某个句法规则的字符串的单个字符串。正则表达式在大多数情况下，被用来检索和/或替换那些符合某个模式的文本内容。许多程序设计语言都支持利用正则表达式进行字符串的操作。

正则表达式在网络爬虫中，主要的应用：

1.对URL链接进行发过滤，只提取符合特定格式的链接；

2.提取网页内容。

对于正则表达式的数学原理我就不在这里赘述了。。。。。

Java正则表达式：在JDK1.4以后java.util.regex包提供了对正则表达式的支持（好像是这个样子的），在爬虫常用的查找指定内容、去除指定内容、文字替换、截取等功能会比较常用（下面来举例一一说明）：

查找：

Pattern pattern=Pattern.compile("Java.*");//查找以Java开头，任意字符结尾的字符串

Matcher matcher=pattern.matcher("Java是一种编程语言");

boolean b=matcher.matches();

System.out.println(b);

多条件分割字符串：

Pattern pattern=Pattern.compile("[,|]+");

String[] strs=pattern.split("Java Hello World Java ,Hello,,World|Sun");

for (int i=0;i<strs.length;i++){

System.out.println(strs[i]);

文字替换（首次出现字符）：

Pattern pattern =Pattern.compile("正则表达式");

Matcher matcher=pattern.matcher("正则表达式 hello worle 正则表达式 hello world!!");

System.out.println(matcher.replaceFirst("Java"));//替换掉第一次出现的字符串

文字替换（全部）：

Pattern pattern =Pattern.compile("正则表达式");

Matcher matcher=pattern.matcher("正则表达式 hello worle 正则表达式 hello world!!");

System.out.println(matcher.replaceAll("Java"));//替换掉所有出现的字符串

文字替换（置换字符）：

Pattern pattern =Pattern.compile("正则表达式");

Matcher matcher=pattern.matcher("正则表达式 hello worle 正则表达式 hello world!!");

StringBuffer sbr=new StringBuffer();

while(matcher.find()){

matcher.appendRepalcement(sbr,"Java");

}

matcher.appendTail(sbr);

System.out.println(sbr.toString());

//appendReplacement()与appendAll()的区别：前者是一次性找到需要替换的再一次性全部替换，后者是找到一个替换一个。。。。

验证是否为邮件地址：

String str="[email protected]";

Pattern pattern=Pattern.compile("[\\w\\.\\-]+@([\\w\\-])+[\\w\\- ]+"Pattern.CASE_INSENSITIVE);

Matcher matcher=pattern.matcher(str);

System.out.println(matcher.matcher());

去除html标记：

Pattern pattern=Pattern.compile("<.+?>",Pattern.DOTALL);

Matcher matcher =pattern.matcher("<a href=\"index.html\">主页</a>");

String string=matcher.replaceAll("");

System.out.println(string);

截取http://地址：

Pattern pattern=Pattern.compile("(http://|https://){1}[\\w\\.\\-/:]+");

Matcher matcher=pattern.matcher("dsdsds<http://dsds//gfgffdfd>fdf");

StringBuffer buffer=new StringBuffer();

while(matcher.find()){

buffer.append(matcher.group());

buffer.append("\r\n");

System.out.println(buffer.toString());

}

查找HTML中对应条件字符串:

Pattern pattern=Pattern.compile("href=\"(.+?)\"");

Matcher matcher=pattern.matcher("<a href=\"index.html\">主页</a>");

while(matcher.find()){

System.out.println(matcher.group());

}

替换{}中的文字

String str="Java目前的发展史是由{0}年到{3}年";

String[][] object={new String[] {"\\{0\\}","1995"},new String[]{"\\{1\\}","2007"}};

public static String replace(String sourceString,Object[] object){

String temp=sourceString;

for(int i=0;i<object.length;i++){

String[] result=(String[])object[i];

Pattern pattern=Pattern.compile(result[0]);

Matcher matcher=pattern.matcher(temp);

temp=matcher.replaceAll(result[1]);

}

return temp;

}

System.out.println(replace(str,object));

还有。。。。。。。。。。。。。。

以正则表达式查询制定目录下的文件。。。。

java 爬虫中的正则表达式(笔记)

10分钟搞定Mysql主从部署配置

如何使用 JS 判断用户是否处于活跃状态

一键自动化博客发布工具,用过的人都说好(掘金篇)

「Pygors跨平台GUI」2：安装MinGW-w64、MSYS2还是WSL2

[转帖]

python列出centos7内存使用前50的进程信息

「Pygors跨平台GUI」1：Pygors跨平台GUI应用研究

nodejs学习06——小案例

评估统计算法在银行伪造钞票检测中的价值

C# Xmlserializer 程序集内存泄露

給自己兩個月的時間，只做一件事情，看看自己到底行不行

python裝飾器學習筆記

python socket模塊學習

我的友情鏈接

python多線程學習

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結