需要找出微博正文中的鏈接(主要爲http鏈接),話題標籤(#內容#),@用戶,用正則表達式解決之,暫時找到的方案如下
1. 鏈接
正則表達式
(?:^|[\\W])((ht|f)tp(s?):\\/\\/|www\\.)(([\\w\\-]+\\.){1,}?([\\w\\-.~]+\\/?)*[\\p{Alnum}.,%_=?&#\\-+()\\[\\]\\*$~@!:/{};']*)
Java程序示例
/**
* URL正則表達式
*/
private static final Pattern urlPattern = Pattern.compile(
"(?:^|[\\W])((ht|f)tp(s?):\\/\\/|www\\.)"
+ "(([\\w\\-]+\\.){1,}?([\\w\\-.~]+\\/?)*"
+ "[\\p{Alnum}.,%_=?&#\\-+()\\[\\]\\*$~@!:/{};']*)",
Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
/**
* 去掉文本中URLs
* @param text
* @return
*/
public static String removeURLs(String text){
Matcher matcher;
String newTweet = text.trim();
String cleanedText="";
while(!newTweet.equals(cleanedText)){
cleanedText=newTweet;
matcher = urlPattern.matcher(cleanedText);
newTweet = matcher.replaceAll("");
newTweet =newTweet.trim();
}
return cleanedText;
}
/**
* 獲得文本中URL列表
* @param originalString
* @return
*/
public static List<String> getURLs(String originalString){
List<String> urlsSet=new ArrayList<String>();
Matcher matcher = urlPattern.matcher(originalString);
while (matcher.find()) {
int matchStart = matcher.start(1);
int matchEnd = matcher.end();
String tmpUrl=originalString.substring(matchStart,matchEnd);
urlsSet.add(tmpUrl);
// now you have the offsets of a URL match
originalString=originalString.replace(tmpUrl,"");
matcher = urlPattern.matcher(originalString);
}
return urlsSet;
}
2. 話題標籤
正則表達式
#[^#]+#
Java程序示例
/**
* Hashtag正則表達式
*/
// private static final Pattern hashtagPattern =
// Pattern.compile("(?:^|\\s|[\\p{Punct}&&[^/]])(#[\\p{L}0-9-_]+)");
private static final Pattern hashtagPattern =
Pattern.compile("#[^#]+#");
private static String removeHashtags(String text){
Matcher matcher;
String newTweet = text.trim();
String cleanedText="";
while(!newTweet.equals(cleanedText)){
cleanedText=newTweet;
matcher = hashtagPattern.matcher(cleanedText);
newTweet = matcher.replaceAll("");
newTweet =newTweet.trim();
}
return cleanedText;
}
public static List<String> getHashtags(String originalString){
List<String> hashtagSet=new ArrayList<String>();
Matcher matcher = hashtagPattern.matcher(originalString);
while (matcher.find()) {
// int matchStart = matcher.start(1);
int matchStart = matcher.start();
int matchEnd = matcher.end();
String tmpHashtag=originalString.substring(matchStart,matchEnd);
hashtagSet.add(tmpHashtag);
originalString=originalString.replace(tmpHashtag,"");
matcher = hashtagPattern.matcher(originalString);
}
return hashtagSet;
}
3. @用戶
正則表達式
@[\u4e00-\u9fa5a-zA-Z0-9_-]{2,30}
Java程序示例
/**
* 用戶@正則表達式
* 新浪微博中的用戶名格式爲是“4-30個字符,支持英文、數字、"_"或減號”,
* 也就是說,支持中文、字母、數字、下劃線及減號,並且是4到30個字符(這裏暫且認爲漢字爲一個字符)
* 那麼在寫匹配的表達式的時候就可以這麼來寫: @[\u4e00-\u9fa5a-zA-Z0-9_-]{4,30}
*/
// private static final Pattern usermentionPattern =
// Pattern.compile("(?:^|\\s|[\\p{Punct}&&[^/]])(@[\\p{L}0-9-_]+)");
private static final Pattern usermentionPattern =
Pattern.compile("@[\u4e00-\u9fa5a-zA-Z0-9_-]{2,30}");
public static String removeUserMentions(String text){
Matcher matcher;
String newTweet = text.trim();
String cleanedText="";
while(!newTweet.equals(cleanedText)){
cleanedText=newTweet;
matcher = usermentionPattern.matcher(cleanedText);
newTweet = matcher.replaceAll("");
newTweet =newTweet.trim();
}
return cleanedText;
}
public static List<String> getUsermentions(String originalString){
List<String> usermentionsSet=new ArrayList<String>();
Matcher matcher = usermentionPattern.matcher(originalString);
while (matcher.find()) {
// int matchStart = matcher.start(1);
int matchStart = matcher.start();
int matchEnd = matcher.end();
String tmpUsermention=originalString.substring(matchStart,matchEnd);
usermentionsSet.add(tmpUsermention);
originalString=originalString.replace(tmpUsermention,"");
matcher = usermentionPattern.matcher(originalString);
}
return usermentionsSet;
}