微博內容正則表達式匹配鏈接, 話題標籤與@用戶

需要找出微博正文中的鏈接(主要爲http鏈接)，話題標籤(#內容#)，@用戶，用正則表達式解決之，暫時找到的方案如下

1. 鏈接

正則表達式

(?:^|[\\W])((ht|f)tp(s?):\\/\\/|www\\.)(([\\w\\-]+\\.){1,}?([\\w\\-.~]+\\/?)*[\\p{Alnum}.,%_=?&#\\-+()\\[\\]\\*$~@!:/{};']*)

Java程序示例

 /**
       * URL正則表達式
       */
 private static final Pattern urlPattern = Pattern.compile(
            "(?:^|[\\W])((ht|f)tp(s?):\\/\\/|www\\.)"
            + "(([\\w\\-]+\\.){1,}?([\\w\\-.~]+\\/?)*"
            + "[\\p{Alnum}.,%_=?&#\\-+()\\[\\]\\*$~@!:/{};']*)",
            Pattern.CASE_INSENSITIVE | Pattern.MULTILINE | Pattern.DOTALL);
 /**
  * 去掉文本中URLs
  * @param text
  * @return
  */
  public static String removeURLs(String text){
      Matcher matcher;
      String newTweet = text.trim();
      String cleanedText="";
      while(!newTweet.equals(cleanedText)){
          cleanedText=newTweet;
          matcher = urlPattern.matcher(cleanedText);
          newTweet = matcher.replaceAll("");
          newTweet =newTweet.trim();
      }
      return cleanedText;
 }
	
 /**
  * 獲得文本中URL列表
  * @param originalString
  * @return
  */
 public static List<String> getURLs(String originalString){
     List<String> urlsSet=new ArrayList<String>();
     Matcher matcher = urlPattern.matcher(originalString);
     while (matcher.find()) {
         int matchStart = matcher.start(1);
         int matchEnd = matcher.end();
         String tmpUrl=originalString.substring(matchStart,matchEnd);
         urlsSet.add(tmpUrl);
         // now you have the offsets of a URL match
         originalString=originalString.replace(tmpUrl,"");
         matcher = urlPattern.matcher(originalString);
     }
     return urlsSet;
 }

2. 話題標籤

正則表達式

#[^#]+#

Java程序示例

/**
  * Hashtag正則表達式
  */
// private static final Pattern hashtagPattern = 
//    Pattern.compile("(?:^|\\s|[\\p{Punct}&&[^/]])(#[\\p{L}0-9-_]+)");
   private static final Pattern hashtagPattern = 
       Pattern.compile("#[^#]+#");
private static String removeHashtags(String text){
        Matcher matcher;
        String newTweet = text.trim();
        String cleanedText="";
        while(!newTweet.equals(cleanedText)){
                cleanedText=newTweet;
                matcher = hashtagPattern.matcher(cleanedText);
                newTweet = matcher.replaceAll("");
                newTweet =newTweet.trim();
        }
        return cleanedText;
    }
	
	public static List<String> getHashtags(String originalString){
        List<String> hashtagSet=new ArrayList<String>();
        Matcher matcher = hashtagPattern.matcher(originalString);
        while (matcher.find()) {
//            int matchStart = matcher.start(1);
        	int matchStart = matcher.start();
        	int matchEnd = matcher.end();
            String tmpHashtag=originalString.substring(matchStart,matchEnd);
            hashtagSet.add(tmpHashtag);
            originalString=originalString.replace(tmpHashtag,"");
            matcher = hashtagPattern.matcher(originalString);
        }
        return hashtagSet;
    }

3. @用戶

正則表達式

@[\u4e00-\u9fa5a-zA-Z0-9_-]{2,30}

Java程序示例

/**
 * 用戶@正則表達式
 * 新浪微博中的用戶名格式爲是“4-30個字符，支持英文、數字、"_"或減號”, 
 * 也就是說，支持中文、字母、數字、下劃線及減號，並且是4到30個字符（這裏暫且認爲漢字爲一個字符）
 * 那麼在寫匹配的表達式的時候就可以這麼來寫:    @[\u4e00-\u9fa5a-zA-Z0-9_-]{4,30} 
 */
 // private static final Pattern usermentionPattern = 
//      Pattern.compile("(?:^|\\s|[\\p{Punct}&&[^/]])(@[\\p{L}0-9-_]+)");    
    private static final Pattern usermentionPattern = 
        Pattern.compile("@[\u4e00-\u9fa5a-zA-Z0-9_-]{2,30}");
public static String removeUserMentions(String text){
        Matcher matcher;
        String newTweet = text.trim();
        String cleanedText="";
        while(!newTweet.equals(cleanedText)){
                cleanedText=newTweet;
                matcher = usermentionPattern.matcher(cleanedText);
                newTweet = matcher.replaceAll("");
                newTweet =newTweet.trim();
        }
        return cleanedText;
    }
	
	public static List<String> getUsermentions(String originalString){
        List<String> usermentionsSet=new ArrayList<String>();
        Matcher matcher = usermentionPattern.matcher(originalString);
        while (matcher.find()) {
//            int matchStart = matcher.start(1);
        	int matchStart = matcher.start();
            int matchEnd = matcher.end();
            String tmpUsermention=originalString.substring(matchStart,matchEnd);
            usermentionsSet.add(tmpUsermention);
            originalString=originalString.replace(tmpUsermention,"");
            matcher = usermentionPattern.matcher(originalString);
        }
        return usermentionsSet;
    }

hfut_jf

發佈了61 篇原創文章 · 獲贊 14 · 訪問量 26萬+

私信關注

微博內容正則表達式匹配鏈接, 話題標籤與@用戶

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

sql求連續值問題

cs01 CSS Syntax

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Linux/Golang/glibC系統調用

python matplotlib繪圖時圖例顯示問題

Latex宏包管理

Latex beamer書籤亂碼解決方法

Latex overline斷開連續字母上橫線

Latex 大型運算符上下標

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結