Java正則表達式匹配中文字符

原創

AceMa

2020-06-23 17:45

若要用JAVA正則表達式匹配中文字符，主要是瞭解中文字符的編碼。

匹配中文字符：半角:[\u4e00-\u9fa5] ，全角：[ufe30-uffa0]

匹配中文符號：。；，： “ ”（）、？《》的對應編碼爲："[\u3002\uff1b\uff0c\uff1a\u201c\u201d\uff08\uff09\u3001\uff1f\u300a\u300b]"

英文字母:[a-zA-Z]
數字:[0-9]

下面的例子是要匹配出字符串中含有“《中英字符*》”的例子，並輸出匹配內容，其中.表示除了行終止符之外的所有字符。

	public static void main(String[] args) {
		String patternStr = "\u300a.+\u300b";
		Pattern pattern = Pattern.compile(patternStr);
		String input = "《21世紀經濟報道》記者";
		Matcher matcher = pattern.matcher(input);
		if (matcher.find()) {
			int start = matcher.start();
			int end = matcher.end();
			System.out.println(input.substring(start, end));
		}else{
			System.out.println("not found");
		}
		//output: 《21世紀經濟報道》
	}

上例patternStr得到的匹配結果是最長的string。比如若input=”莫言作品《豐乳肥臀》，《紅高粱》“，那麼輸出的就是”《豐乳肥臀》，《紅高粱》“。

若修改patternStr="\u300a[^\u300a]+\u300b"; 即《》內的字符不能爲《。輸出的結果就是《豐乳肥臀》。

當然若有具體的數據特徵，可進一步改進pattern string。

最近要做一箇中文數據的處理，查了一下蠻有意思的。

參考文獻：

http://www.111cn.net/jsp/Java/46105.htm

http://blog.csdn.net/love_5209/article/details/23353907

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Java正則表達式匹配中文字符

AI 畫圖真刺激，手把手教你如何用 ComfyUI 來畫出刺激的圖

公司剛入職了一名 Java 中級開發，短短 4 行代碼居然湊齊了 3 個 bug！我哭了~~

數據展示動態（跑分）顯示

公衆號5月C#/.NET熱文一覽

git 下載大陸鏡像地址

Ubuntu軟件安裝指南：dpkg、apt 與源碼包安裝

Java正則表達式匹配中文字符

acm入門必看的學長經驗【轉載自www.acmwiki.com】

LeetCode -- Best Time to Buy and Sell Stock II （貪心策略，差分序列）

What will the following polymorphic code output in C ++

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Java正則表達式 匹配中文字符

Java正則表達式匹配中文字符