正則表達式基本概念
正則表達式(regular expression)是一個模式,這個模式描述了一個字符串的集合。正則表達式的用途是對文本的查找和替換。
當前常見的有3個不同版本的正則表達式語法,它們是“basic” (BRE), “extended” (ERE) 和 “perl” (PRCE)。perl正則表達式提供了比extended更加豐富的功能,但是不一定在所有的平臺上能夠使用它定義的所有功能。關於perl正則表達式的語法可查看man 3 pcrepattern。這裏主要以extended的爲主。
關於正則表達式語法的介紹可以在Ubuntu終端執行man 1 grep,查看“REGULAR EXPRESSIONS”章節。
正則表達式元字符總結
下面轉載一個來自msdn的關於正則表達式元字符說明的幾個表格,總結的不錯,在快速查看時非常有效。
原文地址:Regular Expression Syntax
單個字符的元字符
正則表達式包括普通字符(例如,a 到 z 之間的字母)和特殊字符(稱爲“元字符”,Special Character, Metacharacter)。
Metacharacter |
Behavior |
Example |
---|---|---|
* |
Matches the preceding character or subexpression zero or more times. Equivalent to {0,}. |
zo* matches “z” and “zoo”. |
+ |
Matches the preceding character or subexpression one or more times. Equivalent to {1,}. |
zo+ matches “zo” and “zoo”, but not “z”. |
? |
Matches the preceding character or subexpression zero or one time. Equivalent to {0,1}. When ? immediately follows any other quantifier (*, +, ?, {n}, {n,}, or {n,m}), the matching pattern is non-greedy. A non-greedy pattern matches as little of the searched string as possible. The default greedy pattern matches as much of the searched string as possible. |
zo? matches “z” and “zo”, but not “zoo”. o+? matches a single “o” in “oooo”, and o+ matches all “o”s. do(es)? matches the “do” in “do” or “does”. |
^ |
Matches the position at the start of the searched string. If the m (multiline search) character is included with the flags, ^ also matches the position following \n or \r. When used as the first character in a bracket expression, ^ negates the character set. |
^\d{3} matches 3 numeric digits at the start of the searched string. [^abc] matches any character except a, b, and c. |
$ |
Matches the position at the end of the searched string. If the m (multiline search) character is included with the flags, ^ also matches the position before \n or \r. |
\d{3}$ matches 3 numeric digits at the end of the searched string. |
. |
Matches any single character except the newline character \n. To match any character including the \n, use a pattern like [\s\S]. |
a.c matches “abc”, “a1c”, and “a-c”. |
[] |
Marks the start and end of a bracket expression. |
[1-4] matches “1”, “2”, “3”, or “4”. [^aAeEiIoOuU] matches any non-vowel character. |
{} |
Marks the start and end of a quantifier expression. |
a{2,3} matches “aa” and “aaa”. |
() |
Marks the start and end of a subexpression. Subexpressions can be saved for later use. |
A(\d) matches “A0” to “A9”. The digit is saved for later use. |
| |
Indicates a choice between two or more items. |
z|food matches “z” or “food”. (z|f)ood matches “zood” or “food”. |
/ |
Denotes the start or end of a literal regular expression pattern in JScript. After the second “/”, single-character flags can be added to specify search behavior. |
/abc/gi is a JScript literal regular expression that matches “abc”. The g (global) flag specifies to find all occurrences of the pattern, and the i (ignore case) flag makes the search case-insensitive. |
\ |
Marks the next character as a special character, a literal, a backreference, or an octal escape. |
\n matches a newline character. \( matches “(“. \\ matches “\”. |
注意
1. 爲了匹配上表這些特殊字符本身,你必須轉義它。例如,你要匹配”+”,必須用正則表達式”+”,要匹配”\”,你必須用正則表達式”\”。
2. 大多數元字符放在中括號表達式中(如[a-b])時,會失去在上表中說明的特殊含義。在中括號表達式中, 爲了匹配這3個元字符本身,] 必須放在中括號中第一個 (即[]abc]),放在後面的位置即是和[形成[];^ 不能放在中括號中第一個;- 應該放在第一或者最後一個(例如[-a-z],[a-z-])。
3. 關於貪婪/非貪婪匹配:貪婪型元字符匹配儘可能多的字符,非貪婪型元字符匹配儘可能少的字符。典型的貪婪型的元字符是,+,{n,},它們對應的非貪婪型的元字符是?,+?,{n,}?。具體可查看上表關於?字符的說明。
多個字符的元字符
Metacharacter |
Behavior |
Example |
---|---|---|
\b |
Matches a word boundary; that is, the position between a word and a space. |
er\b matches the “er” in “never” but not the “er” in “verb”. |
\B |
Matches a word non-boundary. |
er\B matches the “er” in “verb” but not the “er” in “never”. |
\d |
Matches a digit character. Equivalent to [0-9]. |
In the searched string “12 345”, \d{2} matches “12” and “34”. \d matches “1”, 2”, “3”, “4”, and “5”. |
\D |
Matches a nondigit character. Equivalent to [^0-9]. |
\D+ matches “abc” and ” def” in “abc123 def”. |
\w |
Matches any of the following characters: A-Z, a-z, 0-9, and underscore. Equivalent to [A-Za-z0-9_]. |
In the searched string “The quick brown fox…”, \w+ matches “The”, “quick”, “brown”, and “fox”. |
\W |
Matches any character except A-Z, a-z, 0-9, and underscore. Equivalent to [^A-Za-z0-9_]. |
In the searched string “The quick brown fox…”, \W+ matches “…” and all of the spaces. |
[xyz] |
A character set. Matches any one of the specified characters. |
[abc] matches the “a” in “plain”. |
[^xyz] |
A negative character set. Matches any character that is not specified. |
[^abc] matches the “p”, “l”, “i”, and “n” in “plain”. |
[a-z] |
A range of characters. Matches any character in the specified range. |
[a-z] matches any lowercase alphabetical character in the range “a” through “z”. |
[^a-z] |
A negative range of characters. Matches any character that is not in the specified range. |
[^a-z] matches any character that is not in the range “a” through “z”. |
{n} |
Matches exactly n times. n is a nonnegative integer. |
o{2} does not match the “o” in “Bob”, but does match the two “o”s in “food”. |
{n,} |
Matches at least n times. n is a nonnegative integer. * is equivalent to {0,}. + is equivalent to {1,}. |
o{2,} does not match the “o” in “Bob” but does match all the “o”s in “foooood”. |
{n,m} |
Matches at least n and at most m times. n and m are nonnegative integers, where n <= m. There cannot be a space between the comma and the numbers. ? is equivalent to {0,1}. |
In the searched string”1234567”, \d{1,3} matches “123”, “456”, and “7”. |
(pattern) |
Matches pattern and saves the match. You can retrieve the saved match from array elements returned by the exec Method in JScript. To match parentheses characters ( ), use “\(” or “\)”. |
(Chapter|Section) [1-9] matches “Chapter 5”, and “Chapter” is saved for later use. |
(?:pattern) |
Matches pattern but does not save the match; that is, the match is not stored for possible later use. This is useful for combining parts of a pattern with the “or” character (|). |
industr(?:y|ies) is equivalent to industry|industries. |
(?=pattern) |
Positive lookahead. After a match is found, the search for the next match starts before the matched text. The match is not saved for later use. |
^(?=.*\d).{4,8}$ applies a restriction that a password must be 4 to 8 characters long, and must contain at least one digit. Within the pattern, .*\d finds any number of characters followed by a digit. For the searched string “abc3qr”, this matches “abc3”. Starting before instead of after that match, .{4,8} matches a 4-8 character string. This matches “abc3qr”. The ^ and $ specify the positions at the start and end of the searched string. This is to prevent a match if the searched string contains any characters outside of the matched characters. |
(?!pattern) |
Negative lookahead. Matches a search string that does not match pattern. After a match is found, the search for the next match starts before the matched text. The match is not saved for later use. |
\b(?!th)\w+\b matches words that do not start with “th”. Within the pattern, \b matches a word boundary. For the searched string ” quick “, this matches the first space. (?!th) matches a string that is not “th”. This matches “qu”. Starting before that match, \w+ matches a word. This matches “quick”. |
\cx |
Matches the control character indicated by x. The value of x must be in the range of A-Z or a-z. If it is not, c is assumed to be a literal “c” character. |
\cM matches a CTRL+M or carriage return character. |
\xn |
Matches n, where n is a hexadecimal escape value. Hexadecimal escape values must be exactly two digits long. Allows ASCII codes to be used in regular expressions. |
\x41 matches “A”. \x041 is equivalent to “\x04” followed by “1”, (because n must be exactly 2 digits). |
\num |
Matches num, where num is a positive integer. This is a reference to saved matches. |
(.)\1 matches two consecutive identical characters. |
\n |
Identifies either an octal escape value or a backreference. If \n is preceded by at least n captured subexpressions, n is a backreference. Otherwise, n is an octal escape value if n is an octal digit (0-7). |
(\d)\1 matches two consecutive identical digits. |
\nm |
Identifies either an octal escape value or a backreference. If \nm is preceded by at least nm captured subexpressions, nm is a backreference. If \nm is preceded by at least n captured subexpressions, n is a backreference followed by literal m. If neither of those conditions exist, \nm matches octal escape value nm when n and m are octal digits (0-7). |
\11 matches a tab character. |
\nml |
Matches octal escape value nml when n is an octal digit (0-3) and m and l are octal digits (0-7). |
\011 matches a tab character. |
\un |
Matches n, where n is a Unicode character expressed as four hexadecimal digits. |
\u00A9 matches the copyright symbol (©). |
補充說明:
1. 子表達式和回溯引用(backreference): 把正則表達式使用括號括起來就定義了一個子表達式,在後面可以使用\1,\2,\3…等形式來引用前面定義的第一個,第二個,第三個子表達式。舉個例子:有一段文本”This is a block of of text, several words here are are repeated, and and they should not be.”。定義正則表達式 [ ]+(\w+)[ ]+\1,可以匹配連續重複的單詞。[ ]+匹配一個或者多個空格,(\w+)爲子表達式,匹配一個單詞,後面\1引用的是這個表達式,所以of of, are are, and and都可以找出來。回溯引用可以保證前後匹配一致,因爲他就是對前面的定義的子表達式的引用。有些正則表達式的實現是用$,而不是\來引用子表達式的。\0匹配整個正則表達式。
正則表達式表示不可打印字符
Character |
Matches |
Equivalent to |
---|---|---|
\f |
Form-feed character(換頁符). |
\x0c and \cL |
\n |
Newline character. |
\x0a and \cJ |
\r |
Carriage-return character. |
\x0d and \cM |
\s |
Any white-space character. This includes space, tab, and form feed. |
[ \f\n\r\t\v] |
\S |
Any non–white space character. |
[^ \f\n\r\t\v] |
\t |
Tab character. |
\x09 and \cI |
\v |
Vertical tab character. |
\x0b and \cK |
正則表達式中操作符的優先級順序
Operator or operators |
Description |
---|---|
\ |
Escape |
(), (?:), (?=), [] |
Parentheses and brackets |
*, +, ?, {n}, {n,}, {n,m} |
Quantifiers |
^, $, \anymetacharacter |
Anchors and sequences |
| |
Alternation |
POSIX字符類
POSIX字符類(character class)是對中括號表達式的一個擴展,對字符進行分類並給它們命名。
字符類 | 說明 |
---|---|
[:alnum:] | 任何一個字母或者數字,等價於[a-zA-Z0-9] |
[:alpha:] | 任何一個字母,等價於[a-zA-Z] |
[:blank:] | 空格或製表符,等價於[\t] |
[:cntrl:] | ASCII表中的控制字符,即編碼值從0到31的字符,以及127 |
[:digit:] | 任何一個數字,等價於[0-9] |
[:graph:] | 和[:print:]一樣,但不包括空格 |
[:lower:] | 任何一個小寫字母,等價於[a-z] |
[:print:] | 任何一個可打印字符,可打印字符可以查見上表 |
[:punct:] | 既不屬於[:alnum:]也不屬於[:cntrl:]的任何一個字符 |
[:space:] | 任何一個空白字符,包括空格,等價於[^\f\n\r\t\v] |
[:upper:] | 任何一個大寫字母,等價於[A-Z] |
[:xdigit:] | 任何一個十六進制數字,等價於[a-fA-F0-9] |
注意:
POSIX字符類必須包括在[:和:]之間,我們使用的[:alnum:],其中的[和]是字符類的組成部分,所以在模式表達式應該使用[[:alnum:]]。
正則表達式對字母進行大小寫轉換
有些正則表達式實現允許我們使用下表的元字符對字母進行大小寫轉換。
元字符 | 說明 |
---|---|
\E | 結束\L或者\U轉換 |
\l | 把下一個字符轉換爲小寫 |
\L | 把\L到\E之間的字符全部轉換爲小寫 |
\u | 把下一個字符轉換爲大寫 |
\U | 把\U到\E之間的字符全部轉換爲大寫 |
正則表達式中的前後查找
前後查找(lookaround)模式定義了一個必須匹配但不在結果中返回的模式。前/後是指與被查找文本(即子表達式中的pattern)的相對位置而言,左爲前,右爲後。
向前查找(lookahead)模式: 實際上就是一個以?=開頭的子表達式,需要匹配的文本跟在=的後面,語法是(?=pattern)。
例子:
文本
http://www.forta.com/
ftp://ftp.fforta.com/
使用正則表達式.+(?=:)將匹配http,ftp。(?=:)定義了向前查找模式,匹配:,但是並不在結果中返回。所以整個正則表達式返回:之前的任意字符。
向後查找(lookbehind)模式:實際上就是一個以?<=開頭的子表達式,需要匹配的文本跟在<=的後面,語法是(?<=pattern)。
例子:
文本
ABC01: $23.45
HGG42: $5.31
CFMX1: $899.00
使用正則表達式(?<=$)[0-9.]+, 即可匹配23.45, 5.31, 899.00。
負前後查找(negative lookaround)模式: 前後查找模式實踐上是用來定位的,通過匹配特定的模式來定位文本的位置,基於這個位置在向前或者向後匹配,這種用法被稱爲正向前查找(positive lookahead)和正向後查找(positive lookbehind)。
還有一種不太常見的用法叫做負前後查找(negative lookaround)。負向前查找(negative lookahead)將向前查找不與給定模式相匹配的文本。負向後查找(negative lookbehind)將向後查找不與給定模式相匹配的文本。
操作符 | 說明 |
---|---|
(?=pattern) | 正向前查找 |
(?!pattern) | 負向前查找 |
(?<=pattern) | 正向後查找 |
(? | 負向後前查找 |
例子:
文本
I paid $30 for 100 apples,
50 oranges, and 60 peers.
I saved $5 on this order.
正向後查找模式(?<=$)\d+, 匹配30, 5。
負向後前查找模式\b(?<!\$)\d+\b
,匹配100,50,60。