JavaScript中的正則表達式(regular expression)

 (文章內容主要摘自《JavaScript-The Definitive Guide》5th edition)
    利用JavaScript提供的方法,在客戶端通過正則表達式(regular expression)的方式,驗證頁面輸入的合法性是很常用且很高效的做法。想要與給定的正則表達式的模式相比較,不僅可以通過字符串提供的一些方法,也可以通過正則表達式對象(RegExp)提供的方法實現。

正則表達式的定義與語法

    在JavaScrpt中,可以通過RegExp的構造函數RegExp()來構造一個正則表達式對象,更常見的,也可以通過直接量的語法,定義一個正則表達式對象。與字符串類似,表達式內容的兩側用斜線(/)標識。
   
    直接量字符
   
    反斜線開頭的字符具有特殊的意義
   

Character

Matches

字符、數字

Itself

\0

空字符 (\u0000)

\t

Tab (\u0009)

\n

換行 (\u000A)

\v

Vertical tab (\u000B)

\f

Form feed (\u000C)

\r

回車 (\u000D)

\xnn

The Latin character specified by the hexadecimal number nn; for example, \x0A is the same as \n

\uxxxx

The Unicode character specified by the hexadecimal number xxxx; for example, \u0009 is the same as \t

\cX

The control character ^X; for example, \cJ is equivalent to the newline character \n


    另外一些特殊意義的符號:
       ^ $ . * + ? = ! : | \ / ( ) [ ] { }
   
    字符類

    許多單獨的字符可以利用方括號,組合成一個字符類。一個字符類可以匹配任何一個其包含的字符,僅限一個字符。例如: /[abc]/ 匹配字母a, b, c中的任義一個字母。而“脫字符”^可以表達相反的意思,例如,/[^abc]/匹配除了a, b, c以外的任義一個字符。連字號 - 表達兩個字符之間的任義字符,例如,/[a-z]/ 匹配小寫字母 a z 之間的任義一個字母。
    因爲一些字符類比較常用,JavaScript中定義了一些字符來表示這些常用的字符類。

Character

Matches

[...]

任意一個在中括號內的字符。

[^...]

任意一個不在中括號內的字符

.

Any character except newline or another Unicode line terminator.

\w

任意一個 ASCII 字符。 相當於 [a-zA-Z0-9_]

\W

任意一個非 ASCII 字符。 相當於 [^a-zA-Z0-9_]

\s

任意一個 Unicode 空格符。

\S

任意一個非Unicode空格符。 注意 \w(小寫)\S 不是一回事。

\d

任意一個 ASCII 數字。相當於 [0-9]

\D

任意一個非 ASCII 數字。相當於[^0-9]

[\b]

一個退格符 (特例)。


    轉義字符是可以使用在[ ]內的。值得注意的是\b,在方括號[ ]之內時,其意思是退格符。然而在方括號之外直接使用時,則匹配字符的邊界。
   
    重複

With the regular expression syntax you've learned so far, you can describe a two-digit number as /\d\d/ and a four-digit number as /\d\d\d\d/. But you don't have any way to describe, for example, a number that can have any number of digits or a string of three letters followed by an optional digit. These more complex patterns use regular-expression syntax that specifies how many times an element of a regular expression may be repeated.

The characters that specify repetition always follow the pattern to which they are being applied. Because certain types of repetition are quite commonly used, there are special characters to represent these cases. For example, + matches one or more occurrences of the previous pattern. Table 11-3 summarizes the repetition syntax.

Table 11-3. Regular expression repetition characters

Character

Meaning

{n,m}

該項至少出現n次,但是不多於m次。

{n,}

該項至少出現n次。

{n}

該項出現n次。(不能多,也不能少)

?

該項出現0次或者一次。就是說,該項是可選的,相當於{0,1}。

+

該項出現1次或者更多次,相當於 {1,}.

*

該項出現0次或者更多次。 相當於 {0,}.


下面是一些例子:

/\d{2,4}/     // 匹配2到4個數字
/\w{3}\d?/    // 匹配3個字符和1個可選的數字,即該數字可以有也可以沒有
/\s+java\s+/  // Match "java" with one or more spaces before and after
/[^"]*/       // Match zero or more non-quote characters

Be careful when using the * and ? repetition characters. Since these characters may match zero instances of whatever precedes them, they are allowed to match nothing. For example, the regular expression /a*/ actually matches the string "bbbb" because the string contains zero occurrences of the letter a!



    選擇、分組和引用
   

The regular-expression grammar includes special characters for specifying alternatives, grouping subexpressions, and referring to previous subexpressions. The | character separates alternatives. For example, /ab|cd|ef/ matches the string "ab" or the string "cd" or the string "ef". And /\d{3}|[a-z]{4}/ matches either three digits or four lowercase letters.

Note that alternatives are considered left to right until a match is found. If the left alternative matches, the right alternative is ignored, even if it would have produced a "better" match. Thus, when the pattern /a|ab/ is applied to the string "ab", it matches only the first letter.

Parentheses have several purposes in regular expressions. One purpose is to group separate items into a single subexpression so that the items can be treated as a single unit by |, *, +, ?, and so on. For example, /java(script)?/ matches "java" followed by the optional "script". And /(ab|cd)+|ef)/ matches either the string "ef" or one or more repetitions of either of the strings "ab" or "cd".

Another purpose of parentheses in regular expressions is to define subpatterns within the complete pattern. When a regular expression is successfully matched against a target string, it is possible to extract the portions of the target string that matched any particular parenthesized subpattern. (You'll see how these matching substrings are obtained later in the chapter.) For example, suppose you are looking for one or more lowercase letters followed by one or more digits. You might use the pattern /[a-z]+\d+/. But suppose you only really care about the digits at the end of each match. If you put that part of the pattern in parentheses (/[a-z]+(\d+)/), you can extract the digits from any matches you find, as explained later.

A related use of parenthesized subexpressions is to allow you to refer back to a subexpression later in the same regular expression. This is done by following a \ character by a digit or digits. The digits refer to the position of the parenthesized subexpression within the regular expression. For example, \1 refers back to the first subexpression, and \3 refers to the third. Note that, because subexpressions can be nested within others, it is the position of the left parenthesis that is counted. In the following regular expression, for example, the nested subexpression ([Ss]cript) is referred to as \2:

/([Jj]ava([Ss]cript)?)\sis\s(fun\w*)/

A reference to a previous subexpression of a regular expression does not refer to the pattern for that subexpression but rather to the text that matched the pattern. Thus, references can be used to enforce a constraint that separate portions of a string contain exactly the same characters. For example, the following regular expression matches zero or more characters within single or double quotes. However, it does not require the opening and closing quotes to match (i.e., both single quotes or both double quotes):

/['"][^'"]*['"]/

To require the quotes to match, use a reference:

/(['"])[^'"]*\1/

The \1 matches whatever the first parenthesized subexpression matched. In this example, it enforces the constraint that the closing quote match the opening quote. This regular expression does not allow single quotes within double-quoted strings or vice versa. It is not legal to use a reference within a character class, so you cannot write:

/(['"])[^\1]*\1/

Later in this chapter, you'll see that this kind of reference to a parenthesized subexpression is a powerful feature of regular-expression search-and-replace operations.

In JavaScript 1.5 (but not JavaScript 1.2), it is possible to group items in a regular expression without creating a numbered reference to those items. Instead of simply grouping the items within ( and ), begin the group with (?: and end it with ). Consider the following pattern, for example:

/([Jj]ava(?:[Ss]cript)?)\sis\s(fun\w*)/

Here, the subexpression (?:[Ss]cript) is used simply for grouping, so the ? repetition character can be applied to the group. These modified parentheses do not produce a reference, so in this regular expression, \2 refers to the text matched by (fun\w*).

Table 11-4 summarizes the regular-expression alternation, grouping, and referencing operators.

Table 11-4. Regular expression alternation, grouping, and reference characters

Character

Meaning

|

Alternation. Match either the subexpression to the left or the subexpression to the right.

(...)

Grouping. Group items into a single unit that can be used with *, +, ?, |, and so on. Also remember the characters that match this group for use with later references.

(?:...)

Grouping only. Group items into a single unit, but do not remember the characters that match this group.

\n

Match the same characters that were matched when group number n was first matched. Groups are subexpressions within (possibly nested) parentheses. Group numbers are assigned by counting left parentheses from left to right. Groups formed with (?: are not numbered.


    確定匹配位置

    確定匹配的起始與結束位置,對於精確匹配也很關鍵。

   

As described earlier, many elements of a regular expression match a single character in a string. For example, \s matches a single character of whitespace. Other regular expression elements match the positions between characters, instead of actual characters. \b, for example, matches a word boundarythe boundary between a \w (ASCII word character) and a \W (nonword character), or the boundary between an ASCII word character and the beginning or end of a string.[*] Elements such as \b do not specify any characters to be used in a matched string; what they do specify, however, is legal positions at which a match can occur. Sometimes these elements are called regular-expression anchors because they anchor the pattern to a specific position in the search string. The most commonly used anchor elements are ^, which ties the pattern to the beginning of the string, and $, which anchors the pattern to the end of the string.

[*] Except within a character class (square brackets), where \b matches the backspace character.

For example, to match the word "JavaScript" on a line by itself, you can use the regular expression /^JavaScript$/. If you want to search for "Java" used as a word by itself (not as a prefix, as it is in "JavaScript"), you can try the pattern /\sJava\s/, which requires a space before and after the word. But there are two problems with this solution. First, it does not match "Java" if that word appears at the beginning or the end of a string, but only if it appears with space on either side. Second, when this pattern does find a match, the matched string it returns has leading and trailing spaces, which is not quite what's needed. So instead of matching actual space characters with \s, match (or anchor to) word boundaries with \b. The resulting expression is /\bJava\b/. The element \B anchors the match to a location that is not a word boundary. Thus, the pattern /\B[Ss]cript/ matches "JavaScript" and "postscript", but not "script" or "Scripting".

In JavaScript 1.5 (but not JavaScript 1.2), you can also use arbitrary regular expressions as anchor conditions. If you include an expression within (?= and ) characters, it is a lookahead assertion, and it specifies that the enclosed characters must match, without actually matching them. For example, to match the name of a common programming language, but only if it is followed by a colon, you could use /[Jj]ava([Ss]cript)?(?=\:)/. This pattern matches the word "JavaScript" in "JavaScript: The Definitive Guide", but it does not match "Java" in "Java in a Nutshell" because it is not followed by a colon.

If you instead introduce an assertion with (?!, it is a negative lookahead assertion, which specifies that the following characters must not match. For example, /Java(?!Script)([A-Z]\w*)/ matches "Java" followed by a capital letter and any number of additional ASCII word characters, as long as "Java" is not followed by "Script". It matches "JavaBeans" but not "Javanese", and it matches "JavaScrip" but not "JavaScript" or "JavaScripter".

Table 11-5 summarizes regular-expression anchors.

Table 11-5. Regular-expression anchor characters

Character

Meaning

^

Match the beginning of the string and, in multiline searches, the beginning of a line.

$

Match the end of the string and, in multiline searches, the end of a line.

\b

Match a word boundary. That is, match the position between a \w character and a \W character or between a \w character and the beginning or end of a string. (Note, however, that [\b] matches backspace.)

\B

Match a position that is not a word boundary.

(?=p)

A positive lookahead assertion. Require that the following characters match the pattern p, but do not include those characters in the match.

(?!p)

A negative lookahead assertion. Require that the following characters do not match the pattern p.


    標誌

    正則表達式最後一個語法問題就是標誌。有三種標誌,如下表:

Character

Meaning

i

Perform case-insensitive matching.

g

Perform a global matchthat is, find all matches rather than stopping after the first match.

m

Multiline mode. ^ matches beginning of line or beginning of string, and $ matches end of line or end of string.




模式匹配的字符串方法  

    JavaScript中,爲字符串提供了4個使用正則表達式的方法。
    String.search();
    String.replace();
    String.match();
    String.split();
    search()的參數是一個正則表達式。如果在參數位置傳遞的不是正則表達式,會先將該參數傳遞給正則表達式的構造函數RegExp(),將其轉換成正則表達式。

        "JavaScript".search(/script/i);

   
search()忽略g標誌。不會進行全局查找,它的返回值是匹配字符的起始位置。如果沒有找到匹配值,則返回-1.上例中,返回4。

    replace()執行“查找-替換”操作。第一個參數是正則表達式,第二個是替換字符串。
    replace()非常有用。可以利用下例的方法,將字符串兩側的雙引號,替換成兩個單引號。

        var quote = /"([^"]*)"/g;
        text.replace(quote, "''$1''");


    match()是最常用的方法。
       "1 plus 2 equals 3".match(/\d+/g) // returns ["1", "2", "3"]
    如果正則表達式不含有g標誌,match不進行全局查找。僅僅查找到第一個匹配的字符串爲止,並返回一個數組array。數組的第一個元素array[0]儲存匹配的字符串。下一個元素
array[1]儲存匹配第一個括號內表達式(parenthesized expression)的字符串。以後的元素以此類推。
    To draw a paralled with replace(),
array[n]儲存的$n中的內容。
    例如:

            var url = /(\w+):\/\/([\w.]+)\/(\S*)/;
            var text = "Visit my blog at http://www.example.com/~david";
            var result = text.match(url);
            if (result != null)
            {
                var fullurl = result[0]; // Contains "http://www.example.com/~david"
                var protocol = result[1]; // Contains "http"
                var host = result[2]; // Contains "www.example.com"
                var path = result[3]; // Contains "~david"
            }

    如果正則表達式包含g標誌,match進行全局查找,返回的數組中,每個元素儲存一個與正則表達式相匹配的字符串。

    split()的參數一般是一個符號,用以分隔字符串。例如:

       "123,456,789".split(","); // Returns ["123","456","789"]
   
    也可以是一個正則表達式。這個能力是該方法非常有用。例如,你可以利用正則表達式,去掉分隔字符兩側的空格:

       "1, 2, 3, 4, 5".split(/\s*,\s*/); // Returns ["1","2","3","4","5"]


模式匹配的RegExp對象方法


   
RegExp對象也可以通過RegExp()構造函數生成。構造函數,是動態生成RegExp對象的好方法。它包括一個或者兩個字符串參數。第一個參數是正則表達式的內容,第二個參數是標誌,例如:g, i, m等。
   
            // Find all five-digit numbers in a string. Note the double \\ in this case.
            var zipcode = new RegExp("\\d{5}", "g");

    RegExp
對象有兩種方法驗證字符串與正則表達式模式是否匹配。第一個方法就是exec( )方法,類似於match方法。不同於match的是,exec方法無論是否有 g 標誌,它都只返回同樣的數組array。array的第一個元素array[0]儲存完全匹配的字符串,隨後的元素一次儲存與子字符類想匹配的子字符串。當模式有 g 標誌的時候,exec方法執行一次以後,會自動將RegExp對象的一個特殊屬性lastIndex置爲此次匹配的字符串的最後一個字母的後一個位置。
    當通一個正則表達式再次執行的時候,會在lastIndex位置開始查找,而不是 0 位置開始查找。如果exec沒有找到匹配的字符串,它將自動將lastIndex置爲 0。這個特殊的方法,可以很方便的循環遍歷整個字符串,以找到所有匹配的子字符串。
    當然,你也可以在找到最後一個匹配子字符串以前的任意時刻將lastIndex置爲 0,然後用該RegExp對象執行另外的字符串。

    var pattern = /Java/g;
    var text = "JavaScript is more fun than Java!";
    var result;
    while((result = pattern.exec(text)) != null)
    {
        alert("Matched '" + result[0] + "'" + " at position " + result.index + "; next search begins at " + pattern.lastIndex);
    }

    RegExp
對象的另外一個執行匹配的方法是test( ),它要比exec( )簡單的多。它只有一個字符串作爲唯一的參數,返回true或者在沒有找到匹配字符串是返回null。當RegExp有 g 標誌時,test與exec對lastIndex執行同樣的操作  


例子:

    將textbox傳遞給方法checkDate,作爲Object的值。檢驗textbox中輸入的月份是否爲mm/dd/yyyy這樣的格式:
發佈了110 篇原創文章 · 獲贊 9 · 訪問量 17萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章