說明:剛纔發現在處理*元字符時弄錯了,代碼修改重新上傳到CSDN了,文章中的示例代碼也進行了修改。
前一版本有錯誤的代碼中將*處理成了前一字符至少出現1次,修改後爲出現0次或多次。
如果你是通過CSDN下載找到這個頁面的,請務必留意,你下載的可能不是最終版的代碼。最終版代碼下載地址:
http://download.csdn.net/detail/sun2043430/5333836
看了《代碼之美》第一章的《正則表達式》之後一直手癢,想自己寫一個。趁着週末有空試驗了一把。
首先是挑選正則表達式的元字符集,我選擇了以下這些元字符:
// c 匹配任意的字母(大小寫均可)
// . 任意字符
// * 重複前一字符0次或多次,但儘量多次(貪婪匹配)
// ^ 匹配串的開頭
// $ 匹配串的結尾
// \ 轉義字符(\c表示一個字母'c',\\表示一個字母'\')
在《代碼之美》裏面缺少轉義字符,但是在一個完備的正則表達式中,轉義字符是必不可少的(否則會有不能表示的字符),所以我在代碼實現中加入了轉義字符。
另外《代碼之美》一書中的代碼對*元字符采用的是簡單的懶惰匹配,而在正規的正則表達式中,*、+、?都是採用的貪婪匹配的。從代碼實現上來說貪婪匹配稍難於懶惰匹配。
另外,《代碼之美》一書中也提到了如果涉及到轉義字符、中括號表示的字符集時,最好用結構體的方式來表示正則表達式中的一個字符。
我的源代碼中在進行正則表達式匹配之前,先對正則表達式的正確性做驗證,看正則表達式是否正確。羅列的標準如下:
- 元字符*不能在開頭,元字符^只能出現在表達式開頭,元字符$只能出現在表達式末尾。
- 轉義字符\後面只能跟元字符
- 元字符^後面不能是元字符*
- 元字符*後面不能是元字符*
"^^now err" is error regular expression!
"^*abcd" is error regular expression!
"ababcd\" is error regular expression!
"^^*bcd" is error regular expression!
"a^*bcd" is error regular expression!
"^ab$cd" is error regular expression!
"a**abcd" is error regular expression!
"*abcd" is error regular expression!
"\a*bcd" is error regular expression!
"now ok" is correct regular expression!
"\\a*bcd" is correct regular expression!
"\**bcd" is correct regular expression!
".*abcd" is correct regular expression!
"^a*bcd" is correct regular expression!
"^abcd$" is correct regular expression!
".abc d" is correct regular expression!
".*abcd" is correct regular expression!
"\c*abcd" is correct regular expression!
"\*abcd" is correct regular expression!
"c*abcd" is correct regular expression!
"\.c*abcd" is correct regular expression!
"abc\.c*abcd" is correct regular expression!
"abc d" is correct regular expression!
"\^*abcd" is correct regular expression!
檢測正則表達式是否正確的代碼:
bool CheckRegExpr(const char *pReg)
{
const char *pBegin = pReg;
if ('*' == *pBegin)
return false;
while (*pReg)
{
if ( ('^' == *pReg && pReg != pBegin) ||
('^' == *pReg && pReg[1] == '*') ||
('$' == *pReg && pReg[1] != '\0') )
return false;
if ('*' == *pReg && '*' == pReg[1])
return false;
if ('\\' == *pReg)
{
if (!IsRegMetacharacter(pReg[1]))
return false;
else
pReg++;
}
pReg++;
}
return true;
}
正則匹配的核心代碼:
const char* RegExprFind(const char *pText, const char *pReg)
{
const char *pCur = pText;
if (!CheckRegExpr(pReg))
return (char*)-1;
do
{
g_pBeg = pCur;
MATCH_STATE eResult = Match(pCur, pReg);
if (MATCH_OK == eResult)
return pCur;
else if (MATCH_ERROR == eResult)
return (char*)-1;
else if (MATCH_FAIL == eResult)
return NULL;
}
while (*pCur++);
return NULL;
}
MATCH_STATE Match(const char *pCur, const char *pReg)
{
g_pEnd = pCur;
if ('\0' == *pReg)
return (g_pEnd != g_pBeg) ? MATCH_OK : MATCH_NEXT_POSITION;
if ('$' == *pReg)
return ('\0' == *pCur && g_pEnd != g_pBeg) ? MATCH_OK : MATCH_FAIL;
if ('^' == *pReg)
{
// return Match(pCur, pReg+1);
if (MATCH_OK == Match(pCur, pReg+1))
return MATCH_OK;
else
return MATCH_FAIL;
}
st_RE_Element elementCur;// 更復雜的情況要藉助結構體
st_RE_Element elementNext;
int nElementLen1 = 0;
int nElementLen2 = 0;
nElementLen1 = GetReElement(&elementCur, pReg);
nElementLen2 = GetReElement(&elementNext, pReg+nElementLen1);
if (!elementNext.isEscape && elementNext.ch == '*')// 貪婪匹配比較麻煩
{
const char *pStart = pCur;
while(*pCur && MatchAt(elementCur, *pCur))
pCur++;
while (pCur >= pStart)
{
if (MATCH_OK == Match(pCur, pReg+nElementLen1+nElementLen2))
return MATCH_OK;
pCur--;
}
}
else
{
if (MatchAt(elementCur, *pCur))
return Match(pCur+1, pReg+nElementLen1);
}
return MATCH_NEXT_POSITION;
}
int GetReElement(pst_RE_Element pElement, const char *pReg)
{
if (*pReg == '\\')
{
pElement->isEscape = true;
pElement->ch = pReg[1];
return 2;
}
else
{
pElement->isEscape = false;
pElement->ch = *pReg;
return 1;
}
}
bool MatchAt(st_RE_Element regChar, char ch)
{
if (regChar.isEscape) // \c \\ etc.
{
return ch == regChar.ch;
}
else // a . c
{
if ('.' == regChar.ch || ('c' == regChar.ch && IsAlpha(ch)) )
{
return true;
}
return ch == regChar.ch;
}
}
正則表達式對應的結構體和枚舉常量:
typedef struct _st_RE_Element
{
char ch;
bool isEscape;
}st_RE_Element, *pst_RE_Element;
enum MATCH_STATE
{
MATCH_OK,
MATCH_NEXT_POSITION,
MATCH_FAIL,
MATCH_ERROR, //regular expression syntax error, like "a**b"
};
正則表達式匹配效果演示:
0 1 2 3 4 5
012345678901234567890123456789012345678901234567890
Text is Let's ggggo abcdeccc.^$*\\\fg 223333332122222333322222
Find at 00 RegExpr = "Let" Let
Find at 00 RegExpr = "^Let.*$" Let's ggggo abcdeccc.^$*\\\fg 223333332122222333322222
Find at 00 RegExpr = "c*" Let
Find at 06 RegExpr = "g*o*" ggggo
Find at 06 RegExpr = "g*a*" gggg
Find at 30 RegExpr = "k*2*3*k*" 22333333
Find at 30 RegExpr = "k*2*k*3*" 22333333
Find at 22 RegExpr = "\$" $
Find at 20 RegExpr = "\." .
Find at 24 RegExpr = "\\*" \\\
Find at 20 RegExpr = "\..*23" .^$*\\\fg 2233333321222223
Find at 00 RegExpr = "c*" Let
Find at 14 RegExpr = "\c*" c
Find at 49 RegExpr = "2*$" 22222
Find at 32 RegExpr = "3*" 333333
Find at 30 RegExpr = "2*223" 223
Find at 30 RegExpr = "2*23" 223
Find at 06 RegExpr = "g*o.*3*21" ggggo abcdeccc.^$*\\\fg 2233333321
Find at 06 RegExpr = "g*o.*3*2" ggggo abcdeccc.^$*\\\fg 223333332122222333322222
Find at 06 RegExpr = "g*o.*3*" ggggo abcdeccc.^$*\\\fg 223333332122222333322222
Find at 06 RegExpr = "g*o.*3" ggggo abcdeccc.^$*\\\fg 2233333321222223333
Not Found! RegExpr = "Below are Not Found!------------------------------------------"
Not Found! RegExpr = "k*"
Not Found! RegExpr = "g*o.2*3"
Find at 54 RegExpr = "3*$"
Not Found! RegExpr = "^3*"
Not Found! RegExpr = "Below are Syntax Error!---------------------------------------"
Syntax error! RegExpr = "^*abc"
Syntax error! RegExpr = ".*a$bc"
完整代碼可到以下地址下載:
http://download.csdn.net/detail/sun2043430/5333836
注:代碼中使用函數式實現,爲了便於打印輸出、展示結果,使用了全局變量。