一個完備的微型正則表達式【源碼實現】

說明:剛纔發現在處理*元字符時弄錯了,代碼修改重新上傳到CSDN了,文章中的示例代碼也進行了修改。

      前一版本有錯誤的代碼中將*處理成了前一字符至少出現1次,修改後爲出現0次或多次。

      如果你是通過CSDN下載找到這個頁面的,請務必留意,你下載的可能不是最終版的代碼。最終版代碼下載地址:

      http://download.csdn.net/detail/sun2043430/5333836


看了《代碼之美》第一章的《正則表達式》之後一直手癢,想自己寫一個。趁着週末有空試驗了一把。


首先是挑選正則表達式的元字符集,我選擇了以下這些元字符:

// c 匹配任意的字母(大小寫均可)
// . 任意字符
// * 重複前一字符0次或多次,但儘量多次(貪婪匹配)
// ^ 匹配串的開頭
// $ 匹配串的結尾
// \ 轉義字符(\c表示一個字母'c',\\表示一個字母'\')

在《代碼之美》裏面缺少轉義字符,但是在一個完備的正則表達式中,轉義字符是必不可少的(否則會有不能表示的字符),所以我在代碼實現中加入了轉義字符。

另外《代碼之美》一書中的代碼對*元字符采用的是簡單的懶惰匹配,而在正規的正則表達式中,*、+、?都是採用的貪婪匹配的。從代碼實現上來說貪婪匹配稍難於懶惰匹配。


另外,《代碼之美》一書中也提到了如果涉及到轉義字符、中括號表示的字符集時,最好用結構體的方式來表示正則表達式中的一個字符。


我的源代碼中在進行正則表達式匹配之前,先對正則表達式的正確性做驗證,看正則表達式是否正確。羅列的標準如下:

  1. 元字符*不能在開頭,元字符^只能出現在表達式開頭,元字符$只能出現在表達式末尾。
  2. 轉義字符\後面只能跟元字符
  3. 元字符^後面不能是元字符*
  4. 元字符*後面不能是元字符*
一些正則表達式是否正確的測試樣例:
"^^now err"     is error regular expression!
"^*abcd"        is error regular expression!
"ababcd\"       is error regular expression!
"^^*bcd"        is error regular expression!
"a^*bcd"        is error regular expression!
"^ab$cd"        is error regular expression!
"a**abcd"       is error regular expression!
"*abcd"         is error regular expression!
"\a*bcd"        is error regular expression!
"now ok"        is correct regular expression!
"\\a*bcd"       is correct regular expression!
"\**bcd"        is correct regular expression!
".*abcd"        is correct regular expression!
"^a*bcd"        is correct regular expression!
"^abcd$"        is correct regular expression!
".abc d"        is correct regular expression!
".*abcd"        is correct regular expression!
"\c*abcd"       is correct regular expression!
"\*abcd"        is correct regular expression!
"c*abcd"        is correct regular expression!
"\.c*abcd"      is correct regular expression!
"abc\.c*abcd"   is correct regular expression!
"abc  d"        is correct regular expression!
"\^*abcd"       is correct regular expression!

檢測正則表達式是否正確的代碼:

bool CheckRegExpr(const char *pReg)
{
    const char *pBegin = pReg;
    if ('*' == *pBegin)
        return false;

    while (*pReg)
    {
        if ( ('^' == *pReg && pReg != pBegin) || 
            ('^' == *pReg && pReg[1] == '*') || 
            ('$' == *pReg && pReg[1] != '\0') )
            return false;

        if ('*' == *pReg && '*' == pReg[1])
            return false;

        if ('\\' == *pReg)
        {
            if (!IsRegMetacharacter(pReg[1]))
                return false;
            else
                pReg++;
        }

        pReg++;
    }
    return true;
}

正則匹配的核心代碼:

const char* RegExprFind(const char *pText, const char *pReg)
{
    const char *pCur = pText;
    if (!CheckRegExpr(pReg))
        return (char*)-1;

    do 
    {
        g_pBeg = pCur;
        MATCH_STATE eResult = Match(pCur, pReg);
        if (MATCH_OK == eResult)
            return pCur;
        else if (MATCH_ERROR == eResult)
            return (char*)-1;
        else if (MATCH_FAIL == eResult)
            return NULL;
    }
    while (*pCur++);
    return NULL;
}

MATCH_STATE Match(const char *pCur, const char *pReg)
{
    g_pEnd = pCur;
    if ('\0' == *pReg)
        return (g_pEnd != g_pBeg) ? MATCH_OK : MATCH_NEXT_POSITION;

    if ('$' == *pReg)
        return ('\0' == *pCur && g_pEnd != g_pBeg) ? MATCH_OK : MATCH_FAIL;

    if ('^' == *pReg)
    {
//        return Match(pCur, pReg+1);
        if (MATCH_OK == Match(pCur, pReg+1))
            return MATCH_OK;
        else
            return MATCH_FAIL;
    }

    st_RE_Element elementCur;// 更復雜的情況要藉助結構體
    st_RE_Element elementNext;
    int nElementLen1 = 0;
    int nElementLen2 = 0;
    nElementLen1 = GetReElement(&elementCur, pReg);
    nElementLen2 = GetReElement(&elementNext, pReg+nElementLen1);

    if (!elementNext.isEscape && elementNext.ch == '*')// 貪婪匹配比較麻煩
    {
        const char *pStart = pCur;
        while(*pCur && MatchAt(elementCur, *pCur))
            pCur++;
        while (pCur >= pStart)
        {
            if (MATCH_OK == Match(pCur, pReg+nElementLen1+nElementLen2))
                return MATCH_OK;
            pCur--;
        }
    }
    else
    {
        if (MatchAt(elementCur, *pCur))
            return Match(pCur+1, pReg+nElementLen1);
    }

    return MATCH_NEXT_POSITION;
}

int GetReElement(pst_RE_Element pElement, const char *pReg)
{
    if (*pReg == '\\')
    {
        pElement->isEscape = true;
        pElement->ch = pReg[1];
        return 2;
    }
    else
    {
        pElement->isEscape = false;
        pElement->ch = *pReg;
        return 1;
    }
}

bool MatchAt(st_RE_Element regChar, char ch)
{
    if (regChar.isEscape) // \c \\ etc.
    {
        return ch == regChar.ch;
    }
    else // a . c
    {
        if ('.' == regChar.ch || ('c' == regChar.ch && IsAlpha(ch)) )
        {
            return true;
        }
        return ch == regChar.ch;
    }
}


正則表達式對應的結構體和枚舉常量:

typedef struct _st_RE_Element
{
    char ch;
    bool isEscape;
}st_RE_Element, *pst_RE_Element;

enum MATCH_STATE
{
    MATCH_OK,
    MATCH_NEXT_POSITION,
    MATCH_FAIL,
    MATCH_ERROR, //regular expression syntax error, like "a**b"
};

正則表達式匹配效果演示:

                                        0         1         2         3         4         5
                                        012345678901234567890123456789012345678901234567890
Text is                                 Let's ggggo abcdeccc.^$*\\\fg 223333332122222333322222
Find at 00      RegExpr  =  "Let"       Let
Find at 00      RegExpr  =  "^Let.*$"   Let's ggggo abcdeccc.^$*\\\fg 223333332122222333322222
Find at 00      RegExpr  =  "c*"        Let
Find at 06      RegExpr  =  "g*o*"      ggggo
Find at 06      RegExpr  =  "g*a*"      gggg
Find at 30      RegExpr  =  "k*2*3*k*"  22333333
Find at 30      RegExpr  =  "k*2*k*3*"  22333333
Find at 22      RegExpr  =  "\$"        $
Find at 20      RegExpr  =  "\."        .
Find at 24      RegExpr  =  "\\*"       \\\
Find at 20      RegExpr  =  "\..*23"    .^$*\\\fg 2233333321222223
Find at 00      RegExpr  =  "c*"        Let
Find at 14      RegExpr  =  "\c*"       c
Find at 49      RegExpr  =  "2*$"       22222
Find at 32      RegExpr  =  "3*"        333333
Find at 30      RegExpr  =  "2*223"     223
Find at 30      RegExpr  =  "2*23"      223
Find at 06      RegExpr  =  "g*o.*3*21" ggggo abcdeccc.^$*\\\fg 2233333321
Find at 06      RegExpr  =  "g*o.*3*2"  ggggo abcdeccc.^$*\\\fg 223333332122222333322222
Find at 06      RegExpr  =  "g*o.*3*"   ggggo abcdeccc.^$*\\\fg 223333332122222333322222
Find at 06      RegExpr  =  "g*o.*3"    ggggo abcdeccc.^$*\\\fg 2233333321222223333
Not Found!      RegExpr  =  "Below are Not Found!------------------------------------------"
Not Found!      RegExpr  =  "k*"
Not Found!      RegExpr  =  "g*o.2*3"
Find at 54      RegExpr  =  "3*$"
Not Found!      RegExpr  =  "^3*"
Not Found!      RegExpr  =  "Below are Syntax Error!---------------------------------------"
Syntax error!   RegExpr  =  "^*abc"
Syntax error!   RegExpr  =  ".*a$bc"

完整代碼可到以下地址下載:

http://download.csdn.net/detail/sun2043430/5333836

注:代碼中使用函數式實現,爲了便於打印輸出、展示結果,使用了全局變量。


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章