[Good for enterprise] GFE我們是怎麼監控的?- 2014-08-09更新

This post will be also published in English: http://www.cnblogs.com/LarryAtCNBlog/p/3900870.html

GFE 監控相關貼子,按發佈時間排序,

http://www.cnblogs.com/LarryAtCNBlog/p/3890033.html  Eng: http://www.cnblogs.com/LarryAtCNBlog/p/3890743.html

就在發完上面的貼子之後不久,issue又出現啦,不過其實沒有造成user級別的影響,只是我們internal team的人知道。

這個issue就是第一個GFE監控貼子裏提到的5662/5669/5675/5733這幾個event,我之前說過這幾個event是不應該出現的,如果出現說明的的確確出現了NOC通訊問題,這話其實也沒有錯,這次的issue的確生成了這幾個event中的一兩個,但是GFE和NOC之間的連接馬上就恢復了,所以並未造成business impact,只是internal team收到了alert。這個問題幾個月前就出現過了,和NOC的偶爾通訊失敗,導致email有幾分鐘左右的delay,這種情況其實應該排除掉,否則on-call的那個人半夜收到這樣的alert又要起來看什麼情況了。

這次的issue我也給Good開了case,他們沒有給出具體的原因,因爲這其中網絡情況參雜較多,package從我們公司proxy出去之後要經過各種ISP運營商,然後纔到達Good,任何一個出現問題都會導致和NOC的通訊失敗,而且由於持續時間短的原因和alert延時之類的原因,基本上抓不到trace和ping日誌,Good給不出解決方案我也不覺得意外。

所以最後我轉向,詢問了Good的Webservice地址和Good logs的decoding方法,我想寫個腳本從訪問webservice和分析Good logs兩方面排除掉上面這種false alert。

PS: Good的一些關鍵日誌是encoding的,他們可以提供方法decode,但是需要公司裏當時和Good定協議的那個人同意,在我這邊是一個UK的manager,我想了一下沒可能同意,就放棄了Good logs的分析。

選定了從Good webservice下手,於是拿到了Good的幾個web service url。

https://xml28.good.com/
https://xml29.good.com/
https://xml30.good.com/

基本的想法就是,改動原有的腳本,在抓到了NOC通訊失敗的幾個event之後,調用探測Webservice的腳本,看看當時webservice是否真得無法reach。

所以有了下面的Test-NOC.ps1的腳本,所有URL都能reach返回true,任何一個不能返回false,因爲在我經驗中,GFE會用到所有的webservice,不一定就是一個,可能這其中有Good的balancing機制在其中。

### this script invoked by EventID.Monitoring.ps1
### Used to test Good NOC connectivity

$GoodNOC_Url = @(
    'https://xml28.good.com/',
    'https://xml29.good.com/',
    'https://xml30.good.com/'
)

$WebProxy = New-Object 'System.Net.WebProxy'
# Change below proxy to your own proxy server and port
$WebProxy.Address = 'http://ProxyServer:Port'

$WebClient = New-Object 'System.Net.WebClient'
$WebClient.Proxy = $WebProxy

$Result = $true
foreach($Url in $GoodNOC_Url)
{
    $LoopCount = 0
    do
    {
        $LoopResult = $false
        $LoopCount++
        if(($WebClient.DownloadString($Url)).Contains('Congratulations!  You have successfully connected to the GoodLink Service.'))
        {
            $LoopResult = $true
            break
        }
    }
    while($LoopCount -lt 3)
    $Result = $Result -and $LoopResult
    if($Result)
    {
        Add-Log -Path $strLogFile_e -Value "NOC Testing succeed: [$Url]" -Type Info
    }
    else
    {
        Add-Log -Path $strLogFile_e -Value "NOC Testing failed: [$Url]" -Type Warning
    }
}

return $Result

這樣的話,相應的主監控腳本就要相應更新,用於調用上面的子腳本,

#更改working directory
Set-Location (Get-Item ($MyInvocation.MyCommand.Definition)).DirectoryName

#定義要監控的event和屬性
#EventClass是說明該event是否和其它event是類似的,同樣class的event觸發threshold後會觸發額外判斷腳本
#ID爲需要監控的eventID,如果使用數組如@(xx,yy),說明這兩個EventID在統計的時候是一起算的,如xx產生了10條,yy產生了10條,加一起20再和threshold比較
#Pattern是C#中的正則表達式,用於過濾出含特定字符的event。
#MinusPattern也是正則表達式,用於過濾出含特定字符的event。
#如果Pattern和MinusPattern都有值的話,pattern匹配到了100條,而MinusPattern匹配到了90條,減一下最終爲10條再和threshold比較,這樣可以排除掉“自動恢復的情況”
#Threshold就是前幾屬性的匹配過後,與最終值的數值比較,超過threshold就發告警
$Events = @(
    @{EventClass = 1; ID = 3563; Pattern = '\bPausing .*MAPI error'; MinusPattern = 'Unpausing'; Threshold = 100;},
    @{EventClass = 2; ID = @(1299, 1300, 1301); Pattern = $null; Threshold = 100;},
    @{EventClass = 1; ID = 3386; Pattern = 'GDMAPI_OpenMsgStore failed'; Threshold = 100;},
    @{EventClass = 3; ID = @(5662, 5669); Pattern = $null; Threshold = 1;},
    @{EventClass = 3; ID = 5675; Pattern = 'errNetConnect'; Threshold = 1;},
    @{EventClass = 3; ID = 5733; Pattern = 'errNetTimeout'; Threshold = 1;}
)

# Script爲空的話,說明不觸發額外的腳本判斷,以threshold爲準
# Script不爲空的話,說明觸發額外的腳本判斷,由該腳本返回true或false來判定最終判斷
$EventClass = @{
    1 = @{Script = $null; Description = 'MAPI Error'};
    2 = @{Script = $null; Description = 'Good thread hung up'};
    3 = @{Script = 'Test-NOC.ps1'; Description = 'Failed to contact NOC'};
}

$Date = Get-Date
$strDate = $Date.ToString("yyyy-MM-dd")

$End_time = $Date
$Start_time = $Date.AddMinutes(-15)
$strLogFile = "${strDate}.log.txt"
$strLogFile_e = "${strDate}_Error.log.txt"

#定義郵件發送屬性
$Mail_From = "$($env:COMPUTERNAME)@fil.com"
$Mail_To = 'xxxxx@xxx.xxx'
$Mail_Subject = 'Good event IDs warning'
$Mail_SMTPServer = 'smtpserver'

Set-Content -Path $strLogFile_e -Value $null 

function Add-Log
{
    PARAM(
        [String]$Path,
        [String]$Value,
        [String]$Type
    )
    $Type = $Type.ToUpper()
    Write-Host "$((Get-Date).ToString('[HH:mm:ss] '))[$Type] $Value"
    if($Path){
        Add-Content -Path $Path -Value "$((Get-Date).ToString('[HH:mm:ss] '))[$Type] $Value"
    }
}

Add-Log -Path $strLogFile_e -Value "Catch logs after : $($Start_time.ToString('HH:mm:ss'))" -Type Info
Add-Log -Path $strLogFile_e -Value "Catch logs before: $($End_time.ToString('HH:mm:ss'))" -Type Info
Add-Log -Path $strLogFile_e -Value "Working directory: $($PWD.Path)" -Type Info

$EventsCache = @(Get-EventLog -LogName Application -After $Start_time -Before $End_time.AddMinutes(5))
Add-Log -Path $strLogFile_e -Value "Total logs count : $($EventsCache.Count)" -Type Info
$Error_Array = @()
foreach($e in $Events)
{
    $Events_e_ALL = $null
    $Events_e_Matched = $null
    $Events_e_NMatched = $null
    $Events_e_FinalCount = 0

    $Events_e_ALL = @($EventsCache | ?{$e.ID -contains $_.EventID})
    Add-Log -Path $strLogFile_e -Value "Captured [$($e.ID -join '], [')], count: $($Events_e_ALL.Count)" -Type Info
    $Events_e_Matched = @($Events_e_ALL | ?{$_.Message -imatch $e.Pattern})
    Add-Log -Path $strLogFile_e -Value "Pattern matched, count: $($Events_e_Matched.Count)" -Type Info
    
    if($e.MinusPattern)
    {
        $Events_e_NMatched = @($Events_e_ALL | ?{$_.Message -imatch $e.MinusPattern})
        Add-Log -Path $strLogFile_e -Value "Minus pattern matched, count: $($Events_e_NMatched.Count)" -Type Info
    }

    $Events_e_FinalCount = $Events_e_Matched.Count - [int]$Events_e_NMatched.Count
    Add-Log -Path $strLogFile_e -Value "Final matched, count: $Events_e_FinalCount" -Type Info
    if($Events_e_FinalCount -ge $e.Threshold)
    {
        Add-Log -Path $strLogFile_e -Value "Over threshold: $($e.Threshold)" -Type Warning
        if($Error_Array -notcontains $e.EventClass)
        {
            $Error_Array += $e.EventClass
        }
    }
}

Add-Log -Path $strLogFile_e -Value "Alert classes captured: [$($Error_Array -join '], [')]" -Type Info
for($e = 0; $e -lt $Error_Array.Count; $e++)
{
    Add-Log -Path $strLogFile_e -Value "Process class: [$e]" -Type Info
    if($EventClass.$($Error_Array[$e]).Script -imatch '^$')
    {
        Add-Log -Path $strLogFile_e -Value 'Final script not set, need to send alert.' -Type Warning
    }
    else
    {
        Add-Log -Path $strLogFile_e -Value "Run final script: [$($EventClass.$($Error_Array[$e]).Script)]" -Type Info
        if((& $EventClass.$($Error_Array[$e]).Script) -eq $true)
        {
            Add-Log -Path $strLogFile_e -Value 'Final script: [Positive], no need to send alert.' -Type Info
            $Error_Array[$e] = $null
        }
        else
        {
            Add-Log -Path $strLogFile_e -Value 'Final script: [Negetive], need to send alert' -Type Warning
        }
    }
}

$Error_Array | %{$Mail_Body = @()}{
    if($_)
    {
        $Mail_Body += $EventClass.$_.Description
    }
}
$Mail_Body = $Mail_Body -join "`n"

Add-Log -Path $strLogFile_e -Value "===================split line====================" -Type Info
Get-Content -Path $strLogFile_e | Add-Content -Path $strLogFile

If($Mail_Body)
{
    try
    {
        Send-MailMessage -From $Mail_From -To $Mail_To -Subject $Mail_Subject -Body $Mail_Body -SmtpServer $Mail_SMTPServer -Attachments $strLogFile_e
    }
    catch
    {
        Add-Log -Path $strLogFile -Value "Failed to send mail, cause: $($Error[0])" -Type Error
    }
}

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章