Bosun 預警配置

表達式

數據類型

  1. Scalar: This is the simplest type, it is a single numeric value with no group associated with it. Keep in mind that an empty group, “{}” is still a group.
  2. NumberSet: A number set is a group of tagged numeric values with one value per unique grouping. As a special case, a scalar may be used in place of a numberSet with a single member with an empty group.
  3. SeriesSet: A series is an array of timestamp-value pairs and an associated group.

運算符

分類

  1. 標準算術運算符:+,-, *, /, %
  2. 關係運算符:<,>, ==, !=, >=, <=
  3. 邏輯運算符:&&,||,!

優先級

從高到低如下:
1. () ,一元運算符 ! 和 -
1. *,/,%
1. +,-
1. ==,!=,>,>=,<,<=
1. &&
1. ||

常用函數

  • q(query string, startDuration string, endDuration string)
    代表的是查詢從“endDuration ”開始到“startDuration ”之前的數據,若第三個參數爲空,則代表的是當前時刻。該函數是Open TSDB中常用的查詢函數。如查詢從現在開始到一分鐘之前的所有主機被使用的內存,代碼如下:
q("avg:os.mem.used{host=*}", "1m", "")

result列顯示對應主機內存使用情況,是一個數值集合結果。結果如下:

  • avg(seriesSet): 求平均值,返回的是數值結果。如計算“vs123”主機一分鐘內使用內存的平均值,表達式如下:
avg(q("avg:os.mem.used{host=vs123}", "1m", ""))

結果如下:

  • max(seriesSet):求最大值,返回的是數字結果。
max(q("avg:os.mem.used{host=vs123}", "1m", ""))

結果如下:

  • min(seriesSet):求最小值,返回的是數字結果。
  • sum(seriesSet):求和,返回的是數字結果。
q("avg:os.mem.used{host=vs123}", "1m", "")

sum(q("avg:os.mem.used{host=vs123}", "1m", ""))

  • t(numberSet, group string):分組函數。
    如查看以“vs12”開頭主機的內存使用,未轉換之前:
avg(q("avg:os.mem.used{host=vs12*}", "1m", ""))


使用轉換函數之後:

t(avg(q("avg:os.mem.used{host=vs12*}", "1m", "")),"")

  • limit(numberSet, count scalar):限制結果
  • filter(seriesSet, numberSet):過濾結果
    如下:過濾出以“vs”開頭的主機中CPU使用最高的前10個主機
filter(q("sum:os.cpu{host=regexp(^vs)}", "1m", ""),limit(sort(avg(q("sum:os.cpu{host=regexp(^vs)}", "1m", "")),"desc"),10))

預警配置

預警配置中分爲alert、template、lookup、notification、macro五個部分,每個部分要以“{}”包圍,基本的預警需要包括template、alert、notification(郵件配置)三部分。

變量

定義規則:以“使 {var}、varenv.tsdbHost= {env.TSDBHOST}

模板(template)

模板用於以一定的格式發送預警消息,如:使用郵件發送預警通知時,郵件主題以及內容將會匹配特定的模板,以設置好的樣式發送預警郵件。

簡單模板示例:

#模板名稱:unknownTemp 
template unknownTemp {
    #模板主題
    subject = {{.Name}}: {{.Group | len}} unknown alerts 
    #模板內容(與HTML類似)
    body = `
    <p>Time: {{.Time}} 
    <p>Name: {{.Name}} 
    <p>Alerts: {{range .Group}}
        <br>{{.}}
    {{end}}` 
}

預警(alert)

alert部分寫預警表達式,觸發發送郵件、日誌等觸發器。
可使用的參數:

  • crit:寫臨界預警表達式。
  • critNotification:寫發生臨界預警時,使用的notification。
  • warn:寫警告預警表達式(比crit級別低)。
  • warnNotification:寫發生警告預警時,使用的notification。
    示例如下:
notification email {
    #可以添加多個郵件地址,以逗號分隔就好
    email = email.email1@example.com, email.email2@example.com
    print = true
}
alert{
    ……
    #匹配notification
    critNotification = email
    warnNotification = email
}
  • ignoreUnknown:忽略Unknown預警。
alert{
    ignoreUnknown = true
}
  • depends:預警依賴的表達式。
  • unknownIsNormal:將unknown轉成正常的。
  • runEvery:執行alert頻率。
  • template:寫模板名稱。
  • unjoinedOk:設置後會忽略unjoined表達式錯誤。
  • unknown
  • log:如果log=true,則形成日誌預警。
  • maxLogFrequency:日誌預警頻率。

預警示例

CPU預警

template cpuTemplate {
    subject = {{.Last.Status}}: {{.Alert.Name}} on {{.Group.host}}
    body = `<p>Notes:{{.Alert.Vars.notes }}</p>
    <p>Alert: {{.Alert.Name}} triggered on {{.Group.host}}
    <hr>
    <p><strong>Computation</strong>
    <table>
        {{range .Computations}}
            <tr><td><a href="{{$.Expr .Text}}">{{.Text}}</a></td><td>{{.Value}}</td></tr>
        {{end}}
    </table>
    <p><strong>All Hosts CPU Information</strong>
    <p>(Red color means unhealthy,green color means healthy)</p>
    <table>
    {{range $f := .EvalAll .Alert.Vars.avgcpu}}
        <tr><td>{{ $f.Group.host}}</td>
        {{if gt $f.Value 70.0}}
            <td style="color: red;">
            {{else}}
                <td style="color: green;">
            {{end}}
        {{ $f.Value | printf "%.0f" }}</td></tr>
    {{end}}
    </table>
    <hr>
    {{ .GraphAll .Alert.Vars.filteResult }}
    <hr>
    <p><strong>Relevant Tags</strong>
    <table>
        {{range $k, $v := .Group}}
            <tr><td>{{$k}}</td><td>:</td><td>{{$v}}</td></tr>
        {{end}}
    </table>
    <p>Attention: The time in the graph is <font color="red">UTC</font> time</p>
    <p>The X axis means the time from now to {{.Alert.Vars.queryTime}} ago.</p>`
}
alert cpu.is.too.high {
    template = cpuTemplate
    $notes = This alert monitors the percentage of cpu against the cpu limit in haproxy (maxconn) and alerts when we are getting close to that limit and will need to raise that limit. This alert was created due to a socket outage we experienced for that reason
    $queryTime = 1h
    $limit = 10
    $metric = q("sum:rate{counter,,1}:os.cpu{host=regexp(^vs)}", "$queryTime", "")
    $avgcpu = avg($metric)
    $orderCPU = limit(sort($avgcpu, "desc"), $limit)
    $filteResult = filter($metric, $orderCPU)
    crit = $avgcpu > 80
    warn = $avgcpu > 70
    ignoreUnknown = true
    critNotification = email
    warnNotification = email
}




磁盤預警

template diskTemplate {
    subject = {{.Last.Status}}: {{.Alert.Name}} on {{.Group.host}}
    body = `<p>Notes:{{.Alert.Vars.notes }}</p>
    <p>Alert: {{.Alert.Name}} triggered on {{.Group.host}}
    <hr>
    <p><strong>Computation</strong>
    <table>
        {{range .Computations}}
            <tr><td><a href="{{$.Expr .Text}}">{{.Text}}</a></td><td>{{.Value}}</td></tr>
        {{end}}
    </table>
    <p><strong>All Hosts Disk Information</strong>
    <p>(Red color means unhealthy,green color means healthy)</p>
    <table>
    {{range $f := .EvalAll .Alert.Vars.avgDiskPercent}}
        <tr><td>{{ $f.Group.host}}</td>
        {{if lt $f.Value 10.0}}
            <td style="color: red;">
            {{else}}
                <td style="color: green;">
            {{end}}
        {{ $f.Value | printf "%.0f" }}</td></tr>
    {{end}}
    </table>
    <hr>
    {{ .GraphAll .Alert.Vars.filteResult }}
    <hr>
    <p><strong>Relevant Tags</strong>
    <table>
        {{range $k, $v := .Group}}
            <tr><td>{{$k}}</td><td>:</td><td>{{$v}}</td></tr>
        {{end}}
    </table>
    <p>Attention: The time in the graph is <font color="red">UTC</font> time</p>
    <p>The X axis means the time from now to {{.Alert.Vars.queryTime}} ago.</p>`
}
alert disk.free.space.is.too.small {
    template = diskTemplate
    $notes = This alert monitors the percentage of disk free space 
    $queryTime = 1h
    $limit = 10
    $diskPercentFree = q("avg:os.disk.fs.percent_free{host=regexp(^vs)}", "$queryTime", "")
    $avgDiskPercent = avg($diskPercentFree)
    $orderDisk = limit(sort($avgDiskPercent, "asc"), $limit)
    $filteResult = filter($diskPercentFree, $orderDisk)
    ignoreUnknown = true
    crit = $avgDiskPercent < 5
    warn = $avgDiskPercent < 10
    critNotification = email
    warnNotification = email
}

內存預警

template memroyTemplate {
    body = `{{if .Alert.Vars.notes}}
    <p>Notes: {{.Alert.Vars.notes}}
    {{end}}
    {{if .Group.host}}

    {{end}}
    <hr>
    <p><strong>Alert definition:</strong>
    <table>
        <tr>
            <td>Name:</td>
            <td>{{replace .Alert.Name "." " " -1}}</td></tr>
        <tr>
            <td>Warn:</td>
            <td>{{.Alert.Warn}}</td></tr>
        <tr>
            <td>Crit:</td>
            <td>{{.Alert.Crit}}</td></tr>
    </table>
    <hr>
    <p><strong>All Hosts Memory Information</strong>
    <p>(Red color means unhealthy,green color means healthy)</p>
    <table>
    {{range $f := .EvalAll .Alert.Vars.avgfree}}
        <tr><td>{{ $f.Group.host}}</td>
        {{if lt $f.Value 30.0}}
            <td style="color: red;">
            {{else}}
                <td style="color: green;">
            {{end}}
        {{ $f.Value | printf "%.0f" }}</td></tr>
    {{end}}
    </table>
    <p><strong>Tags</strong>

    <table>
        {{range $k, $v := .Group}}
            {{if eq $k "host"}}
                <tr><td>{{$k}}</td><td>:</td><td><a href="{{$.HostView $v}}">{{$v}}</a></td></tr>
            {{else}}
                <tr><td>{{$k}}</td><td>{{$v}}</td></tr>
            {{end}}
        {{end}}
    </table>
    <p><strong>Computation</strong>
    <table>
        {{range .Computations}}
            <tr><td><a href="{{$.Expr .Text}}">{{.Text}}</a></td><td>{{.Value}}</td></tr>
        {{end}}
    </table>
    <hr>
    {{ .GraphAll .Alert.Vars.filteResult }}
    <hr>
    <p>Attention: The time in the graph is <font color="red">UTC</font> time</p>
    <p>The X axis means the time from now to {{.Alert.Vars.queryTime}} ago.</p>`
    subject = {{.Last.Status}}: {{replace .Alert.Name "." " " -1}}: {{.Eval .Alert.Vars.avgfree | printf "%.2f"}}{{if .Alert.Vars.unit_string}}{{.Alert.Vars.unit_string}}{{end}} on {{.Group.host}}
}
alert os.low.memory {
    template = memroyTemplate
    $notes = In Linux, Buffers and Cache are considered "Free Memory".This alert monitors the percentage of memory free space.
    $unit_string = % Free Memory
    $queryTime = 1h
    $limit = 10
    $memory = q("avg:os.mem.percent_free{host=regexp(^vs)}", "$queryTime", "")
    $avgfree = avg($memory)
    $orderMemory = limit(sort($avgfree, "asc"), $limit) 
    $filteResult = filter($memory, $orderMemory)
ignoreUnknown = true
    crit = $avgfree < 20
    warn = $avgfree < 30
    critNotification = email
    warnNotification = email
}

忽略Unknown

template unknownTemp {
    subject = {{.Name}}: {{.Group | len}} unknown alerts 
    body = `
    <p>Time: {{.Time}} 
    <p>Name: {{.Name}} 
    <p>Alerts: {{range .Group}}
        <br>{{.}}
    {{end}}` 
}
unknownTemplate = unknownTemp

郵件配置

smtpHost = mail.example.com:25 
emailFrom = username@163.com
smtpUsername= username@163.com 
smtpPassword= password
notification email {
    email = example1@example1.com, example2@example2.com
    print = true
}

參考文件

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章