在今天的文章中，我們將講述如何運用 Elasticsearch 的 ingest 節點來對數據進行結構化並對數據進行處理。

數據集

在我們的實際數據採集中，數據可能來自不同的來源，並且以不同的形式展展現：

這些數據可以是一種很結構化的數據被攝，比如數據庫中的數據，或者就是一直最原始的非結構化的數據，比如日誌。對於一些非結構化的數據，我們該如何把它們結構化，並使用 Elasticsearch 進行分析呢？

結構化數據

就如上面的數據展示的那樣。在很多的情況下，數據在攝入的時候是一種非結構化的形式來呈現的。這個數據通常有一個叫做 message 的字段。爲了能達到結構化的目的，我們們需要 parse 及 transform 這個 message 字段，並把這個 message 變爲我們所需要的字段，從而達到結構化的母的。讓我們看一個例子。假如我們有如下的信息：

{
    "message": "2019-09-29T00:39:02.9122 [Debug] MyApp stopped"
}

顯然上面的信息是一個非結構化的信息。它含有唯一的一個字段 message。我們希望通過一些方法把它變成爲：

{
    "@timestamp": "2019-09-29T00:39:02.9122",
    "loglevel": "Debug",
    "status": "MyApp stopped"
}

顯然上面的數據是一個結構化的文檔。它更便於我們對數據進行分析。比如我們對數據進行聚合或在Kibana中進行展示。

我們接下來看一下一個典型的 Elastic Stack 的架構圖：

在上面，我們可以看到有兩個地方我們可以對數據進行處理：

我們可以使用Logstash和Ingest node來對我們的數據進行處理。如果大家還對使用 Logstash 或者是 Ingest Node 沒法做選擇的話，請參閱我之前的文章 “我應該使用Logstash或是Elasticsearch ingest 節點?”。

如果你的日誌數據不是一個已有的格式，比如 apache, nginx，那麼你需要建立自己的 pipeline 來對這些日誌進行處理。在今天的文章裏，我們將介紹如何使用 Elasticsearch 的 ingest processors 來對我們的非結構化數據進行處理，從而把它們變爲結構化的數據：

split
dissect
kv
grok
...

Ingest pipelines

一個Elasticsearch pipeline是一組 processors:

讓我們在數據建立索引之前做預處理
每一個 processor 可以修改經過它的文檔
processor 的處理是在 Elasticsearch 新的 ingest node 裏進行的

定義一個 Elasticsearch 的 ingest pipeline

我們可以使用 Ingest API 來定義 pipelines:

我們可以使用 _simulate 終點來進行測試：

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "split": {
          "field": "message",
          "separator": " "
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "2019-09-29T00:39:02.912Z AppServer1 STATUS_OK"
      }
    }
  ]
}

上面的運行的結果是：

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "message" : [
            "2019-09-29T00:39:02.912Z",
            "AppServer1",
            "STATUS_OK"
          ]
        },
        "_ingest" : {
          "timestamp" : "2020-04-27T08:40:43.059569Z"
        }
      }
    }
  ]
}

我們看到在上面的 split proocessor 中它把一個非結構化的 message 變成了一個結果話的數據。message 現在是一個數組，那麼我們該如何引用這個數組裏的數據呢？

我們接着修改 pipeline：

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "split": {
          "field": "message",
          "separator": " "
        }
      },
      {
        "set": {
          "field": "timestamp",
          "value": "{
  
  {message.0}}"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "2019-09-29T00:39:02.912Z AppServer1 STATUS_OK"
      }
    }
  ]
}

在上面我們使用了 { {message.0}} 來訪問數組裏的第一個數據。上面的命令運行的結果爲：

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "message" : [
            "2019-09-29T00:39:02.912Z",
            "AppServer1",
            "STATUS_OK"
          ],
          "timestamp" : "2019-09-29T00:39:02.912Z"
        },
        "_ingest" : {
          "timestamp" : "2020-12-09T02:08:25.004644Z"
        }
      }
    }
  ]
}

我們可以看到一個叫做 timestamp 的字段。

在實際的使用中，我們甚至可以使用 target_field 來重新被 split 後的字段名稱：

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "split": {
          "field": "message",
          "separator": " ",
          "target_field": "new"
        }
      },
      {
        "set": {
          "field": "timestamp",
          "value": "{
  
  {message.0}}"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "2019-09-29T00:39:02.912Z AppServer1 STATUS_OK 2000"
      }
    }
  ]
}

上面運行的結果是：

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "new" : [
            "2019-09-29T00:39:02.912Z",
            "AppServer1",
            "STATUS_OK",
            "2000"
          ],
          "message" : "2019-09-29T00:39:02.912Z AppServer1 STATUS_OK 2000",
          "timestamp" : ""
        },
        "_ingest" : {
          "timestamp" : "2020-12-09T02:13:43.697296Z"
        }
      }
    }
  ]
}

我們可以看到一個叫做 new 的字段代替之前的 message。由於我們增加了一個新的文字 “2000”，在我們的 new 字段輸出中，可以看到一個新增加的字符串 “2000”。假如我們想把這個字段轉換爲整數，那麼我們可以使用如下的辦法：

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "split": {
          "field": "message",
          "separator": " ",
          "target_field": "new"
        }
      },
      {
        "set": {
          "field": "timestamp",
          "value": "{
  
  {message.0}}"
        }
      },
      {
        "convert": {
          "field": "new.3",
          "type": "integer"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "2019-09-29T00:39:02.912Z AppServer1 STATUS_OK 2000"
      }
    }
  ]
}

在上面，我們使用 new.3 來表想要轉換的字段。上面的輸出結果爲：

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "new" : [
            "2019-09-29T00:39:02.912Z",
            "AppServer1",
            "STATUS_OK",
            2000
          ],
          "message" : "2019-09-29T00:39:02.912Z AppServer1 STATUS_OK 2000",
          "timestamp" : ""
        },
        "_ingest" : {
          "timestamp" : "2020-12-09T02:16:30.144772Z"
        }
      }
    }
  ]
}

從上面我們可以看出來 “2000” 變成了 2000。

如何使用 Pipeline

一旦你定義好一個 pipeline，如果你是使用 Filebeat 接入到 Elasticsearch 導入數據，那麼你可以在 filebeat 的配置文件中這樣使用這個 pipeline：

output.elasticsearch:
   hosts: ["http://localhost:9200"]
   pipeline: my_pipeline

你也可以直接爲你的 Elasticsearch index 定義一個默認的 pipeline：

PUT my_index
{
  "settings": {
    "default_pipeline": "my_pipeline"
  }
}

這樣當我們的數據導入到 my_index 裏去的時候，my_pipeline 將會被自動調用。

例子

Dissect

我們下面來看一個更爲複雜一點的例子。你需要同時使用 split 及 kv processor 來結構化這個消息：

正如我們上面顯示的那樣，我們想提取上面用紅色標識的部分，但是我們並不需要信息中中括號【及】。我可以使用 dissect processor:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "Example using dissect processor",
    "processors": [
      {
        "dissect": {
          "field": "message",
          "pattern": "%{@timestamp} [%{loglevel}] %{status}"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "2019-09-29T00:39:02.912Z [Debug] MyApp stopped"
      }
    }
  ]
}

上面顯示的結果是：

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "@timestamp" : "2019-09-29T00:39:02.912Z",
          "loglevel" : "Debug",
          "message" : "2019-09-29T00:39:02.912Z [Debug] MyApp stopped",
          "status" : "MyApp stopped"
        },
        "_ingest" : {
          "timestamp" : "2020-04-27T09:10:33.720665Z"
        }
      }
    }
  ]
}

我們接下來顯示一個 key-value 對的信息：

{
  "message": "2019009-29T00:39:02.912Z host=AppServer status=STATUS_OK"
}

我們同樣可以使用 dissect processor 來處理：

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "Example using dissect processor key-value",
    "processors": [
      {
        "dissect": {
          "field": "message",
          "pattern": "%{@timestamp} %{*field1}=%{&field1} %{*field2}=%{&field2}"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "2019009-29T00:39:02.912Z host=AppServer status=STATUS_OK"
      }
    }
  ]
}

在上面，*及&是參考鍵修飾符，它們用來改變 dissect 的行爲。上面的結果顯示：

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "@timestamp" : "2019009-29T00:39:02.912Z",
          "host" : "AppServer",
          "message" : "2019009-29T00:39:02.912Z host=AppServer status=STATUS_OK",
          "status" : "STATUS_OK"
        },
        "_ingest" : {
          "timestamp" : "2020-04-27T14:04:38.639127Z"
        }
      }
    }
  ]
}

對於許多新的開發者來說，有時他們對 dissect 和 grok 的區別不是很理解。從表面上看，dissect 和 grok 有很多重疊的地方，但是 dissect 的執行速度遠遠高於 grok，所以在實際的使用中，儘量使用 dissect 來完成。但是在實際的使用中，有些情況下，我們還必須使用 grok 來完成。我們在一下的 grok 部分講到。

Script processor

儘管現有的很多的 processor 都能給我們帶來很大的方便，但是在實際的應用中，有很多的能夠並不在我們的 Logstash 或Elasticsearch預設的功能之列。一種辦法就是寫自己的插件，但是這可能是一件巨大的任務。我們可以寫一個腳本來完成這個工作。通常這個是由Elasticsearch的Painless腳本來完成的。如果你想了解更多的Painless的知識，你可以在 “Elastic：菜鳥上手指南” 找到幾篇這個語言的介紹文章。

有兩種方法可以允許Painless script：inline或者stored。

Inline scripts

在下面的例子中它展示的是一個inline的腳本，用來更新一個叫做new_field的字段：

PUT /_ingest/pipeline/my_script_pipeline
{
  "processors": [
    {
      "script": {
        "source": "ctx['new_field'] = params.value",
        "params": {
          "value": "Hello world"
        }
      }
    }
  ]
}

在上面，我們使用 params 來把參數傳入。這樣做的好處是 source 的代碼一直是沒有變化的，這樣它只會被編譯一次。如果 source 的代碼隨着調用的不同而改變，那麼它將會被每次編譯從而造成浪費。

Stored scripts

Scripts也可以保存於 Cluster 的狀態中，並且在以後引用 script 的 ID 來調用：

PUT _scripts/my_script
{
  "script": {
    "lang": "painless",
    "source": "ctx['new_field'] = params.value"
  }
}

PUT /_ingest/pipeline/my_script_pipeline
{
  "processors": [
    {
      "script": {
        "id": "my_script",
        "params": {
          "value": "Hello world!"
        }
      }
    }
  ]
}

上面的兩個命令將實現和之前一樣的功能。當我們在 ingest node 使用場景的時候，我們訪問文檔的字段時，使用 cxt['new_field']。我們也可以訪問它的元字段，比如 cxt['_id'] = ctx['my_field']。

我們先來做幾個練習：

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "script": {
          "lang": "painless",
          "source": "ctx['new_value'] = ctx['current_value'] + 1"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "current_value": 2
      }
    }
  ]
}

上面的腳本運行時會生產一個新的叫做 new_value 的字段，並且它的值將會是由 curent_value 字段的值加上1。運行上面的結果是：

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "new_value" : 3,
          "current_value" : 2
        },
        "_ingest" : {
          "timestamp" : "2020-04-27T14:49:35.775395Z"
        }
      }
    }
  ]
}

我們接下來一個例子就是來創建一個 stored script:

PUT _scripts/my_script
{
  "script": {
    "lang": "painless",
    "source": "ctx['new_value'] = ctx['current_value'] + params.value"
  }
}

PUT /_ingest/pipeline/my_script_pipeline
{
  "processors": [
    {
      "script": {
        "id": "my_script",
        "params": {
          "value": 1
        }
      }
    }
  ]
}

上面的這個語句和之前的那個實現的是同一個功能。我們先執行上面的兩個命令。爲了能測試上面的 pipeline 是否工作，我們嘗試創建兩個文檔：

POST test_docs/_doc
{
  "current_value": 34
}

POST test_docs/_doc
{
  "current_value": 80
}

然後，我們運行如下的命令：

POST test_docs/_update_by_query?pipeline=my_script_pipeline
{
  "query": {
    "range": {
      "current_value": {
        "gt": 30
      }
    }
  }
}

在上面，我們通過使用 _update_by_query 結合 pipepline 一起來更新我們的文檔。我們只針對 current_value 大於30的文檔才起作用。運行完後：

{
  "took" : 25,
  "timed_out" : false,
  "total" : 2,
  "updated" : 2,
  "deleted" : 0,
  "batches" : 1,
  "version_conflicts" : 0,
  "noops" : 0,
  "retries" : {
    "bulk" : 0,
    "search" : 0
  },
  "throttled_millis" : 0,
  "requests_per_second" : -1.0,
  "throttled_until_millis" : 0,
  "failures" : [ ]
}

它顯示已經更新兩個文檔了。我們使用如下的語句來進行查看：

GET test_docs/_search

顯示的結果：

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test_docs",
        "_type" : "_doc",
        "_id" : "EIEnvHEBQHMgxFmxZyBq",
        "_score" : 1.0,
        "_source" : {
          "new_value" : 35,
          "current_value" : 34
        }
      },
      {
        "_index" : "test_docs",
        "_type" : "_doc",
        "_id" : "D4EnvHEBQHMgxFmxXyBp",
        "_score" : 1.0,
        "_source" : {
          "new_value" : 81,
          "current_value" : 80
        }
      }
    ]
  }
}

從上面我們可以看出來 new_value 字段的值是 current_value 字段的值加上1。

我們再接着看如下的例子：

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "split": {
          "field": "message",
          "separator": " ", 
          "target_field": "split_message"
        }
      },
      {
        "set": {
          "field": "environment",
          "value": "prod"
        }
      },
      {
        "set": {
          "field": "@timestamp",
          "value": "{
  
  {split_message.0}}"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "2019-09-29T00:39:02.912Z AppServer1 STATUS_OK"
      }
    }
  ]
}

在上面第一個 split processor，我們把 message 按照" "來進行拆分，並同時把結果賦予給字段 split_message。它其實是一個數組。接着我們通過 set processor添加一個叫做 environment 的字段，並賦予值 prod。再接着我們把 split_message 數組裏的第一個值拿出來賦予給 @timestamp 字段。這是一個添加的字段。運行的結果如下：

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "environment" : "prod",
          "@timestamp" : "2019-09-29T00:39:02.912Z",
          "message" : "2019-09-29T00:39:02.912Z AppServer1 STATUS_OK",
          "split_message" : [
            "2019-09-29T00:39:02.912Z",
            "AppServer1",
            "STATUS_OK"
          ]
        },
        "_ingest" : {
          "timestamp" : "2020-04-27T15:35:00.922037Z"
        }
      }
    }
  ]
}

Grok processor

Grok processor 提供了一種正則匹配的方式讓我們把 pattern 和 message 進行匹配，從而提前出 message 裏的結構化數據。相比較 Dissect 而言，Grok 的相率並不高。這是我們需要注意的。那麼爲什麼我們還是需要使用 Grok呢？我們首先來看一下如下的一個例子：

157.97.192.70 2019 09 29 00:39:02.912 AppServer1 Process 11111 Init
157.97.192.70 2019 09 29 00:39:06.554 AppServer1 22222 Stopped 3.642

在上面的兩個日誌中，我們發現如果使用 Dissect processor，還是無能爲力，這是因爲 process id 在兩個不同的日誌裏出現的位置並不相同。但是我們可以使用 Grok 來完美地解決這個問題。

我們可以在 Kibana 中打入如下的命令來查詢現有的預設的 grok pattern:

GET /_ingest/processor/grok

我們可以看到有超過 300 多個的預設的 grok patern 供我們使用：

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "grok": {
          "field": "message",
          "patterns": [
            "%{TIMESTAMP_ISO8601:@timestamp} %{IP:client} \\[%{WORD:status}\\] %{NUMBER:duration}"
          ]
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "2019-09-29T00:39:02.912Z 55.3.241.1 [OK] 0.043"
      }
    }
  ]
}

上面的返回結果是：

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "duration" : "0.043",
          "@timestamp" : "2019-09-29T00:39:02.912Z",
          "client" : "55.3.241.1",
          "message" : "2019-09-29T00:39:02.912Z 55.3.241.1 [OK] 0.043",
          "status" : "OK"
        },
        "_ingest" : {
          "timestamp" : "2020-04-28T00:16:52.155688Z"
        }
      }
    }
  ]
}

Grok processro 也對多行的事件也可以處理的很好。比如：

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "grok": {
          "field": "text",
          "patterns": ["%{GREEDYMULTILINE:allMyData}"],
          "pattern_definitions": {
            "GREEDYMULTILINE": "(.|\n)*"
          }
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "text": "This is a text \n secondline"
      }
    }
  ]
}

上面運行的結果是：

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "text" : """This is a text 
 secondline""",
          "allMyData" : """This is a text 
 secondline"""
        },
        "_ingest" : {
          "timestamp" : "2020-04-28T00:31:38.913929Z"
        }
      }
    }
  ]
}

在上面我們可以看到 allMydata 把多行的數據都提前到同一個字段。在上面如果我們只用其中的一種 pattern_definitions，比如 .*：

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "grok": {
          "field": "text",
          "patterns": ["%{GREEDYMULTILINE:allMyData}"],
          "pattern_definitions": {
            "GREEDYMULTILINE": ".*"
          }
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "text": "This is a text \n secondline"
      }
    }
  ]
}

那麼我們可以看到：

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "text" : """This is a text 
 secondline""",
          "allMyData" : "This is a text "
        },
        "_ingest" : {
          "timestamp" : "2020-04-28T00:35:59.67759Z"
        }
      }
    }
  ]
}

也就是它只提前了第一行。

Date processor

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "date": {
          "field": "date",
          "formats": [
            "MM/dd/yyyy HH:mm",
            "dd-MM-yyyy HH:mm:ssz"
          ]
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "date": "03/25/2019 03:39"
      }
    },
    {
      "_source": {
        "date": "25-03-2019 03:39:00+01:00"
      }
    }
  ]
}

在上面我們定義了兩種時間的格式，如果其中的一個有匹配，那麼時間將會被正確地解析，同時被自動賦予給 @timestamp 字段。這個和 Logstash 的 date processor 是一樣的。上面運行的結果是：

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "date" : "03/25/2019 03:39",
          "@timestamp" : "2019-03-25T03:39:00.000Z"
        },
        "_ingest" : {
          "timestamp" : "2020-04-28T00:24:24.802381Z"
        }
      }
    },
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "date" : "25-03-2019 03:39:00+01:00",
          "@timestamp" : "2019-03-25T02:39:00.000Z"
        },
        "_ingest" : {
          "timestamp" : "2020-04-28T00:24:24.802396Z"
        }
      }
    }
  ]
}

Elasticsearch：Elastic可觀測性 - 運用 pipeline 使數據結構化

數據集

結構化數據

Ingest pipelines

定義一個 Elasticsearch 的 ingest pipeline

如何使用 Pipeline

例子

Dissect

Script processor

Grok processor

Date processor

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

[轉帖]

python列出centos7內存使用前50的進程信息

Garnet：微軟官方基於.NET開源的高性能分佈式緩存存儲數據庫

Flink執行圖

Java響應式編程

評估統計算法在銀行僞造鈔票檢測中的價值

瞧一瞧~看一看~MyCat架構剖析免費不要錢！(上)

印度服務器的租用價格怎麼樣呢？

beego項目和go項目打包部署到linux

篩選法求素數&一般方法求素數&判斷一個數是否是素數

PROFINET協議解析-DCP

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結