在今天的文章中，我們將講述如何運用Elasticsearch的 ingest 節點來對數據進行結構化並對數據進行處理。

數據集

在我們的實際數據採集中，數據可能來自不同的來源，並且以不同的形式展展現：

這些數據可以是一種很結構化的數據被攝，比如數據庫中的數據，或者就是一直最原始的非結構化的數據，比如日誌。對於一些非結構化的數據，我們該如何把它們結構化，並使用Elasticsearch進行分析呢？

結構化數據

就如上面的數據展示的那樣。在很多的情況下，數據在攝入的時候是一種非結構化的形式來呈現的。這個數據通常有一個叫做message的字段。爲了能達到結構化的目的，我們們需要parse及transform這個message字段，並把這個message變爲我們所需要的字段，從而達到結構化的母的。讓我們看一個例子。假如我們有如下的信息：

{
    "message": "2019-09-29T00:39:02.9122 [Debug] MyApp stopped"
}

顯然上面的信息是一個非結構化的信息。它含有唯一的一個字段message。我們希望通過一些方法把它變成爲：

{
    "@timestamp": "2019-09-29T00:39:02.9122",
    "loglevel": "Debug",
    "status": "MyApp stopped"
}

顯然上面的數據是一個結構化的文檔。它更便於我們對數據進行分析。比如我們對數據進行聚合或在Kibana中進行展示。

我們接下來看一下一個典型的Elastic Stack的架構圖：

在上面，我們可以看到有兩個地方我們可以對數據進行處理：

我們可以使用Logstash和Ingest node來對我們的數據進行處理。如果大家還對使用Logstash或者是Ingest Node沒法做選擇的話，請參閱我之前的文章“我應該使用Logstash或是Elasticsearch ingest 節點?”。

如果你的日誌數據不是一個已有的格式，比如apache, nginx，那麼你需要建立自己的pipeline來對這些日誌進行處理。在今天的文章裏，我們將介紹如何使用Elasticsearch的ingest processors來對我們的非結構化數據進行處理，從而把它們變爲結構化的數據：

split
dissect
kv
grok
...

Ingest pipelines

一個Elasticsearch pipeline是一組processors:

讓我們在數據建立索引之前做預處理
每一個processor可以修改經過它的文檔
processor的處理是在Elasticsearch新的ingest node裏進行的

定義一個Elasticsearch的ingest pipeline

我們可以使用Ingest API來定義pipelines:

我們可以使用_simulate重點來進行測試：

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "split": {
          "field": "message",
          "separator": " "
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "2019-09-29T00:39:02.912Z AppServer1 STATUS_OK"
      }
    }
  ]
}

上面的運行的結果是：

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "message" : [
            "2019-09-29T00:39:02.912Z",
            "AppServer1",
            "STATUS_OK"
          ]
        },
        "_ingest" : {
          "timestamp" : "2020-04-27T08:40:43.059569Z"
        }
      }
    }
  ]
}

如何使用Pipeline

一旦你定義好一個pipeline，如果你是使用Filebeat接入到Elasticsearch導入數據，那麼你可以在filebeat的配置文件中這樣使用這個pipeline：

output.elasticsearch:
   hosts: ["http://localhost:9200"]
   pipeline: my_pipeline

你也可以直接爲你的Elasticsearch index定義一個默認的pipeline：

PUT my_index
{
  "settings": {
    "default_pipeline": "my_pipeline"
  }
}

這樣當我們的數據導入到my_index裏去的時候，my_pipeline將會被自動調用。

例子

Dissect

我們下面來看一個更爲複雜一點的例子。你需要同時使用split及kv processor來結構化這個消息：

正如我們上面顯示的那樣，我們想提取上面用紅色標識的部分，但是我們並不需要信息中中括號【及】。我可以使用dissect processor:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "Example using dissect processor",
    "processors": [
      {
        "dissect": {
          "field": "message",
          "pattern": "%{@timestamp} [%{loglevel}] %{status}"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "2019-09-29T00:39:02.912Z [Debug] MyApp stopped"
      }
    }
  ]
}

上面顯示的結果是：

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "@timestamp" : "2019-09-29T00:39:02.912Z",
          "loglevel" : "Debug",
          "message" : "2019-09-29T00:39:02.912Z [Debug] MyApp stopped",
          "status" : "MyApp stopped"
        },
        "_ingest" : {
          "timestamp" : "2020-04-27T09:10:33.720665Z"
        }
      }
    }
  ]
}

我們接下來顯示一個key-value對的信息：

{
  "message": "2019009-29T00:39:02.912Z host=AppServer status=STATUS_OK"
}

我們同樣可以使用dissect processor來處理：

POST _ingest/pipeline/_simulate
{
  "pipeline": {
    "description": "Example using dissect processor key-value",
    "processors": [
      {
        "dissect": {
          "field": "message",
          "pattern": "%{@timestamp} %{*field1}=%{&field1} %{*field2}=%{&field2}"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "2019009-29T00:39:02.912Z host=AppServer status=STATUS_OK"
      }
    }
  ]
}

在上面，*及&是參考鍵修飾符，它們用來改變dissect的行爲。上面的結果顯示：

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "@timestamp" : "2019009-29T00:39:02.912Z",
          "host" : "AppServer",
          "message" : "2019009-29T00:39:02.912Z host=AppServer status=STATUS_OK",
          "status" : "STATUS_OK"
        },
        "_ingest" : {
          "timestamp" : "2020-04-27T14:04:38.639127Z"
        }
      }
    }
  ]
}

Script processor

儘管現有的很多的processor都能給我們帶來很大的方便，但是在實際的應用中，有很多的能夠並不在我們的Logstash或Elasticsearch預設的功能之列。一種辦法就是寫自己的插件，但是這可能是一件巨大的任務。我們可以寫一個腳本來完成這個工作。通常這個是由Elasticsearch的Painless腳本來完成的。如果你想了解更多的Painless的知識，你可以在“Elastic：菜鳥上手指南”找到幾篇這個語言的介紹文章。

有兩種方法可以允許Painless script：inline或者stored。

Inline scripts

在下面的例子中它展示的是一個inline的腳本，用來更新一個叫做new_field的字段：

PUT /_ingest/pipeline/my_script_pipeline
{
  "processors": [
    {
      "script": {
        "source": "ctx['new_field'] = params.value",
        "params": {
          "value": "Hello world"
        }
      }
    }
  ]
}

在上面，我們使用params來把參數傳入。這樣做的好處是source的代碼一直是沒有變化的，這樣它只會被編譯一次。如果source的代碼隨着調用的不同而改變，那麼它將會被每次編譯從而造成浪費。

Stored scripts

Scripts也可以保存於Cluster的狀態中，並且在以後引用script的ID來調用：

PUT _scripts/my_script
{
  "script": {
    "lang": "painless",
    "source": "ctx['new_field'] = params.value"
  }
}

PUT /_ingest/pipeline/my_script_pipeline
{
  "processors": [
    {
      "script": {
        "id": "my_script",
        "params": {
          "value": "Hello world!"
        }
      }
    }
  ]
}

上面的兩個命令將實現和之前一樣的功能。當我們在ingest node使用場景的時候，我們訪問文檔的字段時，使用cxt['new_field']。我們也可以訪問它的元字段，比如cxt['_id'] = ctx['my_field']。

我們先來做幾個練習：

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "script": {
          "lang": "painless",
          "source": "ctx['new_value'] = ctx['current_value'] + 1"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "current_value": 2
      }
    }
  ]
}

上面的腳本運行時會生產一個新的叫做new_value的字段，並且它的值將會是由curent_value字段的值加上1。運行上面的結果是：

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "new_value" : 3,
          "current_value" : 2
        },
        "_ingest" : {
          "timestamp" : "2020-04-27T14:49:35.775395Z"
        }
      }
    }
  ]
}

我們接下來一個例子就是來創建一個stored script:

PUT _scripts/my_script
{
  "script": {
    "lang": "painless",
    "source": "ctx['new_value'] = ctx['current_value'] + params.value"
  }
}

PUT /_ingest/pipeline/my_script_pipeline
{
  "processors": [
    {
      "script": {
        "id": "my_script",
        "params": {
          "value": 1
        }
      }
    }
  ]
}

上面的這個語句和之前的那個實現的是同一個功能。我們先執行上面的兩個命令。爲了能測試上面的pipeline是否工作，我們嘗試創建兩個文檔：

POST test_docs/_doc
{
  "current_value": 34
}

POST test_docs/_doc
{
  "current_value": 80
}

然後，我們運行如下的命令：

POST test_docs/_update_by_query?pipeline=my_script_pipeline
{
  "query": {
    "range": {
      "current_value": {
        "gt": 30
      }
    }
  }
}

在上面，我們通過使用_update_by_query結合pipepline一起來更新我們的文檔。我們只針對 current_value大於30的文檔才起作用。運行完後：

{
  "took" : 25,
  "timed_out" : false,
  "total" : 2,
  "updated" : 2,
  "deleted" : 0,
  "batches" : 1,
  "version_conflicts" : 0,
  "noops" : 0,
  "retries" : {
    "bulk" : 0,
    "search" : 0
  },
  "throttled_millis" : 0,
  "requests_per_second" : -1.0,
  "throttled_until_millis" : 0,
  "failures" : [ ]
}

它顯示已經更新兩個文檔了。我們使用如下的語句來進行查看：

GET test_docs/_search

顯示的結果：

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "test_docs",
        "_type" : "_doc",
        "_id" : "EIEnvHEBQHMgxFmxZyBq",
        "_score" : 1.0,
        "_source" : {
          "new_value" : 35,
          "current_value" : 34
        }
      },
      {
        "_index" : "test_docs",
        "_type" : "_doc",
        "_id" : "D4EnvHEBQHMgxFmxXyBp",
        "_score" : 1.0,
        "_source" : {
          "new_value" : 81,
          "current_value" : 80
        }
      }
    ]
  }
}

從上面我們可以看出來new_value字段的值是current_value字段的值加上1。

我們再接着看如下的例子：

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "split": {
          "field": "message",
          "separator": " ", 
          "target_field": "split_message"
        }
      },
      {
        "set": {
          "field": "environment",
          "value": "prod"
        }
      },
      {
        "set": {
          "field": "@timestamp",
          "value": "{{split_message.0}}"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "2019-09-29T00:39:02.912Z AppServer1 STATUS_OK"
      }
    }
  ]
}

在上面第一個split processor，我們把message按照" "來進行拆分，並同時把結果賦予給字段split_message。它其實是一個數組。接着我們通過set processor添加一個叫做environment的字段，並賦予值prod。再接着我們把split_message數組裏的第一個值拿出來賦予給@timestamp字段。這是一個添加的字段。運行的結果如下：

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "environment" : "prod",
          "@timestamp" : "2019-09-29T00:39:02.912Z",
          "message" : "2019-09-29T00:39:02.912Z AppServer1 STATUS_OK",
          "split_message" : [
            "2019-09-29T00:39:02.912Z",
            "AppServer1",
            "STATUS_OK"
          ]
        },
        "_ingest" : {
          "timestamp" : "2020-04-27T15:35:00.922037Z"
        }
      }
    }
  ]
}

Grok processor

Grok processor提供了一種正則匹配的方式讓我們把pattern和message進行匹配，從而提前出message裏的結構化數據：

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "grok": {
          "field": "message",
          "patterns": [
            "%{TIMESTAMP_ISO8601:@timestamp} %{IP:client} \\[%{WORD:status}\\] %{NUMBER:duration}"
          ]
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "message": "2019-09-29T00:39:02.912Z 55.3.241.1 [OK] 0.043"
      }
    }
  ]
}

上面的返回結果是：

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "duration" : "0.043",
          "@timestamp" : "2019-09-29T00:39:02.912Z",
          "client" : "55.3.241.1",
          "message" : "2019-09-29T00:39:02.912Z 55.3.241.1 [OK] 0.043",
          "status" : "OK"
        },
        "_ingest" : {
          "timestamp" : "2020-04-28T00:16:52.155688Z"
        }
      }
    }
  ]
}

Grok processro也對多行的事件也可以處理的很好。比如：

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "grok": {
          "field": "text",
          "patterns": ["%{GREEDYMULTILINE:allMyData}"],
          "pattern_definitions": {
            "GREEDYMULTILINE": "(.|\n)*"
          }
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "text": "This is a text \n secondline"
      }
    }
  ]
}

上面運行的結果是：

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "text" : """This is a text 
 secondline""",
          "allMyData" : """This is a text 
 secondline"""
        },
        "_ingest" : {
          "timestamp" : "2020-04-28T00:31:38.913929Z"
        }
      }
    }
  ]
}

在上面我們可以看到allMydata把多行的數據都提前到同一個字段。在上面如果我們只用其中的一種 pattern_definitions，比如 .*：

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "grok": {
          "field": "text",
          "patterns": ["%{GREEDYMULTILINE:allMyData}"],
          "pattern_definitions": {
            "GREEDYMULTILINE": ".*"
          }
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "text": "This is a text \n secondline"
      }
    }
  ]
}

那麼我們可以看到：

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "text" : """This is a text 
 secondline""",
          "allMyData" : "This is a text "
        },
        "_ingest" : {
          "timestamp" : "2020-04-28T00:35:59.67759Z"
        }
      }
    }
  ]
}

也就是它只提前了第一行。

Date processor

POST /_ingest/pipeline/_simulate
{
  "pipeline": {
    "processors": [
      {
        "date": {
          "field": "date",
          "formats": [
            "MM/dd/yyyy HH:mm",
            "dd-MM-yyyy HH:mm:ssz"
          ]
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "date": "03/25/2019 03:39"
      }
    },
    {
      "_source": {
        "date": "25-03-2019 03:39:00+01:00"
      }
    }
  ]
}

在上面我們定義了兩種時間的格式，如果其中的一個有匹配，那麼時間將會被正確地解析，同時被自動賦予給@timestamp字段。這個和Logstash的date processor是一樣的。上面運行的結果是：

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "date" : "03/25/2019 03:39",
          "@timestamp" : "2019-03-25T03:39:00.000Z"
        },
        "_ingest" : {
          "timestamp" : "2020-04-28T00:24:24.802381Z"
        }
      }
    },
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "date" : "25-03-2019 03:39:00+01:00",
          "@timestamp" : "2019-03-25T02:39:00.000Z"
        },
        "_ingest" : {
          "timestamp" : "2020-04-28T00:24:24.802396Z"
        }
      }
    }
  ]
}

Elasticsearch：Elastic可觀測性 - 數據結構化及處理

數據集

結構化數據

Ingest pipelines

定義一個Elasticsearch的ingest pipeline

如何使用Pipeline

例子

Dissect

Script processor

Grok processor

Date processor

10分鐘搞定Mysql主從部署配置

如何使用 JS 判斷用戶是否處於活躍狀態

「Pygors跨平臺GUI」2：安裝MinGW-w64、MSYS2還是WSL2

[轉帖]

python列出centos7內存使用前50的進程信息

「Pygors跨平臺GUI」1：Pygors跨平臺GUI應用研究

一鍵自動化博客發佈工具,用過的人都說好(掘金篇)

lightdb數據庫超時相關控制參數

lightdb秒級增加列和刪除列（not null帶默認值）

Java ThreadPoolShutdown

Elastic：在 Grok 中運用 custom pattern 來定義 pattern

Logstash：運用 Elastic Stack 分析 CSDN 閱讀量

Elasticsearch：使用布爾查詢提高搜索的相關性

Elastic：運用 Elastic Stack 分析 Spring boot 微服務日誌 (二）

Observability：使用 Elastic Stack 分析地理空間數據（二）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結