spark讀取elasticsearch nested array

原創

2020-07-07 22:21

anton

spark讀elasticsearch array

anton

elasticsearch數組

在Elasticsearch中，沒有專用的數組類型。默認情況下，任何字段都可以包含零個或多個值(數組中的所有值必須具有相同的數據類型)。
所以，我們在寫數據的時候，可以忽略數組和單個值得區別。例如：

PUT my_index/_doc/1
{
  "group" : "fans",
  "user" : [
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ],
  "tag": [
    "read",
    "write"
  ]
}

PUT my_index/_doc/2
{
  "group" : "fans",
  "user" : {
    "first" : "John",
    "last" :  "Smith"
  },
  "tag": "read"
}

這樣寫是沒有問題的，查詢也能得到結果
查詢：

GET my_index/_search
{
  "query": {
    "match": {
      "user.first": "john"
    }
  }
}

結果：

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "my_index",
        "_type": "_doc",
        "_id": "2",
        "_score": 0.2876821,
        "_source": {
          "group": "fans",
          "user": {
            "first": "John",
            "last": "Smith"
          },
          "tag": "read"
        }
      },
      {
        "_index": "my_index",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "group": "fans",
          "user": [
            {
              "first": "John",
              "last": "Smith"
            },
            {
              "first": "Alice",
              "last": "White"
            }
          ],
          "tag": [
            "read",
            "write"
          ]
        }
      }
    ]
  }
}

可以發現，文檔1的user是數組，文檔2的user是對象。

spark讀elasticsearch array

當我們用spark讀數據的時候就有問題了，例如：

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df_test = sqlContext.read.format("org.elasticsearch.spark.sql")\
.option("es.nodes", "xx.xx.xx.xx")\
.option("es.nodes.wan.only", "true")\
.option("es.port", "9200")\
.load("my_index/_doc")
df_test.show()

錯誤：

org.elasticsearch.hadoop.EsHadoopIllegalStateException: Field 'user.first' not found; typically this occurs with arrays which are not mapped as single value

由於Elasticsearch可以將一個或多個值映射到字段，因此elasticsearch-hadoop無法根據映射確定是實例化一個值還是數組類型（取決於庫類型）。因此我們需要顯式聲明數組字段，可以通過es.read.field.as.array.include設置。
再次查詢：

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df_test = sqlContext.read.format("org.elasticsearch.spark.sql")\
.option("es.nodes", "xx.xx.xx.xx")\
.option("es.nodes.wan.only", "true")\
.option("es.port", "9200")\
.option("es.read.field.as.array.include", "user, tag")\
.load("my_index/_doc")
df_test.show()

錯誤如下：

scala.MatchError: [John,Smith] (of class org.elasticsearch.spark.sql.ScalaEsRow)

猜測是因爲user字段類型不一致導致。
更新文檔2：

PUT my_index/_doc/2
{
  "group" : "fans",
  "user" : [{
    "first" : "John",
    "last" :  "Smith"
  }],
  "tag": "read"
}

文檔更新之後的到結果：

+-----+-------------+--------------------+
|group|          tag|                user|
+-----+-------------+--------------------+
| fans|       [read]|      [[John,Smith]]|
| fans|[read, write]|[[John,Smith], [A...|
+-----+-------------+--------------------+

可以看出，基本類型的數組，單個值寫入時是值本身還是數組，對讀取沒有影響；對於對象數組，最好以數組方式寫入。

spark讀nested array

如果對象數組聲明爲nested，結果如何呢？

PUT my_nested_index
{
    "mappings": {
      "doc": {
        "properties": {
          "group": {
            "type": "text"
          },
          "tag": {
            "type": "text"
          },
          "user": {
            "type": "nested",
            "properties": {
              "first": {
                "type": "text"
              },
              "last": {
                "type": "text"
              }
            }
          }
        }
      }
    }
}

PUT my_nested_index/doc/1
{
  "group" : "fans",
  "user" : [
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ],
  "tag": [
    "read",
    "write"
  ]
}

PUT my_nested_index/doc/2
{
  "group" : "fans",
  "user" : [{
    "last" :  "Smith",
    "first" : "John"
  }],
  "tag": "read"
}

查詢：

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df_nested = sqlContext.read.format("org.elasticsearch.spark.sql")\
.option("es.nodes", "xx.xx.xx.xx")\
.option("es.nodes.wan.only", "true")\
.option("es.port", "9200")\
.option("es.read.field.as.array.include", "user, tag")\
.load("my_nested_index/doc")

df_nested.show()

報錯：

scala.MatchError: [John,Smith] (of class org.elasticsearch.spark.sql.ScalaEsRow

將user從es.read.field.as.array.include中去掉，再查詢：

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df_nested = sqlContext.read.format("org.elasticsearch.spark.sql")\
.option("es.nodes", "xx.xx.xx.xx")\
.option("es.nodes.wan.only", "true")\
.option("es.port", "9200")\
.option("es.read.field.as.array.include", "tag")\
.load("my_nested_index/doc")

df_nested.show()

結果：

+-----+-------------+--------------------+
|group|          tag|                user|
+-----+-------------+--------------------+
| fans|       [read]|      [[John,Smith]]|
| fans|[read, write]|[[John,Smith], [A...|
+-----+-------------+--------------------+

如果在mapping中已經聲明user爲nested，就不必在es.read.field.as.array.include中包含。

結論

數組最好全部以數組形式寫入es，方便解析也方便spark讀取。
如果類型是數組，讀取時設置es.read.field.as.array.include，如果數組定義爲nested，則不必設置es.read.field.as.array.include。
合理設置mapping，如果可以，避免複雜的索引結構。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

spark讀取elasticsearch nested array

spark讀elasticsearch array

elasticsearch數組

spark讀elasticsearch array

spark讀nested array

結論

spark讀取elasticsearch nested array

使用python 進行oracle 全庫數據描述性及探索性逆向分析

使用python fake module批量製造測試數據

elasticsearch 6.3.0 快照

動態添加tab選項卡及tab頁面內容（ajax請求json數據）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結