spark讀取elasticsearch nested array

anton

spark讀elasticsearch array

anton

elasticsearch數組

在Elasticsearch中,沒有專用的數組類型。默認情況下,任何字段都可以包含零個或多個值(數組中的所有值必須具有相同的數據類型)。
所以,我們在寫數據的時候,可以忽略數組和單個值得區別。例如:

PUT my_index/_doc/1
{
  "group" : "fans",
  "user" : [
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ],
  "tag": [
    "read",
    "write"
  ]
}

PUT my_index/_doc/2
{
  "group" : "fans",
  "user" : {
    "first" : "John",
    "last" :  "Smith"
  },
  "tag": "read"
}

這樣寫是沒有問題的,查詢也能得到結果
查詢:

GET my_index/_search
{
  "query": {
    "match": {
      "user.first": "john"
    }
  }
}

結果:

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "my_index",
        "_type": "_doc",
        "_id": "2",
        "_score": 0.2876821,
        "_source": {
          "group": "fans",
          "user": {
            "first": "John",
            "last": "Smith"
          },
          "tag": "read"
        }
      },
      {
        "_index": "my_index",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "group": "fans",
          "user": [
            {
              "first": "John",
              "last": "Smith"
            },
            {
              "first": "Alice",
              "last": "White"
            }
          ],
          "tag": [
            "read",
            "write"
          ]
        }
      }
    ]
  }
}

可以發現,文檔1的user是數組,文檔2的user是對象。

spark讀elasticsearch array

當我們用spark讀數據的時候就有問題了,例如:

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df_test = sqlContext.read.format("org.elasticsearch.spark.sql")\
.option("es.nodes", "xx.xx.xx.xx")\
.option("es.nodes.wan.only", "true")\
.option("es.port", "9200")\
.load("my_index/_doc")
df_test.show()

錯誤:

org.elasticsearch.hadoop.EsHadoopIllegalStateException: Field 'user.first' not found; typically this occurs with arrays which are not mapped as single value

由於Elasticsearch可以將一個或多個值映射到字段,因此elasticsearch-hadoop無法根據映射確定是實例化一個值還是數組類型(取決於庫類型)。因此我們需要顯式聲明數組字段,可以通過es.read.field.as.array.include設置。
再次查詢:

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df_test = sqlContext.read.format("org.elasticsearch.spark.sql")\
.option("es.nodes", "xx.xx.xx.xx")\
.option("es.nodes.wan.only", "true")\
.option("es.port", "9200")\
.option("es.read.field.as.array.include", "user, tag")\
.load("my_index/_doc")
df_test.show()

錯誤如下:

scala.MatchError: [John,Smith] (of class org.elasticsearch.spark.sql.ScalaEsRow)

猜測是因爲user字段類型不一致導致。
更新文檔2:

PUT my_index/_doc/2
{
  "group" : "fans",
  "user" : [{
    "first" : "John",
    "last" :  "Smith"
  }],
  "tag": "read"
}

文檔更新之後的到結果:

+-----+-------------+--------------------+
|group|          tag|                user|
+-----+-------------+--------------------+
| fans|       [read]|      [[John,Smith]]|
| fans|[read, write]|[[John,Smith], [A...|
+-----+-------------+--------------------+

可以看出,基本類型的數組,單個值寫入時是值本身還是數組,對讀取沒有影響;對於對象數組,最好以數組方式寫入。

spark讀nested array

如果對象數組聲明爲nested,結果如何呢?

PUT my_nested_index
{
    "mappings": {
      "doc": {
        "properties": {
          "group": {
            "type": "text"
          },
          "tag": {
            "type": "text"
          },
          "user": {
            "type": "nested",
            "properties": {
              "first": {
                "type": "text"
              },
              "last": {
                "type": "text"
              }
            }
          }
        }
      }
    }
}

PUT my_nested_index/doc/1
{
  "group" : "fans",
  "user" : [
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ],
  "tag": [
    "read",
    "write"
  ]
}

PUT my_nested_index/doc/2
{
  "group" : "fans",
  "user" : [{
    "last" :  "Smith",
    "first" : "John"
  }],
  "tag": "read"
}

查詢:

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df_nested = sqlContext.read.format("org.elasticsearch.spark.sql")\
.option("es.nodes", "xx.xx.xx.xx")\
.option("es.nodes.wan.only", "true")\
.option("es.port", "9200")\
.option("es.read.field.as.array.include", "user, tag")\
.load("my_nested_index/doc")

df_nested.show()

報錯:

scala.MatchError: [John,Smith] (of class org.elasticsearch.spark.sql.ScalaEsRow

將user從es.read.field.as.array.include中去掉,再查詢:

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df_nested = sqlContext.read.format("org.elasticsearch.spark.sql")\
.option("es.nodes", "xx.xx.xx.xx")\
.option("es.nodes.wan.only", "true")\
.option("es.port", "9200")\
.option("es.read.field.as.array.include", "tag")\
.load("my_nested_index/doc")

df_nested.show()

結果:

+-----+-------------+--------------------+
|group|          tag|                user|
+-----+-------------+--------------------+
| fans|       [read]|      [[John,Smith]]|
| fans|[read, write]|[[John,Smith], [A...|
+-----+-------------+--------------------+

如果在mapping中已經聲明user爲nested,就不必在es.read.field.as.array.include中包含。

結論

  1. 數組最好全部以數組形式寫入es,方便解析也方便spark讀取。
  2. 如果類型是數組,讀取時設置es.read.field.as.array.include,如果數組定義爲nested,則不必設置es.read.field.as.array.include。
  3. 合理設置mapping,如果可以,避免複雜的索引結構。
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章