spark读取elasticsearch nested array

原創

2020-07-07 22:21

anton

spark读elasticsearch array

anton

elasticsearch数组

在Elasticsearch中，没有专用的数组类型。默认情况下，任何字段都可以包含零个或多个值(数组中的所有值必须具有相同的数据类型)。
所以，我们在写数据的时候，可以忽略数组和单个值得区别。例如：

PUT my_index/_doc/1
{
  "group" : "fans",
  "user" : [
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ],
  "tag": [
    "read",
    "write"
  ]
}

PUT my_index/_doc/2
{
  "group" : "fans",
  "user" : {
    "first" : "John",
    "last" :  "Smith"
  },
  "tag": "read"
}

这样写是没有问题的，查询也能得到结果
查询：

GET my_index/_search
{
  "query": {
    "match": {
      "user.first": "john"
    }
  }
}

结果：

{
  "took": 0,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "my_index",
        "_type": "_doc",
        "_id": "2",
        "_score": 0.2876821,
        "_source": {
          "group": "fans",
          "user": {
            "first": "John",
            "last": "Smith"
          },
          "tag": "read"
        }
      },
      {
        "_index": "my_index",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "group": "fans",
          "user": [
            {
              "first": "John",
              "last": "Smith"
            },
            {
              "first": "Alice",
              "last": "White"
            }
          ],
          "tag": [
            "read",
            "write"
          ]
        }
      }
    ]
  }
}

可以发现，文档1的user是数组，文档2的user是对象。

spark读elasticsearch array

当我们用spark读数据的时候就有问题了，例如：

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df_test = sqlContext.read.format("org.elasticsearch.spark.sql")\
.option("es.nodes", "xx.xx.xx.xx")\
.option("es.nodes.wan.only", "true")\
.option("es.port", "9200")\
.load("my_index/_doc")
df_test.show()

错误：

org.elasticsearch.hadoop.EsHadoopIllegalStateException: Field 'user.first' not found; typically this occurs with arrays which are not mapped as single value

由于Elasticsearch可以将一个或多个值映射到字段，因此elasticsearch-hadoop无法根据映射确定是实例化一个值还是数组类型（取决于库类型）。因此我们需要显式声明数组字段，可以通过es.read.field.as.array.include设置。
再次查询：

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df_test = sqlContext.read.format("org.elasticsearch.spark.sql")\
.option("es.nodes", "xx.xx.xx.xx")\
.option("es.nodes.wan.only", "true")\
.option("es.port", "9200")\
.option("es.read.field.as.array.include", "user, tag")\
.load("my_index/_doc")
df_test.show()

错误如下：

scala.MatchError: [John,Smith] (of class org.elasticsearch.spark.sql.ScalaEsRow)

猜测是因为user字段类型不一致导致。
更新文档2：

PUT my_index/_doc/2
{
  "group" : "fans",
  "user" : [{
    "first" : "John",
    "last" :  "Smith"
  }],
  "tag": "read"
}

文档更新之后的到结果：

+-----+-------------+--------------------+
|group|          tag|                user|
+-----+-------------+--------------------+
| fans|       [read]|      [[John,Smith]]|
| fans|[read, write]|[[John,Smith], [A...|
+-----+-------------+--------------------+

可以看出，基本类型的数组，单个值写入时是值本身还是数组，对读取没有影响；对于对象数组，最好以数组方式写入。

spark读nested array

如果对象数组声明为nested，结果如何呢？

PUT my_nested_index
{
    "mappings": {
      "doc": {
        "properties": {
          "group": {
            "type": "text"
          },
          "tag": {
            "type": "text"
          },
          "user": {
            "type": "nested",
            "properties": {
              "first": {
                "type": "text"
              },
              "last": {
                "type": "text"
              }
            }
          }
        }
      }
    }
}

PUT my_nested_index/doc/1
{
  "group" : "fans",
  "user" : [
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ],
  "tag": [
    "read",
    "write"
  ]
}

PUT my_nested_index/doc/2
{
  "group" : "fans",
  "user" : [{
    "last" :  "Smith",
    "first" : "John"
  }],
  "tag": "read"
}

查询：

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df_nested = sqlContext.read.format("org.elasticsearch.spark.sql")\
.option("es.nodes", "xx.xx.xx.xx")\
.option("es.nodes.wan.only", "true")\
.option("es.port", "9200")\
.option("es.read.field.as.array.include", "user, tag")\
.load("my_nested_index/doc")

df_nested.show()

报错：

scala.MatchError: [John,Smith] (of class org.elasticsearch.spark.sql.ScalaEsRow

将user从es.read.field.as.array.include中去掉，再查询：

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df_nested = sqlContext.read.format("org.elasticsearch.spark.sql")\
.option("es.nodes", "xx.xx.xx.xx")\
.option("es.nodes.wan.only", "true")\
.option("es.port", "9200")\
.option("es.read.field.as.array.include", "tag")\
.load("my_nested_index/doc")

df_nested.show()

结果：

+-----+-------------+--------------------+
|group|          tag|                user|
+-----+-------------+--------------------+
| fans|       [read]|      [[John,Smith]]|
| fans|[read, write]|[[John,Smith], [A...|
+-----+-------------+--------------------+

如果在mapping中已经声明user为nested，就不必在es.read.field.as.array.include中包含。

结论

数组最好全部以数组形式写入es，方便解析也方便spark读取。
如果类型是数组，读取时设置es.read.field.as.array.include，如果数组定义为nested，则不必设置es.read.field.as.array.include。
合理设置mapping，如果可以，避免复杂的索引结构。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

spark读取elasticsearch nested array

spark读elasticsearch array

elasticsearch数组

spark读elasticsearch array

spark读nested array

结论

linux安装cuda和cudnn

测试人员都是画画大神，让我看看谁还不会用代码图？

Object.values()对象遍历

我拍了拍Redis，被移出了群聊···

网络现代化通向云原生应用的高速公路

面试官：说说你对序列化的理解

我宣布，这是我找到的史上AI最全论文体系！

spark讀取elasticsearch nested array

使用python 進行oracle 全庫數據描述性及探索性逆向分析

使用python fake module批量製造測試數據

elasticsearch 6.3.0 快照

動態添加tab選項卡及tab頁面內容（ajax請求json數據）

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結