anton
spark讀elasticsearch array
anton
elasticsearch數組
在Elasticsearch中,沒有專用的數組類型。默認情況下,任何字段都可以包含零個或多個值(數組中的所有值必須具有相同的數據類型)。
所以,我們在寫數據的時候,可以忽略數組和單個值得區別。例如:
PUT my_index/_doc/1
{
"group" : "fans",
"user" : [
{
"first" : "John",
"last" : "Smith"
},
{
"first" : "Alice",
"last" : "White"
}
],
"tag": [
"read",
"write"
]
}
PUT my_index/_doc/2
{
"group" : "fans",
"user" : {
"first" : "John",
"last" : "Smith"
},
"tag": "read"
}
這樣寫是沒有問題的,查詢也能得到結果
查詢:
GET my_index/_search
{
"query": {
"match": {
"user.first": "john"
}
}
}
結果:
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.2876821,
"hits": [
{
"_index": "my_index",
"_type": "_doc",
"_id": "2",
"_score": 0.2876821,
"_source": {
"group": "fans",
"user": {
"first": "John",
"last": "Smith"
},
"tag": "read"
}
},
{
"_index": "my_index",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"group": "fans",
"user": [
{
"first": "John",
"last": "Smith"
},
{
"first": "Alice",
"last": "White"
}
],
"tag": [
"read",
"write"
]
}
}
]
}
}
可以發現,文檔1的user是數組,文檔2的user是對象。
spark讀elasticsearch array
當我們用spark讀數據的時候就有問題了,例如:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df_test = sqlContext.read.format("org.elasticsearch.spark.sql")\
.option("es.nodes", "xx.xx.xx.xx")\
.option("es.nodes.wan.only", "true")\
.option("es.port", "9200")\
.load("my_index/_doc")
df_test.show()
錯誤:
org.elasticsearch.hadoop.EsHadoopIllegalStateException: Field 'user.first' not found; typically this occurs with arrays which are not mapped as single value
由於Elasticsearch可以將一個或多個值映射到字段,因此elasticsearch-hadoop無法根據映射確定是實例化一個值還是數組類型(取決於庫類型)。因此我們需要顯式聲明數組字段,可以通過es.read.field.as.array.include設置。
再次查詢:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df_test = sqlContext.read.format("org.elasticsearch.spark.sql")\
.option("es.nodes", "xx.xx.xx.xx")\
.option("es.nodes.wan.only", "true")\
.option("es.port", "9200")\
.option("es.read.field.as.array.include", "user, tag")\
.load("my_index/_doc")
df_test.show()
錯誤如下:
scala.MatchError: [John,Smith] (of class org.elasticsearch.spark.sql.ScalaEsRow)
猜測是因爲user字段類型不一致導致。
更新文檔2:
PUT my_index/_doc/2
{
"group" : "fans",
"user" : [{
"first" : "John",
"last" : "Smith"
}],
"tag": "read"
}
文檔更新之後的到結果:
+-----+-------------+--------------------+
|group| tag| user|
+-----+-------------+--------------------+
| fans| [read]| [[John,Smith]]|
| fans|[read, write]|[[John,Smith], [A...|
+-----+-------------+--------------------+
可以看出,基本類型的數組,單個值寫入時是值本身還是數組,對讀取沒有影響;對於對象數組,最好以數組方式寫入。
spark讀nested array
如果對象數組聲明爲nested,結果如何呢?
PUT my_nested_index
{
"mappings": {
"doc": {
"properties": {
"group": {
"type": "text"
},
"tag": {
"type": "text"
},
"user": {
"type": "nested",
"properties": {
"first": {
"type": "text"
},
"last": {
"type": "text"
}
}
}
}
}
}
}
PUT my_nested_index/doc/1
{
"group" : "fans",
"user" : [
{
"first" : "John",
"last" : "Smith"
},
{
"first" : "Alice",
"last" : "White"
}
],
"tag": [
"read",
"write"
]
}
PUT my_nested_index/doc/2
{
"group" : "fans",
"user" : [{
"last" : "Smith",
"first" : "John"
}],
"tag": "read"
}
查詢:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df_nested = sqlContext.read.format("org.elasticsearch.spark.sql")\
.option("es.nodes", "xx.xx.xx.xx")\
.option("es.nodes.wan.only", "true")\
.option("es.port", "9200")\
.option("es.read.field.as.array.include", "user, tag")\
.load("my_nested_index/doc")
df_nested.show()
報錯:
scala.MatchError: [John,Smith] (of class org.elasticsearch.spark.sql.ScalaEsRow
將user從es.read.field.as.array.include中去掉,再查詢:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df_nested = sqlContext.read.format("org.elasticsearch.spark.sql")\
.option("es.nodes", "xx.xx.xx.xx")\
.option("es.nodes.wan.only", "true")\
.option("es.port", "9200")\
.option("es.read.field.as.array.include", "tag")\
.load("my_nested_index/doc")
df_nested.show()
結果:
+-----+-------------+--------------------+
|group| tag| user|
+-----+-------------+--------------------+
| fans| [read]| [[John,Smith]]|
| fans|[read, write]|[[John,Smith], [A...|
+-----+-------------+--------------------+
如果在mapping中已經聲明user爲nested,就不必在es.read.field.as.array.include中包含。
結論
- 數組最好全部以數組形式寫入es,方便解析也方便spark讀取。
- 如果類型是數組,讀取時設置es.read.field.as.array.include,如果數組定義爲nested,則不必設置es.read.field.as.array.include。
- 合理設置mapping,如果可以,避免複雜的索引結構。