spark讀寫Elasticsearch

關於scala代碼和Elasticsearch集成已經很常見了

直接一個maven配置

<dependency>
	<groupId>org.elasticsearch</groupId>
	<artifactId>elasticsearch-hadoop</artifactId>
	<version>6.1.0</version>
</dependency>

再然後一個簡單的代碼書寫,即可把mysql的數據寫入到Elasticsearch,非常方便

var sconf = new SparkConf()
	.setAppName(this.getClass.getName)
	.setMaster("local[5]")
	.set("spark.testing.memory", "471859200")
	.set("es.nodes", "xxx")
	.set("es.port","9200")
	.set("es.index.auto.create", "true")
	.set("es.nodes.wan.only", "true")
  val spark = SparkSession.builder().config(sconf).getOrCreate()
  spark.sparkContext.setLogLevel("WARN")
  val dataDF = spark.read.format("jdbc")
	.option("url", "jdbc:mysql://xxx:3306/database?characterEncoding=utf8&useSSL=false")
	.option("driver", "com.mysql.jdbc.Driver")
	.option("user", "root")
	.option("password", "123")
	.option("dbtable", "table")
	  .load()

  EsSparkSQL.saveToEs(dataDF,"test_index/doc")

pyspark讀寫也非常的簡單,需要下載相應的jar包

下載地址:
https://www.elastic.co/cn/downloads/past-releases/elasticsearch-apache-hadoop-6-4-1

這個包裏面還有多個jar,可以和hive、pig、mapreduce、storm、spark等等框架集成,本次需要的事和spark集成,將裏面的elasticsearch-hadoop-6.4.1.jar拷貝到spark的jars文件夾中。

官網的幾個例子:

Scala

Reading

創建一個esRDD,然後指定查詢::

import org.elasticsearch.spark._

..
val conf = ...
val sc = new SparkContext(conf)
sc.esRDD("radio/artists", "?q=me*")

Spark SQL

import org.elasticsearch.spark.sql._

// DataFrame schema automatically inferred
val df = sqlContext.read.format("es").load("buckethead/albums")

// operations get pushed down and translated at runtime to Elasticsearch QueryDSL
val playlist = df.filter(df("category").equalTo("pikes").and(df("year").geq(2016)))

Writing

Import the org.elasticsearch.spark._ package to gain savetoEs methods on your RDDs:

import org.elasticsearch.spark._        

val conf = ...
val sc = new SparkContext(conf)         

val numbers = Map("one" -> 1, "two" -> 2, "three" -> 3)
val airports = Map("OTP" -> "Otopeni", "SFO" -> "San Fran")

sc.makeRDD(Seq(numbers, airports)).saveToEs("spark/docs")

Spark SQL

import org.elasticsearch.spark.sql._

val df = sqlContext.read.json("examples/people.json")
df.saveToEs("spark/people")

Java

In a Java environment, use the org.elasticsearch.spark.rdd.java.api package, in particular the JavaEsSpark class.

Reading

To read data from ES, create a dedicated RDD and specify the query as an argument.

import org.apache.spark.api.java.JavaSparkContext;   
import org.elasticsearch.spark.rdd.api.java.JavaEsSpark; 

SparkConf conf = ...
JavaSparkContext jsc = new JavaSparkContext(conf);   

JavaPairRDD<String, Map<String, Object>> esRDD = JavaEsSpark.esRDD(jsc, "radio/artists");

Spark SQL

SQLContext sql = new SQLContext(sc);
DataFrame df = sql.read().format("es").load("buckethead/albums");
DataFrame playlist = df.filter(df.col("category").equalTo("pikes").and(df.col("year").geq(2016)))

Writing

Use JavaEsSpark to index any RDD to Elasticsearch:

import org.elasticsearch.spark.rdd.api.java.JavaEsSpark; 

SparkConf conf = ...
JavaSparkContext jsc = new JavaSparkContext(conf); 

Map<String, ?> numbers = ImmutableMap.of("one", 1, "two", 2);     
Map<String, ?> airports = ImmutableMap.of("OTP", "Otopeni", "SFO", "San Fran");

JavaRDD<Map<String, ?>> javaRDD = jsc.parallelize(ImmutableList.of(doc1, doc2)); 
JavaEsSpark.saveToEs(javaRDD, "spark/docs");

Spark SQL

import org.elasticsearch.spark.sql.api.java.JavaEsSparkSQL;

DataFrame df = sqlContext.read.json("examples/people.json")
JavaEsSparkSQL.saveToEs(df, "spark/docs")

這次

打開pycharm,將測試文件數據寫入到elasticsearch中

import os
import sys
from pyspark.sql import SparkSession, Row

os.environ['SPARK_HOME'] =r'D:\data\spark-2.3.3-bin-hadoop2.6'
sys.path.append(r'D:\data\spark-2.3.3-bin-hadoop2.6\python')

spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
test_json_file = sc.textFile(xxx.file)
#get_base_info是解析方法
file_map = test_json_file.map(lambda line:get_base_info(line))
#數據轉成dataframe
file_info = spark.createDataFrame(file_map)

#結果存入elasticsearch中
jd_info.write \
    .format("org.elasticsearch.spark.sql") \
    .option("es.nodes", "xxx") \
    .option("es.resource", "xxx/doc") \
    .mode('append') \
    .save()

還可以隨意的讀取,加載成一張表

query = """
    {   
     "query": {
        "match": {
          "sum_rec":"xxx"
        }
      }
    }"""
spark.read \
    .format("org.elasticsearch.spark.sql") \
    .option("es.nodes", "ip") \
    .option("es.resource", "xxx/doc") \
    .option("es.input.json","yes") \
    .option("es.index.read.missing.as.empty","true") \
    .option("es.query",query) \
    .load().registerTempTable("temp")

根據elasticsearch的官網
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html

還可以和spark streaming結合

import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._               
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._

import org.elasticsearch.spark.streaming._           

...

val conf = ...
val sc = new SparkContext(conf)                      
val ssc = new StreamingContext(sc, Seconds(1))       

val numbers = Map("one" -> 1, "two" -> 2, "three" -> 3)
val airports = Map("arrival" -> "Otopeni", "SFO" -> "San Fran")

val rdd = sc.makeRDD(Seq(numbers, airports))
val microbatches = mutable.Queue(rdd)                

ssc.queueStream(microbatches).saveToEs("spark/docs") 

ssc.start()
ssc.awaitTermination() 

或者

import org.apache.spark.SparkContext
import org.elasticsearch.spark.streaming.EsSparkStreaming         

// define a case class
case class Trip(departure: String, arrival: String)               

val upcomingTrip = Trip("OTP", "SFO")
val lastWeekTrip = Trip("MUC", "OTP")

val rdd = sc.makeRDD(Seq(upcomingTrip, lastWeekTrip))
val microbatches = mutable.Queue(rdd)                             
val dstream = ssc.queueStream(microbatches)

EsSparkStreaming.saveToEs(dstream, "spark/docs")                  

ssc.start()    

線上環境部署,只需要把jar包放進線上的spark客戶端裏面去,重啓spark,或者–jars指定。

pyspark線上運行可能要加上這麼一個配置spark.driver.memory

spark = SparkSession.builder.config("spark.driver.memory","2g").getOrCreate()
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章