關於scala代碼和Elasticsearch集成已經很常見了
直接一個maven配置
<dependency>
<groupId>org.elasticsearch</groupId>
<artifactId>elasticsearch-hadoop</artifactId>
<version>6.1.0</version>
</dependency>
再然後一個簡單的代碼書寫,即可把mysql的數據寫入到Elasticsearch,非常方便
var sconf = new SparkConf()
.setAppName(this.getClass.getName)
.setMaster("local[5]")
.set("spark.testing.memory", "471859200")
.set("es.nodes", "xxx")
.set("es.port","9200")
.set("es.index.auto.create", "true")
.set("es.nodes.wan.only", "true")
val spark = SparkSession.builder().config(sconf).getOrCreate()
spark.sparkContext.setLogLevel("WARN")
val dataDF = spark.read.format("jdbc")
.option("url", "jdbc:mysql://xxx:3306/database?characterEncoding=utf8&useSSL=false")
.option("driver", "com.mysql.jdbc.Driver")
.option("user", "root")
.option("password", "123")
.option("dbtable", "table")
.load()
EsSparkSQL.saveToEs(dataDF,"test_index/doc")
pyspark讀寫也非常的簡單,需要下載相應的jar包
下載地址:
https://www.elastic.co/cn/downloads/past-releases/elasticsearch-apache-hadoop-6-4-1
這個包裏面還有多個jar,可以和hive、pig、mapreduce、storm、spark等等框架集成,本次需要的事和spark集成,將裏面的elasticsearch-hadoop-6.4.1.jar拷貝到spark的jars文件夾中。
官網的幾個例子:
Scala
Reading
創建一個esRDD,然後指定查詢::
import org.elasticsearch.spark._
..
val conf = ...
val sc = new SparkContext(conf)
sc.esRDD("radio/artists", "?q=me*")
Spark SQL
import org.elasticsearch.spark.sql._
// DataFrame schema automatically inferred
val df = sqlContext.read.format("es").load("buckethead/albums")
// operations get pushed down and translated at runtime to Elasticsearch QueryDSL
val playlist = df.filter(df("category").equalTo("pikes").and(df("year").geq(2016)))
Writing
Import the org.elasticsearch.spark._
package to gain savetoEs
methods on your RDD
s:
import org.elasticsearch.spark._
val conf = ...
val sc = new SparkContext(conf)
val numbers = Map("one" -> 1, "two" -> 2, "three" -> 3)
val airports = Map("OTP" -> "Otopeni", "SFO" -> "San Fran")
sc.makeRDD(Seq(numbers, airports)).saveToEs("spark/docs")
Spark SQL
import org.elasticsearch.spark.sql._
val df = sqlContext.read.json("examples/people.json")
df.saveToEs("spark/people")
Java
In a Java environment, use the org.elasticsearch.spark.rdd.java.api
package, in particular the JavaEsSpark
class.
Reading
To read data from ES, create a dedicated RDD
and specify the query as an argument.
import org.apache.spark.api.java.JavaSparkContext;
import org.elasticsearch.spark.rdd.api.java.JavaEsSpark;
SparkConf conf = ...
JavaSparkContext jsc = new JavaSparkContext(conf);
JavaPairRDD<String, Map<String, Object>> esRDD = JavaEsSpark.esRDD(jsc, "radio/artists");
Spark SQL
SQLContext sql = new SQLContext(sc);
DataFrame df = sql.read().format("es").load("buckethead/albums");
DataFrame playlist = df.filter(df.col("category").equalTo("pikes").and(df.col("year").geq(2016)))
Writing
Use JavaEsSpark
to index any RDD
to Elasticsearch:
import org.elasticsearch.spark.rdd.api.java.JavaEsSpark;
SparkConf conf = ...
JavaSparkContext jsc = new JavaSparkContext(conf);
Map<String, ?> numbers = ImmutableMap.of("one", 1, "two", 2);
Map<String, ?> airports = ImmutableMap.of("OTP", "Otopeni", "SFO", "San Fran");
JavaRDD<Map<String, ?>> javaRDD = jsc.parallelize(ImmutableList.of(doc1, doc2));
JavaEsSpark.saveToEs(javaRDD, "spark/docs");
Spark SQL
import org.elasticsearch.spark.sql.api.java.JavaEsSparkSQL;
DataFrame df = sqlContext.read.json("examples/people.json")
JavaEsSparkSQL.saveToEs(df, "spark/docs")
這次
打開pycharm,將測試文件數據寫入到elasticsearch中
import os
import sys
from pyspark.sql import SparkSession, Row
os.environ['SPARK_HOME'] =r'D:\data\spark-2.3.3-bin-hadoop2.6'
sys.path.append(r'D:\data\spark-2.3.3-bin-hadoop2.6\python')
spark = SparkSession.builder.getOrCreate()
sc = spark.sparkContext
test_json_file = sc.textFile(xxx.file)
#get_base_info是解析方法
file_map = test_json_file.map(lambda line:get_base_info(line))
#數據轉成dataframe
file_info = spark.createDataFrame(file_map)
#結果存入elasticsearch中
jd_info.write \
.format("org.elasticsearch.spark.sql") \
.option("es.nodes", "xxx") \
.option("es.resource", "xxx/doc") \
.mode('append') \
.save()
還可以隨意的讀取,加載成一張表
query = """
{
"query": {
"match": {
"sum_rec":"xxx"
}
}
}"""
spark.read \
.format("org.elasticsearch.spark.sql") \
.option("es.nodes", "ip") \
.option("es.resource", "xxx/doc") \
.option("es.input.json","yes") \
.option("es.index.read.missing.as.empty","true") \
.option("es.query",query) \
.load().registerTempTable("temp")
根據elasticsearch的官網
https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html
還可以和spark streaming結合
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.StreamingContext._
import org.elasticsearch.spark.streaming._
...
val conf = ...
val sc = new SparkContext(conf)
val ssc = new StreamingContext(sc, Seconds(1))
val numbers = Map("one" -> 1, "two" -> 2, "three" -> 3)
val airports = Map("arrival" -> "Otopeni", "SFO" -> "San Fran")
val rdd = sc.makeRDD(Seq(numbers, airports))
val microbatches = mutable.Queue(rdd)
ssc.queueStream(microbatches).saveToEs("spark/docs")
ssc.start()
ssc.awaitTermination()
或者
import org.apache.spark.SparkContext
import org.elasticsearch.spark.streaming.EsSparkStreaming
// define a case class
case class Trip(departure: String, arrival: String)
val upcomingTrip = Trip("OTP", "SFO")
val lastWeekTrip = Trip("MUC", "OTP")
val rdd = sc.makeRDD(Seq(upcomingTrip, lastWeekTrip))
val microbatches = mutable.Queue(rdd)
val dstream = ssc.queueStream(microbatches)
EsSparkStreaming.saveToEs(dstream, "spark/docs")
ssc.start()
線上環境部署,只需要把jar包放進線上的spark客戶端裏面去,重啓spark,或者–jars指定。
pyspark線上運行可能要加上這麼一個配置spark.driver.memory
spark = SparkSession.builder.config("spark.driver.memory","2g").getOrCreate()