###好好好####JanusGraph批量導入數據優化

JanusGraph批量導入數據優化

批量導入工具:

https://github.com/dengziming/janusgraph-util

批量導入配置項

storage.batch-loading =true
  • 導入的數據必須具有一致性並且和已存在的數據必須具有一致性。(比如:name數據是具有唯一索引(a unique composite index),那麼導入的數據在name屬性上上和已有的數據不能重複)

  • 下面是優化配置,優化的目的,就是減少批量導入時間。

ID 分配優化

ID Block Size

ids.block-size
  • 配置項,JanusGraph實例通過id池管理對象從id blocks中獲取ids值爲新加入的vertex、edge分配唯一id,爲了保證庫唯一性,所以獲取id block(id塊)是昂貴的(因爲存在多個實例競爭),所以增加block-size可以減少獲取block的次數,但是值過大會導致多餘的id被浪費掉。
  • 一般情況下事務的負載,ids.block-size的默認值是滿足要求的。但是對於批量導入時,需要調節值爲每個JanusGraph實例需要添加節點和邊數的10倍。
  • 該配置項在集羣中所有實例上值必須唯一。

ID Acquisition Process

1) ids.authority.wait-time
  • 配置毫秒:id池管理器允許id block獲取程序最大允許等待時間,時間到還未獲取到就失敗。建議值設置爲存儲後端第95%時讀寫時間之和。
  • 該配置在集羣上要設置統一值。
2) ids.renew-timeout
  • 在失敗獲取id block後,id池管理程序等待多少毫秒後再次發起一個新的嘗試。儘可能大。

讀寫優化

Buffer Size

storage.buffer-size  緩存大小
  • JanusGraph會緩存寫操作,然後批量發送到後端執行。這樣可以減少請求次數,從而避免短時間內執行服務器寫請求過多導致的失敗。
  • 如果緩存設置太大,會增加寫延遲因此會增加執行失敗的可能性。
  • 建議:謹慎設置該值。

Read and Write Robustness

  • 如果存儲後端的寫或讀操作失敗後(storage.buffer-size 太大會增加失敗的可能性),將會重試多少次纔會放棄。
storage.read-attempts   讀嘗試的次數
storage.write-attempts  寫嘗試次數
storage.attempt-wait  :兩次嘗試之間的時間間隔,在批量導入情況下,此值可以設置大一些

策略

Parallelizing the Load

  • 如果存儲後端可以可承受足夠的請求,那麼可以在多個機器上並行批量導入,可以減少導入時間。
  • Chapter 35, JanusGraph with TinkerPop’s Hadoop-Gremlin 通過MapReduce批量導入數據。
  • 如果不使用Hadoop,則可以將大圖拆分成小圖並行導入。
  • 如果圖不可以被拆分,那麼可以將頂點和邊分開並行導入。

使用操作:

基本配置

  1. 首先從官網下載並解壓janusgraph到本地/data/janusgraph/目錄。
  2. 然後配置圖數據庫前後端。由於我們用的是es + hbase, 所以直接修改/data/janusgraph/conf/janusgraph-hbase-es.properties :
#重要
gremlin.graph=org.janusgraph.core.JanusGraphFactory
#hbase配置
storage.batch-loading=true
storage.backend=hbase
storage.hostname=c1-nn1.bdp.idc,c1-nn2.bdp.idc,c1-nn3.bdp.idc
storage.hbase.ext.hbase.zookeeper.property.clientPort=2181
storage.hbase.table = yisou:test_graph
#es配置
index.search.backend=elasticsearch
index.search.hostname=10.120.64.69  #es是隻安裝在本地,此爲本機ip。
index.search.elasticsearch.client-only=true
index.search.index-name=yisou_test_graph
#默認cache配置
cache.db-cache = true
cache.db-cache-clean-wait = 20
cache.db-cache-time = 180000
cache.db-cache-size = 0.5

3.修改/data/janusgraph/lib下的jar包。由於在跑yarn-client批量導入時有guava等jar包衝突,我根據衝突情況對lib下面的jar包作了調整。主要調整了3個jar包:

  1. hbase-client-1.2.4.jar ==> yisou-hbase-1.0-SNAPSHOT.jar
    由於lib下的hbase-client-1.2.4.jar用的guava與我們yarn集羣的guava版本有衝突,所以我們用了公司內部的去除了guava的hbase-client,即yisou-hbase-1.0-SNAPSHOT.jar 。
    如果不替換,報錯 "Caused by: java.lang.IllegalAccessError: tried to access method com.google.common.base.Stopwatch.<init>()V from class org.apache.hadoop.hbase.zookeeper.MetaTableLocator"
  2. spark-assembly-1.6.1-hadoop2.6.0.jar ==> spark-assembly-1.6.2-hadoop2.6.0.jar
    lib自帶的spark-assembly-1.6.1-hadoop2.6.0.jar也會引起guava衝突,我將其替換成spark-assembly-1.6.2-hadoop2.6.0.jar。
    如果不替換,將會報錯"java.lang.NoSuchMethodError: groovy.lang.MetaClassImpl.hasCustomStaticInvokeMethod()Z"
  3. 刪除 hbase-protocol-1.2.4.jar.
    如果不刪除,將會報錯 "com.google.protobuf.ServiceException: java.lang.NoSuchMethodError: org.apache.hadoop.hbase.protobuf.generated.RPCProtos$ConnectionHeader$Builder.setVersionInfo(Lorg/apache/hadoop/hbase/protobuf/generated/RPCProtos$VersionInfo;)Lorg/apache/hadoop/hbase/protobuf/generated/RPCProtos$ConnectionHeader$Builder;"

4.配置圖中邊和節點屬性,具體參考官網,本文不展開。

批量導入配置

由於需要與yarn配合,將導入程序放在yarn上執行,所以需要hadoop相關環境配置。需要修改兩個配置文件,一個是Janusgraph的啓動腳本/data/janusgraph/lib/gremlin.sh, 另一個是hadoop和spark相關的配置/data/janusgraph/conf/hadoop-graph/hadoop-script.properties。

1.複製/data/janusgraph/lib/gremlin.sh, 假定命名爲yarn-gremlin.sh。 然後增加hadoop的配置到JAVA_OPTIONS和CLASSPATH中。這樣能保證hadoop相關配置能被程序讀取到,便於正常啓動spark在yarn上的任務。

#!/bin/bash
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_HOME=/usr/local/hadoop-2.7.1
export JAVA_OPTIONS="$JAVA_OPTIONS -Djava.library.path=$HADOOP_HOME/lib/native"
export CLASSPATH=$HADOOP_CONF_DIR
#JANUSGRAPH_HOME爲用戶安裝janusgraph的目錄/data/janusgraph/
cd $JANUSGRAPH_HOME
./bin/gremlin.sh

2.修改/data/janusgraph/conf/hadoop-graph/hadoop-script.properties
主要根據要導入文件的格式修改inputFormat、指定要導入的hdfs文件路徑、parse函數路徑以及spark master指定爲yarn-client等。

#
# Hadoop Graph Configuration
#
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat
gremlin.hadoop.jarsInDistributedCache=true

#導入文件的hdfs路徑。也可以在加載該配置文件後指定
gremlin.hadoop.inputLocation=/user/yisou/taotian1/janus/data/fewData.test.dup
#解析hdfs文件的parse函數路徑。也可以在加載該配置文件後指定
gremlin.hadoop.scriptInputFormat.script=/user/yisou/taotian1/janus/data/conf/vertex_parse.groovy
#gremlin.hadoop.outputLocation=output

#
# SparkGraphComputer with Yarn Configuration
#
spark.master=yarn-client
spark.executor.memory=6g
spark.executor.instances=10
spark.executor.cores=2
spark.serializer=org.apache.spark.serializer.KryoSerializer
# spark.kryo.registrationRequired=true
# spark.storage.memoryFraction=0.2
# spark.eventLog.enabled=true
# spark.eventLog.dir=/tmp/spark-event-logs
# spark.ui.killEnabled=true

#cache config
gremlin.spark.persistContext=true
gremlin.spark.graphStorageLevel=MEMORY_AND_DISK
#gremlin.spark.persistStorageLevel=DISK_ONLY


#####################################
# GiraphGraphComputer Configuration #
#####################################
giraph.minWorkers=2
giraph.maxWorkers=3
giraph.useOutOfCoreGraph=true
giraph.useOutOfCoreMessages=true
mapred.map.child.java.opts=-Xmx1024m
mapred.reduce.child.java.opts=-Xmx1024m
giraph.numInputThreads=4
giraph.numComputeThreads=4
# giraph.maxPartitionsInMemory=1
# giraph.userPartitionCount=2

執行批量導入

啓動命令:

sh /data/janusgraph/lib/yarn-gremlin.sh

批量導入命令:

local_root="/data/janusgraph"
hdfs_root="/user/yisou/taotian1/janus"
social_graph="${local_root}/conf/janusgraph-hbase-es.properties"
graph = GraphFactory.open("${local_root}/conf/hadoop-script.properties")
graph.configuration().setProperty("gremlin.hadoop.inputLocation","/user/yisou/taotian1/janus/data/fewData.test.dup")
graph.configuration().setProperty("gremlin.hadoop.scriptInputFormat.script", "${hdfs_root}/conf/vertex_parse.groovy")
blvp = BulkLoaderVertexProgram.build().writeGraph(social_graph).create(graph)
graph.compute(SparkGraphComputer).program(blvp).submit().get()

運行結果:

sh /data/janusgraph/lib/yarn-gremlin.sh
\,,,/
(o o)
-----oOOo-(3)-oOOo-----
plugin activated: janusgraph.imports
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data2/janusgraph-0.1.1-hadoop2/lib/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data2/janusgraph-0.1.1-hadoop2/lib/logback-classic-1.1.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data2/janusgraph-0.1.1-hadoop2/lib/spark-assembly-1.6.2-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data2/janusgraph-0.1.1-hadoop2/lib/yisou-hbase-1.0-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
21:22:00,392  INFO HadoopGraph:87 - HADOOP_GREMLIN_LIBS is set to: /data2/janusgraph-0.1.1-hadoop2/lib
plugin activated: tinkerpop.hadoop
plugin activated: tinkerpop.spark
plugin activated: tinkerpop.tinkergraph
gremlin>
gremlin> local_root="/data2/janusgraph-0.1.1-hadoop2/social"
==>/data2/janusgraph-0.1.1-hadoop2/social
gremlin> hdfs_root="/user/yisou/taotian1/janus"
==>/user/yisou/taotian1/janus
gremlin> social_graph="${local_root}/conf/janusgraph-hbase-es-social.properties"
==>/data2/janusgraph-0.1.1-hadoop2/social/conf/janusgraph-hbase-es-social.properties
gremlin> graph = GraphFactory.open("${local_root}/conf/hadoop-yarn.properties")
==>hadoopgraph[scriptinputformat->graphsonoutputformat]
gremlin> graph.configuration().setProperty("gremlin.hadoop.inputLocation","/user/yisou/taotian1/janus/tmp1person/")
==>null
gremlin> graph.configuration().setProperty("gremlin.hadoop.scriptInputFormat.script", "${hdfs_root}/person_parse.groovy")
==>null
gremlin> blvp = BulkLoaderVertexProgram.build().writeGraph(social_graph).create(graph)
==>BulkLoaderVertexProgram[bulkLoader=IncrementalBulkLoader, vertexIdProperty=bulkLoader.vertex.id, userSuppliedIds=false, keepOriginalIds=true, batchSize=0]
gremlin> graph.compute(SparkGraphComputer).program(blvp).submit().get()
21:25:04,666  INFO deprecation:1173 - mapred.reduce.child.java.opts is deprecated. Instead, use mapreduce.reduce.java.opts
21:25:04,667  INFO deprecation:1173 - mapred.map.child.java.opts is deprecated. Instead, use mapreduce.map.java.opts
21:25:04,680  INFO KryoShimServiceLoader:117 - Set KryoShimService provider to org.apache.tinkerpop.gremlin.hadoop.structure.io.HadoopPoolShimService@4cb2918c (class org.apache.tinkerpop.gremlin.hadoop.structure.io.HadoopPoolShimService) because its priority value (0) is the highest available
21:25:04,680  INFO KryoShimServiceLoader:123 - Configuring KryoShimService provider org.apache.tinkerpop.gremlin.hadoop.structure.io.HadoopPoolShimService@4cb2918c with user-provided configuration
  21:25:10,479  WARN SparkConf:70 - The configuration key 'spark.yarn.user.classpath.first' has been deprecated as of Spark 1.3 and may be removed in the future. Please use spark.{driver,executor}.userClassPathFirst instead.
21:25:10,505  INFO SparkContext:58 - Running Spark version 1.6.2
21:25:10,524  WARN SparkConf:70 - The configuration key 'spark.yarn.user.classpath.first' has been deprecated as of Spark 1.3 and may be removed in the future. Please use spark.{driver,executor}.userClassPathFirst instead.
21:25:10,564  INFO SecurityManager:58 - Changing view acls to: yisou
21:25:10,565  INFO SecurityManager:58 - Changing modify acls to: yisou
21:25:10,566  INFO SecurityManager:58 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yisou); users with modify permissions: Set(yisou)
21:25:10,833  WARN SparkConf:70 - The configuration key 'spark.yarn.user.classpath.first' has been deprecated as of Spark 1.3 and may be removed in the future. Please use spark.{driver,executor}.userClassPathFirst instead.
21:25:10,835  WARN SparkConf:70 - The configuration key 'spark.yarn.user.classpath.first' has been deprecated as of Spark 1.3 and may be removed in the future. Please use spark.{driver,executor}.userClassPathFirst instead.
21:25:11,035  INFO Utils:58 - Successfully started service 'sparkDriver' on port 36502.
21:25:11,576  INFO Slf4jLogger:80 - Slf4jLogger started
  21:25:11,646  INFO Remoting:74 - Starting remoting
............
21:25:20,736  INFO Client:58 - Submitting application 2727164 to ResourceManager
21:25:20,771  INFO YarnClientImpl:273 - Submitted application application_1466564207556_2727164
21:25:21,780  INFO Client:58 - Application report for application_1466564207556_2727164 (state: ACCEPTED)
21:25:21,785  INFO Client:58 -
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.yisou
start time: 1500297920750
final status: UNDEFINED
tracking URL: http://c1-nn3.bdp.idc:8981/proxy/application_1466564207556_2727164/
21:25:22,787  INFO Client:58 - Application report for application_1466564207556_2727164 (state: ACCEPTED)
21:25:23,789  INFO Client:58 - Application report for application_1466564207556_2727164 (state: ACCEPTED)
21:25:24,791  INFO Client:58 - Application report for application_1466564207556_2727164 (state: ACCEPTED)
21:25:25,793  INFO Client:58 - Application report for application_1466564207556_2727164 (state: ACCEPTED)
21:25:39,585  INFO JettyUtils:58 - Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
21:25:39,823  INFO Client:58 - Application report for application_1466564207556_2727164 (state: RUNNING)
21:25:39,824  INFO Client:58 -
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.130.1.50
ApplicationMaster RPC port: 0
queue: root.yisou
start time: 1500297920750
final status: UNDEFINED
tracking URL: http://c1-nn3.bdp.idc:8981/proxy/application_1466564207556_2727164/
..........
21:25:42,864  INFO SparkContext:58 - Added JAR /data2/janusgraph-0.1.1-hadoop2/lib/commons-codec-1.7.jar at http://10.130.64.69:38209/jars/commons-codec-1.7.jar with timestamp 1500297942864
21:25:42,866  INFO SparkContext:58 - Added JAR /data2/janusgraph-0.1.1-hadoop2/lib/commons-lang-2.5.jar at http://10.130.64.69:38209/jars/commons-lang-2.5.jar with timestamp 1500297942866
21:25:42,869  INFO SparkContext:58 - Added JAR /data2/janusgraph-0.1.1-hadoop2/lib/commons-collections-3.2.2.jar at http://10.130.64.69:38209/jars/commons-collections-3.2.2.jar with timestamp 1500297942869
21:25:42,872  INFO SparkContext:58 - Added JAR /data2/janusgraph-0.1.1-hadoop2/lib/commons-io-2.3.jar at http://10.130.64.69:38209/jars/commons-io-2.3.jar with timestamp 1500297942872
21:25:42,874  INFO SparkContext:58 - Added JAR /data2/janusgraph-0.1.1-hadoop2/lib/jetty-util-6.1.26.jar at http://10.130.64.69:38209/jars/jetty-util-6.1.26.jar with timestamp 1500297942874
21:25:42,879  INFO SparkContext:58 - Added JAR /data2/janusgraph-0.1.1-hadoop2/lib/htrace-core-3.1.0-incubating.jar at http://10.130.64.69:38209/jars/htrace-core-3.1.0-incubating.jar with timestamp 1
............
21:26:14,751  INFO MapOutputTrackerMaster:58 - Size of output statuses for shuffle 2 is 146 bytes
21:26:14,767  INFO TaskSetManager:58 - Finished task 0.0 in stage 6.0 (TID 4) in 40 ms on c1-dn31.bdp.idc (1/1)
21:26:14,767  INFO YarnScheduler:58 - Removed TaskSet 6.0, whose tasks have all completed, from pool
21:26:14,767  INFO DAGScheduler:58 - ResultStage 6 (foreachPartition at SparkExecutor.java:173) finished in 0.042 s
21:26:14,768  INFO DAGScheduler:58 - Job 1 finished: foreachPartition at SparkExecutor.java:173, took 1.776125 s
21:26:14,775  INFO ShuffledRDD:58 - Removing RDD 2 from persistence list
21:26:14,785  INFO BlockManager:58 - Removing RDD 2
==>result[hadoopgraph[scriptinputformat->graphsonoutputformat],memory[size:0]]
gremlin> 21:26:22,515  INFO YarnClientSchedulerBackend:58 - Registered executor NettyRpcEndpointRef(null) (c1-dn9.bdp.idc:60762) with ID 8

批量導入性能優化

如果不做優化,janusgraph批量導入的速度非常慢,導入4千萬條數據大約需要3.5小時。優化後可降低到1小時.
1.加大ids.block-size和storage.buffer-size參數的大小(在janusgraph-hbase-es.properties中配置)。
ids.block-size=100000000
storage.buffer-size=102400

2.指定hbase初始的region數目(在janusgraph-hbase-es.properties中配置)。
storage.hbase.region-count = 50

3.邊和頂點同時導入,而不是頂點和邊分成不同的文件,分開導入。格式可參考/data/janusgraph/data/grateful-dead.txt。

總結

本文主要講解了janusgraph中如何配置yarn-client的方式批量導入節點和邊。

分爲基本配置和批量導入的配置兩部分,基本配置中需要注意janusgraph自帶jar包與用戶yarn環境中jar包的衝突問題,可替換或者刪除相關jar包。

批量導入配置中重點是在gremlin.sh中添加hadoop的相關配置,將hadoop環境配置到JAVA_OPTIONS和CLASSPATH中。

參考鏈接

Titan 數據庫使用
圖數據庫Titan在生產環境中的使用全過程+分析
合併頂點和邊,批量導入parse函數樣例
Yet Another Analytics & Intelligence Communication Series

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章