JanusGraph批量導入數據優化

批量導入工具：

https://github.com/dengziming/janusgraph-util

批量導入配置項

storage.batch-loading =true

導入的數據必須具有一致性並且和已存在的數據必須具有一致性。（比如：name數據是具有唯一索引（a unique composite index），那麼導入的數據在name屬性上上和已有的數據不能重複）
下面是優化配置，優化的目的，就是減少批量導入時間。

ID 分配優化

ID Block Size

ids.block-size

配置項，JanusGraph實例通過id池管理對象從id blocks中獲取ids值爲新加入的vertex、edge分配唯一id，爲了保證庫唯一性，所以獲取id block（id塊）是昂貴的（因爲存在多個實例競爭），所以增加block-size可以減少獲取block的次數，但是值過大會導致多餘的id被浪費掉。
一般情況下事務的負載，ids.block-size的默認值是滿足要求的。但是對於批量導入時，需要調節值爲每個JanusGraph實例需要添加節點和邊數的10倍。
該配置項在集羣中所有實例上值必須唯一。

ID Acquisition Process

1) ids.authority.wait-time

配置毫秒：id池管理器允許id block獲取程序最大允許等待時間，時間到還未獲取到就失敗。建議值設置爲存儲後端第95%時讀寫時間之和。
該配置在集羣上要設置統一值。

2) ids.renew-timeout

在失敗獲取id block後，id池管理程序等待多少毫秒後再次發起一個新的嘗試。儘可能大。

讀寫優化

Buffer Size

storage.buffer-size  緩存大小

JanusGraph會緩存寫操作，然後批量發送到後端執行。這樣可以減少請求次數，從而避免短時間內執行服務器寫請求過多導致的失敗。
如果緩存設置太大，會增加寫延遲因此會增加執行失敗的可能性。
建議：謹慎設置該值。

Read and Write Robustness

如果存儲後端的寫或讀操作失敗後（storage.buffer-size 太大會增加失敗的可能性），將會重試多少次纔會放棄。

storage.read-attempts   讀嘗試的次數
storage.write-attempts  寫嘗試次數
storage.attempt-wait  ：兩次嘗試之間的時間間隔，在批量導入情況下，此值可以設置大一些

策略

Parallelizing the Load

如果存儲後端可以可承受足夠的請求，那麼可以在多個機器上並行批量導入，可以減少導入時間。
Chapter 35, JanusGraph with TinkerPop’s Hadoop-Gremlin 通過MapReduce批量導入數據。
如果不使用Hadoop，則可以將大圖拆分成小圖並行導入。
如果圖不可以被拆分，那麼可以將頂點和邊分開並行導入。

使用操作：

基本配置

首先從官網下載並解壓janusgraph到本地/data/janusgraph/目錄。
然後配置圖數據庫前後端。由於我們用的是es + hbase，所以直接修改/data/janusgraph/conf/janusgraph-hbase-es.properties ：

#重要
gremlin.graph=org.janusgraph.core.JanusGraphFactory
#hbase配置
storage.batch-loading=true
storage.backend=hbase
storage.hostname=c1-nn1.bdp.idc,c1-nn2.bdp.idc,c1-nn3.bdp.idc
storage.hbase.ext.hbase.zookeeper.property.clientPort=2181
storage.hbase.table = yisou:test_graph
#es配置
index.search.backend=elasticsearch
index.search.hostname=10.120.64.69  #es是隻安裝在本地，此爲本機ip。
index.search.elasticsearch.client-only=true
index.search.index-name=yisou_test_graph
#默認cache配置
cache.db-cache = true
cache.db-cache-clean-wait = 20
cache.db-cache-time = 180000
cache.db-cache-size = 0.5

3.修改/data/janusgraph/lib下的jar包。由於在跑yarn-client批量導入時有guava等jar包衝突，我根據衝突情況對lib下面的jar包作了調整。主要調整了3個jar包：

hbase-client-1.2.4.jar ==> yisou-hbase-1.0-SNAPSHOT.jar
由於lib下的hbase-client-1.2.4.jar用的guava與我們yarn集羣的guava版本有衝突，所以我們用了公司內部的去除了guava的hbase-client，即yisou-hbase-1.0-SNAPSHOT.jar 。
如果不替換，報錯 "Caused by: java.lang.IllegalAccessError: tried to access method com.google.common.base.Stopwatch.<init>()V from class org.apache.hadoop.hbase.zookeeper.MetaTableLocator"

spark-assembly-1.6.1-hadoop2.6.0.jar ==> spark-assembly-1.6.2-hadoop2.6.0.jar
lib自帶的spark-assembly-1.6.1-hadoop2.6.0.jar也會引起guava衝突，我將其替換成spark-assembly-1.6.2-hadoop2.6.0.jar。
如果不替換，將會報錯"java.lang.NoSuchMethodError: groovy.lang.MetaClassImpl.hasCustomStaticInvokeMethod()Z"

刪除 hbase-protocol-1.2.4.jar.
如果不刪除，將會報錯 "com.google.protobuf.ServiceException: java.lang.NoSuchMethodError: org.apache.hadoop.hbase.protobuf.generated.RPCProtos$ConnectionHeader$Builder.setVersionInfo(Lorg/apache/hadoop/hbase/protobuf/generated/RPCProtos$VersionInfo;)Lorg/apache/hadoop/hbase/protobuf/generated/RPCProtos$ConnectionHeader$Builder;"

4.配置圖中邊和節點屬性，具體參考官網，本文不展開。

批量導入配置

由於需要與yarn配合，將導入程序放在yarn上執行，所以需要hadoop相關環境配置。需要修改兩個配置文件，一個是Janusgraph的啓動腳本/data/janusgraph/lib/gremlin.sh, 另一個是hadoop和spark相關的配置/data/janusgraph/conf/hadoop-graph/hadoop-script.properties。

1.複製/data/janusgraph/lib/gremlin.sh, 假定命名爲yarn-gremlin.sh。然後增加hadoop的配置到JAVA_OPTIONS和CLASSPATH中。這樣能保證hadoop相關配置能被程序讀取到，便於正常啓動spark在yarn上的任務。

#!/bin/bash
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
export HADOOP_HOME=/usr/local/hadoop-2.7.1
export JAVA_OPTIONS="$JAVA_OPTIONS -Djava.library.path=$HADOOP_HOME/lib/native"
export CLASSPATH=$HADOOP_CONF_DIR
#JANUSGRAPH_HOME爲用戶安裝janusgraph的目錄/data/janusgraph/
cd $JANUSGRAPH_HOME
./bin/gremlin.sh

2.修改/data/janusgraph/conf/hadoop-graph/hadoop-script.properties
主要根據要導入文件的格式修改inputFormat、指定要導入的hdfs文件路徑、parse函數路徑以及spark master指定爲yarn-client等。

#
# Hadoop Graph Configuration
#
gremlin.graph=org.apache.tinkerpop.gremlin.hadoop.structure.HadoopGraph
gremlin.hadoop.graphInputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.script.ScriptInputFormat
gremlin.hadoop.graphOutputFormat=org.apache.tinkerpop.gremlin.hadoop.structure.io.graphson.GraphSONOutputFormat
gremlin.hadoop.jarsInDistributedCache=true

#導入文件的hdfs路徑。也可以在加載該配置文件後指定
gremlin.hadoop.inputLocation=/user/yisou/taotian1/janus/data/fewData.test.dup
#解析hdfs文件的parse函數路徑。也可以在加載該配置文件後指定
gremlin.hadoop.scriptInputFormat.script=/user/yisou/taotian1/janus/data/conf/vertex_parse.groovy
#gremlin.hadoop.outputLocation=output

#
# SparkGraphComputer with Yarn Configuration
#
spark.master=yarn-client
spark.executor.memory=6g
spark.executor.instances=10
spark.executor.cores=2
spark.serializer=org.apache.spark.serializer.KryoSerializer
# spark.kryo.registrationRequired=true
# spark.storage.memoryFraction=0.2
# spark.eventLog.enabled=true
# spark.eventLog.dir=/tmp/spark-event-logs
# spark.ui.killEnabled=true

#cache config
gremlin.spark.persistContext=true
gremlin.spark.graphStorageLevel=MEMORY_AND_DISK
#gremlin.spark.persistStorageLevel=DISK_ONLY


#####################################
# GiraphGraphComputer Configuration #
#####################################
giraph.minWorkers=2
giraph.maxWorkers=3
giraph.useOutOfCoreGraph=true
giraph.useOutOfCoreMessages=true
mapred.map.child.java.opts=-Xmx1024m
mapred.reduce.child.java.opts=-Xmx1024m
giraph.numInputThreads=4
giraph.numComputeThreads=4
# giraph.maxPartitionsInMemory=1
# giraph.userPartitionCount=2

執行批量導入

啓動命令：

sh /data/janusgraph/lib/yarn-gremlin.sh

批量導入命令：

local_root="/data/janusgraph"
hdfs_root="/user/yisou/taotian1/janus"
social_graph="${local_root}/conf/janusgraph-hbase-es.properties"
graph = GraphFactory.open("${local_root}/conf/hadoop-script.properties")
graph.configuration().setProperty("gremlin.hadoop.inputLocation","/user/yisou/taotian1/janus/data/fewData.test.dup")
graph.configuration().setProperty("gremlin.hadoop.scriptInputFormat.script", "${hdfs_root}/conf/vertex_parse.groovy")
blvp = BulkLoaderVertexProgram.build().writeGraph(social_graph).create(graph)
graph.compute(SparkGraphComputer).program(blvp).submit().get()

運行結果：

sh /data/janusgraph/lib/yarn-gremlin.sh
\,,,/
(o o)
-----oOOo-(3)-oOOo-----
plugin activated: janusgraph.imports
plugin activated: tinkerpop.server
plugin activated: tinkerpop.utilities
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/data2/janusgraph-0.1.1-hadoop2/lib/slf4j-log4j12-1.7.12.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data2/janusgraph-0.1.1-hadoop2/lib/logback-classic-1.1.2.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data2/janusgraph-0.1.1-hadoop2/lib/spark-assembly-1.6.2-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/data2/janusgraph-0.1.1-hadoop2/lib/yisou-hbase-1.0-SNAPSHOT.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
21:22:00,392  INFO HadoopGraph:87 - HADOOP_GREMLIN_LIBS is set to: /data2/janusgraph-0.1.1-hadoop2/lib
plugin activated: tinkerpop.hadoop
plugin activated: tinkerpop.spark
plugin activated: tinkerpop.tinkergraph
gremlin>
gremlin> local_root="/data2/janusgraph-0.1.1-hadoop2/social"
==>/data2/janusgraph-0.1.1-hadoop2/social
gremlin> hdfs_root="/user/yisou/taotian1/janus"
==>/user/yisou/taotian1/janus
gremlin> social_graph="${local_root}/conf/janusgraph-hbase-es-social.properties"
==>/data2/janusgraph-0.1.1-hadoop2/social/conf/janusgraph-hbase-es-social.properties
gremlin> graph = GraphFactory.open("${local_root}/conf/hadoop-yarn.properties")
==>hadoopgraph[scriptinputformat->graphsonoutputformat]
gremlin> graph.configuration().setProperty("gremlin.hadoop.inputLocation","/user/yisou/taotian1/janus/tmp1person/")
==>null
gremlin> graph.configuration().setProperty("gremlin.hadoop.scriptInputFormat.script", "${hdfs_root}/person_parse.groovy")
==>null
gremlin> blvp = BulkLoaderVertexProgram.build().writeGraph(social_graph).create(graph)
==>BulkLoaderVertexProgram[bulkLoader=IncrementalBulkLoader, vertexIdProperty=bulkLoader.vertex.id, userSuppliedIds=false, keepOriginalIds=true, batchSize=0]
gremlin> graph.compute(SparkGraphComputer).program(blvp).submit().get()
21:25:04,666  INFO deprecation:1173 - mapred.reduce.child.java.opts is deprecated. Instead, use mapreduce.reduce.java.opts
21:25:04,667  INFO deprecation:1173 - mapred.map.child.java.opts is deprecated. Instead, use mapreduce.map.java.opts
21:25:04,680  INFO KryoShimServiceLoader:117 - Set KryoShimService provider to org.apache.tinkerpop.gremlin.hadoop.structure.io.HadoopPoolShimService@4cb2918c (class org.apache.tinkerpop.gremlin.hadoop.structure.io.HadoopPoolShimService) because its priority value (0) is the highest available
21:25:04,680  INFO KryoShimServiceLoader:123 - Configuring KryoShimService provider org.apache.tinkerpop.gremlin.hadoop.structure.io.HadoopPoolShimService@4cb2918c with user-provided configuration
  21:25:10,479  WARN SparkConf:70 - The configuration key 'spark.yarn.user.classpath.first' has been deprecated as of Spark 1.3 and may be removed in the future. Please use spark.{driver,executor}.userClassPathFirst instead.
21:25:10,505  INFO SparkContext:58 - Running Spark version 1.6.2
21:25:10,524  WARN SparkConf:70 - The configuration key 'spark.yarn.user.classpath.first' has been deprecated as of Spark 1.3 and may be removed in the future. Please use spark.{driver,executor}.userClassPathFirst instead.
21:25:10,564  INFO SecurityManager:58 - Changing view acls to: yisou
21:25:10,565  INFO SecurityManager:58 - Changing modify acls to: yisou
21:25:10,566  INFO SecurityManager:58 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(yisou); users with modify permissions: Set(yisou)
21:25:10,833  WARN SparkConf:70 - The configuration key 'spark.yarn.user.classpath.first' has been deprecated as of Spark 1.3 and may be removed in the future. Please use spark.{driver,executor}.userClassPathFirst instead.
21:25:10,835  WARN SparkConf:70 - The configuration key 'spark.yarn.user.classpath.first' has been deprecated as of Spark 1.3 and may be removed in the future. Please use spark.{driver,executor}.userClassPathFirst instead.
21:25:11,035  INFO Utils:58 - Successfully started service 'sparkDriver' on port 36502.
21:25:11,576  INFO Slf4jLogger:80 - Slf4jLogger started
  21:25:11,646  INFO Remoting:74 - Starting remoting
............
21:25:20,736  INFO Client:58 - Submitting application 2727164 to ResourceManager
21:25:20,771  INFO YarnClientImpl:273 - Submitted application application_1466564207556_2727164
21:25:21,780  INFO Client:58 - Application report for application_1466564207556_2727164 (state: ACCEPTED)
21:25:21,785  INFO Client:58 -
client token: N/A
diagnostics: N/A
ApplicationMaster host: N/A
ApplicationMaster RPC port: -1
queue: root.yisou
start time: 1500297920750
final status: UNDEFINED
tracking URL: http://c1-nn3.bdp.idc:8981/proxy/application_1466564207556_2727164/
21:25:22,787  INFO Client:58 - Application report for application_1466564207556_2727164 (state: ACCEPTED)
21:25:23,789  INFO Client:58 - Application report for application_1466564207556_2727164 (state: ACCEPTED)
21:25:24,791  INFO Client:58 - Application report for application_1466564207556_2727164 (state: ACCEPTED)
21:25:25,793  INFO Client:58 - Application report for application_1466564207556_2727164 (state: ACCEPTED)
21:25:39,585  INFO JettyUtils:58 - Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
21:25:39,823  INFO Client:58 - Application report for application_1466564207556_2727164 (state: RUNNING)
21:25:39,824  INFO Client:58 -
client token: N/A
diagnostics: N/A
ApplicationMaster host: 10.130.1.50
ApplicationMaster RPC port: 0
queue: root.yisou
start time: 1500297920750
final status: UNDEFINED
tracking URL: http://c1-nn3.bdp.idc:8981/proxy/application_1466564207556_2727164/
..........
21:25:42,864  INFO SparkContext:58 - Added JAR /data2/janusgraph-0.1.1-hadoop2/lib/commons-codec-1.7.jar at http://10.130.64.69:38209/jars/commons-codec-1.7.jar with timestamp 1500297942864
21:25:42,866  INFO SparkContext:58 - Added JAR /data2/janusgraph-0.1.1-hadoop2/lib/commons-lang-2.5.jar at http://10.130.64.69:38209/jars/commons-lang-2.5.jar with timestamp 1500297942866
21:25:42,869  INFO SparkContext:58 - Added JAR /data2/janusgraph-0.1.1-hadoop2/lib/commons-collections-3.2.2.jar at http://10.130.64.69:38209/jars/commons-collections-3.2.2.jar with timestamp 1500297942869
21:25:42,872  INFO SparkContext:58 - Added JAR /data2/janusgraph-0.1.1-hadoop2/lib/commons-io-2.3.jar at http://10.130.64.69:38209/jars/commons-io-2.3.jar with timestamp 1500297942872
21:25:42,874  INFO SparkContext:58 - Added JAR /data2/janusgraph-0.1.1-hadoop2/lib/jetty-util-6.1.26.jar at http://10.130.64.69:38209/jars/jetty-util-6.1.26.jar with timestamp 1500297942874
21:25:42,879  INFO SparkContext:58 - Added JAR /data2/janusgraph-0.1.1-hadoop2/lib/htrace-core-3.1.0-incubating.jar at http://10.130.64.69:38209/jars/htrace-core-3.1.0-incubating.jar with timestamp 1
............
21:26:14,751  INFO MapOutputTrackerMaster:58 - Size of output statuses for shuffle 2 is 146 bytes
21:26:14,767  INFO TaskSetManager:58 - Finished task 0.0 in stage 6.0 (TID 4) in 40 ms on c1-dn31.bdp.idc (1/1)
21:26:14,767  INFO YarnScheduler:58 - Removed TaskSet 6.0, whose tasks have all completed, from pool
21:26:14,767  INFO DAGScheduler:58 - ResultStage 6 (foreachPartition at SparkExecutor.java:173) finished in 0.042 s
21:26:14,768  INFO DAGScheduler:58 - Job 1 finished: foreachPartition at SparkExecutor.java:173, took 1.776125 s
21:26:14,775  INFO ShuffledRDD:58 - Removing RDD 2 from persistence list
21:26:14,785  INFO BlockManager:58 - Removing RDD 2
==>result[hadoopgraph[scriptinputformat->graphsonoutputformat],memory[size:0]]
gremlin> 21:26:22,515  INFO YarnClientSchedulerBackend:58 - Registered executor NettyRpcEndpointRef(null) (c1-dn9.bdp.idc:60762) with ID 8

批量導入性能優化

如果不做優化，janusgraph批量導入的速度非常慢，導入4千萬條數據大約需要3.5小時。優化後可降低到1小時.
1.加大ids.block-size和storage.buffer-size參數的大小（在janusgraph-hbase-es.properties中配置）。
ids.block-size=100000000
storage.buffer-size=102400

2.指定hbase初始的region數目（在janusgraph-hbase-es.properties中配置）。
storage.hbase.region-count = 50

3.邊和頂點同時導入，而不是頂點和邊分成不同的文件，分開導入。格式可參考/data/janusgraph/data/grateful-dead.txt。

總結

本文主要講解了janusgraph中如何配置yarn-client的方式批量導入節點和邊。

分爲基本配置和批量導入的配置兩部分，基本配置中需要注意janusgraph自帶jar包與用戶yarn環境中jar包的衝突問題，可替換或者刪除相關jar包。

批量導入配置中重點是在gremlin.sh中添加hadoop的相關配置，將hadoop環境配置到JAVA_OPTIONS和CLASSPATH中。

參考鏈接

Titan 數據庫使用
 圖數據庫Titan在生產環境中的使用全過程+分析
 合併頂點和邊，批量導入parse函數樣例
 Yet Another Analytics & Intelligence Communication Series

###好好好####JanusGraph批量導入數據優化

JanusGraph批量導入數據優化

批量導入工具：

https://github.com/dengziming/janusgraph-util

批量導入配置項

ID 分配優化

ID Block Size

ID Acquisition Process

讀寫優化

Buffer Size

Read and Write Robustness

策略

Parallelizing the Load

基本配置

批量導入配置

執行批量導入

運行結果：

批量導入性能優化

總結

linux安裝cuda和cudnn

測試人員都是畫畫大神，讓我看看誰還不會用代碼圖？

Object.values()對象遍歷

我拍了拍Redis，被移出了羣聊···

網絡現代化通向雲原生應用的高速公路

面試官：說說你對序列化的理解

我宣佈，這是我找到的史上AI最全論文體系！

##好好好好###開源的標註工具

###haohaohao######主動學習用於標註優化迭代

###豪豪豪豪######2020 推薦系統技術演進趨勢瞭解

###好好好######一文詳解微服務架構

einsum初探

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結