主流大數據技術全體系參數與搭建與後臺代碼工程框架的編寫（百分之70）

之前查閱源碼啊，性能測試啊調優啊。。基本告一段落，項目也接近尾聲，那麼整理下spark所有配置參數與優化策略，方便以後開發與配置：

Spark安裝配置與代碼框架

spark-default.conf 配置

spark.executor.instance 參數，向Yarn申請創建的資源池實例數

spark.executor.cores 參數，每個container中所包含的core數量

spark.executor.memory 參數，每個資源池所具有的內存數

spark.dirver.memory 參數，driver端所佔用的資源數

spark.storage.memoryFraction 參數，用於cache數據與計算數據集的內存使用比例

spark.kryoserializer.buffer.max 參數，序列化最大值，默認64M

spark.shuffle.consolidateFiles 參數，shuffle是否合併文件

spark.rdd.compress 參數，rdd是否進行壓縮

spark.sql.shuffle.partitions 參數，shuffle過程中所創建的partition個數

spark.reducer.maxSizeInFlight 參數，設置shuffle read task的buffer緩衝大小，它將決定每次數據從遠程的executors中拉取大小。這個拉取過程是由5個並行的request,從不同的executor中拉取過來,從而提升了fetch的效率。

spark.shuffle.io.retryWait 參數，每次拉取數據的等待間隔

spark.shuffle.manage 參數,使用hash，同時與參數spark.shuffle.consolidateFiles true並用。因爲不需要對中間結果進行排序，同時合併中間文件的個數，從而減少打開文件的性能消耗,在spark2.0.2中不可直接配置hash，會報錯，其他優化參數包括：Sort Shuffle、Tungsten Sort，這裏我們要根據數據量進行選擇，優缺點請參考本博客《Spark Shuffle詳細過程》

spark.executor.heartbeatInterval 參數，與driver的通訊間隔，使driver知道executor有木有掛

spark.driver.maxResultSize 參數，所有分區的序列化結果的總大小限制

spark.yarn.am.cores 參數，在yarn-client模式下，申請Yarn App Master所用的CPU核數

spark.master 參數，選用的模式

spark.task.maxFailures 參數，task失敗多少次後丟棄job（防止因爲網絡IO等問題失敗，重新拉取）

spark.shuffle.file.buffer 參數，會增大Map任務的寫磁盤前的cache緩存

spark-env.sh 配置

export HADOOP_CONF_DIR 參數，配置hadoop所在配置文件路徑

export HADOOP_HOME 參數，配置hadoop Client的所在路徑

export JAVA_HOME 參數，配置JAVA的環境變量地址

export SPARK_YARN_APP_NAME 參數，配置application的名稱

export SPARK_LOG_DIR 參數，配置Spark log的輸出路徑

export SPARK_PID_DIR 參數，配置spark的Pid輸出路徑

將hive-site.xml文件放入spark的conf下修改spark thrift port，使其與hive的thrift的port分離開來，同時配置mysql的數據源，因爲hive的meta信息存在mysql中，以及配置meta指定的hdfs路徑：

<property>
　　<name>hive.server2.thrift.port</name>
　　<value>10000</value>
　　<description>Port number of HiveServer2 Thrift interface.
　　Can be overridden by setting $HIVE_SERVER2_THRIFT_PORT</description>
</property>

<property>
　　<name>hive.metastore.warehouse.dir</name>
　　<value>/user/hive/warehouse</value>
</property>

<property>
　　<name>javax.jdo.option.ConnectionDriverName</name>
　　<value>com.mysql.jdbc.Driver</value>
　　<description>Driver class name for a JDBC metastore</description>
</property>

<property>
　　<name>javax.jdo.option.ConnectionUserName</name>
　　<value>root</value>
　　<description>username to use against metastore database</description>
</property>

<property>
　　<name>javax.jdo.option.ConnectionPassword</name>
　　<value>root</value>
　　<description>password to use against metastore database</description>
</property>

Spark動態資源分配測試：

spark.dynamicAllocation.cachedExecutorIdleTimeout 360000000 如果executor中有數據則不移除

spark.dynamicAllocation.executorIdleTimeout 60s executor空閒時間達到規定值，則將該executor移除

spark.dynamicAllocation.initialExecutors 3 如果所有的executor都移除了，重新請求時啓動的初始executor數

spark.dynamicAllocation.maxExecutors 30 能夠啓動的最大executor數目

spark.dynamicAllocation.minExecutors 1 能夠啓動的最小executor數目

spark.dynamicAllocation.schedulerBacklogTimeout 1s task等待運行時間超過該值後開始啓動executor

spark.dynamicAllocation.enabled True 開啓動態參數配置

spark.dynamicAllocation.sustainedSchedulerBacklogTimeout 1s 啓動executor的時間間隔

啓動腳本：

/usr/local/spark-1.6.1/sbin/start-thriftserver.sh \

--conf spark.shuffle.service.enabled=true \

--conf spark.dynamicAllocation.enabled=true \

--conf spark.dynamicAllocation.minExecutors=2 \

--conf spark.dynamicAllocation.maxExecutors=30 \

--conf spark.dynamicAllocation.sustainedSchedulerBacklogTimeout = 5s \

--conf spark.dynamicAllocation.schedulerBacklogTimeout=1s \

--conf spark.dynamicAllocation.initialExecutors=2 \

--conf spark.dynamicAllocation.executorIdleTimeout=60s \

--conf spark.dynamicAllocation.cachedExecutorIdleTimeout=360000000s \

--conf spark.driver.memory=50g

代碼框架：

首先，我們引入需要依賴的包括hadoop、spark、hbase等jar包，pom.xml配置如下：

<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
  <modelVersion>4.0.0</modelVersion>
  <groupId>sparkApp</groupId>
  <artifactId>sparkApp</artifactId>
  <version>1.0-SNAPSHOT</version>
  <inceptionYear>2008</inceptionYear>
  <properties>
      <scala.version>2.10.0</scala.version>
      <spring.version>4.0.2.RELEASE</spring.version>
      <hadoop.version>2.6.0</hadoop.version>
      <jedis.version>2.8.1</jedis.version>
  </properties>

  <repositories>
    <repository>
      <id>scala-tools.org</id>
      <name>Scala-Tools Maven2 Repository</name>
      <url>http://scala-tools.org/repo-releases</url>
    </repository>
  </repositories>

  <pluginRepositories>
    <pluginRepository>
      <id>scala-tools.org</id>
      <name>Scala-Tools Maven2 Repository</name>
      <url>http://scala-tools.org/repo-releases</url>
    </pluginRepository>
  </pluginRepositories>

  <dependencies>
    <dependency>
      <groupId>org.scala-lang</groupId>
      <artifactId>scala-library</artifactId>
      <version>${scala.version}</version>
    </dependency>
    <dependency>
      <groupId>junit</groupId>
      <artifactId>junit</artifactId>
      <version>4.4</version>
      <scope>test</scope>
    </dependency>

      <!-- SPARK START -->
      <dependency>
          <groupId>org.apache.spark</groupId>
          <artifactId>spark-core_2.10</artifactId>
          <version>1.6.1</version>
      </dependency>

      <!--<dependency>
          <groupId>org.apache.spark</groupId>
          <artifactId>spark-streaming-kafka_2.10</artifactId>
          <version>1.6.1</version>
      </dependency>-->

      <dependency>
          <groupId>org.apache.spark</groupId>
          <artifactId>spark-sql_2.10</artifactId>
          <version>1.6.1</version>
      </dependency>

      <dependency>
          <groupId>org.apache.spark</groupId>
          <artifactId>spark-mllib_2.10</artifactId>
          <version>1.6.1</version>
      </dependency>

      <dependency>
          <groupId>org.apache.spark</groupId>
          <artifactId>spark-hive_2.10</artifactId>
          <version>1.6.1</version>
      </dependency>

      <!-- SPARK END -->

      <!-- HADOOP START -->
      <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-hdfs</artifactId>
          <version>${hadoop.version}</version>
          <exclusions>
              <exclusion>
                  <groupId>commons-logging</groupId>
                  <artifactId>commons-logging</artifactId>
              </exclusion>
              <exclusion>
                  <groupId>log4j</groupId>
                  <artifactId>log4j</artifactId>
              </exclusion>
              <exclusion>
                  <groupId>org.slf4j</groupId>
                  <artifactId>slf4j-log4j12</artifactId>
              </exclusion>
              <exclusion>
                  <groupId>org.slf4j</groupId>
                  <artifactId>slf4j-api</artifactId>
              </exclusion>
              <exclusion>
                  <groupId>javax.servlet.jsp</groupId>
                  <artifactId>jsp-api</artifactId>
              </exclusion>
          </exclusions>
      </dependency>

      <dependency>
          <groupId>org.apache.hadoop</groupId>
          <artifactId>hadoop-common</artifactId>
          <version>${hadoop.version}</version>
      </dependency>

      <!-- HADOOP END -->


      <!-- hbase START  -->
      <dependency>
          <groupId>org.apache.hbase</groupId>
          <artifactId>hbase-server</artifactId>
          <version>1.0.2</version>
      </dependency>

      <dependency>
          <groupId>org.apache.htrace</groupId>
          <artifactId>htrace-core</artifactId>
          <version>3.1.0-incubating</version>
      </dependency>

      <dependency>
          <groupId>org.apache.hbase</groupId>
          <artifactId>hbase-client</artifactId>
          <version>1.0.2</version>
      </dependency>

      <dependency>
          <groupId>org.apache.hbase</groupId>
          <artifactId>hbase-common</artifactId>
          <version>1.0.2</version>
      </dependency>

      <dependency>
          <groupId>org.apache.hbase</groupId>
          <artifactId>hbase-protocol</artifactId>
          <version>1.0.2</version>
      </dependency>

      <!-- hbase END  -->

      <dependency>
          <groupId>com.google.guava</groupId>
          <artifactId>guava</artifactId>
          <version>11.0.2</version>
      </dependency>

      <dependency>
          <groupId>org.slf4j</groupId>
          <artifactId>slf4j-api</artifactId>
          <version>1.7.12</version>
      </dependency>


      <!-- REDIS START -->
      <dependency>
          <groupId>redis.clients</groupId>
          <artifactId>jedis</artifactId>
          <version>${jedis.version}</version>
      </dependency>
      <!-- REDIS END -->

      <dependency>
          <groupId>org.apache.commons</groupId>
          <artifactId>commons-lang3</artifactId>
          <version>3.1</version>
      </dependency>

      <dependency>
          <groupId>org.apache.commons</groupId>
          <artifactId>commons-pool2</artifactId>
          <version>2.3</version>
      </dependency>

      <dependency>
          <groupId>mysql</groupId>
          <artifactId>mysql-connector-java</artifactId>
          <version>5.1.26</version>
      </dependency>

      <!--<dependency>
          <groupId>org.datanucleus</groupId>
          <artifactId>datanucleus-api-jdo</artifactId>
          <version>3.2.6</version>
      </dependency>

      <dependency>
          <groupId>org.datanucleus</groupId>
          <artifactId>datanucleus-core</artifactId>
          <version>3.2.1</version>
      </dependency>

      <dependency>
          <groupId>org.datanucleus</groupId>
          <artifactId>datanucleus-rdbms</artifactId>
          <version>3.2.9</version>
      </dependency>-->

      <dependency>
          <groupId>io.netty</groupId>
          <artifactId>netty-all</artifactId>
          <version>5.0.0.Alpha1</version>
      </dependency>

     <!-- <dependency>
          <groupId>org.apache.hive</groupId>
          <artifactId>hive-service</artifactId>
          <version>1.2.1</version>
      </dependency>-->

      <dependency>
          <groupId>org.apache.hive</groupId>
          <artifactId>hive-jdbc</artifactId>
          <version>2.1.0</version>
      </dependency>

      <dependency>
          <groupId>org.apache.hive</groupId>
          <artifactId>hive-service</artifactId>
          <version>1.2.1</version>
      </dependency>

  </dependencies>

  <build>
<!--    <sourceDirectory>src/main/scala</sourceDirectory>
    <testSourceDirectory>src/test/scala</testSourceDirectory>-->
    <plugins>
      <plugin>
        <groupId>org.scala-tools</groupId>
        <artifactId>maven-scala-plugin</artifactId>
        <executions>
          <execution>
            <goals>
              <goal>compile</goal>
              <goal>testCompile</goal>
            </goals>
          </execution>
        </executions>
        <configuration>
          <scalaVersion>${scala.version}</scalaVersion>
          <args>
            <arg>-target:jvm-1.5</arg>
          </args>
        </configuration>
      </plugin>
      <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-eclipse-plugin</artifactId>
        <configuration>
          <downloadSources>true</downloadSources>
          <buildcommands>
            <buildcommand>ch.epfl.lamp.sdt.core.scalabuilder</buildcommand>
          </buildcommands>
          <additionalProjectnatures>
            <projectnature>ch.epfl.lamp.sdt.core.scalanature</projectnature>
          </additionalProjectnatures>
          <classpathContainers>
            <classpathContainer>org.eclipse.jdt.launching.JRE_CONTAINER</classpathContainer>
            <classpathContainer>ch.epfl.lamp.sdt.launching.SCALA_CONTAINER</classpathContainer>
          </classpathContainers>
        </configuration>
      </plugin>
    </plugins>
  </build>
  <reporting>
    <plugins>
      <plugin>
        <groupId>org.scala-tools</groupId>
        <artifactId>maven-scala-plugin</artifactId>
        <configuration>
          <scalaVersion>${scala.version}</scalaVersion>
        </configuration>
      </plugin>
    </plugins>
  </reporting>
</project>

然後將集羣中hive-site.xml、hdfs-site.xml、hbase-site.xml引入項目中。

編寫HBase公共方法（部分代碼）：

  1 package hbase
  2 
  3 import java.util.{Calendar, Date}
  4 
  5 import org.apache.hadoop.hbase.HBaseConfiguration
  6 import org.apache.hadoop.hbase.client.{Result, Scan}
  7 import org.apache.hadoop.hbase.filter._
  8 import org.apache.hadoop.hbase.io.ImmutableBytesWritable
  9 import org.apache.hadoop.hbase.mapreduce.TableInputFormat
 10 import org.apache.hadoop.hbase.protobuf.ProtobufUtil
 11 import org.apache.hadoop.hbase.util.{Base64, Bytes}
 12 import org.apache.spark.{Logging, SparkContext}
 13 import org.apache.spark.rdd.RDD
 14 import org.slf4j.{Logger, LoggerFactory}
 15 
 16 /**
 17   * Created by ysy on 2016/11/6.
 18   */
 19 object HBaseTableHelper extends Serializable {
 20 
 21   val logger: Logger = LoggerFactory.getLogger(HBaseTableHelper.getClass)
 22 
 23   //根據timestramp過濾加載Hbase數據
 24   def tableInitByTime(sc : SparkContext,tablename:String,columns :String,fromdate: Date,todate:Date):RDD[(ImmutableBytesWritable,Result)] = {
 25     val configuration = HBaseConfiguration.create()
 26     configuration.set(TableInputFormat.INPUT_TABLE, tablename)
 27 
 28     val scan = new Scan
 29     scan.setTimeRange(fromdate.getTime,todate.getTime)
 30     val column = columns.split(",")
 31     for(columnName <- column){
 32       scan.addColumn("f1".getBytes, columnName.getBytes)
 33     }
 34     configuration.set(TableInputFormat.SCAN, convertScanToString(scan))
 35     val hbaseRDD = sc.newAPIHadoopRDD(configuration, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])
 36     logger.info("-------count-------" + hbaseRDD.count() + "------------------")
 37     hbaseRDD
 38   }
 39 
 40   def convertScanToString(scan : Scan) = {
 41     val proto = ProtobufUtil.toScan(scan)
 42     Base64.encodeBytes(proto.toByteArray)
 43   }
 44 
 45   //根據時間條件filter數據
 46   def tableInitByFilter(sc : SparkContext,tablename : String,columns : String,time : String) : RDD[(ImmutableBytesWritable,Result)] = {
 47     val configuration = HBaseConfiguration.create()
 48     configuration.set(TableInputFormat.INPUT_TABLE,tablename)
 49     val filter: Filter = new RowFilter(CompareFilter.CompareOp.GREATER_OR_EQUAL, new SubstringComparator(time))
 50     val scan = new Scan
 51     scan.setFilter(filter)
 52     val column = columns.split(",")
 53     for(columnName <- column){
 54       scan.addColumn("f1".getBytes, columnName.getBytes)
 55     }
 56     val hbaseRDD = sc.newAPIHadoopRDD(configuration, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])
 57     logger.info("-------count-------" + hbaseRDD.count() + "------------------")
 58     hbaseRDD
 59   }
 60 
 61   def HBaseTableInit(): Unit ={
 62 
 63   }
 64 
 65   def hbaseToHiveTable(): Unit ={
 66 
 67   }
 68 
 69   //前N天的時間戳獲取
 70   def getPassDays(beforeDay : Int): Date ={
 71     val calendar = Calendar.getInstance()
 72     var year = calendar.get(Calendar.YEAR)
 73     var dayOfYear  = calendar.get(Calendar.DAY_OF_YEAR)
 74     var j = 0
 75     for(i <- 0 to beforeDay){
 76       calendar.set(Calendar.DAY_OF_YEAR, dayOfYear - j);
 77       if (calendar.get(Calendar.YEAR) < year) {
 78         //跨年了
 79         j = 1;
 80         //更新 標記年
 81         year = year + 1;
 82         //重置日曆
 83         calendar.set(year,Calendar.DECEMBER,31);
 84         //重新獲取dayOfYear
 85         dayOfYear = calendar.get(Calendar.DAY_OF_YEAR);
 86       }else{
 87         j = j + 1
 88       }
 89     }
 90     calendar.getTime()
 91   }
 92 
 93   //根據startRow與endRow進行過濾
 94   def scanHbaseByStartAndEndRow(sc : SparkContext,startRow : String,stopRow : String,tableName : String) : RDD[(ImmutableBytesWritable,Result)] ={
 95     val configuration = HBaseConfiguration.create()
 96     val scan = new Scan()
 97     scan.setCacheBlocks(false)
 98     scan.setStartRow(Bytes.toBytes(startRow))
 99     scan.setStopRow(Bytes.toBytes(stopRow))
100     val filterList = new FilterList()
101     filterList.addFilter(new KeyOnlyFilter())
102     filterList.addFilter(new InclusiveStopFilter(Bytes.toBytes(stopRow)))
103     scan.setFilter(filterList)
104     configuration.set(TableInputFormat.INPUT_TABLE,tableName)
105     configuration.set(TableInputFormat.SCAN, convertScanToString(scan))
106     val hbaseRDD = sc.newAPIHadoopRDD(configuration, classOf[TableInputFormat], classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable], classOf[org.apache.hadoop.hbase.client.Result])
107     logger.info("-------ScanHbaseCount-------" + hbaseRDD.count() + "------------------")
108     hbaseRDD
109   }
110 
111 }

編寫hive公共方法（部分代碼）：

 1 package hive
 2 
 3 import org.apache.spark.{Logging, SparkContext}
 4 import org.apache.spark.rdd.RDD
 5 import org.apache.spark.sql.Row
 6 import org.apache.spark.sql.hive.HiveContext
 7 import org.apache.spark.sql.types.{StringType, StructField, StructType}
 8 
 9 /**
10   * Created by uatcaiwy on 2016/11/6.
11   */
12 object HiveTableHelper extends Logging {
13 
14     def hiveTableInit(sc:SparkContext): HiveContext ={
15       val sqlContext = new HiveContext(sc)
16       sqlContext
17     }
18 
19     def writePartitionTable(HCtx:HiveContext,inputRdd:RDD[Row],tabName:String,colNames:String):Unit ={
20       val schema = StructType(
21         colNames.split(" ").map(fieldName => StructField(fieldName,StringType,true))
22       )
23       val table = colNames.replace(" dt","").split(" ").map(name => name + " String").toList.toString().replace("List(","").replace(")","")
24       val df = HCtx.createDataFrame(inputRdd,schema)
25       df.show(20)
26       logInfo("----------------------------------begin write table-----------------------------------")
27       val temptb = "temp" + tabName
28       HCtx.sql("drop table if exists " + tabName)
29       df.registerTempTable(temptb)
30       HCtx.sql("CREATE EXTERNAL TABLE if not exists " + tabName +" ("+ table+ ") PARTITIONED BY (`dt` string) ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' STORED AS INPUTFORMAT  'org.apache.hadoop.mapred.SequenceFileIn　　　　　　putFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat'")
31       HCtx.sql("set hive.exec.dynamic.partition.mode = nonstrict")
32       HCtx.sql("insert overwrite table " + tabName + " partition(`dt`)" + " select * from " + temptb)
33     }
34 }

讀取hdfs文件,有時我們需要根據文件的編碼來讀取，否則會亂碼，並改變編碼公共方法：

 1 package importSplitFiletoHive
 2 
 3 import org.apache.hadoop.io.{LongWritable, Text}
 4 import org.apache.hadoop.mapred.TextInputFormat
 5 import org.apache.spark.SparkContext
 6 import org.apache.spark.rdd.RDD
 7 
 8 /**
 9   * Created by ysy on 2016/12/7.
10   */
11 object changeEncode {
12 
13   def changeFileEncoding(sc:SparkContext,path:String,encode : String):RDD[String]={
14     sc.hadoopFile(path,classOf[TextInputFormat],classOf[LongWritable],classOf[Text],1)
15       .map(p => new String(p._2.getBytes,0,p._2.getLength,encode))
16   }

spark進行xml解析（部分代碼）：

 1 import hive.HiveTableHelper
 2 import org.apache.spark.Logging
 3 import org.apache.spark.rdd.RDD
 4 import org.apache.spark.sql.Row
 5 import org.apache.spark.sql.hive.HiveContext
 6 import org.apache.spark.sql.types.{StringType, StructField, StructType}
 7 import org.slf4j.LoggerFactory
 8 
 9 import scala.xml._
10 
11 object xmlParse extends Logging{
12   val schemaString = "column1,column2...."
13   def getPBOC_V1_F1(HCtx:HiveContext,rdd:RDD[String],outputTablename:String):Unit = {
14     val tbrdd = rdd.filter(_.split("\t").length == 9)
15       .filter(_.split("\t")(8) != "RESULTS")
16       .map(data => {
17         val sp = data.split("\t")
18         val dt = sp(5).substring(0, 10).replaceAll("-", "")
19         ((sp(0), sp(1), sp(2), sp(3), sp(4), sp(5), sp(6), "RN"), sp(8), dt)
20       }).filter(_._2 != "")
21       .filter(_._2.split("<").length > 2)
22       .filter(data => !(data._2.indexOf("SingleQueryResultMessage0009") == -1 || data._2.indexOf("ReportMessage") == -1))
23       .map(data => {
24         val xml = if (XML.loadString(data._2) != null) XML.loadString(data._2) else null
25         logDebug("%%%%%%%%%%%%%%%%%%finding xml-1:" + xml + "%%%%%%%%%%%%%%%%%%")
26         val column1 = if ((xml \ "PBOC" \ "TYPE") != null) (xml \ "PBOC" \ "TYPE").text else "null"
27         val column2 = if ((xml \ "HEAD" \ "VER") != null) (xml \ "HEAD" \ "VER").text else "null"
28         val column3 = if ((xml \ "HEAD" \ "SRC") != null) (xml \ "HEAD" \ "SRC").text else "null"
29         val column4 = if ((xml \ "HEAD" \ "DES") != null) (xml \ "HEAD" \ "DES").text else "null"
30         ....
31         (data._1,column1,column2,column3...)
32       })
33 　ROW(....)
34     HiveTableHelper.writePartitionTable(HCtx, tbrdd, outputTablename, schemaString)

Redis編碼公共方法（部分代碼）：

 1 package redis
 2 
 3 import org.slf4j.{Logger, LoggerFactory}
 4 import redis.clients.jedis.{Jedis, JedisPool, JedisPoolConfig}
 5 
 6 import scala.collection.mutable.ArrayBuffer
 7 import scala.util.Random
 8 
 9 /**
10   * Created by ysy on 2016/11/21.
11   */
12 object RedisClient extends Serializable{
13   val logger: Logger = LoggerFactory.getLogger(RedisClient.getClass)
14   @transient private var jedisPool : JedisPool = null
15     private var jedisPoolList  = new ArrayBuffer[JedisPool]
16     private val poolSize = 0
17     makePool
18 
19   def makePool() ={
20       val pc = new PropConfig("redis.properties");
21     if(jedisPool == null){
22       val poolConfig : JedisPoolConfig = new JedisPoolConfig
23       poolConfig.setMaxIdle(pc.getProperty("redis.pool.maxActive").toInt)
24       poolConfig.setMaxTotal(pc.getProperty("redis.pool.maxActive").toInt)
25       poolConfig.setMaxWaitMillis(pc.getProperty("redis.pool.maxWait").toInt)
26       poolConfig.setTestOnBorrow(pc.getProperty("redis.pool.testOnBorrow").toBoolean)
27       poolConfig.setTestOnReturn(pc.getProperty("redis.pool.testOnReturn").toBoolean)
28       val hosts = pc.getProperty("redis.pool.servers").split(",")
29         .map(data => data.split(":"))
30       for(host <- hosts){
31         jedisPool = new JedisPool(poolConfig,host(0),host(1).toInt,pc.getProperty("redis.server.timeout").toInt)
32         jedisPoolList += jedisPool
33       }
34     }
35 
36   }
37 
38 
39   def getForString(key : String) : String = {
40       var value = ""
41     if(key != null && !key.isEmpty()){
42       val jedis = getJedis
43       value = jedis.get(key)
44     }
45     value
46   }
47 
48   def setForString(key : String,value : String) ={
49     if(key != null && !key.isEmpty){
50       var jedis = getJedis
51        jedis.set(key,value)
52     }else{
53 
54     }
55   }
56 
57   def zexsist(key : String) : Boolean ={
58       var flag = false
59       if(key != null && !key.isEmpty){
60         val jedis = getJedis
61 
62         val resultNum = jedis.zcard(key)
63         if("0".equals(resultNum.toLong)){
64             flag = true
65         }
66       }
67     flag
68   }
69 
70   def getJedis() : Jedis ={
71       var ramNub = 0
72       if(poolSize == 1){
73         ramNub = 0
74       }else{
75         val random = new Random
76         ramNub = Math.abs(random.nextInt % 1)
77       }
78     jedisPool = jedisPoolList(ramNub)
79     jedisPool.getResource
80   }
81 
82   def returnJedisResource(redis : Jedis): Unit ={
83     if(redis != null){
84       redis.close()
85     }
86   }
87 
88   def close: Unit = {
89       for(jedisPool <- jedisPoolList){
90           jedisPool.close
91       }
92     if(jedisPool!=null && !jedisPool.isClosed()){
93       jedisPool.close
94     }else{
95       jedisPool=null
96     }
97   }
98 
99 }

詳細就不寫了，那麼完整的工程框架搭建完畢：

隨後通過main方法創建sparkContext對象，開始數據分析與處理,在spark路徑的bin目錄下或者寫成腳本文件執行:

./spark-submit --conf spark.ui.port=5566 --name "sparkApp" --master yarn-client --num-executors 3 --executor-cores 2 --executor-memory 10g --class impl.spark /usr/local/spark1.6.1/sparkApp/sparkApp.jar

（注意：這裏的配置參數會覆蓋spark-default.conf中配置的變量，重新聲明spark.ui.port的原因也是因爲在同時啓動spark的thrfit的時候，提交submit會造成UI佔用的問題，至此spark完結）

Hadoop安裝配置與MapReduce代碼框架

安裝：

yum install gcc

yum install gcc-c++

yum install make

yum install autoconfautomake libtool cmake

yum install ncurses-devel

yum install openssl-devel

安裝protoc(需用root用戶)

tar -xvf protobuf-2.5.0.tar.bz2

cd protobuf-2.5.0

./configure --prefix=/opt/protoc/

make && make install

編譯hadoop

mvn clean package -Pdist,native -DskipTests -Dtar

編譯完的hadoop在 /home/hadoop/ocdc/hadoop-2.6.0-src/hadoop-dist/target 路徑下

配置hosts文件

10.1.245.244 master

10.1.245.243 slave1

命令行輸入 hostname master

免密碼登錄：

執行命令生成密鑰： ssh-keygen -t rsa -P ""

進入文件夾cd .ssh (進入文件夾後可以執行ls -a 查看文件)

將生成的公鑰id_rsa.pub 內容追加到authorized_keys（執行命令：cat id_rsa.pub >> authorized_keys）

core-site.xml

<name>fs.defaultFS</name>

<value>hdfs://master</value>

</property>

<name>io.file.buffer.size</name>

</property>

<name>hadoop.tmp.dir</name>

<value>/home/hadoop/ocdc/hadoop-2.6.0/tmp</value>

<description>Abasefor other temporary directories.</description>

</property>

<name>hadoop.proxyuser.spark.hosts</name>

</property>

<name>hadoop.proxyuser.spark.groups</name>

</property>

</configuration>

<name>ha.zookeeper.quorum</name>

</property>

</configuration>

hdfs-site.xml

<name>dfs.namenode.secondary.http-address</name>

<value>master:9001</value>

</property>

<name>dfs.namenode.name.dir</name>

<value>/home/hadoop/ocdc/hadoop-2.6.0/name</value>

</property>

<name>dfs.datanode.data.dir</name>

<value>/home/hadoop/ocdc/hadoop-2.6.0/data</value>

</property>

<name>dfs.replication</name>

</property>

<name>dfs.webhdfs.enabled</name>

</property>

<name>dfs.nameservices</name>

</property>

<name>dfs.ha.namenodes.ns1</name>

</property>

</configuration>

yarn-site.xml

<name>yarn.nodemanager.aux-services</name>

<value>mapreduce_shuffle</value>

</property>

<name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>

<value>org.apache.hadoop.mapred.ShuffleHandler</value>

</property>

<name>yarn.resourcemanager.address</name>

<value>master:8032</value>

</property>

<name>yarn.resourcemanager.scheduler.address</name>

<value>master:8030</value>

</property>

<name>yarn.resourcemanager.resource-tracker.address</name>

<value>master:8035</value>

</property>

<name>yarn.resourcemanager.admin.address</name>

<value>master:8033</value>

</property>

<name>yarn.resourcemanager.webapp.address</name>

<value>master:8088</value>

</property>

<name>yarn.nodemanager.resource.memory-mb</name>

</property>

<name>yarn.resourcemanager.hostname</name>

</property>

</configuration>

mapred-site.xml

<name>mapreduce.framework.name</name>

</property>

<name>mapreduce.jobhistory.address</name>

<value>master:10020</value>

</property>

<name>mapreduce.jobhistory.webapp.address</name>

<value>master:19888</value>

</property>

<name>yarn.scheduler.maximum-allocation-mb</name>

</property>

</configuration>

關於MapReduce，1年多前大家都覺得很神祕，其實就相當於在Map階段或者Reduce階段中，進行數據的處理，也可以在Map中讀取寫入hbase、redis都可以~其實就相當於在MapReduce中寫業務處理邏輯，代碼如下：

 1 public static class Map extends MapReduceBase implments Mapper<LongWritable,Text,Text,IntWritable>{
 2     //設置常量1，用來形成<word,1>形式的輸出
 3     private fianll static IntWritable one = new IntWritable(1)
 4     private Text word = new Text();
 5 
 6 public void map(LongWritable key,Text value,OutputCollector<Text,output,Reporter reporter) throws IOException{
 7    //hadoop執行map函數時爲是一行一行的讀取數據處理，有多少行，就會執行多少次map函數
 8     String line = value.toString();
 9     //進行單詞的分割，可以多傳入進行分割的參數
10     StringTokenizer tokenizer = new StringTokenizer(line);
11     //遍歷單詞
12     while(tokenizer.hasMoreTokens()){
13        //往Text中寫入<word,1>
14         word.set(tokenizer.nextToken());
15         output.collect(word,one);
16     }
17     }
18 }
19 //需要注意的是，reduce將相同key值(這裏是word)的value值收集起來，形成<word,list of 1>的形式，再將這些1累加
20 public static class Reduce extends MapReduceBase implements Reducer<Text IntWritable,Text,IntWritable>{
21         public void reduce(Text key,Iterator<IntWritable> values,OutputCollector<Text,IntWritable> output,Reporter reporter) throws IOException{
22     //初始word個數設置
23     int sum = 0;
24     while(values,hasNext()){
25      //單詞個數相加   
26         sum += value.next().get();
27     }
28     output.collect(key,new IntWritbale(sum));
29     }
30 }

HBASE安裝配置與hbase代碼框架

由於我使用的是外置的zookeeper，所以這裏HBASE_MANAGES_ZK設置爲，設置參數:

export JAVA_HOME=/usr/local/yangsy/jdk1.7.0_55

export HBASE_CLASSPATH=/usr/local/hbase-1.0.2/conf

export HBASE_MANAGES_ZK=false

hbase-site.xml

//設置將數據寫入hdfs的目錄
<property>
　　<name>hbase.rootdir</name>
　　<value>hdfs://master:9000/usr/local/hadoop-2.6.0/hbaseData</value>
</property>

//設置hbase的模式爲集羣模式
<property>
　　<name>hbase.cluster.distributed</name>
　　<value>true</value>
</property>

//設置hbase的master端口地址

<property>
　　<name>hbase.master</name>
　　<value>hdfs://master:60000</value>
</property>

//HBase Master Web借鑑綁定的默認端口，默認爲0.0.0.0

<property>
　　<name>hbase.master.info.port</name>
　　<value>60010</value>
</property>

//設置zookeeper的連接地址（必須爲基數個）

<property>
　　<name>hbase.zookeeper.property.clientPort</name>
　　<value>2183</value>
</property>

//zookeeper的節點

<property>
　　<name>hbase.zookeeper.quorum</name>
　　<value>master,slave1,slave2</value>
</property>

//zookeeper數據地址

<property>
　　<name>hbase.zookeeper.property.dataDir</name>
　　<value>/usr/local/zookeeper-3.4.6/data</value>
</property>

//zookeeper連接超時時間

<property>
　　<name>zookeeper.session.timeout</name>

</configuration>

這裏要注意的是，如果選擇外置的zookeeper集羣，則需要將zookeeper的zoo.cfg拷貝至HBase的conf下。在啓動HBase時，將會自動加載該配置文件。

regionServers中配置regionserver節點的地址

代碼結構：

  1 package HbaseTest;
  2 
  3 import akka.io.Tcp;
  4 import org.apache.hadoop.conf.Configuration;
  5 import org.apache.hadoop.hbase.*;
  6 import org.apache.hadoop.hbase.client.*;
  7 
  8 import java.util.ArrayList;
  9 import java.util.List;
 10 
 11 /**
 12  * Created by root on 5/30/16.
 13  */
 14 public class HbaseTest {
 15     private Configuration conf;
 16     public void init(){
 17         conf = HBaseConfiguration.create();
 18     }
 19 
 20    public void createTable(){
 21        Connection conn = null;
 22        try{
 23            conn = ConnectionFactory.createConnection(conf);
 24            HBaseAdmin hadmin = (HBaseAdmin)conn.getAdmin();
 25            HTableDescriptor desc = new HTableDescriptor("TableName".valueOf("yangsy"));
 26 
 27            desc.addFamily(new HColumnDescriptor("f1"));
 28            if(hadmin.tableExists("yangsy")){
 29                System.out.println("table is exists!");
 30                System.exit(0);
 31            }else{
 32                hadmin.createTable(desc);
 33                System.out.println("create table success");
 34            }
 35        }catch (Exception e){
 36            e.printStackTrace();
 37        }finally {
 38            {
 39                if(null != conn){
 40                    try{
 41                        conn.close();
 42                    }catch(Exception e){
 43                        e.printStackTrace();
 44                    }
 45                }
 46            }
 47        }
 48    }
 49 
 50     public void query(){
 51         Connection conn = null;
 52         HTable table = null;
 53         ResultScanner scan = null;
 54         try{
 55             conn = ConnectionFactory.createConnection(conf);
 56             table = (HTable)conn.getTable(TableName.valueOf("yangsy"));
 57 
 58             scan = table.getScanner(new Scan());
 59 
 60             for(Result rs : scan){
 61                 System.out.println("rowkey:" + new String(rs.getRow()));
 62 
 63                 for(Cell cell : rs.rawCells()){
 64                     System.out.println("column:" + new String(CellUtil.cloneFamily(cell)));
 65 
 66                     System.out.println("columnQualifier:"+new String(CellUtil.cloneQualifier(cell)));
 67 
 68                     System.out.println("columnValue:" + new String(CellUtil.cloneValue(cell)));
 69 
 70                     System.out.println("----------------------------");
 71                 }
 72             }
 73         }catch(Exception e){
 74             e.printStackTrace();
 75         }finally{
 76             try {
 77                 table.close();
 78                 if(null != conn) {
 79                     conn.close();
 80                 }
 81             }catch (Exception e){
 82                 e.printStackTrace();
 83             }
 84         }
 85     }
 86 
 87     public void queryByRowKey(){
 88         Connection conn = null;
 89         ResultScanner scann = null;
 90         HTable table = null;
 91         try {
 92             conn = ConnectionFactory.createConnection(conf);
 93             table = (HTable)conn.getTable(TableName.valueOf("yangsy"));
 94 
 95             Result rs = table.get(new Get("1445320222118".getBytes()));
 96             System.out.println("yangsy the value of rokey:1445320222118");
 97             for(Cell cell : rs.rawCells()){
 98                 System.out.println("family" + new String(CellUtil.cloneFamily(cell)));
 99                 System.out.println("value:"+new String(CellUtil.cloneValue(cell)));
100             }
101         }catch (Exception e){
102             e.printStackTrace();
103         }finally{
104             if(null != table){
105                 try{
106                     table.close();
107                 }catch (Exception e){
108                     e.printStackTrace();
109                 }
110             }
111         }
112     }
113 
114     public void insertData(){
115         Connection conn = null;
116         HTable hTable = null;
117         try{
118             conn = ConnectionFactory.createConnection(conf);
119             hTable = (HTable)conn.getTable(TableName.valueOf("yangsy"));
120 
121             Put put1 = new Put(String.valueOf("1445320222118").getBytes());
122 
123             put1.addColumn("f1".getBytes(),"Column_1".getBytes(),"123".getBytes());
124             put1.addColumn("f1".getBytes(),"Column_2".getBytes(),"456".getBytes());
125             put1.addColumn("f1".getBytes(),"Column_3".getBytes(),"789".getBytes());
126 
127             Put put2 = new Put(String.valueOf("1445320222119").getBytes());
128 
129             put2.addColumn("f1".getBytes(),"Column_1".getBytes(),"321".getBytes());
130             put2.addColumn("f1".getBytes(),"Column_2".getBytes(),"654".getBytes());
131             put2.addColumn("f1".getBytes(),"Column_3".getBytes(),"987".getBytes());
132 
133             List<Put> puts = new ArrayList<Put>();
134             puts.add(put1);
135             puts.add(put2);
136             hTable.put(puts);
137         }catch(Exception e){
138             e.printStackTrace();
139         }finally{
140             try {
141                 if (null != hTable) {
142                     hTable.close();
143                 }
144             }catch(Exception e){
145                 e.printStackTrace();
146             }
147         }
148     }
149 
150 public static void main(String args[]){
151     HbaseTest test = new HbaseTest();
152     test.init();
153     test.createTable();
154     test.insertData();
155     test.query();
156 }
157 
158 
159 }

Storm安裝配置與代碼框架

拓撲構造：
編寫topology實體類，在構造方法中加入配置參數，序列化等。通過CommandlLine獲取啓動時的workers與calc數量，最終調用StormSubmitter的submitTopologyWithProgressBar,傳入topo的名稱，配置項，以及TopologyBuilder實例。

數據接收Spout：

Storm從kafka中獲取數據,創建於BasicTopology，其中配置參數：

kafka.brokerZkStr Kafka使用的zookeeper服務器地址

kafka.brokerZkPath 保存offset的zookeeper服務器地址

kafka.offset.zkPort 保存offset的zookeeper端口默認2181

kafka.offset.zkRoot 保存offset的zookeeper路徑 /kafka-offset

stateUpdateIntervalMs 把offset信息寫入zookeeper的間隔時間 30000

spout、bolt初始化時，構建對象stormBeanFactory,其後使用getBean方法從BeanFactory中獲取對象

Bolt數據處理：

自定義bolt繼承自extendsBaseRichBolt，實現它的prepare、declareOutputFileds方法。在prepare方法中，獲取StormBeanFactory的配置，加載業務實體類。

Storm配置參數：

dev.zookeeper.path : '/tmp/dev-storm-zookeeper' 以dev.zookeeper.path配置的值作爲本地目錄,以storm.zookeeper.port配置的值作爲端口,啓動一個新的zookeeper服務
drpc.childopts: '-Xms768m'
drpc.invocations.port 3773
drpc.port : 3772
drpc.queue.size : 128
drpc.request.timeout.secs : 600
drpc.worker.threads : 64
java.library.path : ''
logviewer.appender.name : 'A1'
logviewer.childopts : '-Xms128m'
logviewr.port : 8000
metrics.reporter.register : 'org.apache.hadoop.metrics2.sink.storm.StormTimelineMetricsReporter'
nimbus.childopts : '-Xmx1024m -javaagent:/usr/hdp/current/storm-nimbus/contrib/storm-jmxetric/lib/jmxetric-1.0.4.jar=host=localhost,port=8649,wireformat31x-true...'
nimbus.cleanup.inbox.freq.secs : 600
nimbus.file.copy.exiration.secs : 600

storm.yaml

nimbus.host nimbus的配置地址

nimbus.inbox.jar.expiration.secs 主要負責清理nimbus的inbox文件夾最後一次修改時間,默認3600秒

nimbus.monitor.freq.secs : 120
nimbus.reassign : true 當發現task失敗時nimbus是否重新分配執行。默認爲真，不建議修改
nimbus.supervisor.timeout.secs : 60
nimbus.task.launch.secs : 120
nimbus.task.timeout.secs : 30 心跳超時時間，超時後nimbus會認爲task死掉並重分配給另一個地址
nimbus.thrift.max_buffer_size : 1048576
nimbus.thrift.port : 6627
nimbus.topology.validator : 'backtype.storm.nimbus.DefaultTopologyValidator'
storm.cluster.metrics.consumer.register : [{"class" : "org.apache.hadoop.metrics2.sink.storm.Strorm.StormTimeLineMetricsReporter"}]
storm.cluster.mode : 'distributed'
storm.local.dir : '/storage/disk1/hadoop/storm'
storm.local.mode.zmq : false
storm.log.dir : '/var/log/storm'
storm.messaging.netty.buffer_size : 5242880 爲每次批量發送的Tuple 序列化之後的Task Message 消息的大小
storm.messaging.netty.client_worker_threads : 10 指定netty服務器工作線程數量
storm.messaging.netty.max_retries : 60 指定最大重試次數
storm.messaging.netty.max_wait_ms : 2000 指定最大等待時間（毫秒）
storm.messaging.netty.min_wait_ms : 100 指定最小等待時間（毫秒）
storm.messaging.netty.server_worker_threads : 10 指定netty服務器工作線程數量
storm.messaging.transport : 'backtype.storm.messaging.netty.Context' 指定傳輸協議
storm.thrift.transport : 'backtype.storm.security.auth.SimpleTransportPlugin'
storm.zookeeper.connection.timeout : 15000
storm.zookeeper.port : 2181
storm.zookeeper.retry.interval : 1000
storm.zookeeper.retry.intervalceiling.millis : 30000
storm.zookeeper.retry.times : 5
storm.zookeeper.root : '/storm' ZooKeeper中Storm的根目錄位置
storm.zookeeper.servers : ['',''.....] zookeeper服務器列表
storm.zookeeper.session.timeout : 20000 客戶端連接ZooKeeper超時時間
supervisor.childopts : 在storm-deploy項目中使用,用來配置supervisor守護進程的jvm選項
supervisor.heartbeat.frequency.secs : 5 supervisor心跳發送頻率(多久發送一次)
supervisor.monitor.frequency.secs : 3 supervisor檢查worker心跳的頻率
supervisor.slots.ports : [6700,6701,....] supervisor上能夠運行workers的端口列表.每個worker佔用一個端口,且每個端口只運行一個worker.通過這項配置可以調整每臺機器上運行的worker數.(調整slot數/每機)
supervisor.worker.start.timeout.secs : 120
supervisor.worker.timeout.secs : 30 supervisor中的worker心跳超時時間,一旦超時supervisor會嘗試重啓worker進程.
task.heartbeat.frequency.secs : 3 task彙報狀態心跳時間間隔
task.refresh.poll.secs : 10 ask與其他tasks之間鏈接同步的頻率.(如果task被重分配,其他tasks向它發送消息需要刷新連接).一般來講，重分配發生時其他tasks會理解得到通知。該配置僅僅爲了防止未通知的情況。
topology.acker.executors : null
topology.builtin.metrics.bucket.size.secs : 60
topology.debug : false
topology.disruptor.wait.strategy : 'com.lmax.disruptor.BlockingWaitStrategy'
topology.enable.message.timeouts : true
topology.error.throttle.interval.secs : 10
topology.executor.receive.buffer.size : 1024
topology.executor.send.buffer.size : 1024
topology.fall.back.on.java.serialization : true
topology.kryo.factory : 'backtype.storm.serialization.DefaultKryoFactory'
topology.max.error.report.per.interval : 5
topology.max.spout.pending : null 一個spout task中處於pending狀態的最大的tuples數量.該配置應用於單個task,而不是整個spouts或topology.
topology.max.task.parallelism : null 在一個topology中能夠允許的最大組件並行度.該項配置主要用在本地模式中測試線程數限制.
topology.message.timeout.secs : 30 topology中spout發送消息的最大處理超時時間.如果一條消息在該時間窗口內未被成功ack,Storm會告知spout這條消息失敗。而部分spout實現了失敗消息重播功能。
topology.metrics.aggregate.metric.evict.secs : 5
topology.metrics.aggregate.per.worker : true
topology.metrics.consumer.register :
topology.metrics.expand.map.type : true
topology.metrics.metric.name.separator : ','
topology.optimize ： true
topology.skip.missing.kryo.registrations : false Storm是否應該跳過它不能識別的kryo序列化方案.如果設置爲否task可能會裝載失敗或者在運行時拋出錯誤.
topology.sleep.spout.wait.strategy.time.ms : 1
topology.spoout.wait.strategy : 'backtype.storm.spout.SleeSpoutWaitStrategy'
topology.state.synchronization.timeout.secs : 60
topology.stats.sample.rate : 0.05
topology.tick.tuple.freq.secs : null
topology.transfer.buffer.size : 1024
topology.trident.batch.emit.interval.millis : 500
topology.tuple.serializer : 'backtype.storm.serialization.types.ListDelegateSerializer'
topology.worker.childopts : null
topology.worker.shared.thread.pool.size : 4
topology.workers : 40 執行該topology集羣中應當啓動的進程數量.每個進程內部將以線程方式執行一定數目的tasks.topology的組件結合該參數和並行度提示來優化性能
transactional.zookeeper.port : null
transactional.zookeeper.root : '/transactional'
transactional.zookeeper.servers : null
ui.childopts : '-Xmx2048m'
ui.filter : null
ui.port : 8744 Storm UI的服務端口
worker.childopts :
worker.heartbeet.frequency.secs : 1
zmq.hwm : 0
zmq.linger.millis : 5000 當連接關閉時,鏈接嘗試重新發送消息到目標主機的持續時長.這是一個不常用的高級選項,基本上可以忽略.
zmq.threads : 1 每個worker進程內zeromq通訊用到的線程數

storm代碼框架總結：

基類topology,BasicTopology.java

  1 import java.io.UnsupportedEncodingException;
  2 import java.math.BigDecimal;
  3 import java.util.Date;
  4 import java.util.HashMap;
  5 import java.util.List;
  6 import java.util.Map;
  7 
  8 import org.apache.commons.cli.CommandLine;
  9 import org.apache.commons.cli.CommandLineParser;
 10 import org.apache.commons.cli.DefaultParser;
 11 import org.apache.commons.cli.HelpFormatter;
 12 import org.apache.commons.cli.Options;
 13 import org.apache.commons.lang3.StringUtils;
 14 import org.apache.storm.guava.base.Preconditions;
 15 import org.edm.storm.topo.util.DelayKafkaSpout;
 16 import org.edm.storm.topo.util.StormBeanFactory;
 17 import org.joda.time.DateTime;
 18 
 19 import storm.kafka.BrokerHosts;
 20 import storm.kafka.KafkaSpout;
 21 import storm.kafka.SpoutConfig;
 22 import storm.kafka.ZkHosts;
 23 import backtype.storm.Config;
 24 import backtype.storm.LocalCluster;
 25 import backtype.storm.StormSubmitter;
 26 import backtype.storm.topology.TopologyBuilder;
 27 import backtype.storm.tuple.Fields;
 28 
 29 public abstract class BasicTopology {
 30     
 31     public static final String HASH_TAG = "hashTag";
 32     
 33     public static final Fields HASH_FIELDS = new Fields(HASH_TAG);
 34     
 35     protected Options options = new Options();
 36     
 37     protected StormBeanFactory stormBeanFactory;
 38     
 39     protected Config config = new Config();
 40     
 41     protected String configFile;
 42     
 43     public BasicTopology(){
 44         config.setFallBackOnJavaSerialization(false);
 45         
 46         config.setSkipMissingKryoRegistrations(false);
 47         config.registerSerialization(Date.class);
 48         config.registerSerialization(BigDecimal.class);
 49         config.registerSerialization(HashMap.class);
 50         config.registerSerialization(Map.class);
 51 
 52         options.addOption("name", true, "拓撲運行時名稱");
 53         options.addOption("conf", false, "配置文件路徑");
 54         options.addOption("workers", true, "虛擬機數量");
 55     }
 56     
 57     protected void setupConfig(CommandLine cmd) throws UnsupportedEncodingException{
 58         //配置文件名稱
 59         String confLocation = cmd.getOptionValue("conf",getConfigName());
 60         //創建stormBeanFactory
 61         stormBeanFactory = new StormBeanFactory(confLocation);
 62         Map<String,Object> stormConfig = stormBeanFactory.getBean("stormConfig",Map.class);
 63         Preconditions.checkNotNull(stormConfig);
 64         config.putAll(stormConfig);
 65         config.put(StormBeanFactory.SPRING_BEAN_FACTORY_XML, stormBeanFactory.getXml());
 66         //先默認加載，然後再加載命令行
 67         String numWorkers = cmd.getOptionValue("workers");
 68         if(numWorkers != null){
 69             config.setNumWorkers(Integer.parseInt(numWorkers));
 70         }else{
 71             config.setNumWorkers(getNumWorkers());
 72         }
 73         config.put(Config.TOPOLOGY_MESSAGE_TIMEOUT_SECS, 180);
 74         config.put(Config.TOPOLOGY_MAX_SPOUT_PENDING, 2000);
 75         config.put(Config.TOPOLOGY_EXECUTOR_RECEIVE_BUFFER_SIZE, 16384);
 76         config.put(Config.TOPOLOGY_EXECUTOR_SEND_BUFFER_SIZE, 16384);
 77         config.put(Config.TOPOLOGY_ACKER_EXECUTORS, config.get(Config.TOPOLOGY_WORKERS));
 78     }
 79     
 80     @SuppressWarnings("unchecked")
 81     protected SpoutConfig getSpoutConfig(String topic){
 82         String brokerZkStr = (String)config.get("kafka.brokerZkStr");
 83         String brokerZkPath = (String)config.get("kafka.brokerZkPath");
 84         
 85         List<String> zkServers = (List<String>)config.get("kafka.offset.zkServers");
 86         Integer zkPort = Integer.parseInt(String.valueOf(config.get("kafka.offset.zkPort")));
 87         String zkRoot = (String)config.get("kafka.offset.zkRoot");
 88         String id = StringUtils.join(getTopoName(),"-",topic);
 89         BrokerHosts kafkaBrokerZk = new ZkHosts(brokerZkStr, brokerZkPath);
 90         SpoutConfig spoutConfig = new SpoutConfig(kafkaBrokerZk, topic, zkRoot, id);
 91         spoutConfig.zkServers = zkServers;
 92         spoutConfig.zkPort = zkPort;
 93         spoutConfig.zkRoot = zkRoot;
 94         spoutConfig.stateUpdateIntervalMs = 30000;
 95         return spoutConfig;
 96     }
 97     
 98     //創建kafkaspout
 99     public KafkaSpout getKafkaSpout(String topic){
100         SpoutConfig spoutConfig = getSpoutConfig(topic);
101         return new DelayKafkaSpout(spoutConfig);
102     }
103     
104     
105     /**
106      * 拓撲可部署多次，但從kafka獲取數據，做唯一次過濾等用的
107      * 
108      * @return
109      */
110     public abstract String getTopoName();
111     
112     public abstract String getConfigName();
113     
114     public abstract int getNumWorkers();
115     
116     public void registerKryo(Config config){
117         
118     }
119     
120     public abstract void addOptions(Options options);
121         
122     public abstract void setupOptionValue(CommandLine cmd);
123     
124     public abstract void createTopology(TopologyBuilder builder);
125     
126     public void runLocat(String[] args) throws Exception{
127         CommandLineParser parser = new DefaultParser();
128         CommandLine cmd = parser.parse(options, args);
129         HelpFormatter formatter = new HelpFormatter();
130         formatter.printHelp("topology", options);
131         setupConfig(cmd);
132         
133         config.setDebug(true);
134         config.setNumWorkers(1);
135         TopologyBuilder builder = new TopologyBuilder();
136         createTopology(builder);
137         LocalCluster cluster = new LocalCluster();
138         String topoName = cmd.getOptionValue("name",
139                 StringUtils.join(getTopoName(), "-", new DateTime().toString("yyyyMMdd-HHmmss")));
140         cluster.submitTopology(topoName,config,builder.createTopology());
141         
142     }
143     
144     public void run(String args[]) throws Exception{
145         CommandLineParser parser = new DefaultParser();
146         CommandLine cmd = parser.parse(options, args);
147         HelpFormatter formatter = new HelpFormatter();
148         formatter.printHelp("topology", options);
149         setupConfig(cmd);
150         setupOptionValue(cmd);
151         
152         TopologyBuilder builder = new TopologyBuilder();
153         createTopology(builder);
154         String topoName = cmd.getOptionValue("name",StringUtils.join(getTopoName(),new DateTime().toString("yyyyMMdd-HHmmss")));
155         StormSubmitter.submitTopologyWithProgressBar(topoName, config, builder.createTopology());
156     }
157 }

用於實現數據流spout,bolt的業務實現Topology:

 1 import org.apache.commons.cli.CommandLine;
 2 import org.apache.commons.cli.Options;
 3 
 4 import storm.kafka.KafkaSpout;
 5 
 6 import backtype.storm.topology.TopologyBuilder;
 7 
 8 import com.practice.impl.BasicTopology;
 9 
10 public class LbsCalcTopology extends BasicTopology{
11     
12     private int spoutParallelism = 1;
13     
14     private int onceParallelism = 1;
15     
16     private int calculateParallelism = 1;
17     
18 
19     @Override
20     public void addOptions(Options options) {
21         options.addOption("spout",true,"spoutParallelism");
22         options.addOption("once",true,"onceParallelism");
23         options.addOption("calc",true,"calculateParallelism");
24     }
25 
26     @Override
27     public void setupOptionValue(CommandLine cmd) {
28         spoutParallelism = Integer.parseInt(cmd.getOptionValue("spout","1"));
29         onceParallelism = Integer.parseInt(cmd.getOptionValue("once","1"));
30         calculateParallelism = Integer.parseInt(cmd.getOptionValue("calc","1"));
31     }
32 
33     @Override
34     public void createTopology(TopologyBuilder builder) {
35         KafkaSpout kafkaSpout = getKafkaSpout("lbs");
36         LbsOnceBolt exactlyOnceBolt = new LbsOnceBolt(getTopoName(),"lbs","lbs");
37         LbsCalcBolt calculateBolt = new LbsCalcBolt();
38         
39         builder.setSpout("kafkaSpout", kafkaSpout,spoutParallelism);
40         builder.setBolt("onceParallelism", exactlyOnceBolt,onceParallelism).shuffleGrouping("kafkaSpout");
41         builder.setBolt("calculateBolt", calculateBolt, calculateParallelism).fieldsGrouping("onceParallelism", BasicTopology.HASH_FIELDS);
42     }
43     
44     @Override
45     public String getTopoName() {
46         return "lbs";
47     }
48 
49     @Override
50     public String getConfigName() {
51         return "lbs-topology.xml";
52     }
53 
54     @Override
55     public int getNumWorkers() {
56         return 3;
57     }
58     
59     public static void main(String args[]) throws Exception{
60         LbsCalcTopology topo = new LbsCalcTopology();
61         topo.run(args);
62     }
63 }

具體業務時間bolt:

 1 import java.util.Map;
 2 
 3 import backtype.storm.task.OutputCollector;
 4 import backtype.storm.task.TopologyContext;
 5 import backtype.storm.topology.OutputFieldsDeclarer;
 6 import backtype.storm.topology.base.BaseRichBolt;
 7 import backtype.storm.tuple.Tuple;
 8 
 9 public class LbsCalcBolt extends BaseRichBolt{
10 
11     @Override
12     public void execute(Tuple input) {
13         try{
14             String msg = input.getString(0);
15             if(msg != null){
16                 System.out.println(msg);
17             }
18         }catch(Exception e){
19             e.printStackTrace();
20         }
21     }
22 
23     @Override
24     public void prepare(Map arg0, TopologyContext arg1, OutputCollector arg2) {
25         // TODO Auto-generated method stub
26         
27     }
28 
29     @Override
30     public void declareOutputFields(OutputFieldsDeclarer arg0) {
31         // TODO Auto-generated method stub
32         
33     }
34 
35 }

Kafka安裝配置與代碼框架

1、主要配置參數說明

-- 日誌數據存儲目錄
log.dirs = /storage/disk17/kafka-logs

-- Kafka server端口
port = 6667

-- 強制切分日誌文件的間隔時間（切分的文件大小由log.segment.bytes決定，默認1073741824 bytes）
log.roll.hours = 168

-- 日誌文件保留時間
log.retention.hours = 168

-- 每次flush到磁盤的最大消息數量，數量太大會讓寫磁盤時間過長(IO阻塞)，導致客戶端發生延遲
log.flush.interval.messages=10000

-- 每次flush消息的間隔時間
log.flush.scheduler.interval.ms=3000

-- zookeeper 連接超時時間
zookeeper.connection.timeout.ms = 20000

-- zookeeper 集羣的節點地址，多個以，號分隔
zookeeper.connect=127.0.0.1:2181

-- zooKeeper集羣中leader和follower之間的同步時間
zookeeper.sync.time.ms=2000

-- 是否允許自動創建主題
auto.create.topics.enable=true

-- 默認複製因子數量
default.replication.factor=1

-- 默認分區數量
num.partitions=1

2、消息以及日誌數據的存儲策略

kafka和JMS（Java Message Service）實現(activeMQ)不同的是：即使消息被消費，消息仍然不會被立即刪除，日誌文件將會根據broker中的配置要求，保留一定的時間之後刪除，比如log.retention.hours=168，那麼七天後，文件會被清除，無論其中的消息是否被消費。kafka通過這種簡單的手段，來釋放磁盤空間，以及減少消息消費之後對文件內容改動的磁盤IO開銷。

對於consumer而言，它需要保存消費消息的offset（如果不指定則存儲在默認的zookeeper目錄kafak-broker），對於offset的保存和使用，由consumer來控制。當consumer正常消費消息時，offset將會"線性"的向前驅動，即消息將依次順序被消費。事實上consumer可以使用任意順序消費消息，它只需要將offset重置爲任意值。

kafka集羣幾乎不需要維護任何consumer和producer狀態信息，這些信息由zookeeper保存。因此producer和consumer的客戶端實現非常輕量級，它們可以隨意離開，而不會對集羣造成額外的影響。

3、常用操作命令

3.1、創建消息主題(kafka topic)

./kafka-topics.sh --zookeeper 127.0.0.1:2181 --topic topicName --partitions 2 --replication-factor 3 --create

--zookeeper zookeeper 服務端地址
--topic 消息主題名稱
--partitions 分區數量（如果不指定數量則由Kafka的配置決定）
--replication-factor 複製因子數量（副本數量，如果不指定數量則由Kafka的配置決定）
--create 創建主題

3.2、刪除消息主題

* 慎用，這條命令只是刪除zookeeper中主題的元數據，日誌文件並不會刪除，不建議直接刪除日誌文件。如需刪除，安全的措施是停止kafka集羣服務，然後執行這條命令再刪除日誌文件，最後恢復服務，恢復服務後執行--describe命令（3.3）檢查主題是否已經移除！

./kafka-run-class.sh kafka.admin.DeleteTopicCommand --zookeeper 127.0.0.1:2181 --topic topicName

3.3、查看消息主題

./kafka-topics.sh --zookeeper 127.0.0.1:2181 --topic topicName --describe

--zookeeper zookeeper 服務端地址
--topic 消息主題名稱
----describe 主題信息

命令將輸出如下信息，如topicName主題有2個分區，複製因子爲3，以及主題所在的leader。

Topic:topicName PartitionCount:2 ReplicationFactor:3 Configs:
Topic: topicName Partition: 0 Leader: 2 Replicas: 2,1,0 Isr: 2,1,0
Topic: topicName Partition: 1 Leader: 0 Replicas: 0,2,1 Isr: 0,2,1

3.4、其他命令

-- 列出所有的主題
./kafka-topics.sh --zookeeper 127.0.0.1:2181 --list

-- 修改主題分區數量
./kafka-topics.sh --zookeeper 127.0.0.1:2181 --topic topicName --partitions 4 --alter

Redis安裝配置與代碼框架

Jettry配置與代碼框架

Jetty服務一共啓動8個線程進行數據的接入。存儲於bLOCKINGqUEUE中，進行消費，啓啓動多個線程的作用在於併發消費接入的報文數據。同時，同一事件啓動多個服務，通過nginx進行負載均衡。

一、Jetty服務的啓動

1、創建ApplicationContext,通過AbstractApplicationContext創建。
AbstractApplicationContext applicationContext = new FileSystemXmlApplicationContext(configXmlFile);

2、創建PlatformMBeanServer
MBeanServer mbs = java.lang.management.ManagementFactory.getPlatformMBeanServer();

3、註冊ApplicationServer

ObjectName mbeanName = getApplicationObjectName();
ApplicationServer applicationServer = new ApplicationServer();
....
mbs.registerMBean(applicationServer,mbeanNamer);

二、Jettry服務的停止
......

HIVE安裝配置與代碼框架

Rstudio與Spark安裝配置與代碼框架

Nginx配置

user www-data; #運行用戶
worker_processes 1; #啓動進程,通常設置成和cpu的數量相等
error_log /var/log/nginx/error.log; #全局錯誤日誌及PID文件
pid /var/run/nginx.pid;#PID文件

#工作模式及連接數上限
events {
use epoll; #epoll是多路複用IO(I/O Multiplexing)中的一種方式,但是僅用於linux2.6以上內核,可以大大提高nginx的性能
worker_connections 1024;#單個後臺worker process進程的最大併發鏈接數
# multi_accept on;
}

#設定http服務器，利用它的反向代理功能提供負載均衡支持
http {
#設定mime類型,類型由mime.type文件定義
include /etc/nginx/mime.types;
default_type application/octet-stream;
#設定日誌格式
access_log /var/log/nginx/access.log;

#sendfile 指令指定 nginx 是否調用 sendfile 函數（zero copy 方式）來輸出文件，對於普通應用，
#必須設爲 on,如果用來進行下載等應用磁盤IO重負載應用，可設置爲 off，以平衡磁盤與網絡I/O處理速度，降低系統的uptime.
sendfile on;
#tcp_nopush on;

#連接超時時間
#keepalive_timeout 0;
keepalive_timeout 65;
tcp_nodelay on;

#開啓gzip壓縮
gzip on;
gzip_disable "MSIE [1-6]\.(?!.*SV1)";

#設定請求緩衝
client_header_buffer_size 1k;
large_client_header_buffers 4 4k;

include /etc/nginx/conf.d/*.conf;
include /etc/nginx/sites-enabled/*;

#設定負載均衡的服務器列表
upstream mysvr {
#weigth參數表示權值，權值越高被分配到的機率越大
#本機上的Squid開啓3128端口
server 192.168.8.1:3128 weight=5;
server 192.168.8.2:80 weight=1;
server 192.168.8.3:80 weight=6;
}

server {
#偵聽80端口
listen 80;
#定義使用www.xx.com訪問
server_name www.xx.com;(虛擬主機ip)

#設定本虛擬主機的訪問日誌
access_log logs/www.xx.com.access.log main;

#默認請求
location / {
root /root; #定義服務器的默認網站根目錄位置
index index.php index.html index.htm; #定義首頁索引文件的名稱

fastcgi_pass www.xx.com;
fastcgi_param SCRIPT_FILENAME $document_root/$fastcgi_script_name;
include /etc/nginx/fastcgi_params;
}

# 定義錯誤提示頁面
error_page 500 502 503 504 /50x.html;
location = /50x.html {
root /root;
}

#靜態文件，nginx自己處理
location ~ ^/(images|javascript|js|css|flash|media|static)/ {
root /var/www/virtual/htdocs;
#過期30天，靜態文件不怎麼更新，過期可以設大一點，如果頻繁更新，則可以設置得小一點。
expires 30d;
}
#PHP 腳本請求全部轉發到 FastCGI處理. 使用FastCGI默認配置.
location ~ \.php$ {
root /root;
fastcgi_pass 127.0.0.1:9000;
fastcgi_index index.php;
fastcgi_param SCRIPT_FILENAME /home/www/www$fastcgi_script_name;
include fastcgi_params;
}
#設定查看Nginx狀態的地址
location /NginxStatus {
stub_status on;
access_log on;
auth_basic "NginxStatus";
auth_basic_user_file conf/htpasswd;
}
#禁止訪問 .htxxx 文件
location ~ /\.ht {
deny all;
}

}
}

Sentry配置

Hue配置（3.11版本）

hue安裝
rpm -ivh hue-* --force --nodeps
rpm -ivh sentry-* --force --nodeps

hue 啓動
/usr/lib/hue/build/env/bin/hue runserver ip:port > /var/log/hue/hue-start.out 2> /var/log/hue/hue-error.out &

spark livy 集成
hue sqlited is locked
desktopdefault_hdfs_superuserhadoopHDFS管理用戶desktophttp_host10.10.4.125Hue Web Server所在主機/IPdesktophttp_port8000Hue Web Server服務端口desktopserver_userhadoop運行Hue Web Server的進程用戶desktopserver_grouphadoop運行Hue Web Server的進程用戶組desktopdefault_useryanjunHue管理員

1. 新建hue用戶
#usradd hue
將包放到/home/hue/底下(可以自己選擇位置)
#rpm -ivh hue-* --force --nodeps
安裝完成之後，hue的目錄在/usr/lib/hue,配置文件目錄在/etc/hue/conf/hue.ini
2. 配置hue(/etc/hue/conf/hue.ini)
[beeswax]
# Host where HiveServer2 is running.
# If Kerberos security is enabled, use fully-qualified domain name (FQDN).
hive_server_host=172.19.189.50(hiveserver2的地址)

# Port where HiveServer2 Thrift server runs on.
hive_server_port=10010(hiveserver2端口)

# Hive configuration directory, where hive-site.xml is located
hive_conf_dir=/etc/conf(hive配置文件路徑)

# Timeout in seconds for thrift calls to Hive service
server_conn_timeout=120(連接hive超時時間)

# Choose whether to use the old GetLog() thrift call from before Hive 0.14 to retrieve the logs.
# If false, use the FetchResults() thrift call from Hive 1.0 or more instead.
## use_get_log_api=false

# Limit the number of partitions that can be listed.
list_partitions_limit=10000(不限制時查詢界面上list的partition數據)

# The maximum number of partitions that will be included in the SELECT * LIMIT sample query for partitioned tables.
query_partitions_limit=10(查詢語句中最多查詢10個partition)

# A limit to the number of cells (rows * columns) that can be downloaded from a query
# (e.g. - 10K rows * 1K columns = 10M cells.)
# A value of -1 means there will be no limit.
download_cell_limit=10000000(download最大量)

# Hue will try to close the Hive query when the user leaves the editor page.
# This will free all the query resources in HiveServer2, but also make its results inaccessible.
close_queries=true(關閉界面時直接斷開該用戶與hiveserver2的連接)

# Thrift version to use when communicating with HiveServer2.
# New column format is from version 7.
thrift_version=7(默認值，不建議改會報錯)

[[yarn_clusters]]
[[[default]]]
# Enter the host on which you are running the ResourceManager
resourcemanager_host= (RM地址)

# The port where the ResourceManager IPC listens on
resourcemanager_port=8032(RM端口)

# Whether to submit jobs to this cluster
submit_to=True

# Resource Manager logical name (required for HA)
logical_name=ActiveResourceManager(HA，給activeRM命名)

# Change this if your YARN cluster is Kerberos-secured
## security_enabled=false

# URL of the ResourceManager API
resourcemanager_api_url=http://172.19.xxx.xx:8088(RM URL)

# URL of the ProxyServer API
## proxy_api_url=http://localhost:8088

# URL of the HistoryServer API
history_server_api_url=http://172.19.xx.xx:19888

# URL of the Spark History Server
#spark_history_server_url=http://172.30.xx.xx:18088

# In secure mode (HTTPS), if SSL certificates from YARN Rest APIs
# have to be verified against certificate authority
## ssl_cert_ca_verify=True

# HA support by specifying multiple clusters.
# Redefine different properties there.
# e.g.

[[[ha]]]
# Resource Manager logical name (required for HA)
logical_name=StandbyResourceManager(RM的HA從節點name)

# Un-comment to enable
submit_to=True

# URL of the ResourceManager API
resourcemanager_api_url=http://172.19.189.50:8088(RM從節點地址)

[hadoop]
# Configuration for HDFS NameNode
# ------------------------------------------------------------------------
[[hdfs_clusters]]
# HA support by using HttpFs
[[[default]]]
# Enter the filesystem uri
fs_defaultfs=hdfs://hdp(HDFS的filesystem名字)

# NameNode logical name.
logical_name=MasterNamenode(NN name)

# Use WebHdfs/HttpFs as the communication mechanism.
# Domain should be the NameNode or HttpFs host.
# Default port is 14000 for HttpFs.
webhdfs_url=http://172.19.189.50:50070/webhdfs/v1(hue的獲取hdfs的nn HA是通過WebHDFS來切換的，這個地址是啓動Webhdfs的地址)

# Change this if your HDFS cluster is Kerberos-secured
## security_enabled=false

# In secure mode (HTTPS), if SSL certificates from YARN Rest APIs
# have to be verified against certificate authority
## ssl_cert_ca_verify=True

# Directory of the Hadoop configuration
hadoop_conf_dir=/etc/hadoop/conf(hadoop配置文件路徑)

[desktop]
# Set this to a random string, the longer the better.
# This is used for secure hashing in the session store.
## secret_key=

# Execute this script to produce the Django secret key. This will be used when
# 'secret_key' is not set.
## secret_key_script=

# Webserver listens on this address and port
http_host=0.0.0.0
http_port=8888

# Time zone name
time_zone=Asia/Shanghai

# Enable or disable Django debug mode.
#django_debug_mode=false

# Enable or disable database debug mode.
#database_logging=false

# Whether to send debug messages from JavaScript to the server logs.
#send_dbug_messages=true

# Enable or disable backtrace for server error
http_500_debug_mode=true

# Enable or disable memory profiling.
## memory_profiler=false

# Server email for internal error messages
## django_server_email='[email protected]'

# Email backend
## django_email_backend=django.core.mail.backends.smtp.EmailBackend

# Webserver runs as this user
server_user=hue
server_group=hadoop(hue界面登錄用戶)

# This should be the Hue admin and proxy user
default_user=hue(hue管理員用戶)

# This should be the hadoop cluster admin
default_hdfs_superuser=hue(hue在hdfs上的管理員用戶)
[[database]]
#engine=sqlite3
#name=/var/lib/hue/desktop.db
# Database engine is typically one of:
# postgresql_psycopg2, mysql, sqlite3 or oracle.
# Note that for sqlite3, 'name', below is a path to the filename. For other backends, it is the database name
# Note for Oracle, options={"threaded":true} must be set in order to avoid crashes.
# Note for Oracle, you can use the Oracle Service Name by setting "host=" and "port=" and then "name=<host>:<port>/<service_name>".
# Note for MariaDB use the 'mysql' engine.
engine=mysql
host=172.30.115.60
port=3306
user=hue
password=hue
name=hue
(hue的元數據存儲位置)
# Execute this script to produce the database password. This will be used when 'password' is not set.
## password_script=/path/script
## name=desktop/desktop.db
## options={}

[librdbms]
# The RDBMS app can have any number of databases configured in the databases
# section. A database is known by its section name
# (IE sqlite, mysql, psql, and oracle in the list below).
[[databases]]
# sqlite configuration.
## [[[sqlite]]]
# Name to show in the UI.
## nice_name=SQLite

# For SQLite, name defines the path to the database.
## name=/tmp/sqlite.db

# Database backend to use.
## engine=sqlite

# Database options to send to the server when connecting.
# https://docs.djangoproject.com/en/1.4/ref/databases/
## options={}

# mysql, oracle, or postgresql configuration.
[[[mysql]]](開啓mysql)
# Name to show in the UI.
nice_name="HUE DB"(界面顯示mysql的名字)

# For MySQL and PostgreSQL, name is the name of the database.
# For Oracle, Name is instance of the Oracle server. For express edition
# this is 'xe' by default.
name=hue(連接的mysql數據庫)

# Database backend to use. This can be:
# 1. mysql
# 2. postgresql
# 3. oracle
engine=mysql(連接引擎)

# IP or hostname of the database to connect to.
host=172.30.115.60(mysql地址)

# Port the database server is listening to. Defaults are:
# 1. MySQL: 3306
# 2. PostgreSQL: 5432
# 3. Oracle Express Edition: 1521
port=3306(端口)

# Username to authenticate with when connecting to the database.
user=hue(連接數據庫的用戶)

# Password matching the username to authenticate with when
# connecting to the database.
password=hue(連接數據庫的密碼)

# Database options to send to the server when connecting.
# https://docs.djangoproject.com/en/1.4/ref/databases/
## options={}

修改hue源碼
1. /usr/lib/hue/desktop/libs/notebook/src/notebook/connectors/hiveserver2.py
添加調用query_server方法：
def _get_db(self, snippet):

LOG.info('log by ck get_db')
if snippet['type'] == 'hive':

name = 'beeswax'

elif snippet['type'] == 'impala':

name = 'impala'

else:

name = 'spark-sql'

return dbms.get_by_type(self.user, 'data', query_server=get_query_server_config_for_meta(name=name))

修改計算(exector)的執行方法
@query_error_handler
def execute(self, notebook, snippet):
db = self._get_db(snippet)

statement = self._get_current_statement(db, snippet)
session = self._get_session(notebook, snippet['type'])
query = self._prepare_hql_query(snippet, statement['statement'], session)

try:
if statement.get('statement_id') == 0:
db.use(query.database)
handle = db.client.query(query)
except QueryServerException, ex:
raise QueryError(ex.message, handle=statement)

# All good
server_id, server_guid = handle.get()
response = {
'secret': server_id,
'guid': server_guid,
'operation_type': handle.operation_type,
'has_result_set': handle.has_result_set,
'modified_row_count': handle.modified_row_count,
'log_context': handle.log_context,
}
response.update(statement)

return response
2. /usr/lib/hue/apps/beeswax/src/beeswax/server/dbms.py
添加query執行的路徑：(修改ip爲spark的ip和端口)
def get_query_server_config_for_meta(name='beeswax', server=None):

if name == 'impala':

from impala.dbms import get_query_server_config as impala_query_server_config

query_server = impala_query_server_config()

else:

kerberos_principal = hive_site.get_hiveserver2_kerberos_principal(HIVE_SERVER_HOST.get())

query_server = {

'server_name': 'beeswax', # Aka HiveServer2 now

'server_host': '172.30.115.58',#HIVE_SERVER_HOST.get(),

'server_port': '10010',#HIVE_SERVER_PORT.get(),

'principal': kerberos_principal,

'http_url': '%(protocol)s://%(host)s:%(port)s/%(end_point)s' % {

'protocol': 'https' if hiveserver2_use_ssl() else 'http',

'host': '172.30.115.58',#HIVE_SERVER_HOST.get(),

'port': hive_site.hiveserver2_thrift_http_port(),

'end_point': hive_site.hiveserver2_thrift_http_path()

'transport_mode': 'http' if hive_site.hiveserver2_transport_mode() == 'HTTP' else 'socket',

'auth_username': AUTH_USERNAME.get(),

'auth_password': AUTH_PASSWORD.get()

}

if name == 'sparksql': # Spark SQL is almost the same as Hive

from spark.conf import SQL_SERVER_HOST as SPARK_SERVER_HOST, SQL_SERVER_PORT as SPARK_SERVER_PORT

query_server.update({

'server_name': 'sparksql',

'server_host': SPARK_SERVER_HOST.get(),

'server_port': SPARK_SERVER_PORT.get()

})

debug_query_server = query_server.copy()

debug_query_server['auth_password_used'] = bool(debug_query_server.pop('auth_password'))

LOG.info("Query Server: %s" % debug_query_server)

return query_server

添加query的cache locking
DBMS_CACHE = {}

DBMS_CACHE_LOCK = threading.Lock()

DBMS_META_CACHE = {}

DBMS_META_CACHE_LOCK = threading.Lock()

DBMS_DATA_CACHE = {}

DBMS_DATA_CACHE_LOCK = threading.Lock()

def get(user, query_server=None):

global DBMS_CACHE

global DBMS_CACHE_LOCK

if query_server is None:

query_server = get_query_server_config()

DBMS_CACHE_LOCK.acquire()

try:

DBMS_CACHE.setdefault(user.username, {})

if query_server['server_name'] not in DBMS_CACHE[user.username]:

# Avoid circular dependency

from beeswax.server.hive_server2_lib import HiveServerClientCompatible

if query_server['server_name'] == 'impala':

from impala.dbms import ImpalaDbms

from impala.server import ImpalaServerClient

DBMS_CACHE[user.username][query_server['server_name']] = ImpalaDbms(HiveServerClientCompatible(ImpalaServerClient(query_server, user)), QueryHistory.SERVER_TYPE[1][0])

else:

from beeswax.server.hive_server2_lib import HiveServerClient

DBMS_CACHE[user.username][query_server['server_name']] = HiveServer2Dbms(HiveServerClientCompatible(HiveServerClient(query_server, user)), QueryHistory.SERVER_TYPE[1][0])

return DBMS_CACHE[user.username][query_server['server_name']]

finally:

DBMS_CACHE_LOCK.release()

def get_by_type(user, type, query_server=None):

global DBMS_META_CACHE

global DBMS_META_CACHE_LOCK

global DBMS_DATA_CACHE

global DBMS_DATA_CACHE_LOCK

if query_server is None:

query_server = get_query_server_config()

if type == 'meta':

DBMS_META_CACHE_LOCK.acquire()

try:

DBMS_META_CACHE.setdefault(user.username, {})

if query_server['server_name'] not in DBMS_META_CACHE[user.username]:

# Avoid circular dependency

from beeswax.server.hive_server2_lib import HiveServerClientCompatible

if query_server['server_name'] == 'impala':

from impala.dbms import ImpalaDbms

from impala.server import ImpalaServerClient

DBMS_META_CACHE[user.username][query_server['server_name']] = ImpalaDbms(HiveServerClientCompatible(ImpalaServerClient(query_server, user)), QueryHistory.SERVER_TYPE[1][0])

else:

from beeswax.server.hive_server2_lib import HiveServerClient

DBMS_META_CACHE[user.username][query_server['server_name']] = HiveServer2Dbms(HiveServerClientCompatible(HiveServerClient(query_server, user)), QueryHistory.SERVER_TYPE[1][0])

return DBMS_META_CACHE[user.username][query_server['server_name']]

finally:

DBMS_META_CACHE_LOCK.release()

else:

DBMS_DATA_CACHE_LOCK.acquire()

try:

DBMS_DATA_CACHE.setdefault(user.username, {})

if query_server['server_name'] not in DBMS_DATA_CACHE[user.username]:

# Avoid circular dependency

from beeswax.server.hive_server2_lib import HiveServerClientCompatible

if query_server['server_name'] == 'impala':

from impala.dbms import ImpalaDbms

from impala.server import ImpalaServerClient

DBMS_DATA_CACHE[user.username][query_server['server_name']] = ImpalaDbms(HiveServerClientCompatible(ImpalaServerClient(query_server, user)), QueryHistory.SERVER_TYPE[1][0])

else:

from beeswax.server.hive_server2_lib import HiveServerClient

DBMS_DATA_CACHE[user.username][query_server['server_name']] = HiveServer2Dbms(HiveServerClientCompatible(HiveServerClient(query_server, user)), QueryHistory.SERVER_TYPE[1][0])

return DBMS_DATA_CACHE[user.username][query_server['server_name']]

finally:

DBMS_DATA_CACHE_LOCK.release()

3. /usr/lib/hue/apps/filebrowser/src/filebrowser/templates/listdir.mako
刪除filebrower中的delete和drop
< !-- ko if: $root.isS3 -->
< button

class ="btn fileToolbarBtn delete-link" title="${_('Delete forever')}" data-bind="enable: selectedFiles().length > 0, click: deleteSelected" > < i class ="fa fa-bolt" > < / i > ${_('Delete forever')} < / button >

< !-- / ko -->

< ul

class ="dropdown-menu" >

< li > < a

href = "#"

class ="delete-link" title="${_('Delete forever')}" data-bind="enable: selectedFiles().length > 0, click: deleteSelected" > < i class ="fa fa-bolt" > < / i > ${_('Delete forever')} < / a > < / li >

< / ul >

4. /usr/lib/hue/apps/filebrowser/src/filebrowser/templates/listdir_components.mako
< li > < a href = "#" class ="delete-link" title="${_('Delete forever')}" data-bind="enable: $root.selectedFiles().length > 0, click: $root.deleteSelected" > < i class ="fa fa-fw fa-bolt" > < / i > ${_('Delete forever')} < / a > < / li >

5. /usr/lib/hue/apps/filebrowser/src/filebrowser/views.py
request.fs.do_as_user(request.user, request.fs.rmtree, arg['path'], True) # the old code was ('skip_trash' in request.GET)

修改hdfs配置
hdfs-site.xml文件
修改：fs.permissions.umask-mode 000
(hue建表用戶爲web登錄用戶，在用spark做計算引擎時會在hdfs的表目錄中創建文件，後臺spark的用戶爲hive無法寫入)

添加：添加hue用戶爲hdfs代理用戶(superuser)
hadoop.proxyuser.hue.hosts *
hadoop.proxyuser.hue.groups *

主流大數據技術全體系參數與搭建與後臺代碼工程框架的編寫（百分之70）

致遠OA及相關OA系統集成與二次開發

System.Object未被引用的程序集中定義

Java 信號量（semaphore）搭配CountDownLatch 實現多線程處理循環內邏輯並限制創建線程數

【面試準備】項目經驗——接口自動化項目

大數據全體系年終總結

主流大數據技術全體系參數與搭建與後臺代碼工程框架的編寫（百分之70）

Spark 1.6以後的內存管理機制

Kafka集羣配置說明

項目中Zookeeper配置參數筆記

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結