Web/Application Server(Nginx)========>HDFS
collect
日誌類型：access日誌：訪問、請求、客戶端、agent信息，與業務無關的；
ugc日誌：業務相關日誌；

collect方法：
1. 只要這臺機器有GATEWAY，通過 hdfs dfs -put 傳上去，通過 crontab 封裝一下，定時去發送
缺點：這樣的數據傳遞具有延時性；
採用壓縮方式會比較麻煩；
提交log的機器是客戶機器，不可能有GATEWAY

2. Flume 大規模數據採集框架 
collecting, aggregating, and moving large amounts of log data 
  收集        聚合               移動
  源端	      臨時存儲          目標端去
 Source      Channel            Sink
	from many different sources to a centralized data store
	Flume在採集的過程中可以幹很多事情：壓縮、容錯、監控
	Flume可以將(日誌)
3. Stream流處理：
	Flume==>Kafka==>Streaming系統
		Kafka作爲一個緩衝區
		Flume和Streaming之間需要一個消息隊列
		和Flume功能相近的框架還有ElasticSearch棧的logstash以及FileBeat，
						這兩個相對於Flume來說更加輕量級

	Flume==>Streaming 是可以的。數據量少是可以的，但是數據量大了以後，
					   直接從Flume到Steaming端，數據來不及處理
					   會產生數據的積壓，甚至把Streaming端幹爆掉。
					   生產太多，消費不及時

學習Flume就是查字典

Agent 可以看做是Flume的配置文件
Source：
Avro Source：定義端口來接收客戶端信息
Avro是一個序列化的框架
Thrift Source：序列化框架

Exec Source：命令行

Spooling Directory：監控目錄的

Taildir Source：

Channels：緩存在哪裏
Memory Channel：
JDBC Channel：
File Channel：

Sinks：
HDFS Sink
Hive Sink
Logger Sink 寫到控制檯
HBase Sink
Kafka Sink

例1. 從網絡指定端口上收集數據，下沉到控制檯

1. 對Agent技術選型
	NetCat Source ：監聽一個指定的端口，把text的每一行當做一個事件發出來
	Memory Channel ：因爲數據量不大
	Logger Sink：輸出到控制檯
   *****用Flume的關鍵點是 編寫Flume配置文件
2. 對Agent起名；對Source/Channel/Sink起名
	a1.sources = exec-source
	a1.sinks = hdfs-sink
	a1.channels = memory-channel
3. 配置Source
	a1.sources.exec-source.type=netcat
	a1.sources.exec-source.bind=hadoop001	綁定的HostName或者IP
	a1.sources.exec-source.port=44444
4. 配置Channel
	a1.channels.memory-channel.type=memory
5. 配置Sink
	a1.sinks.hdfs-sink.type=logger
6. 配置Source、Sink關聯的Channel
	a1.sources.exec-source.channels=memory-channel
	a1.sinks.hdfs-sink.channel=memory-channel

配置conf
a1.sources = exec-source
a1.sinks = hdfs-sink
a1.channels = memory-channel

a1.sources.exec-source.type=netcat
a1.sources.exec-source.bind=hadoop001
a1.sources.exec-source.port=44444

a1.channels.memory-channel.type=memory

a1.sinks.hdfs-sink.type=logger
a1.sources.exec-source.channels=memory-channel
a1.sinks.hdfs-sink.channel=memory-channel

使用：
flume-ng agent -n a1
-c $FLUME_HOME/conf
-f $FLUME_HOME/conf/flume-simple.conf
-Dflume.root.log nger=INFO,console
Event：是Flume數據傳輸的基本單元
headers + body

例2. WebServer -->Flume–>HDFS
log tail-F
採用 ExecSource 運行一個linux命令
MemoryChannel
HDFS Sink

exec-hdfs-agent.sources = exec-source
exec-hdfs-agent.sinks = hdfs-sink
exec-hdfs-agent.channels = memory-channel

exec-hdfs-agent.sources.exec-source.type=exec
exec-hdfs-agent.sources.exec-source.command=tail -F /home/hadoop/data/g6/data.log
exec-hdfs-agent.sources.exec-source.shell=/bin/sh -c

exec-hdfs-agent.channels.memory-channel.type=memory

exec-hdfs-agent.sinks.hdfs-sink.type=hdfs
exec-hdfs-agent.sinks.hdfs-sink.hdfs.path=hdfs://hadoop001:9000/g6flume/tail
exec-hdfs-agent.sinks.hdfs-sink.hdfs.fileType=DataStream
exec-hdfs-agent.sinks.hdfs-sink.hdfs.writeFormat=Text

exec-hdfs-agent.sources.exec-source.channels=memory-channel
exec-hdfs-agent.sinks.hdfs-sink.channel=memory-channel

flume-ng agent -n exec-hdfs-agent  \
-c $FLUME_HOME/conf \
-f $FLUME_HOME/conf/flume-exec-hdfs.conf \
-Dflume.root.log nger=INFO,console

這種用execSource 直接tail -F 寫入HDFS的方式容易產生小文件，所以生產上不宜使用
而且這種方式監控的是一個文件
例3. 監控文件夾用 ExecSource 監控指定文件夾中的文件，

exec-hdfs-agent.sources = exec-source
exec-hdfs-agent.sinks = hdfs-sink
exec-hdfs-agent.channels = memory-channel

exec-hdfs-agent.sources.exec-source.type=spooldir
exec-hdfs-agent.sources.exec-source.command=/home/hadoop/data/g6/spooldir

exec-hdfs-agent.channels.memory-channel.type=memory

exec-hdfs-agent.sinks.hdfs-sink.type=hdfs

打開時間戳，並將寫入的文件按分鐘分區(沒有的是空的)

exec-hdfs-agent.sinks.hdfs-sink.hdfs.path=hdfs://hadoop001:9000/g6flume/logs/%Y%m%d%H%M
exec-hdfs-agent.sinks.hdfs-sink.hdfs.useLocalTimeStamp=true
exec-hdfs-agent.sinks.hdfs-sink.filePrefix=access
exec-hdfs-agent.sinks.hdfs-sink.hdfs.fileType=CompressedStream
exec-hdfs-agent.sinks.hdfs-sink.hdfs.writeFormat=Text

設置HDFS，避免小文件的產生

exec-hdfs-agent.sinks.hdfs-sink.hdfs.rollInterval=30
exec-hdfs-agent.sinks.hdfs-sink.hdfs.rooSize=1000000
exec-hdfs-agent.sinks.hdfs-sink.hdfs.rollCount=0

採用gzip壓縮

exec-hdfs-agent.sinks.hdfs-sink.hdfs.codeC=gzip

exec-hdfs-agent.sources.exec-source.channels=memory-channel
exec-hdfs-agent.sinks.hdfs-sink.channel=memory-channel

flume-ng agent -n exec-hdfs-agent  \
-c $FLUME_HOME/conf \
-f $FLUME_HOME/conf/flume-exec-hdfs.conf \
-Dflume.root.log nger=INFO,console

例4. Taildir Source Apache1.7纔有但是cdh1.6就有
ExecSource 只能tail -F 一個文件
SpoolSource雖然能監控一個文件夾下的文件，
但是不能監控到指定文件夾下的文件夾中的文件，而且掛掉了沒法處理
所以需要Taildir Source，這纔是生產上需要的

Taildir source 是可靠的，不會丟失數據。即使tail掛掉了。因爲他會週期性的寫文件
的偏移量，然後將position寫到JSON文件裏。即使掛掉了，也可以根據JSON文件
從掛掉的點開始讀

a1.sources = r1
a1.channels = c1
a1.sources.r1.type = TAILDIR
a1.sources.r1.channels = c1
a1.sources.r1.positionFile = /var/log/flume/taildir_position.json
#以空格分隔文件組列表，每個文件組表示要跟蹤的一組文件
a1.sources.r1.filegroups = f1 f2
a1.sources.r1.filegroups.f1 = /var/log/test1/example.log
a1.sources.r1.headers.f1.headerKey1 = value1
a1.sources.r1.filegroups.f2 = /var/log/test2/.log.
a1.sources.r1.headers.f2.headerKey1 = value2
a1.sources.r1.headers.f2.headerKey2 = value2-2
a1.sources.r1.fileHeader = true
a1.sources.ri.maxBatchCount = 1000

Flume小結

打開時間戳，並將寫入的文件按分鐘分區(沒有的是空的)

設置HDFS，避免小文件的產生

採用gzip壓縮

如何使用 JS 判斷用戶是否處於活躍狀態

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

Spark_4 SparkCore緩存和CheckPoint

Spark_7 SparkCore共享變量

Spark_0 Spark版本及編譯

Spark_3 Spark Core運行架構

Hive_04 使用sql進行增量結合歷史數據分析

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結