上一章講了WEB功能設計,本章繼續講下日誌實時分析設計。
《基於spark實時分析nginx請求日誌,自動封禁IP之一:web功能設計》
整體思路:從kafka集羣讀取請求日誌流,經過spark structured streaming(結構化流)實時分析,將觸發閾值的數據輸出至mysql。
1.數據源
nginx日誌通過syslog協議輸出至logstash,然後同時寫入es和kafka:
nginx -> logstash -> elasticsearch and kafka
logstash配置:
output {
elasticsearch {
……
}
## 儘可能在來源處排除無用日誌,減輕spark負擔
if [log_source] in ["ip1","ip2"] and ## 只分析對外nginx的日誌
[real_ip] !~ /10.\d{1,3}.\d{1,3}.\d{1,3}/ and ## 排除內網IP
[domain] != "" and ## 排除不需要分析的域名
[status] != "403" { ## IP拉黑後仍然有請求日誌,只是狀態爲403,因此排除403狀態日誌
kafka {
bootstrap_servers => 'ip1:9092,ip1:9092,ip3:9092' ## kafka集羣
topic_id => 'topic_nginxlog'
codec => plain {
format => "%{time_local},%{real_ip},%{domain},%{url_short}"
}
}
}
}
需要的日誌信息包括:請求時間、用戶IP、域名、URL(不含?後的參數)
2.spark結構化流與滑動窗口
結構化流是以spark sql爲基礎的可擴展的高容錯的數據處理引擎,利用它可以像處理靜態數據一樣處理流式數據。
結構化流底層通過micro-batch處理引擎,將數據流分成連續的小的批處理任務進行計算,實現0.1秒以內的延遲和exactly-once語義。
整個過程圍繞一個很重要的功能:sliding event-time window(基於日誌時間的滑動窗口)
滑動窗口的概念見下圖:
"窗口大小"對應規則表中的win_duration(時間範圍),"滑動週期"對應slide_duration(監控頻率)
這2個參數決定了規則使用spark計算資源的多少,窗口大小越多,說明每個batch需要處理的數據越多;滑動週期越小,說明需要處理的batch數量越多。
本案例中,窗口大小由用戶配置,滑動週期根據窗口的大小計算,以節約資源(slide_duration = math.ceil(win_duration/5))。
watermark delay允許數據遲到一定的時間,如設置爲1分鐘,上圖中00:00:18達到的時間戳爲00:00:05的數據會重新計算。
3.數據處理邏輯
3.1.從kafak中讀取數據,二進制數據轉換成string
selectExpr("CAST(value AS STRING)")
3.2.數據分割
以","分割,生成4個字段:req_time、real_ip、domain、url,其中req_time轉換爲timestamp類型,成爲event_time
.select(regexp_replace(regexp_replace(split("value",",")[0],"\+08:00",""),"T"," ").alias("req_time"),
split("value",",")[1].alias("real_ip"),
split("value",",")[2].alias("domain"),
split("value",",")[3].alias("url"))\
.selectExpr("CAST(req_time AS timestamp)","real_ip", "domain", "url")
3.3.允許數據延遲60秒,遲到超過60秒的數據丟棄
withWatermark("req_time", "60 seconds")
3.4.數據聚合
group by (real_ip,window) having count(*) > 請求閾值
.groupBy(
"real_ip",
window("req_time", str(i_rule.win_duration) + " seconds", str(i_rule.slide_duration) + " seconds")
).agg(F.count("*").alias("cnt")).filter("cnt >= " + str(i_rule.req_threshold))
3.5.補齊rule_id,mysql表入庫需要的字段
windows.withColumn("id",F.lit(i_rule.id))
3.6.設置outputMode
outputMode包含以下三種:
- append mode: 不支持聚合
- complete mode: 從JOB開始所有結果都輸出
- update mode: 數據有更新的結果才輸出
這裏選擇update mode
outputMode("update")
3.7.去除batch中的重複值
每個batch中,經常會接受同一個IP多個窗口的數據,且這些數據都超出的閾值,因此一個batch中一個IP會出現多次,如:
Batch: 21
+---------------+------------------------------------------+---+---+
|real_ip |window |cnt|id |
+---------------+------------------------------------------+---+---+
|117.157.183.228|[2020-05-14 13:44:27, 2020-05-14 13:44:57]|11 |3 |
|117.157.183.228|[2020-05-14 13:44:28, 2020-05-14 13:44:58]|11 |3 |
|117.157.183.228|[2020-05-14 13:44:30, 2020-05-14 13:45:00]|10 |3 |
+---------------+------------------------------------------+---+---+
一個IP只需要入庫一次,因此需要去重;
因爲要保留id,無法像mysql中一樣直接select id但不group by,只能通過join實現:
與之前action的結果關聯,需要緩存結果:persist
代碼:
batchDf.persist()
batchDfGrp = batchDf.groupBy("real_ip").agg(F.min("window").alias("window"))
pd = batchDfGrp.join(
batchDf,
["real_ip", "window"],
'inner'
).select("id", batchDfGrp.real_ip, batchDfGrp.window, "cnt").toPandas()
insert_match_record(pd)
batchDf.unpersist()
3.8.設置JOB執行週期
以滑動週期作爲JOB執行週期
trigger(processingTime=str(i_rule.slide_duration) + ' seconds')
3.9.入庫
上述步驟以及將結果轉換成pandas對象,循環將其中的記錄插入mysql表(記得用連接池)
pool = PooledDB(pymysql,
maxconnections=10,
mincached=5,
maxcached=5,
host='myip',
port=3306,
db='waf',
user='user',
passwd='password',
setsession=['set autocommit = 1'])
def insert_match_record(record_pandas):
connect = pool.connection()
cursor = connect.cursor()
for row in record_pandas.itertuples():
sql = "insert into match_record(rule_id,ip_addr,win_begin,win_end,request_cnt) values(%s,%s,%s,%s,%s)"
cursor.execute(sql,(getattr(row,'id'),getattr(row,'real_ip'),getattr(row,'window')[0].strftime('%Y-%m-%d %H:%M:%S'),getattr(row,'window')[1].strftime('%Y-%m-%d %H:%M:%S'),getattr(row,'cnt')))
connect.commit()
cursor.close()
connect.close()
4.調度
4.1.新增規則
通過多線程啓動每個規則JOB,線程name以rule_id命名
每3秒檢查一次是否有rule不在當前線程列表中(新增rule),有則創建改rule對應的JOB線程;
if __name__ == '__main__':
while True:
time.sleep(3)
curThreadName = []
for t in threading.enumerate():
curThreadName.append(t.getName())
curRuleList = get_rules()
for r in curRuleList:
if "rule"+str(r.id) not in curThreadName:
t = threading.Thread(target=sub_job, args=(r,), name="rule"+str(r.id))
t.start()
4.2.修改規則
每個規則JOB運行過程中,每隔3秒檢查當前規則JOB使用的規則參數與數據庫中是否一致,如果不一致,則停止當前線程JOB(新增規則會將庫中的規則重新創建)
while True:
time.sleep(3)
# this rule not in rule list
ruleValidFlag = 1
curRuleList = get_rules()
for r in curRuleList:
if i_rule.domain == r.domain \
and i_rule.url == r.url \
and i_rule.match_type == r.match_type \
and i_rule.win_duration == r.win_duration \
and i_rule.slide_duration == r.slide_duration \
and i_rule.req_threshold == r.req_threshold:
# this rule in rule list
ruleValidFlag = 0
if ruleValidFlag == 1:
query.stop()
break
5.spark on yarn任務提交
本人用的版本:python3.6, hadoop-2.6.5, spark_2.11-2.4.4
hadoop與yarn的安裝配置略,補充幾個注意事項:
5.1.日誌級別
默認INFO,會佔用大量磁盤空間,因此要改成WARN
修改以下兩個文件
# vi etc/hadoop/log4j.properties
hadoop.root.logger=WARN,console
# vi hadoop-daemon.sh
export HADOOP_ROOT_LOGGER=${HADOOP_ROOT_LOGGER:-"WARN,RFA"}
export HADOOP_SECURITY_LOGGER=${HADOOP_SECURITY_LOGGER:-"WARN,RFAS"}
export HDFS_AUDIT_LOGGER=${HDFS_AUDIT_LOGGER:-"WARN,NullAppender"}
5.2.yarn配置說明
使用capacity調度,創建獨立的隊列,確保和別的JOB資源隔離
yarn.resourcemanager.scheduler.class: org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler
yarn.scheduler.capacity.root.queues: default,waf
yarn.scheduler.capacity.root.waf.capacity: 90
yarn.scheduler.capacity.root.default.capacity: 10
yarn.scheduler.capacity.root.waf.maximum-capacity: 100
yarn.scheduler.capacity.root.default.maximum-capacity: 100
每個container分配的內存最小、最大值
yarn.scheduler.minimum-allocation-mb: 100
yarn.scheduler.maximum-allocation-mb: 5120
在指定driver-memory=4G的情況下,dirver實際使用的內存爲: 4G + (以minimum-allocation-mb爲單位向上取整(max(4G*0.1,384M))) = 4600M,因此指定maximum-allocation-mb爲5120M
在指定spark.executor.memory=1G的情況下,container實際使用內存爲: 1G + (以minimum-allocation-mb爲單位向上取整(max(1G*0.1,384M))) = 1500M
5.3.環境準備:spark jar包和配置
客戶端(提交任務的服務器,我用的master)部署spark-2.4.4環境,目錄:/root/spark244/
jar上傳hdfs
# hadoop fs -mkdir -p hdfs://hdfs_master_ip:9000/system/spark/jars/
# hadoop fs -put $SPARK_HOME/jars/* hdfs://hdfs_master_ip:9000/system/spark/jars/
# cat /root/spark244/conf/spark-defaults.conf
spark.yarn.jars hdfs://hdfs_master_ip:9000/system/spark/jars/*.jar
# cat /root/spark244/spark-env.sh
export SPARK_DIST_CLASSPATH=$(/root/hadoop/hadoop-2.6.5/bin/hadoop classpath)
export HADOOP_CONF_DIR=/root/hadoop/hadoop-2.6.5/etc/hadoop
export YARN_CONF_DIR=/root/hadoop/hadoop-2.6.5/etc/hadoop
export SPARK_CONF_DIR=/root/spark244/conf
5.4.環境準備:python庫
爲了不在每個node安裝python庫,我們將python虛擬環境打包上傳hdfs
先安裝python虛擬環境,並將相關庫安裝好:pip install DBUtils PyMySQL backports.lzma pandas
處理lzma兼容性問題(否則提交任務後application會報錯):
增加try except塊:
# vi lib/python3.6/site-packages/backports/lzma/__init__.py
try:
from ._lzma import *
from ._lzma import _encode_filter_properties, _decode_filter_properties
except ImportError:
from backports.lzma import *
from backports.lzma import _encode_filter_properties, _decode_filter_properties
增加backports前綴:
# vi lib/python3.6/site-packages/pandas/compat/__init__.py
try:
import backports.lzma
return backports.lzma
壓縮並上傳:
# zip -r pyspark-env.zip pyspark-env/
# hadoop fs -mkdir -p /env/pyspark
# hadoop fs -put pyspark-env.zip /env/pyspark
5.5.運行參數
# cat waf.py
ssConf = SparkConf()
ssConf.setMaster("yarn")
ssConf.setAppName("nginx-cc-waf")
ssConf.set("spark.executor.cores", "1")
ssConf.set("spark.executor.memory", "1024M")
ssConf.set("spark.dynamicAllocation.enabled", False)
## 數據量一般的情況下,executor數不宜過高,否則會很影響性能
ssConf.set("spark.executor.instances", "4")
ssConf.set("spark.scheduler.mode", "FIFO")
ssConf.set("spark.default.parallelism", "4")
ssConf.set("spark.sql.shuffle.partitions", "4")
ssConf.set("spark.debug.maxToStringFields", "1000")
ssConf.set("spark.sql.codegen.wholeStage", False)
ssConf.set("spark.jars.packages","org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.4")
spark = SparkSession.builder.config(conf=ssConf).getOrCreate()
提交:
# cat job_waf_start.sh
#!/bin/bash
spark-submit \
--name waf \ ## cluster模式下,代碼裏指定queue並不起作用,需要在spark-submit時指定
--master yarn \
--deploy-mode cluster \
--queue waf \
--driver-memory 4G \
--py-files job_waf/db_mysql.py \
--conf spark.yarn.dist.archives=hdfs://hdfs_master_ip:9000/env/pyspark/pyspark-env.zip#pyenv \
--conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=pyenv/pyspark-env/bin/python \
job_waf/waf.py