前一段時間,在項目中,領導要求實時查看來自各個省份的ip訪問的詳情,根據這一需求,通過flume/logstack實時採集nginx的日誌到生產到kafka,再通過Spark實時消費分析保存到Redis/MySQL中,最後前端通過百度的echart圖實時的顯示出來。
首先,得有一份ip歸屬地的規則表,可以本地的文檔,也可以是分佈式的在多臺機器上的(如hdfs)。
ip規則表部分如下:
1.0.1.0|1.0.3.255|16777472|16778239|亞洲|中國|福建|福州||電信|350100|China|CN|119.306239|26.075302
1.0.8.0|1.0.15.255|16779264|16781311|亞洲|中國|廣東|廣州||電信|440100|China|CN|113.280637|23.125178
1.0.32.0|1.0.63.255|16785408|16793599|亞洲|中國|廣東|廣州||電信|440100|China|CN|113.280637|23.125178
1.1.0.0|1.1.0.255|16842752|16843007|亞洲|中國|福建|福州||電信|350100|China|CN|119.306239|26.075302
1.1.2.0|1.1.7.255|16843264|16844799|亞洲|中國|福建|福州||電信|350100|China|CN|119.306239|26.075302
1.1.8.0|1.1.63.255|16844800|16859135|亞洲|中國|廣東|廣州||電信|440100|China|CN|113.280637|23.125178
1.2.0.0|1.2.1.255|16908288|16908799|亞洲|中國|福建|福州||電信|350100|China|CN|119.306239|26.075302
1.2.2.0|1.2.2.255|16908800|16909055|亞洲|中國|北京|北京|海淀|北龍中網|110108|China|CN|116.29812|39.95931
1.2.4.0|1.2.4.255|16909312|16909567|亞洲|中國|北京|北京||中國互聯網信息中心|110100|China|CN|116.405285|39.904989
1.2.5.0|1.2.7.255|16909568|16910335|亞洲|中國|福建|福州||電信|350100|China|CN|119.306239|26.075302
1.2.8.0|1.2.8.255|16910336|16910591|亞洲|中國|北京|北京||中國互聯網信息中心|110100|China|CN|116.405285|39.904989
1.2.9.0|1.2.127.255|16910592|16941055|亞洲|中國|廣東|廣州||電信|440100|China|CN|113.280637|23.125178
1.3.0.0|1.3.255.255|16973824|17039359|亞洲|中國|廣東|廣州||電信|440100|China|CN|113.280637|23.125178
1.4.1.0|1.4.3.255|17039616|17040383|亞洲|中國|福建|福州||電信|350100|China|CN|119.306239|26.075302
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
本地模式
import java.sql.{Date, PreparedStatement, Connection, DriverManager}
import org.apache.spark.{SparkContext, SparkConf}
/**
* 計算ip從屬地
* Created by tianjun on 2017/2/13.
*/
object IpLocation {
def ip2Long(ip:String):Long = {
val fragments = ip.split("[.]")
var ipNum = 0L
for(i <- 0 until fragments.length){
ipNum=fragments(i).toLong | ipNum << 8L
}
ipNum
}
def binarySearch(lines:Array[(String,String,String)],ip:Long): Int ={
var low =0
var high = lines.length-1
while (low<=high){
val middle = (low + high)/2
if((ip>=lines(middle)._1.toLong)&&(ip<=lines(middle)._2.toLong)){
return middle
}
if(ip<lines(middle)._1.toLong){
high=middle-1
}else{
low = middle +1
}
}
-1
}
val data2MySql = (iterator:Iterator[(String,Int)])=>{
var conn:Connection = null
var ps: PreparedStatement = null
val sql = "INSERT INTO location_info(location,counts,access_date) values(?,?,?)"
try {
conn = DriverManager.getConnection("jdbc:mysql://localhost:3306/bigdata?useUnicode=true&characterEncoding=utf-8", "root", "123")
iterator.foreach(line => {
ps = conn.prepareStatement(sql)
ps.setString(1, line._1)
ps.setInt(2, line._2)
ps.setDate(3, new Date(System.currentTimeMillis()))
ps.executeUpdate()
})
} catch {
case e: Exception => e.printStackTrace()
} finally {
if (ps != null)
ps.close()
if (conn != null)
conn.close()
}
}
def main (args: Array[String]){
//windows上報錯才加的,在linxu上不需要
System.setProperty("hadoop.home.dir","C:\\tianjun\\winutil\\")
val conf = new SparkConf().setMaster("local").setAppName("IpLocation")
val sc = new SparkContext(conf)
//加載ip屬地規則(可以從多臺數據獲取)
val ipRuelsRdd = sc.textFile("c://ip.txt").map(line=>{
val fields = line.split("\\|")
val start_num = fields(2)
val end_num = fields(3)
val province = fields(6)
(start_num,end_num,province)
})
//全部的ip映射規則
val ipRulesArray = ipRuelsRdd.collect()
//廣播規則
val ipRulesBroadcast = sc.broadcast(ipRulesArray)
//加載處理的數據
val ipsRDD = sc.textFile("c://log").map(line=>{
val fields = line.split("\\|")
fields(1)
})
val result = ipsRDD.map(ip =>{
val ipNum = ip2Long(ip)
val index = binarySearch(ipRulesBroadcast.value,ipNum)
val info = ipRulesBroadcast.value(index)
//(ip的起始num,ip的結束num,省份)
info
})
//累加各個省市的結果
.map(t => (t._3,1)).reduceByKey(_+_)
result.foreachPartition(data2MySql)
// println(result.collect().toBuffer)
sc.stop()
}
}
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 10
- 11
- 12
- 13
- 14
- 15
- 16
- 17
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- 31
- 32
- 33
- 34
- 35
- 36
- 37
- 38
- 39
- 40
- 41
- 42
- 43
- 44
- 45
- 46
- 47
- 48
- 49
- 50
- 51
- 52
- 53
- 54
- 55
- 56
- 57
- 58
- 59
- 60
- 61
- 62
- 63
- 64
- 65
- 66
- 67
- 68
- 69
- 70
- 71
- 72
- 73
- 74
- 75
- 76
- 77
- 78
- 79
- 80
- 81
- 82
- 83
- 84
- 85
- 86
- 87
- 88
- 89
- 90
- 91
- 92
- 93
- 94
- 95
- 96
- 97
- 98
- 99
- 100
- 101
- 102
- 103
- 104
- 105
- 106
- 107
- 108
- 109
- 110
- 111
- 112
- 113
- 114
可以看到,利用spark的算子來進行數據分析是非常容易的。
在spark官網可以看到spark對接kafka,數據庫,等,是十分容易的。
再來看看本例子中的寫到數據庫的結果:
+----+----------+--------+---------------------+
| id | location | counts | access_date |
+----+----------+--------+---------------------+
| 7 | 陝西 | 1824 | 2017-02-13 00:00:00 |
| 8 | 河北 | 383 | 2017-02-13 00:00:00 |
| 9 | 雲南 | 126 | 2017-02-13 00:00:00 |
| 10 | 重慶 | 868 | 2017-02-13 00:00:00 |
| 11 | 北京 | 1535 | 2017-02-13 00:00:00 |
+----+----------+--------+---------------------+
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
- 1
- 2
- 3
- 4
- 5
- 6
- 7
- 8
- 9
在本次的測試中,只截取了nginx日誌裏面的4700條左右的日誌,這個文件大小約爲1.9M左右。