1.數據蒐集和數據清洗
·1000個cryptolocker域名
·1000個post-tovar-goz域名
·alexa前1000域名
從DGA文件中提取域名數據:
def load_dga(filename):
domain_list = []
with open(filename) as f:
for line in f:
domain = line.split(",")[0]
if domain >= MIN_LEN:
domain_list.append(domain)
return domain_list
alexa文件使用CSV格式保存域名的排名以及域名,提取方式:
def load_alexa(filename):
domain_list=[]
csv_reader = csv_reader(open(filename))
for row in csv_reader:
domain = row[1]
if domain >= MIN_LEN:
domain_list.append(domain)
return domain_list
2.特徵化
1)元音字母個數
正常人在取域名的時候,通常會偏向取”好讀“的幾個字母組合,這使英文的元音字母比例會比較高。DGA生成域名的時候是隨機的,所以元音字母這方面的特徵不明顯,我們可以通過這個差異來驗證我們的想法。
讀取alexa域名數據:
x1_domain_list = load_alexa("...")
計算元音字母的比例:
def get_aeiou(domain_list):
x=[]
y=[]
for domain in domain_list:
x.append(len(domain))
count = len(re.findall(r'[aeiou]',domain.lower()))
count = (0.0+count)/len(domain)
return x,y
分別獲取兩個殭屍網絡DGA域名以及alexa域名數據,並計算元音字母比例:
x1_domain_list = load_alexa("...")
x_1,y_1 = get_aeiou(x1_domain_list)
x2_domain_list = load_dga("...")
x_2,y_2 = get_aeiou(x2_domain_list)
x3_domain_list = load_dga("...")
x_3,y_3 = get_aeiou(x3_domain_list)
以域名長度爲橫軸,元音字母比例爲縱軸作圖:
fig,ax=plt.subplots()
ax.set_xlabel('Domain Lengh')
ax.set_ylabel('AEIOU Score')
ax.scatter(x_3,y_3,color='b',label="dga_post-tovar-goz",marker = 'o')
ax.scatter(x_2,y_2,color='g',label="dga_cryptolock",marker = 'v')
ax.scatter(x_1,y_1,color='r',label="alexa",marker = '*')
ax.legend(loc='best')
plt.show()
2)去重後字母數字個數與域名長度的比例
def get_uniq_char_num(domain_list):
x=[]
y=[]
for domain in domain_list:
x.append(len(domain))
count = len(set(domain))
count = (0.0+count)/len(domain)
y.append(count)
return x,y
x1_domain_list = load_alexa("...")
x_1,y_1 = get_uniq_char_num(x1_domain_list)
x2_domain_list = load_dga("...")
x_2,y_2 = get_uniq_char_num(x2_domain_list)
x3_domain_list = load_dga("...")
x_3,y_3 = get_uniq_char_num(x3_domain_list)
3)平均jarccard係數:定義爲兩個集合交集與並集元素個數的比值,本次基於2-gram計算。
def count2string_jarccard_index(a,b):
x=set(' '+a[0])
y=set(' '+b[0])
for i in range(0,len(a)-1):
x.add(a[i]+a[i+1]0
x.add(a[len(a)-1]+' ')
for i in range(0,len(b)-1):
y.add(b[i]+b[i+1]0
y.add(b[len(b)-1]+' ')
return (0.0+len(x-y))/len(x|y)
def get_jarccard_index(a_list,b_list):
x=[]
y=[]
for a in a_list:
j = 0.0
for b in b_list:
j+=count2string_jarccard_index(a,b)
x.append(len(a))
y.append(j/len(b_list))
return x,y
4)HMM係數
正常人取域名的時候都會偏向選取常見的幾個單詞組合,抽象成數學可以理解的語言,因此以常見單詞訓練HMM模型,正常域名的HMM係數偏高,殭屍網絡DGA域名由於是隨機生成的,所以HMM係數偏低。
3.模型驗證:SVM分類原理同上篇。