百度面試題：

海量日誌數據，提取出某日訪問百度次數最多的那個IP

解決思路：

因爲是海量數據，所以我們想把所有的日誌數據讀入內存，再去排序，找到出現次數最多的，顯然行不通了。

(1) 我們先假設內存足夠，可以只用幾行代碼，求出最終的結果。

from collections import Counter


if __name__ == '__main__':
    ip_list = ["192.168.1.2","192.168.1.3","192.168.1.3","192.168.1.4","192.168.1.3","192.168.1.2"]  # 爲了簡化，用一個列表代替。
    ip_counter = Counter(ip_list)  # 使用python內置的計數函數，進行統計
    print(ip_counter.most_common()[0][0])

192.168.1.3

(2) 假如內存有限，不足以裝下所有的日誌數據，應該怎麼辦？
既然內存不能裝下所有數據，那麼我們將無法使用排序算法，這裏我們採取“化整爲零”的做法：
假設海量數據的大小是100G，而我們的可用內存是1G。
我們可以把數據分成1000份(只要大於100都是可以的)，每次內存讀入100M，再去處理。

但是問題的關鍵是：怎麼將這100G數據分成1000份呢？
我們以前學過的hash函數就派上用場了。
Hash函數的定義：對於輸入的字符串，返回一個固定長度的整數。
hash函數的巧妙之處在於：
對於相同的字符串，經過hash計算，得出來的結果肯定是相同的；
不同的字符串，經過hash，結果可能相同（可能性一般都很小）或者不同。

解題思路如下：

對於海量數據中的每一個ip，使用hash函數計算hash(ip)%1000，輸出到1000個文件中；
對於這1000個文件，分別找出出現最多的ip；
使用外部排序，對找出來的1000個ip再進行排序。

python代碼

返回頂部

import os
from collections import Counter


source_file = r'C:\Users\13721\Documents\most_ip_temp\bigdata.txt'
temp_files = r'C:\Users\13721\Documents\most_ip_temp\temp\\'  # 最後雙斜槓是爲了轉義
top_1000_ip = []

# 創建文件夾及文件
def hash_file():
    temp_path_list = []
    if not os.path.exists(temp_files):
        os.makedirs(temp_files)
        
    for i in range(1000):
        temp_path_list.append(open(temp_files + str(i) + '.txt', mode='w'))
        
    with open(source_file) as f:
        # 關鍵，使用hash函數計算hash(ip)%1000，輸出到1000個文件中
        for line in f:
            temp_path_list[hash(str(line))%1000].write(line)
            # print(hash(str(line))%1000, line)
            
    for i in range(1000):
        temp_path_list[i].close()  # 文件關閉，不影響運行
        
def cal_query_frequency():
    for root, dirs, files in os.walk(temp_files):
        for file in files:
            real_path = os.path.join(root, file)
            ip_list = []
            
            with open(real_path) as f:
                for line in f:
                    ip_list.append(line.replace('\n', ''))
                    
            try:
                top_1000_ip.append(Counter(ip_list).most_common()[0])  # top_1000_ip結果爲一列表，列表元素爲元組(IP, 出現次數)
            except:
                pass
    # print(top_1000_ip)
            
def get_ip():
    return (sorted(top_1000_ip, key=lambda a:a[1], reverse=True)[0])[0]


if __name__ == '__main__':
    hash_file()
    cal_query_frequency()
    print(get_ip())

"192.168.1.3"

總結

該題思路採用了“分而治之”，“化整爲零”的思想。
關鍵代碼是： 使用hash函數計算hash(ip)%1000，輸出到1000個文件中。 需要注意“%”，而不是“/”。

temp_path_list[hash(str(line))%1000].write(line)

容易掉的坑是： 文件夾temp後面的雙斜槓。

temp_files = r'C:\Users\13721\Documents\most_ip_temp\temp\\'  # 最後雙斜槓是爲了轉義

返回頂部

百度面試題——海量日誌數據，提取出某日訪問百度次數最多的那個IP

百度面試題：

解決思路：

python代碼

總結

通過HPA+CronHPA組合應對業務複雜彈性伸縮場景

百度面試題——海量日誌數據，提取出某日訪問百度次數最多的那個IP

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結