需求背景
我們需要定時採集一些配置文件的內容,每次採集之後和上一次採集的內容進行比較,將按行爲單位的變更記錄持久化到數據庫中。
這樣做的好處是可以隨時來查看這些變更記錄,我們可以知道在什麼時候進行了哪些變更,可以比較方便的分析出哪些變更影響到了服務的正常運行。
下面就開始使用difflib模塊來實現這個需求。
difflib介紹
官方文檔地址:https://docs.python.org/3/library/difflib.html
中文版:https://docs.python.org/zh-cn/3/library/difflib.html
difflib是python的標準庫模塊,它提供的類和方法用來比較兩個序列之間的差異,生成差異結果文本或者html格式的差異化頁面。
使用Differ類
先使用Differ類來比較兩個文本序列。
代碼示例
text1 = ''' 1. Beautiful is better than ugly.
2. Explicit is better than implicit.
3. Simple is better than complex.
4. Complex is better than complicated.
'''.splitlines(keepends=True)
text2 = ''' 1. Beautiful is better than ugly.
3. Simple is better than complex.
4. Complicated is better than complicated.
5. Flat is better than nested.
'''.splitlines(keepends=True)
differ = Differ()
for i in differ.compare(text1, text2):
print(i, end='')
執行結果
1. Beautiful is better than ugly.
- 2. Explicit is better than implicit.
3. Simple is better than complex.
- 4. Complex is better than complicated.
? ^
+ 4. Complicated is better than complicated.
? ++++ ^
+ 5. Flat is better than nested.
該方法生成的結果包括了行間和行內的差異,其實我們對行內的差異並不在意,而且結果的格式很難做解析。
使用SequenceMatcher類
SequenceMatcher類的get_opcodes方法返回描述如何將a轉換爲b的元組列表。
代碼示例
matcher = SequenceMatcher(None, text1, text2)
for tag, alo, ahi, blo, bhi in matcher.get_opcodes():
if tag == 'replace':
print('replace\n{}\n{}'.format(text1[alo:ahi], text2[blo:bhi]))
elif tag == 'delete':
print('delete\n{}'.format(text1[alo:ahi]))
elif tag == 'insert':
print('insert\n{}'.format(text2[blo:bhi]))
elif tag == 'equal':
print('equal\n{}\n{}'.format(text1[alo:ahi], text2[blo:bhi]))
執行結果
equal
[' 1. Beautiful is better than ugly.\n']
[' 1. Beautiful is better than ugly.\n']
delete
[' 2. Explicit is better than implicit.\n']
equal
[' 3. Simple is better than complex.\n']
[' 3. Simple is better than complex.\n']
replace
[' 4. Complex is better than complicated.\n']
[' 4. Complicated is better than complicated.\n', ' 5. Flat is better than nested.\n']
將變更內容拆分成單一的變更
使用SequenceMatcher類得到的結果其實已經符合想要的結果,如果將變更內容拆成單一的變更就更好了。
下面嘗試寫處理函數去實現。
代碼示例
def diff(text1, text2):
change_list = []
matcher = SequenceMatcher(None, text1, text2)
for tag, i1, i2, j1, j2 in matcher.get_opcodes():
if tag == 'replace':
l1, l2 = text1[i1:i2], text2[j1:j2]
change_list.extend(map(lambda x, y: (tag, x, y), l1, l2))
if len(l1) == len(l2):
continue
if len(l1) > len(l2):
change_list.extend(('delete', line) for line in l1[len(l2):])
else:
change_list.extend(('insert', line) for line in l2[len(l1):])
elif tag == 'delete':
change_list.extend([(tag, line) for line in text1[i1:i2]])
elif tag == 'insert':
change_list.extend([(tag, line) for line in text2[j1:j2]])
elif tag == 'equal':
pass
return change_list
for change in diff(text1, text2):
print(change)
執行結果
('delete', ' 2. Explicit is better than implicit.\n')
('replace', ' 4. Complex is better than complicated.\n', ' 4. Complicated is better than complicated.\n')
('insert', ' 5. Flat is better than nested.\n')
這個結果已經可以進行解析和持久化了。
只是對於一些特殊情況並不能有正確的比較結果。
發現問題
經測試後發現,如果將測試數據更改爲以下內容,會出現內容錯位的情況。
代碼示例
text1 = ''' 1. Beautiful is better than ugly.
2. Explicit is better than implicit.
3. Simple is better than complex.
4. Complex is better than complicated.
'''.splitlines(keepends=True)
text2 = ''' 1. Beautiful is better than ugly.
3. Simple is better than complexed.
4. Complicated is better than complicated.
5. Flat is better than nested.
'''.splitlines(keepends=True)
for change in diff(text1, text2):
print(change)
執行結果
('replace', ' 2. Explicit is better than implicit.\n', ' 3. Simple is better than complexed.\n')
('replace', ' 3. Simple is better than complex.\n', ' 4. Complicated is better than complicated.\n')
('replace', ' 4. Complex is better than complicated.\n', ' 5. Flat is better than nested.\n')
自定義CustomDiffer類
看來以上方式也並不可靠,我決定還是從Differ類下手。
Differ類內部其實使用了SequenceMatcher類,它採用了查找最佳匹配對的方式對replace的部分進行了分解,可以很好的解決我們剛纔碰到的問題。
接下來我自定義了CustomDiffer類去繼承Differ類,並重寫了父類的格式化方法,主要的目的還是將Diifer方式的結果格式變得統一。
代碼
class CustomDiffer(Differ):
def _dump(self, tag, x, lo, hi):
if tag == '+':
type = 'insert'
elif tag == '-':
type = 'delete'
else:
return
for i in range(lo, hi):
yield type, x[i]
def _qformat(self, aline, bline, atags, btags):
yield 'replace', aline, bline
執行結果
('delete', ' 2. Explicit is better than implicit.\n')
('replace', ' 3. Simple is better than complex.\n', ' 3. Simple is better than complexed.\n')
('replace', ' 4. Complex is better than complicated.\n', ' 4. Complicated is better than complicated.\n')
('insert', ' 5. Flat is better than nested.\n')
可以看到單個變更內容變爲了元組形式,這樣就比較容易解析和處理了。