[雪峯磁針石博客]python庫介紹-collections：高性能容器數據類型

簡介

2.4新增

源代碼：Lib/collections.py and Lib/_abcoll.py

提供了替換dict, list, set和tuple的數據類型。

主要類型如下：

namedtuple(): 命名元組，創建有名字域的元組子類的工廠函數。python 2.6新增。
deque：雙端隊列，類似於列表，兩端進棧和出棧都比較快速。python 2.4新增。
Counter：字典的子類，用於統計哈希對象。python 2.7新增。
OrderedDict：有序字典，字典的子類，記錄了添加順序。python 2.7新增。
defaultdict：dict的子類，調用一個工廠函數支持不存在的值。python 2.5新增。

還提供了抽象基類，用來測試類是否提供了特殊接口，不管是哈希或者映射。

Counter

計數器(Counter)是一個容器，用來跟蹤值出現了多少次。和其他語言中的bag或multiset類似。

計數器支持三種形式的初始化。構造函數可以調用序列，包含key和計數的字典，或使用關鍵字參數。


import collections

print(collections.Counter(['a', 'b', 'c', 'a', 'b', 'b']))
print(collections.Counter({'a': 2, 'b': 3, 'c': 1}))
print(collections.Counter(a=2, b=3, c=1))

執行結果：

$ python3 collections_counter_init.py 
Counter({'b': 3, 'a': 2, 'c': 1})
Counter({'b': 3, 'a': 2, 'c': 1})
Counter({'b': 3, 'a': 2, 'c': 1})

注意key的出現順序是根據計數的從大到小。

可以創建空的計數器，再update：


import collections

c = collections.Counter()
print('Initial :{0}'.format(c))

c.update('abcdaab')
print('Sequence:{0}'.format(c))

c.update({'a': 1, 'd': 5})
print('Dict    :{0}'.format(c))

執行結果：

 python3.5 collections_counter_update.py*
Initial :Counter()
Sequence:Counter({'a': 3, 'b': 2, 'c': 1, 'd': 1})
Dict    :Counter({'d': 6, 'a': 4, 'b': 2, 'c': 1})

訪問計數


import collections

c = collections.Counter('abcdaab')

for letter in 'abcde':
    print('{0} : {1}'.format(letter, c[letter]))

執行結果：


$ python3.5 collections_counter_get_values.py 
a : 3
b : 2
c : 1
d : 1
e : 0

注意這裏不存在的元素也會統計爲0。

elements方法可以列出所有元素：


import collections

c = collections.Counter('extremely')
c['z'] = 0
print(c)
print(list(c.elements()))

執行結果：


$ python3.5 collections_counter_elements.py 
Counter({'e': 3, 'y': 1, 'r': 1, 'x': 1, 'm': 1, 'l': 1, 't': 1, 'z': 0})
['y', 'r', 'x', 'm', 'l', 't', 'e', 'e', 'e']

注意後面並沒有輸出計數爲0的元素。

most_common()可以提取出最常用的元素。


import collections

c = collections.Counter()
with open('/etc/adduser.conf', 'rt') as f:
    for line in f:
        c.update(line.rstrip().lower())

print('Most common:')
for letter, count in c.most_common(3):
    print('{0}: {1}'.format(letter, count))

執行結果：


$ python3.5 collections_counter_most_common.py 
Most common:
 : 401
e: 310
s: 221

Counter還支持算術和集合運算，它們都只會保留數值爲正整數的key。


import collections
import pprint

c1 = collections.Counter(['a', 'b', 'c', 'a', 'b', 'b'])
c2 = collections.Counter('alphabet')

print('C1:')
pprint.pprint(c1)
print('C2:')
pprint.pprint(c2)

print('\nCombined counts:')
print(c1 + c2)

print('\nSubtraction:')
print(c1 - c2)

print('\nIntersection (taking positive minimums):')
print(c1 & c2)

print('\nUnion (taking maximums):')
print(c1 | c2)

執行結果：


$ python3 collections_counter_arithmetic.py
C1:
Counter({'b': 3, 'a': 2, 'c': 1})
C2:
Counter({'a': 2, 't': 1, 'l': 1, 'e': 1, 'b': 1, 'p': 1, 'h': 1})

Combined counts:
Counter({'b': 4, 'a': 4, 'p': 1, 'e': 1, 'c': 1, 't': 1, 'l': 1, 'h': 1})

Subtraction:
Counter({'b': 2, 'c': 1})

Intersection (taking positive minimums):
Counter({'a': 2, 'b': 1})

Union (taking maximums):
Counter({'b': 3, 'a': 2, 'p': 1, 'e': 1, 'c': 1, 't': 1, 'l': 1, 'h': 1})

上面的例子讓人覺得collections只能處理單個字符。其實不是這樣的，請看標準庫中的實例。


from collections import Counter
import pprint
import re

cnt = Counter()

for word in ['red', 'blue', 'red', 'green', 'blue', 'blue']:
    cnt[word] += 1
pprint.pprint(cnt)
cnt = Counter(['red', 'blue', 'red', 'green', 'blue', 'blue'])
pprint.pprint(cnt)

words = re.findall('\w+', open('/etc/adduser.conf').read().lower())
print(Counter(words).most_common(10))

執行結果：


$ python3 collections_counter_normal.py
Counter({'blue': 3, 'red': 2, 'green': 1})
Counter({'blue': 3, 'red': 2, 'green': 1})
[('the', 27), ('is', 13), ('be', 12), ('if', 12), ('will', 12), ('user', 10), ('home', 9), ('default', 9), ('to', 9), ('users', 8)]

第1段代碼和第2段的代碼效果式樣的，後面一段代碼通過Counter實現了簡單的單詞的統計功能。比如面試題：使用python打印出/etc/ssh/sshd_config出現次數最高的10個單詞及其出現次數。

下面看看Counter的相關定義：

class collections.Counter([iterable-or-mapping]) 。注意Counter是無序的字典。在key不存在的時候返回0. c['sausage'] = 0。設置值爲0不會刪除元素，要使用del c['sausage']。

除了標準的字典方法，額外增加了：

elements() ：返回一個包含所有元素的迭代器，忽略小於1的計數。

most_common([n])：返回最常用的元素及其計數的列表。默認返回所有元素。

subtract([iterable-or-mapping]) ：相減。

namedtuple

命名元組和普通元組的的內存效率差不多。它不會針對每個實例生成字典。


import collections

Person = collections.namedtuple('Person', 'name age gender')

print('Type of Person:{0}'.format(type(Person)))

bob = Person(name='Bob', age=30, gender='male')
print('\nRepresentation: {0}'.format(bob))

jane = Person(name='Jane', age=29, gender='female')
print('\nField by name: {0}'.format(jane.name))

print('\nFields by index:')
for p in [bob, jane]:
    print('{0} is a {1} year old {2}'.format(*p))

執行結果：


$ python3 collections_namedtuple_person.py
Type of Person:<class 'type'>

Representation: Person(name='Bob', age=30, gender='male')

Field by name: Jane

Fields by index:
Bob is a 30 year old male
Jane is a 29 year old female

從上例可以看出命名元組Person類和excel的表頭類似，給下面的每個列取個名字，真正excel行數據則存儲在Person類的實例中。好處在於可以jane.name這樣的形式訪問，比記元組的index要直觀。

注意列名在實現內部其實是個標識符，所以不能和關鍵字衝突，只能用字母或者下劃線開頭。下例會報錯：


import collections

try:
    collections.namedtuple('Person', 'name class age gender')
except ValueError as err:
    print(err)

try:
    collections.namedtuple('Person', 'name age gender age')
except ValueError as err:
    print(err)

執行結果：


$ python3 collections_namedtuple_bad_fields.py 
Type names and field names cannot be a keyword: 'class'
Encountered duplicate field name: 'age'

設置rename=True，列名會在衝突時自動重命名，不過這種重命名並不美觀。


import collections

with_class = collections.namedtuple('Person', 'name class age gender',
                                    rename=True)
print(with_class._fields)

two_ages = collections.namedtuple('Person', 'name age gender age',
                                  rename=True)
print(two_ages._fields)

執行結果：


$ python collections_namedtuple_rename.py
('name', '_1', 'age', 'gender')
('name', 'age', 'gender', '_3')

定義

collections.namedtuple(typename, field_names, verbose=False) 返回一個命名元組類。如果verbose爲True，會打印類定義信息

命名元組在處理數據庫的時候比較有用：

ChainMap 映射鏈

用於查找多個字典。

ChainMap管理一系列字典，按順序根據key查找值。

訪問值：

API和字典類似。

collections_chainmap_read.py


import collections

a = {'a': 'A', 'c': 'C'}
b = {'b': 'B', 'c': 'D'}

m = collections.ChainMap(a, b)

print('Individual Values')
print('a = {}'.format(m['a']))
print('b = {}'.format(m['b']))
print('c = {}'.format(m['c']))
print()

print('m = {}'.format(m))
print('Keys = {}'.format(list(m.keys())))
print('Values = {}'.format(list(m.values())))
print()

print('Items:')
for k, v in m.items():
    print('{} = {}'.format(k, v))
print()

print('"d" in m: {}'.format(('d' in m)))

執行結果：


$ python3 collections_chainmap_read.py 
Individual Values
a = A
b = B
c = C

m = ChainMap({'c': 'C', 'a': 'A'}, {'c': 'D', 'b': 'B'})
Keys = ['c', 'a', 'b']
Values = ['C', 'A', 'B']

Items:
c = C
a = A
b = B

"d" in m: False

調整順序

collections_chainmap_reorder.py


import collections

a = {'a': 'A', 'c': 'C'}
b = {'b': 'B', 'c': 'D'}

m = collections.ChainMap(a, b)

print(m.maps)
print('c = {}\n'.format(m['c']))

# reverse the list
m.maps = list(reversed(m.maps))

print(m.maps)
print('c = {}'.format(m['c']))

執行結果：


$ python3 collections_chainmap_reorder.py
[{'c': 'C', 'a': 'A'}, {'c': 'D', 'b': 'B'}]
c = C

[{'c': 'D', 'b': 'B'}, {'c': 'C', 'a': 'A'}]
c = D

更新值

更新原字典：

collections_chainmap_update_behind.py


import collections

a = {'a': 'A', 'c': 'C'}
b = {'b': 'B', 'c': 'D'}

m = collections.ChainMap(a, b)
print('Before: {}'.format(m['c']))
a['c'] = 'E'
print('After : {}'.format(m['c']))

執行結果


$ python3 collections_chainmap_update_behind.py

Before: C
After : E

直接更新ChainMap：

collections_chainmap_update_directly.py


import collections

a = {'a': 'A', 'c': 'C'}
b = {'b': 'B', 'c': 'D'}

m = collections.ChainMap(a, b)
print('Before:', m)
m['c'] = 'E'
print('After :', m)
print('a:', a)

執行結果


$ python3 collections_chainmap_update_directly.py

Before: ChainMap({'c': 'C', 'a': 'A'}, {'c': 'D', 'b': 'B'})
After : ChainMap({'c': 'E', 'a': 'A'}, {'c': 'D', 'b': 'B'})
a: {'c': 'E', 'a': 'A'}

ChainMap可以方便地在前面插入字典，這樣可以避免修改原來的字典。

collections_chainmap_new_child.py


import collections

a = {'a': 'A', 'c': 'C'}
b = {'b': 'B', 'c': 'D'}

m1 = collections.ChainMap(a, b)
m2 = m1.new_child()

print('m1 before:', m1)
print('m2 before:', m2)

m2['c'] = 'E'

print('m1 after:', m1)
print('m2 after:', m2)

執行結果


$ python3 collections_chainmap_new_child.py
m1 before: ChainMap({'a': 'A', 'c': 'C'}, {'b': 'B', 'c': 'D'})
m2 before: ChainMap({}, {'a': 'A', 'c': 'C'}, {'b': 'B', 'c': 'D'})
m1 after: ChainMap({'a': 'A', 'c': 'C'}, {'b': 'B', 'c': 'D'})
m2 after: ChainMap({'c': 'E'}, {'a': 'A', 'c': 'C'}, {'b': 'B', 'c': 'D'})

還可以通過傳入字典的方式

collections_chainmap_new_child_explicit.py


import collections

a = {'a': 'A', 'c': 'C'}
b = {'b': 'B', 'c': 'D'}
c = {'c': 'E'}

m1 = collections.ChainMap(a, b)
m2 = m1.new_child(c)

print('m1["c"] = {}'.format(m1['c']))
print('m2["c"] = {}'.format(m2['c']))

執行結果


$ python3 collections_chainmap_new_child_explicit.py
m1["c"] = C
m2["c"] = E

另外一種等價的方式：


m2 = collections.ChainMap(c, *m1.maps)

[雪峯磁針石博客]python庫介紹-collections：高性能容器數據類型

簡介

Counter

namedtuple

ChainMap 映射鏈

參考資料

linux安裝cuda和cudnn

測試人員都是畫畫大神，讓我看看誰還不會用代碼圖？

Object.values()對象遍歷

我拍了拍Redis，被移出了羣聊···

網絡現代化通向雲原生應用的高速公路

面試官：說說你對序列化的理解

我宣佈，這是我找到的史上AI最全論文體系！

python小技巧：獲取字典中值最大者的key

使用python排八字計算八字的相合相沖五行分值等

Python技巧: 用isnumeric等代替數值異常處理

[python作業AI畢業設計博客]Analytic Methods in Systems and Software Testing-2018 系統和軟件測試分析方法

[雪峯磁針石博客]pyspark工具機器學習(自然語言處理和推薦系統)2數據處理1

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結