Python Counter() 的實現

collections.Counter 源碼實現

Counter 的相關源碼在lib下的collections.py裏,本文所提及的源碼是python2.7版本, 可參見github

__init__

class Counter(dict):
    '''Dict subclass for counting hashable items.  Sometimes called a bag
    or multiset.  Elements are stored as dictionary keys and their counts
    are stored as dictionary values.
    '''
    def __init__(*args, **kwds):
        '''Create a new, empty Counter object.  And if given, count elements
        from an input iterable.  Or, initialize the count from another mapping
        of elements to their counts.

        >>> c = Counter()                           # a new, empty counter
        >>> c = Counter('gallahad')                 # a new counter from an iterable
        >>> c = Counter({'a': 4, 'b': 2})           # a new counter from a mapping
        >>> c = Counter(a=4, b=2)                   # a new counter from keyword args

        '''
        if not args:
            raise TypeError("descriptor '__init__' of 'Counter' object "
                            "needs an argument")
        self = args[0]
        args = args[1:]
        if len(args) > 1:
            raise TypeError('expected at most 1 arguments, got %d' % len(args))
        super(Counter, self).__init__()
        self.update(*args, **kwds)

Counter 繼承字典類來實現,初始化中對參數進行有效性校驗,其中 args 接受除了 self 外最多一個未知參數。校驗完成後調用自身的 update 方法來具體創建數據結構。

update

def update(*args, **kwds):
    '''Like dict.update() but add counts instead of replacing them.
    '''
    if not args:
        raise TypeError("descriptor 'update' of 'Counter' object "
                        "needs an argument")
    self = args[0]
    args = args[1:]
    if len(args) > 1:
        raise TypeError('expected at most 1 arguments, got %d' % len(args))
    iterable = args[0] if args else None
    if iterable is not None:
        if isinstance(iterable, Mapping):
            if self:
                self_get = self.get
                for elem, count in iterable.iteritems():
                    self[elem] = self_get(elem, 0) + count
            else:
                super(Counter, self).update(iterable) # fast path when counter is empty
        else:
            self_get = self.get
            for elem in iterable:
                self[elem] = self_get(elem, 0) + 1
    if kwds:
        self.update(kwds)

update 方法先檢查參數,位置參數除了self外只允許有一個。然後對傳入的參數進行判斷,如果是以 Counter(a=1,b=2) 的方式調用的,這時候取出 kwds({'a':1,'b'=2}) 再調用自身,將關鍵字參數轉化爲位置參數處理。
如果傳入的位置參數是一個mapping類型的,對應於 Counter({'a':1,'b':2}) 這樣的方式調用,這種情況會判斷self是否爲空,在初始化狀態下self總是空的,這邊加上判斷是因爲update 方法不僅近在 __init__() 裏調用,還可以這樣調用:

x1 = collections.Counter({'a': 1, 'b': 2})
x2 = collections.Counter(a=1, b=2)
x1.update(x2)      # Counter()類型 isinstance(iterable, Mapping) 也返回 True

# 或者這樣調用
x1 = collections.Counter({'a': 1, 'b': 2})
x1.update('aab')

如果傳入的不是一個mapping類型,那麼會迭代該參數的每一項作爲key添加到Counter中

most_common

def most_common(self, n=None):
    '''List the n most common elements and their counts from the most
    common to the least.  If n is None, then list all element counts.

    >>> Counter('abcdeabcdabcaba').most_common(3)
    [('a', 5), ('b', 4), ('c', 3)]

    '''
    # Emulate Bag.sortedByCount from Smalltalk
    if n is None:
        return sorted(self.iteritems(), key=_itemgetter(1), reverse=True)
    return _heapq.nlargest(n, self.iteritems(), key=_itemgetter(1))

如果調用 most_common 不指定參數n則默認返回全部(key, value)組成的列表,按照value降序排列。

itemgetter

這裏用到了有趣的 itemgetter(代碼裏用了別名_itemgetter) , 它是來自 operator 模塊中的方法,可以從下面的代碼感受一下:

# 例子來源python文檔
# 舉例:
After f = itemgetter(1), the call f(r) returns r[1].
After g = itemgetter(2, 5, 3), the call g(r) returns (r[2], r[5], r[3]).
# 實現:
def itemgetter(*items):
    if len(items) == 1:
        item = items[0]
        def g(obj):
            return obj[item]
    else:
        def g(obj):
            return tuple(obj[item] for item in items)
    return g

# 常見用法:
>>> itemgetter(1)('ABCDEFG')
'B'
>>> itemgetter(1,3,5)('ABCDEFG')
('B', 'D', 'F')
>>> itemgetter(slice(2,None))('ABCDEFG')
'CDEFG'

>>> inventory = [('apple', 3), ('banana', 2), ('pear', 5), ('orange', 1)]
>>> getcount = itemgetter(1)
>>> map(getcount, inventory)
[3, 2, 5, 1]
>>> sorted(inventory, key=getcount)
[('orange', 1), ('banana', 2), ('apple', 3), ('pear', 5)]

heapquue

heap queue是“queue algorithm”算法的python實現,調用 _heapq.nlargest() 返回了根據每個value排序前n個大的(key, value)元組組成的列表。具體heap queue使用參見文檔

elements

elements 方法實現了按照value的數值重複返回key。它的實現很精妙,只有一行:

def elements(self):
    '''Iterator over elements repeating each as many times as its count.

    >>> c = Counter('ABCABC')
    >>> sorted(c.elements())
    ['A', 'A', 'B', 'B', 'C', 'C']
    '''
    return _chain.from_iterable(_starmap(_repeat, self.iteritems()))

該實現裏用到了 itertools 裏的 repeat starmap chain 三個方法, 直接按照每項計數的次數重複返回每項內容,拼成一個列表。

repeat

repeat生成一個迭代器,根據第二個參數不停滴返回接受的第一個參數。直接看實現,很好理解, 類似實現如下:

def repeat(object, times=None):
    # repeat(10, 3) --> 10 10 10
    if times is None:
        while True:
            yield object
    else:
        for i in xrange(times):
            yield object

starmap

starmap接受的第一個參數是一個函數,生成一個迭代器,不停滴將該函數以第二個參數傳來的每一項爲參數進行調用(說得抽象,看例子好理解),類似實現如下:

def starmap(function, iterable):
    # starmap(pow, [(2,5), (3,2), (10,3)]) --> 32 9 1000
    for args in iterable:
        yield function(*args)

chain.from_iterable

chain.from_iterable 接受一個可迭代對象,返回一個迭代器,不停滴返回可迭代對象的每一項,類似實現如下:

def from_iterable(iterables):
    # chain.from_iterable(['ABC', 'DEF']) --> A B C D E F
    for it in iterables:
        for element in it:
            yield element

substract

substract的實現和update實現很像,不同之處在counter()相同的項的計數相加改成了相減。

def subtract(*args, **kwds):
    '''Like dict.update() but subtracts counts instead of replacing them.
    Counts can be reduced below zero.  Both the inputs and outputs are
    allowed to contain zero and negative counts.

    Source can be an iterable, a dictionary, or another Counter instance.

    >>> c = Counter('which')
    >>> c.subtract('witch')             # subtract elements from another iterable
    >>> c.subtract(Counter('watch'))    # subtract elements from another counter
    >>> c['h']                          # 2 in which, minus 1 in witch, minus 1 in watch
    0
    >>> c['w']                          # 1 in which, minus 1 in witch, minus 1 in watch
    -1

    '''
    if not args:
        raise TypeError("descriptor 'subtract' of 'Counter' object "
                        "needs an argument")
    self = args[0]
    args = args[1:]
    if len(args) > 1:
        raise TypeError('expected at most 1 arguments, got %d' % len(args))
    iterable = args[0] if args else None
    if iterable is not None:
        self_get = self.get
        if isinstance(iterable, Mapping):
            for elem, count in iterable.items():
                self[elem] = self_get(elem, 0) - count
        else:
            for elem in iterable:
                self[elem] = self_get(elem, 0) - 1
    if kwds:
        self.subtract(kwds)

**

+, -, &, |

通過對 __add__, __sub__, __or__, __and__ 的定義,重寫了 +, -, &, | ,實現了Counter間類似於集合的操作, 代碼不難理解,值得注意的是,將非正的結果略去了:

def __add__(self, other):
        '''Add counts from two counters.

        >>> Counter('abbb') + Counter('bcc')
        Counter({'b': 4, 'c': 2, 'a': 1})

        '''
        if not isinstance(other, Counter):
            return NotImplemented
        result = Counter()
        for elem, count in self.items():
            newcount = count + other[elem]
            if newcount > 0:
                result[elem] = newcount
        for elem, count in other.items():
            if elem not in self and count > 0:
                result[elem] = count
        return result

    def __sub__(self, other):
        ''' Subtract count, but keep only results with positive counts.

        >>> Counter('abbbc') - Counter('bccd')
        Counter({'b': 2, 'a': 1})

        '''
        if not isinstance(other, Counter):
            return NotImplemented
        result = Counter()
        for elem, count in self.items():
            newcount = count - other[elem]
            if newcount > 0:
                result[elem] = newcount
        for elem, count in other.items():
            if elem not in self and count < 0:
                result[elem] = 0 - count
        return result

    def __or__(self, other):
        '''Union is the maximum of value in either of the input counters.

        >>> Counter('abbb') | Counter('bcc')
        Counter({'b': 3, 'c': 2, 'a': 1})

        '''
        if not isinstance(other, Counter):
            return NotImplemented
        result = Counter()
        for elem, count in self.items():
            other_count = other[elem]
            newcount = other_count if count < other_count else count
            if newcount > 0:
                result[elem] = newcount
        for elem, count in other.items():
            if elem not in self and count > 0:
                result[elem] = count
        return result

    def __and__(self, other):
        ''' Intersection is the minimum of corresponding counts.

        >>> Counter('abbb') & Counter('bcc')
        Counter({'b': 1})

        '''
        if not isinstance(other, Counter):
            return NotImplemented
        result = Counter()
        for elem, count in self.items():
            other_count = other[elem]
            newcount = count if count < other_count else other_count
            if newcount > 0:
                result[elem] = newcount
        return result

其它

# 當用Pickler序列化時,遇到不知道怎麼序列化時,查找__reduce__方法
def __reduce__(self):
    return self.__class__, (dict(self),)


# 重寫刪除方法,當Counter有這個key再刪除,避免KeyError
def __delitem__(self, elem):
    'Like dict.__delitem__() but does not raise KeyError for missing values.'
    if elem in self:
        super(Counter, self).__delitem__(elem)

# %s : String (converts any Python object using str()).
# %r : String (converts any Python object using repr()).
def __repr__(self):
    if not self:
        return '%s()' % self.__class__.__name__
    items = ', '.join(map('%r: %r'.__mod__, self.most_common()))
    return '%s({%s})' % (self.__class__.__name__, items)

@classmethod
def fromkeys(cls, iterable, v=None):
    # There is no equivalent method for counters because setting v=1
    # means that no element can have a count greater than one.
    raise NotImplementedError(
        'Counter.fromkeys() is undefined.  Use Counter(iterable) instead.')


 #  實現__missing__方法,當Couter['no_field'] => 0, 字典默認的__missing__ 方法不實現會報錯(KeyError)
def __missing__(self, key):
    'The count of elements not in the Counter is zero.'
    # Needed so that self[missing_item] does not raise KeyError
    return 0

總結

總體來說,Counter通過對內置字典類型的繼承重寫來的實現,比較簡潔,邏輯也很清楚,從源碼中可以學到很多標準庫裏提供的很多的不常見的方法的使用,可以使代碼更加簡潔,思路更加流暢。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章