集体智慧编程中文版---第二章

中英文术语对照表

英文	中文
clustering	j聚类
computationally intensive	计算量很大的
cross-product	叉乘
dendrogram	树状图
groups	群组
kernl methods,kernel tricks	核方法，核技法
k-nearest neighbors	k-最近邻
multidimensional scalling	多维缩放
pattern	模式
solution	（题）解
collective intelligence	集体智慧
crawl	（网页）检索
data-intensive	数据量很大的
dot-product	点积
inbound link,incoming link	外部回指链接
K-Means	K-均值
list comprehension	列表推导式
observation	观测数据，观测值
similarity	相似度，相似性
vertical search engine	垂直搜索引擎

前言

本书的目的是要带领超越以数据库为后端的简单应用系统，并告诉你如何利用自己和他人每天搜集到的信息来编写更为智能的程序。

第二章

搜集偏好
我们要做的第一件事情，是寻找一种表达不同人及其偏好的方法。

在python中，达到这一目的的一种非常简单的方法是使用一个嵌套的字典。如果你打算运行本节中的示例，请新建一个名为recommendations.py的文件，并加入如下代码构造一个数据集：

critics = {'Lisa Rose':{'Lady in the Water':2.5, 'Snake on a Plane':3.5,
            'Just My Luck':3.0, 'Superman Returns':3.5, 'You, Me and Dupree':2.5,
            'The Night Listener':3.0},
            
            'Gene Seymour':{'Lady in the Water':3.0, 'Snake on a Plane':3.5,
            'Just My Luck':1.5, 'Superman Returns':5.0, 'The Night Listener':3.0,
            'You, Me and Dupree':3.5,},
            
            'Michael Phillips':{'Lady in the Water':2.5, 'Snake on a Plane':3.0,
            'Superman Returns':3.5, 'The Night Listener':4.0},
            
            'Claudia Puig':{'Snake on a Plane':3.5,'Just My Luck':3.0,
            'The Night Listener':4.5, 'Superman Returns':4.0, 
            'You, Me and Dupree':2.5},
            
            'Mick LaSalle':{'Lady in the Water':3.0, 'Snake on a Plane':4.0,
            'Just My Luck':2.0, 'Superman Returns':3.0, 'The Night Listener':3.0,
            'You, Me and Dupree':2.0},
            
            'Jack Matthews':{'Lady in the Water':3.0, 'Snake on a Plane':4.0,
            'The Night Listener':3.0, 'Superman Returns':5.0, 'You, Me and Dupree':3.5},
            
            'Toby':{'Snake on a Plane':4.5, 'You, Me and Dupree':1.0, 'Superman Returns':4.0}}

寻找相似用户
搜集完人们的偏好数据之后，我们须要有一种方法确定人们在品味方面的相似程度，为此，我们可以将每个人与所有其他人进行对比，并计算他们的相似度评价值。有若干种方法可以达到此目的，本节中我们将介绍两套计算相似评价值的体系：欧几里得距离和皮尔逊相似度。

欧几里得距离

它以经过人们一致评价的物品为座标轴，然后将参与评价的人绘制到图上，并考查他们彼此间的距离远近。

改图显示了处于“偏好空间”中人们的分布状况，Toby在Snakes轴线和Dupree轴线上所标示的数值分别是4.5和1.0.两人在“偏好空间”中的距离越近，他们的兴趣偏好就越相似。因为这张图是二维的，所以在容易时间内你只能看到两项评分，但是这一规则对于更多数据的评分项而言也是同样适用的。

计算出距离值，偏好越相似的人，其距离就越短。不过，我们还需要一个函数，来对偏好越相近的情况给出越大的值。为此，我们可以将函数值加1（这样就可以避免遇到被零整除的错误了），并取其倒数：

>>> 1/(1+sqrt(pow(4.5-4,2)+pow(1-2,2)))
0.4721359549995794

这一新的函数总是返回介于0到1之间的值，返回1则表示两人具有一样的偏好。我们经前述知识结合起来，就可以构造出用来计算相似度的函数了。将下列代码加入recommendations.py:

from math import sqrt
 
# 返回一个有关 person1 和 person2 的基于距离的相似度评价
def sim_distance(prefs, person1, person2):
    #得到 shared_items 的列表
    si={}
    for item in prefs[person1]:
        if item in prefs[person2]:    
            si[item] = 1
 
    # 如果没有共同之处，返回 0
    if len(si) == 0:return 0
 
    # 计算所有差值的平方和
    sum_of_squares = sum([pow(prefs[person1][item]-prefs[person2][item],2)
                        for item in prefs[person1] if item in prefs[person2]])
    
    return 1/(1 + sqrt(sum_of_squares))

皮尔逊相似度
该相关系数是判断两组数据与某一直线拟合程度的一种度量。对应的公式比欧几里得距离评价的计算公式要复杂，但是它的数据不是很规范的时候（比如，影评者对影片的评价总是相对于平均水平偏离很大时），会倾向于给出更好的结果。

为了形象地展示这一方法，我们可以在图上标示出两位评论者的评分情况，如下图所示。Mick LaSalle为《Superman》评了3分，而Gene Seymour则评了5分，所以该影片被定位在图中的（3.5）处。

在采用皮尔逊方法进行评价时，我们可以从图上发现一个值得注意的地方，那就是它修正了“夸大分值（grade inflation)”的情况。在这张图中，虽然Jack Matthews总是倾向于给出比Lisa Rose更高的分支，但最终的直线仍然是拟合的，这是因为他们两者有着相对近似的偏好。如果某人总是倾向于给出比另一个人更高的分值，而两者的分值之差又始终保持一致，则他们依然可能会存在很好的相关性。此前提到过的欧几里得距离评价方法，会因为一个人的评价始终比另一个的更为“严格”（从而导致评价始终相对偏低），而得出两者不想死的结论，即使他们的品味很相似也是如此。而这一行为是否就是我们想要的结果，则却决于具体的应用场景。

皮尔逊相似度评价算法首先会找出两位评论者都曾评价过的物品，然后计算两者的评分总和与平方和，并求得评分的乘积之和。

# 返回 p1 和 p2 的皮尔逊相关系数
def sim_pearson(prefs, p1, p2):
    # 得到双方都曾评价过的物品列表
    si = {}
    for item in prefs[p1]:
        if item in prefs[p2]:si[item] = 1
 
    # 得到 si 列表元素的个数
    n = len(si)
 
    # 如果两者没有共同之处，返回 1
    if n == 0:return 1
 
    # 对所有偏好求和
    sum1 = sum([prefs[p1][it] for it in si])
    sum2 = sum([prefs[p2][it] for it in si])
 
    # 求平方和
    sum1Sq = sum([pow(prefs[p1][it], 2) for it in si])
    sum2Sq = sum([pow(prefs[p2][it], 2) for it in si])
 
    # 求乘积和
    pSum = sum([prefs[p1][it] * prefs[p2][it] for it in si])
 
    # 计算皮尔逊评价值
    num = pSum - (sum1 * sum2 /n)
    den = sqrt((sum1Sq - pow(sum1, 2) / n) * (sum2Sq - pow(sum2, 2) / n))
    if den == 0:return 0
 
    r = num/den
 
    return r

该函数将返回一个介于-1与1之间的数值。值为1则表明两个人对每一样物品具有着完全一致的评价。

import recommendation
print(recommendation.sim_pearson(recommendation.critics,'Lisa Rose','Gene Seymour') )

为评论者打分
我们已经有了对两个人进行比较的函数，下面我们可以根据指定人员对每个人进行打分，并找出最接近的匹配结果。

# 从反映偏好的字典中返回最为匹配者
# 返回结果的个数和相似度函数均为可选参数
def topMatches(prefs, person, n = 5, similarity = sim_pearson):
    scores = [(similarity(prefs, person, other), other)for other in prefs if other != person]
    # 参见列表推导式
    # 对列表排序，评价值最高的排在前面
    scores.sort() # 排序
    scores.reverse() # 反向列表中元素
    return scores[0:n]

调用该方法并传入自己的姓名，将得到一个有关影评者及其相似度评价的列表：

>>> import recommendation
>>> recommendation.topMatches(recommendation.critics,'Toby',n=3)
[(0.9912407071619299, 'Lisa Rose'), (0.9244734516419049, 'Mick LaSalle'), (0.8934051474415647, 'Claudia Puig')]

根据返回的结果我们了解到，应当阅读Lisa Rose所撰写的评论，因为她的品味与我们的很相似。

匹配商品

只需将之前人对物品的评分矩阵转换为物品对人的评分矩阵

#这个函数就是将字典里面的人员和物品对调
def transformPrefs(prefs):
    result = {}
    for person in prefs:
        for item in prefs[person]:
            result.setdefault(item, {})
            #将物品和人员对调
            result[item][person] = prefs[person][item]
    return result

现在，调用之前用过的topMatches函数，可以得到某个影片相近的影片

>>> import recommendation
>>> movies=recommendation.transformPrefs(recommendation.critics)
>>> recommendation.topMatches(movies,'Snake on a Plane')
[(0.7637626158259785, 'Lady in the Water'), (0.11180339887498941, 'Superman Returns'), (-0.3333333333333333, 'Just My Luck'), (-0.5663521139548527, 'The Night Listener'), (-0.6454972243679047, 'You, Me and Dupree')]

在本例中，实际存在着一些相关评论值为负的情况，这表明存在不喜欢。

上面我们示范了为某部影片提供相关的推荐，不仅如此，我们设置还可以为影片推荐评论者。例如，也许我们正在考虑邀请谁和自己一起参加某部影片的首映式。

>>> recommendation.getRecommendations(movies,'Just My Luck')
[(4.0, 'Michael Phillips'), (3.0, 'Jack Matthews')]
>>>

构建一个基于del.icio.us的链接推荐系统

由于这部分的api在国内不能使用，‪pydelicious模块也不能导入。也没有看到有很好的解决方案，所以只能跳过。

pydelicious.py

"""Library to access del.icio.us data via Python.

:examples:

  Using the API class directly:

  >>> a = pydelicious.apiNew('user', 'passwd')
  >>> # or:
  >>> a = DeliciousAPI('user', 'passwd')
  >>> a.tags_get() # Same as:
  >>> a.request('tags/get', )

  Or by calling the 'convenience' methods on the module.

  - def add(user, passwd, url, description, tags = "", extended = "", dt = "", replace="no"):
  - def get(user, passwd, tag="", dt="",  count = 0):
  - def get_all(user, passwd, tag = ""):
  - def delete(user, passwd, url):
  - def rename_tag(user, passwd, oldtag, newtag):
  - def get_tags(user, passwd):

  >>> a = apiNew(user, passwd)
  >>> a.posts_add(url="http://my.com/", desciption="my.com", extended="the url is my.moc", tags="my com")
  True
  >>> len(a.posts_all())
  1
  >>> get_all(user, passwd)
  1

  This are short functions for getrss calls.

  >>> rss_

def get_userposts(user):
def get_tagposts(tag):
def get_urlposts(url):
def get_popular(tag = ""):

  >>> json_posts()
  >>> json_tags()
  >>> json_network()
  >>> json_fans()

:License: pydelicious is released under the BSD license. See 'license.txt'
 for more informations.

:berend:
 - Rewriting comments to english. More documentation, examples.
 - Added JSON-like return values for XML data (del.icio.us also serves some JSON...)
 - better error/exception classes and handling, work in progress.
 - Encoding seems to be working (using UTF-8 here).

:@todo:
 - Source code SHOULD BE ASCII!
 - More tests.
 - Parse datetimes in XML.
 - Salvage and test RSS functionality?
 - Setup not used, Still works? Should setup.py be tested?
 - API functions need required argument checks.

 * lizense einbinden und auch via setup.py verteilen
 * readme auch schreiben und via setup.py verteilen
 * auch auf anderen systemen testen (linux -> uni)
 * automatisch releases bauen lassen, richtig benennen und in das
   richtige verzeichnis verschieben.
 * was k[o]nnen die anderen librarys denn noch so? (ruby, java, perl, etc)
 * was wollen die, die es benutzen?
 * wof[u]r k[o]nnte ich es benutzen?
 * entschlacken?

:done:
 * Refactored the API class, much cleaner now and functions dlcs_api_request, dlcs_parse_xml are available for who wants them.
 * stimmt das so? muss eher noch t[a]g str2utf8 konvertieren
   >>> pydelicious.getrss(tag="t[a]g")
   url: http://del.icio.us/rss/tag/t[a]g
 * requester muss eine sekunde warten
 * __init__.py gibt die funktionen weiter
 * html parser funktioniert noch nicht, gar nicht
 * alte funktionen fehlen, get_posts_by_url, etc.
 * post funktion erstellen, die auch die fehlenden attribs addiert.
 * die api muss ich noch weiter machen
 * requester muss die 503er abfangen
 * rss parser muss auf viele m[o]glichkeiten angepasst werden
"""
import sys
import os
import time
import datetime
import md5, httplib
import urllib, urllib2, time
from StringIO import StringIO

try:
    from elementtree.ElementTree import parse as parse_xml
except ImportError:
    from  xml.etree.ElementTree import parse as parse_xml

import feedparser


### Static config

__version__ = '0.5.0'
__author__ = 'Frank Timmermann <regenkind_at_gmx_dot_de>' # GP: does not respond to emails
__contributors__ = [
    'Greg Pinero',
    'Berend van Berkum <[email protected]>']
__url__ = 'http://code.google.com/p/pydelicious/'
__author_email__ = ""
# Old URL: 'http://deliciouspython.python-hosting.com/'

__description__ = '''pydelicious.py allows you to access the web service of del.icio.us via it's API through python.'''
__long_description__ = '''the goal is to design an easy to use and fully functional python interface to del.icio.us. '''

DLCS_OK_MESSAGES = ('done', 'ok') # Known text values of positive del.icio.us <result> answers
DLCS_WAIT_TIME = 4
DLCS_REQUEST_TIMEOUT = 444 # Seconds before socket triggers timeout
#DLCS_API_REALM = 'del.icio.us API'
DLCS_API_HOST = 'https://api.del.icio.us'
DLCS_API_PATH = 'v1'
DLCS_API = "%s/%s" % (DLCS_API_HOST, DLCS_API_PATH)
DLCS_RSS = 'http://del.icio.us/rss/'

ISO_8601_DATETIME = '%Y-%m-%dT%H:%M:%SZ'

USER_AGENT = 'pydelicious.py/%s %s' % (__version__, __url__)

DEBUG = 0
if 'DLCS_DEBUG' in os.environ:
    DEBUG = int(os.environ['DLCS_DEBUG'])


# Taken from FeedParser.py
# timeoutsocket allows feedparser to time out rather than hang forever on ultra-slow servers.
# Python 2.3 now has this functionality available in the standard socket library, so under
# 2.3 you don't need to install anything.  But you probably should anyway, because the socket
# module is buggy and timeoutsocket is better.
try:
    import timeoutsocket # http://www.timo-tasi.org/python/timeoutsocket.py
    timeoutsocket.setDefaultSocketTimeout(DLCS_REQUEST_TIMEOUT)
except ImportError:
    import socket
    if hasattr(socket, 'setdefaulttimeout'): socket.setdefaulttimeout(DLCS_REQUEST_TIMEOUT)
if DEBUG: print >>sys.stderr, "Set socket timeout to %s seconds" % DLCS_REQUEST_TIMEOUT


### Utility classes

class _Waiter:
    """Waiter makes sure a certain amount of time passes between
    successive calls of `Waiter()`.

    Some attributes:
    :last: time of last call
    :wait: the minimum time needed between calls
    :waited: the number of calls throttled

    pydelicious.Waiter is an instance created when the module is loaded.
    """
    def __init__(self, wait):
        self.wait = wait
        self.waited = 0
        self.lastcall = 0;

    def __call__(self):
        tt = time.time()

        timeago = tt - self.lastcall

        if self.lastcall and DEBUG>2:
            print >>sys.stderr, "Lastcall: %s seconds ago." % lastcall

        if timeago <= self.wait:
            if DEBUG>0: print >>sys.stderr, "Waiting %s seconds." % self.wait
            time.sleep(self.wait)
            self.waited += 1
            self.lastcall = tt + self.wait
        else:
            self.lastcall = tt

Waiter = _Waiter(DLCS_WAIT_TIME)

class PyDeliciousException(Exception):
    '''Std. pydelicious error'''
    pass

class DeliciousError(Exception):
	"""Raised when the server responds with a negative answer"""


class DefaultErrorHandler(urllib2.HTTPDefaultErrorHandler):
    '''@xxx:bvb: Where is this used? should it be registered somewhere with urllib2?

    Handles HTTP Error, currently only 503.
    '''
    def http_error_503(self, req, fp, code, msg, headers):
        raise urllib2.HTTPError(req, code, throttled_message, headers, fp)


class post(dict):
    """Post object, contains href, description, hash, dt, tags,
    extended, user, count(, shared).

    @xxx:bvb: Is this needed? Right now this is superfluous,
    """
    def __init__(self, href = "", description = "", hash = "", time = "", tag = "", extended = "", user = "", count = "",
                 tags = "", url = "", dt = ""): # tags or tag?
        self["href"] = href
        if url != "": self["href"] = url
        self["description"] = description
        self["hash"] = hash
        self["dt"] = dt
        if time != "": self["dt"] = time
        self["tags"] = tags
        if tag != "":  self["tags"] = tag     # tag or tags? # !! tags
        self["extended"] = extended
        self["user"] = user
        self["count"] = count

    def __getattr__(self, name):
        try: return self[name]
        except: object.__getattribute__(self, name)


class posts(list):
    """@xxx:bvb: idem as class post, python structures (dict/list) might
    suffice or a more generic solution is needed.
    """
    def __init__(self, *args):
        for i in args: self.append(i)

    def __getattr__(self, attr):
        try: return [p[attr] for p in self]
        except: object.__getattribute__(self, attr)

### Utility functions

def str2uni(s):
    # type(in) str or unicode
    # type(out) unicode
    return ("".join([unichr(ord(i)) for i in s]))

def str2utf8(s):
    # type(in) str or unicode
    # type(out) str
    return ("".join([unichr(ord(i)).encode("utf-8") for i in s]))

def str2quote(s):
    return urllib.quote_plus("".join([unichr(ord(i)).encode("utf-8") for i in s]))

def dict0(d):
    # Trims empty dict entries
    # {'a':'a', 'b':'', 'c': 'c'} => {'a': 'a', 'c': 'c'}
    dd = dict()
    for i in d:
            if d[i] != "": dd[i] = d[i]
    return dd

def delicious_datetime(str):
    """Parse a ISO 8601 formatted string to a Python datetime ...
    """
    return datetime.datetime(*time.strptime(str, ISO_8601_DATETIME)[0:6])

def http_request(url, user_agent=USER_AGENT, retry=4):
    """Retrieve the contents referenced by the URL using urllib2.

    Retries up to four times (default) on exceptions.
    """
    request = urllib2.Request(url, headers={'User-Agent':user_agent})

    # Remember last error
    e = None

    # Repeat request on time-out errors
    tries = retry;
    while tries:
        try:
            return urllib2.urlopen(request)

        except urllib2.HTTPError, e: # protocol errors,
            raise PyDeliciousException, "%s" % e

        except urllib2.URLError, e:
            # @xxx: Ugly check for time-out errors
			#if len(e)>0 and 'timed out' in arg[0]:
			print >> sys.stderr, "%s, %s tries left." % (e, tries)
			Waiter()
			tries = tries - 1
			#else:
			#	tries = None

    # Give up
    raise PyDeliciousException, \
            "Unable to retrieve data at '%s', %s" % (url, e)

def http_auth_request(url, host, user, passwd, user_agent=USER_AGENT):
    """Call an HTTP server with authorization credentials using urllib2.
    """
    if DEBUG: httplib.HTTPConnection.debuglevel = 1

    # Hook up handler/opener to urllib2
    password_manager = urllib2.HTTPPasswordMgrWithDefaultRealm()
    password_manager.add_password(None, host, user, passwd)
    auth_handler = urllib2.HTTPBasicAuthHandler(password_manager)
    opener = urllib2.build_opener(auth_handler)
    urllib2.install_opener(opener)

    return http_request(url, user_agent)

def dlcs_api_request(path, params='', user='', passwd='', throttle=True):
    """Retrieve/query a path within the del.icio.us API.

    This implements a minimum interval between calls to avoid
    throttling. [#]_ Use param 'throttle' to turn this behaviour off.

    @todo: back off on 503's (HTTPError, URLError? @todo: testing).

    Returned XML does not always correspond with given del.icio.us examples
    @todo: (cf. help/api/... and post's attributes)

    .. [#] http://del.icio.us/help/api/
    """
    if throttle:
        Waiter()

    if params:
        # params come as a dict, strip empty entries and urlencode
        url = "%s/%s?%s" % (DLCS_API, path, urllib.urlencode(dict0(params)))
    else:
        url = "%s/%s" % (DLCS_API, path)

    if DEBUG: print >>sys.stderr, "dlcs_api_request: %s" % url

    try:
        return http_auth_request(url, DLCS_API_HOST, user, passwd, USER_AGENT)

    # @bvb: Is this ever raised? When?
    except DefaultErrorHandler, e:
        print >>sys.stderr, "%s" % e

def dlcs_parse_xml(data, split_tags=False):
    """Parse any del.icio.us XML document and return Python data structure.

    Recognizes all XML document formats as returned by the version 1 API and
    translates to a JSON-like data structure (dicts 'n lists).

    Returned instance is always a dictionary. Examples::

     {'posts': [{'url':'...','hash':'...',},],}
     {'tags':['tag1', 'tag2',]}
     {'dates': [{'count':'...','date':'...'},], 'tag':'', 'user':'...'}
	 {'result':(True, "done")}
     # etcetera.
    """

    if DEBUG>3: print >>sys.stderr, "dlcs_parse_xml: parsing from ", data

    if not hasattr(data, 'read'):
        data = StringIO(data)

    doc = parse_xml(data)
    root = doc.getroot()
    fmt = root.tag

	# Split up into three cases: Data, Result or Update
    if fmt in ('tags', 'posts', 'dates', 'bundles'):

        # Data: expect a list of data elements, 'resources'.
        # Use `fmt` (without last 's') to find data elements, elements
        # don't have contents, attributes contain all the data we need:
        # append to list
        elist = [el.attrib for el in doc.findall(fmt[:-1])]

        # Return list in dict, use tagname of rootnode as keyname.
        data = {fmt: elist}

        # Root element might have attributes too, append dict.
        data.update(root.attrib)

        return data

    elif fmt == 'result':

        # Result: answer to operations
        if root.attrib.has_key('code'):
            msg = root.attrib['code']
        else:
            msg = root.text

		# Return {'result':(True, msg)} for /known/ O.K. messages,
        # use (False, msg) otherwise
        v = msg in DLCS_OK_MESSAGES
        return {fmt: (v, msg)}

    elif fmt == 'update':

        # Update: "time"
        #return {fmt: root.attrib}
		return {fmt: {'time':time.strptime(root.attrib['time'], ISO_8601_DATETIME)}}

    else:
        raise PyDeliciousException, "Unknown XML document format '%s'" % fmt

def dlcs_rss_request(tag = "", popular = 0, user = "", url = ''):
    """Handle a request for RSS

    @todo: translate from German

    rss sollte nun wieder funktionieren, aber diese try, except scheisse ist so nicht schoen

    rss wird unterschiedlich zusammengesetzt. ich kann noch keinen einheitlichen zusammenhang
    zwischen daten (url, desc, ext, usw) und dem feed erkennen. warum k[o]nnen die das nicht einheitlich machen?
    """
    tag = str2quote(tag)
    user = str2quote(user)
    if url != '':
        # http://del.icio.us/rss/url/efbfb246d886393d48065551434dab54
        url = DLCS_RSS + '''url/%s'''%md5.new(url).hexdigest()
    elif user != '' and tag != '':
        url = DLCS_RSS + '''%(user)s/%(tag)s'''%dict(user=user, tag=tag)
    elif user != '' and tag == '':
        # http://del.icio.us/rss/delpy
        url = DLCS_RSS + '''%s'''%user
    elif popular == 0 and tag == '':
        url = DLCS_RSS
    elif popular == 0 and tag != '':
        # http://del.icio.us/rss/tag/apple
        # http://del.icio.us/rss/tag/web2.0
        url = DLCS_RSS + "tag/%s"%tag
    elif popular == 1 and tag == '':
        url = DLCS_RSS + '''popular/'''
    elif popular == 1 and tag != '':
        url = DLCS_RSS + '''popular/%s'''%tag
    rss = http_request(url).read()
    rss = feedparser.parse(rss)
    # print rss
#     for e in rss.entries: print e;print
    l = posts()
    for e in rss.entries:
        if e.has_key("links") and e["links"]!=[] and e["links"][0].has_key("href"):
            url = e["links"][0]["href"]
        elif e.has_key("link"):
            url = e["link"]
        elif e.has_key("id"):
            url = e["id"]
        else:
            url = ""
        if e.has_key("title"):
            description = e['title']
        elif e.has_key("title_detail") and e["title_detail"].has_key("title"):
            description = e["title_detail"]['value']
        else:
            description = ''
        try: tags = e['categories'][0][1]
        except:
            try: tags = e["category"]
            except: tags = ""
        if e.has_key("modified"):
            dt = e['modified']
        else:
            dt = ""
        if e.has_key("summary"):
            extended = e['summary']
        elif e.has_key("summary_detail"):
            e['summary_detail']["value"]
        else:
            extended = ""
        if e.has_key("author"):
            user = e['author']
        else:
            user = ""
#  time = dt ist weist auf ein problem hin
# die benennung der variablen ist nicht einheitlich
#  api senden und
#  xml bekommen sind zwei verschiedene schuhe :(
        l.append(post(url = url, description = description, tags = tags, dt = dt, extended = extended, user = user))
    return l


### Main module class

class DeliciousAPI:
    """Class providing main interace to del.icio.us API.

    Methods ``request`` and ``request_raw`` represent the core. For all API
    paths there are furthermore methods (e.g. posts_add for 'posts/all') with
    an explicit declaration of the parameters and documentation. These all call
    ``request`` and pass on extra keywords like ``_raw``.
    """

    def __init__(self, user, passwd, codec='iso-8859-1', api_request=dlcs_api_request, xml_parser=dlcs_parse_xml):
        """Initialize access to the API with ``user`` and ``passwd``.

        ``codec`` sets the encoding of the arguments.

        The ``api_request`` and ``xml_parser`` parameters by default point to
        functions within this package with standard implementations to
        request and parse a resource. See ``dlcs_api_request()`` and
        ``dlcs_parse_xml()``. Note that ``api_request`` should return a
        file-like instance with an HTTPMessage instance under ``info()``,
        see ``urllib2.openurl`` for more info.
        """
        assert user != ""
        self.user = user
        self.passwd = passwd
        self.codec = codec

        # Implement communication to server and parsing of respons messages:
        assert callable(api_request)
        self._api_request = api_request
        assert callable(xml_parser)
        self._parse_response = xml_parser

    def _call_server(self, path, **params):
        params = dict0(params)
        for key in params:
            params[key] = params[key].encode(self.codec)

        # see __init__ for _api_request()
        return self._api_request(path, params, self.user, self.passwd)


    ### Core functionality

    def request(self, path, _raw=False, **params):
        """Calls a path in the API, parses the answer to a JSON-like structure by
        default. Use with ``_raw=True`` or ``call request_raw()`` directly to
        get the filehandler and process the response message manually.

        Calls to some paths will return a `result` message, i.e.::

            <result code="..." />

        or::

            <result>...</result>

        These are all parsed to ``{'result':(Boolean, MessageString)}`` and this
        method will raise ``DeliciousError`` on negative `result` answers. Using
        ``_raw=True`` bypasses all parsing and will never raise ``DeliciousError``.

        See ``dlcs_parse_xml()`` and ``self.request_raw()``."""

        # method _parse_response is bound in `__init__()`, `_call_server`
        # uses `_api_request` also set in `__init__()`
        if _raw:
            # return answer
            return self.request_raw(path, **params)

        else:
            # get answer and parse
            fl = self._call_server(path, **params)
            rs = self._parse_response(fl)

			# Raise an error for negative 'result' answers
            if type(rs) == dict and rs == 'result' and not rs['result'][0]:
                errmsg = ""
                if len(rs['result'])>0:
                    errmsg = rs['result'][1:]
                raise DeliciousError, errmsg

            return rs

    def request_raw(self, path, **params):
        """Calls the path in the API, returns the filehandle. Returned
        file-like instances have an ``HTTPMessage`` instance with HTTP header
        information available. Use ``filehandle.info()`` or refer to the
        ``urllib2.openurl`` documentation.
        """
        # see `request()` on how the response can be handled
        return self._call_server(path, **params)

    ### Explicit declarations of API paths, their parameters and docs

    # Tags
    def tags_get(self, **kwds):
        """Returns a list of tags and the number of times it is used by the user.
        ::

            <tags>
                <tag tag="TagName" count="888">
        """
        return self.request("tags/get", **kwds)

    def tags_rename(self, old, new, **kwds):
        """Rename an existing tag with a new tag name. Returns a `result`
        message or raises an ``DeliciousError``. See ``self.request()``.

        &old (required)
            Tag to rename.
        &new (required)
            New name.
        """
        return self.request("tags/rename", old=old, new=new, **kwds)

    # Posts
    def posts_update(self, **kwds):
        """Returns the last update time for the user. Use this before calling
        `posts_all` to see if the data has changed since the last fetch.
        ::

            <update time="CCYY-MM-DDThh:mm:ssZ">
		"""
        return self.request("posts/update", **kwds)

    def posts_dates(self, tag="", **kwds):
        """Returns a list of dates with the number of posts at each date.
        ::

            <dates>
                <date date="CCYY-MM-DD" count="888">

        &tag (optional).
            Filter by this tag.
        """
        return self.request("posts/dates", tag=tag, **kwds)

    def posts_get(self, tag="", dt="", url="", **kwds):
        """Returns posts matching the arguments. If no date or url is given,
        most recent date will be used.
        ::

            <posts dt="CCYY-MM-DD" tag="..." user="...">
                <post ...>

        &tag (optional).
            Filter by this tag.
        &dt (optional).
            Filter by this date (CCYY-MM-DDThh:mm:ssZ).
        &url (optional).
            Filter by this url.
        """
        return self.request("posts/get", tag=tag, dt=dt, url=url, **kwds)

    def posts_recent(self, tag="", count="", **kwds):
        """Returns a list of the most recent posts, filtered by argument.
        ::

            <posts tag="..." user="...">
                <post ...>

        &tag (optional).
            Filter by this tag.
        &count (optional).
            Number of items to retrieve (Default:15, Maximum:100).
        """
        return self.request("posts/recent", tag=tag, count=count, **kwds)

    def posts_all(self, tag="", **kwds):
        """Returns all posts. Please use sparingly. Call the `posts_update`
        method to see if you need to fetch this at all.
        ::

            <posts tag="..." user="..." update="CCYY-MM-DDThh:mm:ssZ">
                <post ...>

        &tag (optional).
            Filter by this tag.
        """
        return self.request("posts/all", tag=tag, **kwds)

    def posts_add(self, url, description, extended="", tags="", dt="",
            replace="no", shared="yes", **kwds):
        """Add a post to del.icio.us. Returns a `result` message or raises an
        ``DeliciousError``. See ``self.request()``.

        &url (required)
            the url of the item.
        &description (required)
            the description of the item.
        &extended (optional)
            notes for the item.
        &tags (optional)
            tags for the item (space delimited).
        &dt (optional)
            datestamp of the item (format "CCYY-MM-DDThh:mm:ssZ").

        Requires a LITERAL "T" and "Z" like in ISO8601 at http://www.cl.cam.ac.uk/~mgk25/iso-time.html for example: "1984-09-01T14:21:31Z"
        &replace=no (optional) - don't replace post if given url has already been posted.
        &shared=no (optional) - make the item private
        """
        return self.request("posts/add", url=url, description=description,
                extended=extended, tags=tags, dt=dt,
                replace=replace, shared=shared, **kwds)

    def posts_delete(self, url, **kwds):
        """Delete a post from del.icio.us. Returns a `result` message or
        raises an ``DeliciousError``. See ``self.request()``.

        &url (required)
            the url of the item.
        """
        return self.request("posts/delete", url=url, **kwds)

    # Bundles
    def bundles_all(self, **kwds):
        """Retrieve user bundles from del.icio.us.
        ::

            <bundles>
                <bundel name="..." tags=...">
        """
        return self.request("tags/bundles/all", **kwds)

    def bundles_set(self, bundle, tags, **kwds):
        """Assign a set of tags to a single bundle, wipes away previous
        settings for bundle. Returns a `result` messages or raises an
        ``DeliciousError``. See ``self.request()``.

        &bundle (required)
            the bundle name.
        &tags (required)
            list of tags (space seperated).
        """
        if type(tags)==list:
            tags = " ".join(tags)
        return self.request("tags/bundles/set", bundle=bundle, tags=tags,
                **kwds)

    def bundles_delete(self, bundle, **kwds):
        """Delete a bundle from del.icio.us. Returns a `result` message or
        raises an ``DeliciousError``. See ``self.request()``.

        &bundle (required)
            the bundle name.
        """
        return self.request("tags/bundles/delete", bundle=bundle, **kwds)

    ### Utils

    # Lookup table for del.icio.us url-path to DeliciousAPI method.
    paths = {
        'tags/get': tags_get,
        'tags/rename': tags_rename,
        'posts/update': posts_update,
        'posts/dates': posts_dates,
        'posts/get': posts_get,
        'posts/recent': posts_recent,
        'posts/all': posts_all,
        'posts/add': posts_add,
        'posts/delete': posts_delete,
        'tags/bundles/all': bundles_all,
        'tags/bundles/set': bundles_set,
        'tags/bundles/delete': bundles_delete,
    }

    def get_url(self, url):
        """Return the del.icio.us url at which the HTML page with posts for
        ``url`` can be found.
        """
        return "http://del.icio.us/url/?url=%s" % (url,)


### Convenience functions on this package

def apiNew(user, passwd):
    """creates a new DeliciousAPI object.
    requires user(name) and passwd
	"""
    return DeliciousAPI(user=user, passwd=passwd)

def add(user, passwd, url, description, tags="", extended="", dt="", replace="no"):
    return apiNew(user, passwd).posts_add(url=url, description=description, extended=extended, tags=tags, dt=dt, replace=replace)

def get(user, passwd, tag="", dt="",  count = 0):
    posts = apiNew(user, passwd).posts_get(tag=tag,dt=dt)
    if count != 0: posts = posts[0:count]
    return posts

def get_all(user, passwd, tag=""):
    return apiNew(user, passwd).posts_all(tag=tag)

def delete(user, passwd, url):
    return apiNew(user, passwd).posts_delete(url=url)

def rename_tag(user, passwd, oldtag, newtag):
    return apiNew(user=user, passwd=passwd).tags_rename(old=oldtag, new=newtag)

def get_tags(user, passwd):
    return apiNew(user=user, passwd=passwd).tags_get()


### RSS functions @bvb: still working...?
def getrss(tag="", popular=0, url='', user=""):
    """get posts from del.icio.us via parsing RSS @bvb[or HTML]

	@bvb[not tested]

    tag (opt) sort by tag
    popular (opt) look for the popular stuff
    user (opt) get the posts by a user, this striks popular
    url (opt) get the posts by url
	"""
    return dlcs_rss_request(tag=tag, popular=popular, user=user, url=url)

def get_userposts(user):
    return getrss(user = user)

def get_tagposts(tag):
    return getrss(tag = tag)

def get_urlposts(url):
    return getrss(url = url)

def get_popular(tag = ""):
    return getrss(tag = tag, popular = 1)


### @TODO: implement JSON fetching
def json_posts(user, count=15):
    """http://del.icio.us/feeds/json/mpe
    http://del.icio.us/feeds/json/mpe/art+history
    count=###   the number of posts you want to get (default is 15, maximum is 100)
    raw         a raw JSON object is returned, instead of an object named Delicious.posts
    """

def json_tags(user, atleast, count, sort='alpha'):
    """http://del.icio.us/feeds/json/tags/mpe
    atleast=###         include only tags for which there are at least ### number of posts
    count=###           include ### tags, counting down from the top
    sort={alpha|count}  construct the object with tags in alphabetic order (alpha), or by count of posts (count)
    callback=NAME       wrap the object definition in a function call NAME(...), thus invoking that function when the feed is executed
    raw                 a pure JSON object is returned, instead of code that will construct an object named Delicious.tags
    """

def json_network(user):
    """http://del.icio.us/feeds/json/network/mpe
    callback=NAME       wrap the object definition in a function call NAME(...)
    ?raw         a raw JSON object is returned, instead of an object named Delicious.posts
    """

def json_fans(user):
    """http://del.icio.us/feeds/json/fans/mpe
    callback=NAME       wrap the object definition in a function call NAME(...)
    ?raw         a pure JSON object is returned, instead of an object named Delicious.
    """

deliciousrec.py

from pydelicious import get_popular,get_userposts,get_urlposts
import time

def initializeUserDict(tag,count=5):
  user_dict={}
  # get the top count' popular posts
  for p1 in get_popular(tag=tag)[0:count]:
    # find all users who posted this
    for p2 in get_urlposts(p1['href']):
      user=p2['user']
      user_dict[user]={}
  return user_dict

def fillItems(user_dict):
  all_items={}
  # Find links posted by all users
  for user in user_dict:
    for i in range(3):
      try:
        posts=get_userposts(user)
        break
      except:
        print "Failed user "+user+", retrying"
        time.sleep(4)
    for post in posts:
      url=post['href']
      user_dict[user][url]=1.0
      all_items[url]=1
  
  # Fill in missing items with 0
  for ratings in user_dict.values():
    for item in all_items:
      if item not in ratings:
        ratings[item]=0.0

推荐近邻与链接
为了随机选择一位用户，并找到与其品味相近的其他用户

我们也可以通过调用getRecommendations函数为该用户获取推荐链接。因为方法调用将会依序返回全部物品，所以最好将其限制在前10条：

偏好列表中的各项是可以被调换的，这样我们就可以依据链接而非人员来进行搜索了。

基于物品的过滤

其总体思路就是为没见物品预先计算好最为相近的其他物品。然后，当我们想为某位用户提供推荐时，就可以查看他曾经评过分的物品，并从中选出排位靠前者，再构造出一个加权列表。物品间的比较不会项用户间的比较那么频繁变化。所以，无须不停地计算与每样物品最为相近地其他物品，我们可以将这样地运算任务安排再网络流量不是很大地时候进行，或者再独立于主应用之外地另一台计算机上单独进行。

构造物品比较数据集

def calculateSimilarItems(prefs, n = 10):
    # 建立字典，以给出与这些物品最为相近的所有其他物品
    result = {}
    # 以物品为中心对偏好矩阵实施倒置处理
    itemPrefs = transformPrefs(prefs)
    c = 0
    for item in itemPrefs:
        # 针对大数据集更新状态变量
        c += 1
        if c % 100 == 0:print ("%d / %d") % (c, len(itemPrefs))
        # 寻找最为相近的物品
        scores = topMatches(itemPrefs, item, n = n, similarity = sim_distance)
        result[item] = scores
    return result # 返回一个包含物品及其最相近物品列表的字典

该函数首先利用了此前定义过地transformPrefs函数，对反映评价值地字典进行倒置处理，从而得到一个有关物品及其用户评价情况的列表。然后，程序又循环遍历每项物品，并将转换了的字典传入topMatches函数中，求得最为相近的物品及其相似度评价值。最后，他建立并返回了一个包含物品及其最相近物品列表的字典。

>>> import recommendation
>>> itemsim=recommendation.calculateSimilarItems(recommendation.critics)
>>> itemsim
{'Lady in the Water': [(0.4494897427831781, 'You, Me and Dupree'), (0.38742588672279304, 'The Night Listener'), (0.3483314773547883, 'Snake on a Plane'), (0.3483314773547883, 'Just My Luck'), (0.2402530733520421, 'Superman Returns')], 'Snake on a Plane': [(0.3483314773547883, 'Lady in the Water'), (0.32037724101704074, 'The Night Listener'), (0.3090169943749474, 'Superman Returns'), (0.2553967929896867, 'Just My Luck'), (0.1886378647726465, 'You, Me and Dupree')], 'Just My Luck': [(0.3483314773547883, 'Lady in the Water'), (0.32037724101704074, 'You, Me and Dupree'), (0.2989350844248255, 'The Night Listener'), (0.2553967929896867, 'Snake on a Plane'), (0.20799159651347807, 'Superman Returns')], 'Superman Returns': [(0.3090169943749474, 'Snake on a Plane'), (0.252650308587072, 'The Night Listener'), (0.2402530733520421, 'Lady in the Water'), (0.20799159651347807, 'Just My Luck'), (0.1918253663634734, 'You, Me and Dupree')], 'You, Me and Dupree': [(0.4494897427831781, 'Lady in the Water'), (0.32037724101704074, 'Just My Luck'), (0.29429805508554946, 'The Night Listener'), (0.1918253663634734, 'Superman Returns'), (0.1886378647726465, 'Snake on a Plane')], 'The Night Listener': [(0.38742588672279304, 'Lady in the Water'), (0.32037724101704074, 'Snake on a Plane'), (0.2989350844248255, 'Just My Luck'), (0.29429805508554946, 'You, Me and Dupree'), (0.252650308587072, 'Superman Returns')]}

只有频繁执行该函数，才能令物品的相似度不至于过期。通常我们需要再用户基数和评分数量不是很大的时候执行这一函数，但是随着用户数量的不断增长，物品间的相似度评价值通常会变得越来越稳定。

获得推荐

def getRecommendedItems(prefs, itemMatch, user):#itemMatch 物品相似度矩阵
    userRatings = prefs[user]
    scores = {}
    totalSim = {}
    # 循环遍历由当前用户评分的物品
    for (item, rating) in userRatings.items(): # dict.items() 此方法返回元组对的列表。
        # 寻遍遍历与当前物品相似的物品
        for (similarity, item2) in itemMatch[item]:
            # 如果该用户已经对当前物品做过评价，则将其忽略
            if item2 in userRatings: continue
            # 评价值与相似度的加权之和
            scores.setdefault(item2, 0) # setdefault 见前面注释
            scores[item2] += similarity * rating
            # 全部相似度之和
            totalSim.setdefault(item2, 0)
            totalSim[item2] += similarity
        # 将每个合计值除以加权和，求出平均值
    rankings = [(score / totalSim[item], item) for item, score in scores.items()]
        # 按最高值到最低值的顺序，返回评分结果
    rankings.sort()
    rankings.reverse()
    return rankings

为Toby提供一个新的推荐结果

>>> import recommendation
>>> itemsim=recommendation.calculateSimilarItems(recommendation.critics)
>>> recommendation.getRecommendedItems(recommendation.critics,itemsim,'Toby')
[(3.1667425234070894, 'The Night Listener'), (2.9366294028444346, 'Just My Luck'), (2.868767392626467, 'Lady in the Water')]

使用MovieLens数据集

数据集获取的网站，注意下的数据是这个小的

需要的数据格式为：

每一行数据都包含了一个用户ID、影片ID、用户对该片所给的评分，以及评分的时间。我们可以通过影片的ID获取到片名，但对于用户数据而言，由于是匿名的，因此再本节中我们只能对用户ID进行处理。

def loadMovieLens(path = '/data/ml-latest-small'):
    # 获取影片标题
    movies = {}
    for line in open(path + '/movies.csv'):
        (movieId, title, genres) = line.split('\t')[0:2] # 这里文件中第三列是影片类型，略作修改
        movies[id] = title # 把 title 和 id对应
    # 加载数据
    prefs = {}
    for line in open(path + '/ratings.csv'):
        (user, movieid, rating, ts) = line.split('\t') # 分割
        prefs.setdefault(user, {})
        prefs[user][movies[movieid]] = float(rating)
    return prefs

由于从网站上下下来的数据csv是以逗号进行分隔的，但是电影名称中也有逗号，所以通过excel进行了数据处理，将其转换为以制表符为分隔方式。参考网站1

但是在将rating数据进行格式转换的时候出现报错

随后又再网上看了其他对数据处理的帖子，将代码改为了，如下所示参考文章1 参考文章2：

def loadMovieLens(path='./data/my-small'):
    import csv
    # 获取影片标题
    movies = {}
    with open(path + '/movies.csv') as movies_file:
        row = csv.reader(movies_file,delimiter='\t')
        next(row)  # 读取首行
        id = [] #建立一个数组来存电影id
        title =[] #建立一个数组来存电影名称
        for r in row:
            id.append(r[0])
            title.append(r[1])
            movies[id] = title  # 把 title 和 id对应
    # 加载数据
    prefs = {}
    with open(path + '/ratings.csv') as ratings_file:
        row = csv.reader(ratings_file,delimiter=',')
        next(row)  # 读取首行
        user = []  # 建立一个数组来存用户id
        movieid = []  # 建立一个数组来存电影id
        rating = []  # 建立一个数组来存用户对电影的评价分数
        ts = []  # 建立一个数组来存记录的时间戳
        # line = line.strip()
        # 读取除首行之后每一行的的数据，并将其加入到各数组之中
        for r in row:
            user.append(r[0])
            movieid.append(r[1])
            rating.append(float(r[2]))
            ts.append(r[3])
            prefs.setdefault(user, {})
            #prefs[user][movies[movieid]] = float(rating)
            prefs[user][movies[movieid]] = rating
        # print(rating[50])
        #print(type(rating))
    #print('程序执行完成')
    return prefs

但是还是报错，欺负我不太懂python

以上的报错，可以参考这篇文章，因为之前str转换为float出错，然后想想还是只做将ratings的值赋值为list类型，最后程序终于通了。下面是读取电影数据的最终代码

def loadMovieLens(path='./data/my-small'):
    import csv
    # 获取影片标题
    movies = {}
    for line in open(path + '/movies.csv'):
        (id, title) = line.split('\t')[0:2]  # 这里文件中第三列是影片类型，略作修改
        movies[id] = title  # 把 title 和 id对应
    # 加载数据
    prefs = {}
    with open('./data/my-small/ratings.csv') as csv_file:
        row = csv.reader(csv_file,delimiter=',')
        next(row) #读取首行
        ratings = [] #建立一个数组来存储评价数据
        #读取除首行之后每一行的的3列数据，并将其加入到数组ratings之中
        for r in row:
            ratings.append(float(r[2])) #将字符串转换为浮点型加入到数组之中
    #print(len(ratings))
    #print(ratings[0])
    #print(ratings[100835])
    i = 0
    for line in open(path + '/ratings.csv'):
        if(i==len(ratings)): break
        (user, movieid, rating, ts) = line.split(',')  # 分割
        prefs.setdefault(user, {})
        prefs[user][movies[movieid]] = ratings[i]
        i+=1
        #print(i)
    return prefs

终于看到与书上一直的输出了，随机的查看了一位用户的评分情况：

终于可以做基于用户的推荐：

import recommendation
prefs=recommendation.loadMovieLens()
print(prefs['87'])
print(recommendation.getRecommendations(prefs,'87')[0:30])

不太知道为什么数据会出现5.00001的情况，可能在做数据转换的时候，出现了精度的问题？

基于物品的推荐：

itemsim = recommendation.calculateSimilarItems(prefs,50)
print(recommendation.getRecommendedItems(prefs,itemsim,'87')[0:30])

但是又报错了

出现以上问题是python版本导致的，在python3的版本中变量的输出语言应该为：

if c % 100 == 0: print("%d / %d" % (c, len(itemPrefs)))

真的是构造物品的相似度用了很长的时间，但是推荐过程几乎是在数据构造完毕后瞬间完成的。而且，获取推荐所花费的时间不会随着用户数量的增加而增加。

在针对大数据集生成推荐列表时，基于物品进行过滤的方式明显要比基于用户的过滤更快，不过它的确有维护物品相似度表的额外开销。同时，这种方法根据数据集“稀疏”程度上的不同也存在精准度上的差异。在涉及电影的例子中，由于每个评论者几乎对每部影片都做过评价，所以数据集是密集的（而非稀疏的）。另一方面，它又不同于查找两位有相近书签的用户—大多数书签都是为小众群体所收藏的，这就形成了一个稀疏数据集。对于系数数据集，基于物品的过滤方法通常要优于基于用户的过滤方法，而对于密集数据集而言，两者的效果几乎是一样的。

集体智慧编程中文版---第二章

前言

第二章

推荐物品

匹配商品

构建一个基于del.icio.us的链接推荐系统

基于物品的过滤

使用MovieLens数据集

2020-07-06遇到的坑

springcloud之zookeeper使用

集體智慧編程----第三章發現羣組

收穫 2020-6-12

2020-6-19 Idea打包項目war並且發佈到服務器

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結